Running a Spark Batch Job on Google Cloud Dataproc

Running a Spark Batch Job on Google Cloud Dataproc

As Data Engineers, one of the most powerful capabilities we often use is running batch Spark jobs on cloud clusters. Google Cloud Dataproc makes this seamless by letting us submit jobs directly to a managed Spark cluster.

Here’s how I recently submitted a batch Spark job on a Dataproc cluster:

👉 Step 1: Create a Dataproc Cluster

gcloud dataproc clusters create my-spark-cluster \
–region=us-central1 \
–zone=us-central1-a \
–single-node \
–master-machine-type=n1-standard-4 \
–master-boot-disk-size=500

👉 Step 2: Submit Spark Job

gcloud dataproc jobs submit spark \
–cluster=my-spark-cluster \
–region=us-central1 \
–class org.apache.spark.examples.SparkPi \
–jars file:///usr/lib/spark/examples/jars/spark-examples.jar \
— 100

👉 Step 3: Monitor Job

Check the status in the GCP Console (Dataproc → Jobs).

Or use CLI:

gcloud dataproc jobs describe –region=us-central1

⚡ Within seconds, the job completes, and results are available.

💡 Why this matters?

No cluster management overhead.

Scalability for large batch workloads.

Perfect for ETL, analytics, and ML pipelines.

As we scale our data engineering pipelines, tools like Dataproc make big data processing simpler, faster, and cloud-native.!

Leave a Reply

Your email address will not be published. Required fields are marked *