How to handle 100GB data in PySpark: A real-world guide with commands

“If you have 100GB of data, how do you process it efficiently in PySpark?”

It’s a classic interview question — but also a real challenge every data engineer faces when working with big data in production.

Let’s break down the best practices step by step, using practical commands you can reuse in your projects.

✅ Step 1: Repartition wisely

Spark parallelizes work across partitions. Too few partitions → underutilized CPU; too many → overhead.

For 100GB, aim for ≈500 partitions (≈200MB each):

df = df.repartition(500)

✅ Step 2: Cache or persist if reused

If you reuse a DataFrame multiple times (e.g., joins, aggregations):

from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)

This prevents recomputation and speeds up pipelines.

✅ Step 3: Prefer narrow transformations

Operations like filter, map, select are narrow: data stays on the same partition.

Avoid excessive groupBy, distinct, and wide joins — they cause shuffles, which are costly.

✅ Step 4: Use broadcast joins

When joining 100GB with a smaller dataset (<2GB):

from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), “key”)
Broadcasting avoids shuffle by copying the small DataFrame to all worker nodes.

✅ Step 5: Leverage built-in functions

Spark’s built-in functions (withColumn, when, agg) are optimized in JVM.

Avoid Python UDFs (slow, non-distributed) unless truly necessary.

✅ Step 6: Optimize file format and compression

Store intermediate & output data as Parquet (columnar) or ORC — both are compression-friendly and enable predicate pushdown:

df.write.parquet(“s3://bucket/output”, compression=”snappy”)

✅ Step 7: Coalesce before writing

If you want fewer output files (e.g., for consumption):

df.coalesce(10).write.parquet(“path”)
Use coalesce (narrow) instead of repartition before writing.

⚡ Bonus: Monitor and tune
Use Spark UI to check stages, tasks, and shuffles

Adjust executor memory, cores, and parallelism

Profile long-running jobs and test at smaller scale first

🚀 Summary

Handling 100GB isn’t about big RAM; it’s about smart distributed design:

✅ Right partitions
✅ Caching
✅ Broadcast joins
✅ Built-in functions
✅ Columnar formats

Hinzinfotech

How to handle 100GB data in PySpark: A real-world guide with commands

Leave a Reply Cancel reply