“If you have 100GB of data, how do you process it efficiently in PySpark?”
It’s a classic interview question — but also a real challenge every data engineer faces when working with big data in production.
Let’s break down the best practices step by step, using practical commands you can reuse in your projects.
✅ Step 1: Repartition wisely
Spark parallelizes work across partitions. Too few partitions → underutilized CPU; too many → overhead.
For 100GB, aim for ≈500 partitions (≈200MB each):
df = df.repartition(500)
✅ Step 2: Cache or persist if reused
If you reuse a DataFrame multiple times (e.g., joins, aggregations):
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
This prevents recomputation and speeds up pipelines.
✅ Step 3: Prefer narrow transformations
Operations like filter, map, select are narrow: data stays on the same partition.
Avoid excessive groupBy, distinct, and wide joins — they cause shuffles, which are costly.
✅ Step 4: Use broadcast joins
When joining 100GB with a smaller dataset (<2GB):
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), “key”)
Broadcasting avoids shuffle by copying the small DataFrame to all worker nodes.
✅ Step 5: Leverage built-in functions
Spark’s built-in functions (withColumn, when, agg) are optimized in JVM.
Avoid Python UDFs (slow, non-distributed) unless truly necessary.
✅ Step 6: Optimize file format and compression
Store intermediate & output data as Parquet (columnar) or ORC — both are compression-friendly and enable predicate pushdown:
df.write.parquet(“s3://bucket/output”, compression=”snappy”)
✅ Step 7: Coalesce before writing
If you want fewer output files (e.g., for consumption):
df.coalesce(10).write.parquet(“path”)
Use coalesce (narrow) instead of repartition before writing.
⚡ Bonus: Monitor and tune
Use Spark UI to check stages, tasks, and shuffles
Adjust executor memory, cores, and parallelism
Profile long-running jobs and test at smaller scale first
🚀 Summary
Handling 100GB isn’t about big RAM; it’s about smart distributed design:
✅ Right partitions
✅ Caching
✅ Broadcast joins
✅ Built-in functions
✅ Columnar formats