Pyspark Interview Questions and answers

Below are 20 PySpark interview questions that appear in almost every data engineering interview — along with concise, practical answers you can use to demonstrate mastery.


1. Difference Between RDD, DataFrame, and Dataset

  • RDD (Resilient Distributed Dataset): Low-level API providing fine-grained control. Immutable and distributed but lacks optimization.
  • DataFrame: Higher-level API built on RDDs, supports Catalyst Optimizer and Tungsten for better performance.
  • Dataset: Typed structure (Scala/Java only). Combines RDD’s type safety and DataFrame’s optimization.

👉 In Python, we mostly use DataFrames.


2. Lazy Evaluation

Spark doesn’t execute transformations immediately. It builds a logical DAG (Directed Acyclic Graph) and waits until an action (like collect(), count(), show()) is called.
✅ Improves performance through pipeline optimization.


3. Catalyst Optimizer

Spark’s query optimization engine that analyzes logical plans, rewrites queries, and generates the most efficient execution plan using rules like predicate pushdown and constant folding.


4. Narrow vs Wide Transformations

  • Narrow: No data movement (e.g., map, filter, union).
  • Wide: Requires data shuffle across nodes (e.g., groupBy, join).
    Wide transformations are costly and can slow jobs.

5. Checkpointing vs Caching

  • Caching: Stores data in memory for repeated use (df.cache()).
  • Checkpointing: Saves RDDs/DataFrames to disk for fault tolerance and truncates lineage.

6. Data Shuffling

Occurs when Spark redistributes data across partitions (during joins, groupBy).
Optimization tips:

  • Use broadcast joins for small tables.
  • Repartition wisely.
  • Avoid wide transformations if possible.

7. Schema Evolution

Handle changing data schemas using:

  • mergeSchema=True when reading Parquet.
  • StructType definitions to enforce schema.
  • Use Delta Lake for automatic schema evolution.

8. Broadcast vs Shuffle Joins

  • Broadcast Join: Copies small dataset to all worker nodes — faster.
  • Shuffle Join: Redistributes large datasets across partitions — slower.

👉 Use broadcast(df2) for small lookups.


9. Spark UI

Accessed via port 4040, shows:

  • Stages & Tasks
  • Shuffle Reads/Writes
  • Executor memory
  • Job duration
    Helps analyze performance bottlenecks.

10. Spark Job Optimization Tips

  • Use cache() only when reused.
  • Reduce shuffle operations.
  • Optimize partition size (~128 MB).
  • Avoid UDFs where possible.
  • Use explain() to understand query plans.

11. Handling Skewed Data

  • Use salting (add random key to distribute load).
  • Apply broadcast joins.
  • Use repartition() on skewed columns.

12. repartition() vs coalesce()

  • repartition(n): Shuffles data for even distribution (expensive).
  • coalesce(n): Reduces partitions without full shuffle (efficient for decreasing partitions).

13. Window Functions Example

from pyspark.sql.window import Window
from pyspark.sql.functions import rank

windowSpec = Window.partitionBy("region").orderBy("sales")
df.withColumn("rank", rank().over(windowSpec)).show()

Used for ranking, cumulative sums, etc.


14. UDFs (User Defined Functions)

Custom Python functions used in Spark.
❗ Downside: Breaks Spark’s optimization, slower due to serialization.
Prefer Spark SQL built-ins or pandas UDFs for vectorization.


15. Reading/Writing from Cloud Storage

df = spark.read.parquet("s3://bucket/data/")
df.write.mode("overwrite").parquet("gs://bucket/output/")

Use partitioned writes, compression (snappy/parquet), and parallelism tuning for better performance.


16. Checkpoint vs Persist vs Cache

MethodStoragePurpose
cache()MemoryReuse within same session
persist()Memory/DiskControl storage level
checkpoint()DiskFault tolerance, lineage truncation

17. Handling Missing/Corrupted Records

df.na.fill(0)
df.na.drop()
df.na.replace("?", None)

Use badRecordsPath in options to store corrupt rows while reading.


18. Spark Submit Modes

  • Local: Single JVM for testing.
  • Client: Driver runs on submit node.
  • Cluster: Driver runs inside cluster (preferred for production).

19. Lineage Graph

Tracks transformations applied to RDD/DataFrame.
If a node fails, Spark recomputes lost data using lineage — ensuring fault tolerance.


20. Debugging Failed Spark Jobs

  • Check Spark UI → Stages → Failed tasks.
  • Look at executor logs.
  • Enable spark.eventLog.enabled for detailed traces.
  • Use try/except with logging in PySpark scripts.

Conclusion:
If you’ve gone through all 20 and understood why each concept matters — congratulations, you’re not just using PySpark; you think like Spark.

💡

Leave a Reply

Your email address will not be published. Required fields are marked *