Below are 20 PySpark interview questions that appear in almost every data engineering interview — along with concise, practical answers you can use to demonstrate mastery.
1. Difference Between RDD, DataFrame, and Dataset
RDD (Resilient Distributed Dataset): Low-level API providing fine-grained control. Immutable and distributed but lacks optimization.
DataFrame: Higher-level API built on RDDs, supports Catalyst Optimizer and Tungsten for better performance.
Dataset: Typed structure (Scala/Java only). Combines RDD’s type safety and DataFrame’s optimization.
👉 In Python, we mostly use DataFrames.
2. Lazy Evaluation
Spark doesn’t execute transformations immediately. It builds a logical DAG (Directed Acyclic Graph) and waits until an action (like collect(), count(), show()) is called. ✅ Improves performance through pipeline optimization.
3. Catalyst Optimizer
Spark’s query optimization engine that analyzes logical plans, rewrites queries, and generates the most efficient execution plan using rules like predicate pushdown and constant folding.
4. Narrow vs Wide Transformations
Narrow: No data movement (e.g., map, filter, union).
Wide: Requires data shuffle across nodes (e.g., groupBy, join). Wide transformations are costly and can slow jobs.
5. Checkpointing vs Caching
Caching: Stores data in memory for repeated use (df.cache()).
Checkpointing: Saves RDDs/DataFrames to disk for fault tolerance and truncates lineage.
6. Data Shuffling
Occurs when Spark redistributes data across partitions (during joins, groupBy). Optimization tips:
Use broadcast joins for small tables.
Repartition wisely.
Avoid wide transformations if possible.
7. Schema Evolution
Handle changing data schemas using:
mergeSchema=True when reading Parquet.
StructType definitions to enforce schema.
Use Delta Lake for automatic schema evolution.
8. Broadcast vs Shuffle Joins
Broadcast Join: Copies small dataset to all worker nodes — faster.
Shuffle Join: Redistributes large datasets across partitions — slower.
repartition(n): Shuffles data for even distribution (expensive).
coalesce(n): Reduces partitions without full shuffle (efficient for decreasing partitions).
13. Window Functions Example
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
windowSpec = Window.partitionBy("region").orderBy("sales")
df.withColumn("rank", rank().over(windowSpec)).show()
Used for ranking, cumulative sums, etc.
14. UDFs (User Defined Functions)
Custom Python functions used in Spark. ❗ Downside: Breaks Spark’s optimization, slower due to serialization. Prefer Spark SQL built-ins or pandas UDFs for vectorization.
Use badRecordsPath in options to store corrupt rows while reading.
18. Spark Submit Modes
Local: Single JVM for testing.
Client: Driver runs on submit node.
Cluster: Driver runs inside cluster (preferred for production).
19. Lineage Graph
Tracks transformations applied to RDD/DataFrame. If a node fails, Spark recomputes lost data using lineage — ensuring fault tolerance.
20. Debugging Failed Spark Jobs
Check Spark UI → Stages → Failed tasks.
Look at executor logs.
Enable spark.eventLog.enabled for detailed traces.
Use try/except with logging in PySpark scripts.
Conclusion: If you’ve gone through all 20 and understood why each concept matters — congratulations, you’re not just using PySpark; you think like Spark.