PySpark Interview Q&As for Data Engineers
✅ 15 PySpark Interview Q&As for Data Engineers:
- What is PySpark?
🔹 PySpark is the Python API for Apache Spark. It allows Python developers to harness the power of distributed computing with Spark’s in-memory processing.
- How is PySpark different from Pandas?
🔹 PySpark handles distributed data and is built for big data, while Pandas works on single-node memory — best for small datasets.
- What are RDDs and DataFrames?
🔹 RDD (Resilient Distributed Dataset): low-level, immutable distributed data.
🔹 DataFrame: high-level abstraction, optimized execution using Catalyst engine.
- What is lazy evaluation in PySpark?
🔹 Operations are not executed immediately. Transformations are queued, and execution happens only when an action (like .collect(), .show()) is triggered.
- Explain wide vs narrow transformations.
🔹 Narrow: data is shuffled within the same partition (e.g., map, filter)
🔹 Wide: data is shuffled across partitions (e.g., groupBy, join) — expensive.
- Difference between map() and flatMap()?
🔹 map() returns one output per input.
🔹 flatMap() can return multiple outputs, then flattens them into one list.
- How do you handle skewed data in PySpark?
🔹 Use salting, broadcast joins, repartitioning, or custom partitioners to handle data skew.
- What is Catalyst Optimizer?
🔹 It’s Spark SQL’s internal query optimizer — it rewrites and optimizes the query plan for faster execution.
- What is Tungsten in Spark?
🔹 It’s Spark’s physical execution engine that improves memory management and CPU efficiency.
- How do you improve PySpark performance?
✅ Cache/repartition wisely
✅ Avoid shuffles
✅ Use broadcast joins
✅ Use .select() instead of *
✅ Persist data smartly
- How does join work in PySpark?
🔹 Joins cause shuffles unless using a broadcast join. Spark supports inner, outer, left, right joins.
- What is checkpointing in PySpark?
🔹 It saves the RDD lineage to a stable storage (e.g., HDFS) to avoid recomputation after failures.
- Explain .cache() vs .persist().
🔹 .cache() = persist data in memory only
🔹 .persist() = more flexible, allows memory + disk or other levels
- What’s the difference between collect() and take()?
🔹 collect() brings the entire dataset to the driver
🔹 take(n) brings only first n elements — safer and more efficient
- How do you write a UDF in PySpark?
python
Copy
Edit
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def convert_upper(text):
return text.upper()
upper_udf = udf(convert_upper, StringType())
df.withColumn(“upper_name”, upper_udf(df[“name”]))