PySpark Interview Q&As for Data Engineers
โ
15 PySpark Interview Q&As for Data Engineers:
- What is PySpark?
๐น PySpark is the Python API for Apache Spark. It allows Python developers to harness the power of distributed computing with Sparkโs in-memory processing.
- How is PySpark different from Pandas?
๐น PySpark handles distributed data and is built for big data, while Pandas works on single-node memory โ best for small datasets.
- What are RDDs and DataFrames?
๐น RDD (Resilient Distributed Dataset): low-level, immutable distributed data.
๐น DataFrame: high-level abstraction, optimized execution using Catalyst engine.
- What is lazy evaluation in PySpark?
๐น Operations are not executed immediately. Transformations are queued, and execution happens only when an action (like .collect(), .show()) is triggered.
- Explain wide vs narrow transformations.
๐น Narrow: data is shuffled within the same partition (e.g., map, filter)
๐น Wide: data is shuffled across partitions (e.g., groupBy, join) โ expensive.
- Difference between map() and flatMap()?
๐น map() returns one output per input.
๐น flatMap() can return multiple outputs, then flattens them into one list.
- How do you handle skewed data in PySpark?
๐น Use salting, broadcast joins, repartitioning, or custom partitioners to handle data skew.
- What is Catalyst Optimizer?
๐น Itโs Spark SQLโs internal query optimizer โ it rewrites and optimizes the query plan for faster execution.
- What is Tungsten in Spark?
๐น Itโs Sparkโs physical execution engine that improves memory management and CPU efficiency.
- How do you improve PySpark performance?
โ
Cache/repartition wisely
โ
Avoid shuffles
โ
Use broadcast joins
โ
Use .select() instead of *
โ
Persist data smartly
- How does join work in PySpark?
๐น Joins cause shuffles unless using a broadcast join. Spark supports inner, outer, left, right joins.
- What is checkpointing in PySpark?
๐น It saves the RDD lineage to a stable storage (e.g., HDFS) to avoid recomputation after failures.
- Explain .cache() vs .persist().
๐น .cache() = persist data in memory only
๐น .persist() = more flexible, allows memory + disk or other levels
- Whatโs the difference between collect() and take()?
๐น collect() brings the entire dataset to the driver
๐น take(n) brings only first n elements โ safer and more efficient
- How do you write a UDF in PySpark?
python
Copy
Edit
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def convert_upper(text):
return text.upper()
upper_udf = udf(convert_upper, StringType())
df.withColumn(“upper_name”, upper_udf(df[“name”]))