PySpark Interview Q&As for Data Engineers

✅ 15 PySpark Interview Q&As for Data Engineers:

  1. What is PySpark?
    🔹 PySpark is the Python API for Apache Spark. It allows Python developers to harness the power of distributed computing with Spark’s in-memory processing.
  2. How is PySpark different from Pandas?
    🔹 PySpark handles distributed data and is built for big data, while Pandas works on single-node memory — best for small datasets.
  3. What are RDDs and DataFrames?
    🔹 RDD (Resilient Distributed Dataset): low-level, immutable distributed data.
    🔹 DataFrame: high-level abstraction, optimized execution using Catalyst engine.
  4. What is lazy evaluation in PySpark?
    🔹 Operations are not executed immediately. Transformations are queued, and execution happens only when an action (like .collect(), .show()) is triggered.
  5. Explain wide vs narrow transformations.
    🔹 Narrow: data is shuffled within the same partition (e.g., map, filter)
    🔹 Wide: data is shuffled across partitions (e.g., groupBy, join) — expensive.
  6. Difference between map() and flatMap()?
    🔹 map() returns one output per input.
    🔹 flatMap() can return multiple outputs, then flattens them into one list.
  7. How do you handle skewed data in PySpark?
    🔹 Use salting, broadcast joins, repartitioning, or custom partitioners to handle data skew.
  8. What is Catalyst Optimizer?
    🔹 It’s Spark SQL’s internal query optimizer — it rewrites and optimizes the query plan for faster execution.
  9. What is Tungsten in Spark?
    🔹 It’s Spark’s physical execution engine that improves memory management and CPU efficiency.
  10. How do you improve PySpark performance?
    ✅ Cache/repartition wisely
    ✅ Avoid shuffles
    ✅ Use broadcast joins
    ✅ Use .select() instead of *
    ✅ Persist data smartly
  11. How does join work in PySpark?
    🔹 Joins cause shuffles unless using a broadcast join. Spark supports inner, outer, left, right joins.
  12. What is checkpointing in PySpark?
    🔹 It saves the RDD lineage to a stable storage (e.g., HDFS) to avoid recomputation after failures.
  13. Explain .cache() vs .persist().
    🔹 .cache() = persist data in memory only
    🔹 .persist() = more flexible, allows memory + disk or other levels
  14. What’s the difference between collect() and take()?
    🔹 collect() brings the entire dataset to the driver
    🔹 take(n) brings only first n elements — safer and more efficient
  15. How do you write a UDF in PySpark?

python
Copy
Edit
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def convert_upper(text):
return text.upper()

upper_udf = udf(convert_upper, StringType())
df.withColumn(“upper_name”, upper_udf(df[“name”]))

Leave a Reply

Your email address will not be published. Required fields are marked *