PySpark Interview Q&As for Data Engineers

โœ… 15 PySpark Interview Q&As for Data Engineers:

  1. What is PySpark?
    ๐Ÿ”น PySpark is the Python API for Apache Spark. It allows Python developers to harness the power of distributed computing with Sparkโ€™s in-memory processing.
  2. How is PySpark different from Pandas?
    ๐Ÿ”น PySpark handles distributed data and is built for big data, while Pandas works on single-node memory โ€” best for small datasets.
  3. What are RDDs and DataFrames?
    ๐Ÿ”น RDD (Resilient Distributed Dataset): low-level, immutable distributed data.
    ๐Ÿ”น DataFrame: high-level abstraction, optimized execution using Catalyst engine.
  4. What is lazy evaluation in PySpark?
    ๐Ÿ”น Operations are not executed immediately. Transformations are queued, and execution happens only when an action (like .collect(), .show()) is triggered.
  5. Explain wide vs narrow transformations.
    ๐Ÿ”น Narrow: data is shuffled within the same partition (e.g., map, filter)
    ๐Ÿ”น Wide: data is shuffled across partitions (e.g., groupBy, join) โ€” expensive.
  6. Difference between map() and flatMap()?
    ๐Ÿ”น map() returns one output per input.
    ๐Ÿ”น flatMap() can return multiple outputs, then flattens them into one list.
  7. How do you handle skewed data in PySpark?
    ๐Ÿ”น Use salting, broadcast joins, repartitioning, or custom partitioners to handle data skew.
  8. What is Catalyst Optimizer?
    ๐Ÿ”น Itโ€™s Spark SQLโ€™s internal query optimizer โ€” it rewrites and optimizes the query plan for faster execution.
  9. What is Tungsten in Spark?
    ๐Ÿ”น Itโ€™s Sparkโ€™s physical execution engine that improves memory management and CPU efficiency.
  10. How do you improve PySpark performance?
    โœ… Cache/repartition wisely
    โœ… Avoid shuffles
    โœ… Use broadcast joins
    โœ… Use .select() instead of *
    โœ… Persist data smartly
  11. How does join work in PySpark?
    ๐Ÿ”น Joins cause shuffles unless using a broadcast join. Spark supports inner, outer, left, right joins.
  12. What is checkpointing in PySpark?
    ๐Ÿ”น It saves the RDD lineage to a stable storage (e.g., HDFS) to avoid recomputation after failures.
  13. Explain .cache() vs .persist().
    ๐Ÿ”น .cache() = persist data in memory only
    ๐Ÿ”น .persist() = more flexible, allows memory + disk or other levels
  14. Whatโ€™s the difference between collect() and take()?
    ๐Ÿ”น collect() brings the entire dataset to the driver
    ๐Ÿ”น take(n) brings only first n elements โ€” safer and more efficient
  15. How do you write a UDF in PySpark?

python
Copy
Edit
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def convert_upper(text):
return text.upper()

upper_udf = udf(convert_upper, StringType())
df.withColumn(“upper_name”, upper_udf(df[“name”]))

Leave a Reply

Your email address will not be published. Required fields are marked *