Data Pipeline Interview Questions and answers

If a pipeline fails midway, how do you ensure data consistency?

👉 Use staging tables and commit data only after full batch success. For Spark streaming, rely on checkpointing. Idempotent operations like MERGE ensure no duplicates.

What is idempotency in pipelines, and why is it important?

👉 Idempotency means re-running the pipeline multiple times gives the same result. Important to avoid duplicates after failure/restart. Implemented using UPSERT/MERGE, unique keys, or partition overwrite.

How do you handle partial batch processing failures?

👉 Track batch metadata (batch ID, processed files, row counts). Use retries for failed jobs. Store unprocessed files in a “pending” bucket and rerun only failed files.

What is checkpointing in Spark, and how does it help?

👉 Spark Structured Streaming checkpoints save offsets & states to HDFS/GCS/S3. If pipeline fails, Spark resumes from last checkpoint → avoids reprocessing or data loss.

What’s the difference between at-least-once and exactly-once delivery?

At-least-once: Messages may be processed multiple times (duplicates possible).

Exactly-once: Messages processed only once (requires idempotent sinks like BigQuery/Snowflake with MERGE).

How do you handle duplicate data in case of retries?

👉 Use deduplication strategies like primary keys, hash columns, or window functions (ROW_NUMBER() OVER(PARTITION BY key ORDER BY timestamp)).

How do you handle corrupted or malformed CSV files?

👉 Options:

PySpark: Use mode(“DROPMALFORMED”) or badRecordsPath.

Airflow DAGs: Validate schema before ingestion.

DLQ (Dead Letter Queue): Push bad records to a separate bucket/table.

What’s the role of retries in Airflow DAG failure handling?

👉 Airflow allows retries with exponential backoff. If a task fails, it can retry 3-5 times before marking as failed, reducing transient errors.

How do you handle schema evolution in pipelines?

👉 Use schema registry or allow schema evolution in Spark (mergeSchema option). In Snowflake, handle via ALTER TABLE ADD COLUMN. Store metadata in a data catalog (Glue/BigLake).

How do you handle late-arriving data in streaming pipelines?

👉 Use watermarking in Spark (withWatermark) or event-time windows. In batch, rerun with partition overwrite (e.g., BigQuery MERGE).

If your Airflow DAG fails, how do you resume only failed tasks?

👉 Use task-level retries. Airflow can rerun only failed tasks instead of the whole DAG with –task-id option in CLI.

How do you ensure fault tolerance in Spark jobs?

👉 By enabling checkpointing, using parquet/orc (splittable formats), and leveraging Spark speculative execution to retry slow/failing tasks.

What if your job fails due to small file problems (too many 6kb files)?

👉 Use file compaction:

Ingest small files → batch into larger parquet files (~128MB).

Use Spark repartitioning (coalesce, repartition).

How do you monitor pipeline failures?

👉

Airflow: Email/SMS/Slack alerts.

GCP: Cloud Monitoring + Error Reporting.

AWS: CloudWatch alarms on Lambda/EMR/S3 events.

How do you handle a failed batch load in Snowflake/BigQuery?

👉 Load data into staging tables first, validate, then MERGE INTO final table. If batch fails → staging table can be dropped/reloaded safely.

What is a dead-letter queue (DLQ), and when do you use it?

👉 DLQ stores invalid or unprocessed messages/events instead of blocking pipeline. Used in Kafka, Pub/Sub, Dataflow, etc.

How do you design pipelines to be restartable?

👉 Maintain metadata-driven pipelines:

Track processed file names (in DB or metadata table).

Check before ingestion → skip already processed.

Use unique batch IDs.

What happens if BigQuery load job fails midway?

👉 BigQuery load jobs are atomic → either the full load succeeds or nothing is written. No partial data.

How do you handle backpressure in streaming systems?

👉 By tuning:

Kafka → consumer lag monitoring.

Spark → increasing partitions, enabling backpressure (spark.streaming.backpressure.enabled=true).

How do you recover from an on-prem DB failure as a final sink?

👉 Store processed data in a cloud bucket/staging layer temporarily. Once DB is back online, replay/reload only pending batches.

Hinzinfotech

Data Pipeline Interview Questions and answers

Leave a Reply Cancel reply