Scenario-Based Data Engineering Questions

Data Engineering interviews are no longer about definitions or SQL puzzles. Companies want engineers who can think like owners, handle uncertainty, and solve messy real-world problems. That’s why scenario-based questions dominate every major tech interview today — from fintech to SaaS to product companies.

Below is a breakdown of twelve core scenarios that expose whether a candidate can actually build, scale, and operate reliable data pipelines.

1. Corrupted Files in S3 Every Morning

The real test isn’t detecting corruption — tools like checksums and schema validation can do that.
The real test is:

How do you prevent pipeline failure?
How do you isolate bad data?
How do you notify upstream?
The best answers include quarantine buckets, schema enforcement, and alerting.

2. PySpark Job Suddenly Slow

Every engineer faces this.
Your answer should show structured debugging:

Check data volume changes
Review skew
Inspect stages in Spark UI
Validate cluster autoscaling patterns
Identify UDF hotspots
Interviewers want the thought process, not magic fixes.

3. Downstream Seeing Duplicates

This scenario tests your understanding of:

Primary keys
CDC merge logic
Idempotency
Late-arriving data
People who jump to DELETE statements fail. Strong candidates fix the root cause: broken merge semantics.

4. Intermittent Job Failures Without Logs

This evaluates resilience.
You should bring up:

Retries + exponential backoff
Better logging
Capturing stderr
Metric instrumentation
Debug DAG separation
Random failures always have a pattern. Show how you find it.

5. CDC Without Updated Timestamps

This is where creativity matters.
Options include:

Hash-based comparisons
Version columns
Change tables
Log-based ingestion
Lazy answers like “impossible without timestamp” get rejected instantly.

6. Sudden BigQuery Cost Spike

You must show cost-awareness:

Audit query history
Identify high-scan queries
Check BI tools exploding queries
Review materialized view refresh costs
Check table partitions/clustering
Interviewers love engineers who treat cost like performance.

7. Inheriting a Messy Airflow DAG

This is about engineering discipline.
Your approach should include:

Breaking DAG into smaller DAGs
Removing circular dependencies
Consistent naming
Adding SLAs and alerts
Implementing modular operators
If you say “rewrite everything,” you fail. Real systems need refactoring, not demolition.

8. Missing 13M Records in Production Table

Hiring managers look for:

Check upstream file counts
Validate schema drift
Compare checkpoint offsets
Reprocess only missing partitions
Backfill logic
They want someone who protects data integrity first, speed second.

9. Glue Job Failing Because of Skew

You must mention:

Salting
Repartitioning
Avoiding wide transformations
Using groupByKey alternatives
Broadcasting small datasets
Skew kills performance. You should show a clear diagnosis path.

10. ML Team Wants Hourly Instead of Daily

Now it’s about architecture.
Options include:

Micro-batching
Structured Streaming / Auto Loader
EMR incremental loads
Queue-based ingestion
Your answer must talk about cost vs latency trade-offs.

Hinzinfotech