Scenario-Based Data Engineering Questions
Data Engineering interviews are no longer about definitions or SQL puzzles. Companies want engineers who can think like owners, handle uncertainty, and solve messy real-world problems. That’s why scenario-based questions dominate every major tech interview today — from fintech to SaaS to product companies.
Below is a breakdown of twelve core scenarios that expose whether a candidate can actually build, scale, and operate reliable data pipelines.
1. Corrupted Files in S3 Every Morning
The real test isn’t detecting corruption — tools like checksums and schema validation can do that.
The real test is:
- How do you prevent pipeline failure?
- How do you isolate bad data?
- How do you notify upstream?
The best answers include quarantine buckets, schema enforcement, and alerting.
2. PySpark Job Suddenly Slow
Every engineer faces this.
Your answer should show structured debugging:
- Check data volume changes
- Review skew
- Inspect stages in Spark UI
- Validate cluster autoscaling patterns
- Identify UDF hotspots
Interviewers want the thought process, not magic fixes.
3. Downstream Seeing Duplicates
This scenario tests your understanding of:
- Primary keys
- CDC merge logic
- Idempotency
- Late-arriving data
People who jump to DELETE statements fail. Strong candidates fix the root cause: broken merge semantics.
4. Intermittent Job Failures Without Logs
This evaluates resilience.
You should bring up:
- Retries + exponential backoff
- Better logging
- Capturing stderr
- Metric instrumentation
- Debug DAG separation
Random failures always have a pattern. Show how you find it.
5. CDC Without Updated Timestamps
This is where creativity matters.
Options include:
- Hash-based comparisons
- Version columns
- Change tables
- Log-based ingestion
Lazy answers like “impossible without timestamp” get rejected instantly.
6. Sudden BigQuery Cost Spike
You must show cost-awareness:
- Audit query history
- Identify high-scan queries
- Check BI tools exploding queries
- Review materialized view refresh costs
- Check table partitions/clustering
Interviewers love engineers who treat cost like performance.
7. Inheriting a Messy Airflow DAG
This is about engineering discipline.
Your approach should include:
- Breaking DAG into smaller DAGs
- Removing circular dependencies
- Consistent naming
- Adding SLAs and alerts
- Implementing modular operators
If you say “rewrite everything,” you fail. Real systems need refactoring, not demolition.
8. Missing 13M Records in Production Table
Hiring managers look for:
- Check upstream file counts
- Validate schema drift
- Compare checkpoint offsets
- Reprocess only missing partitions
- Backfill logic
They want someone who protects data integrity first, speed second.
9. Glue Job Failing Because of Skew
You must mention:
- Salting
- Repartitioning
- Avoiding wide transformations
- Using
groupByKey alternatives
- Broadcasting small datasets
Skew kills performance. You should show a clear diagnosis path.
10. ML Team Wants Hourly Instead of Daily
Now it’s about architecture.
Options include:
- Micro-batching
- Structured Streaming / Auto Loader
- EMR incremental loads
- Queue-based ingestion
Your answer must talk about cost vs latency trade-offs.