Scenario-Based Data Engineering Questions

Data Engineering interviews are no longer about definitions or SQL puzzles. Companies want engineers who can think like owners, handle uncertainty, and solve messy real-world problems. That’s why scenario-based questions dominate every major tech interview today — from fintech to SaaS to product companies.

Below is a breakdown of twelve core scenarios that expose whether a candidate can actually build, scale, and operate reliable data pipelines.


1. Corrupted Files in S3 Every Morning

The real test isn’t detecting corruption — tools like checksums and schema validation can do that.
The real test is:

  • How do you prevent pipeline failure?
  • How do you isolate bad data?
  • How do you notify upstream?
    The best answers include quarantine buckets, schema enforcement, and alerting.

2. PySpark Job Suddenly Slow

Every engineer faces this.
Your answer should show structured debugging:

  • Check data volume changes
  • Review skew
  • Inspect stages in Spark UI
  • Validate cluster autoscaling patterns
  • Identify UDF hotspots
    Interviewers want the thought process, not magic fixes.

3. Downstream Seeing Duplicates

This scenario tests your understanding of:

  • Primary keys
  • CDC merge logic
  • Idempotency
  • Late-arriving data
    People who jump to DELETE statements fail. Strong candidates fix the root cause: broken merge semantics.

4. Intermittent Job Failures Without Logs

This evaluates resilience.
You should bring up:

  • Retries + exponential backoff
  • Better logging
  • Capturing stderr
  • Metric instrumentation
  • Debug DAG separation
    Random failures always have a pattern. Show how you find it.

5. CDC Without Updated Timestamps

This is where creativity matters.
Options include:

  • Hash-based comparisons
  • Version columns
  • Change tables
  • Log-based ingestion
    Lazy answers like “impossible without timestamp” get rejected instantly.

6. Sudden BigQuery Cost Spike

You must show cost-awareness:

  • Audit query history
  • Identify high-scan queries
  • Check BI tools exploding queries
  • Review materialized view refresh costs
  • Check table partitions/clustering
    Interviewers love engineers who treat cost like performance.

7. Inheriting a Messy Airflow DAG

This is about engineering discipline.
Your approach should include:

  • Breaking DAG into smaller DAGs
  • Removing circular dependencies
  • Consistent naming
  • Adding SLAs and alerts
  • Implementing modular operators
    If you say “rewrite everything,” you fail. Real systems need refactoring, not demolition.

8. Missing 13M Records in Production Table

Hiring managers look for:

  • Check upstream file counts
  • Validate schema drift
  • Compare checkpoint offsets
  • Reprocess only missing partitions
  • Backfill logic
    They want someone who protects data integrity first, speed second.

9. Glue Job Failing Because of Skew

You must mention:

  • Salting
  • Repartitioning
  • Avoiding wide transformations
  • Using groupByKey alternatives
  • Broadcasting small datasets
    Skew kills performance. You should show a clear diagnosis path.

10. ML Team Wants Hourly Instead of Daily

Now it’s about architecture.
Options include:

  • Micro-batching
  • Structured Streaming / Auto Loader
  • EMR incremental loads
  • Queue-based ingestion
    Your answer must talk about cost vs latency trade-offs.

Leave a Reply

Your email address will not be published. Required fields are marked *