Databricks interviews are brutally simple: either you understand the core building blocks, or you don’t. And the gap shows immediately. These 20 questions represent what actually matters — the logic behind Delta Lake, pipeline reliability, ingestion patterns, and performance.
Let’s break them down with clarity and precision.
A Delta table is still Parquet underneath, but with ACID transactions, a transaction log, time travel, schema enforcement, and expectations. If you don’t know the internals, you can’t debug production issues.
VACUUM only removes files older than the retention threshold after ensuring no active transactions refer to them. If you VACUUM aggressively without considering streaming checkpoints, you’ll break pipelines.
Checkpoints compact the _delta_log JSON files into Parquet for faster metadata reads. Without understanding this, you’ll never explain why Delta scales better than plain Parquet.
COPY INTO is batch-triggered. Auto Loader is incremental + scalable with file notification services. You use COPY for backfills, Auto Loader for streaming-like ingestion.
Delta supports both evolution and enforcement. Evolution happens via mergeSchema, enforcement blocks unintended changes. Sloppy configuration leads to corrupt data.
Bronze = raw, Silver = cleansed, Gold = business-ready aggregations. The point is lineage, incremental design, and auditability — not aesthetic diagrams.
OPTIMIZE compacts small files; ZORDER improves data skipping. If your data isn’t highly selective, ZORDER is a waste of compute.
You need to know how to analyze skew, spill, shuffle writes, executor memory usage, and stage-level metrics. Logging into a cluster and guessing doesn’t cut it.
Photon uses vectorized execution in C++. It’s faster for SQL and Delta-heavy workloads. Not knowing when to enable it is a missed opportunity.
Delta uses optimistic concurrency control. Understanding conflict detection is essential before explaining why a MERGE job randomly fails.
Auto Loader uses RocksDB (or cloud notifications) to track processed files. If you don’t know this, you can’t explain why it scales better than manual streaming.
It centralizes governance, lineage, and permissioning. Legacy metastore + UC creates confusion, so you must know the migration considerations.
Every Databricks DE job eventually becomes a MERGE-heavy pipeline. Good engineers know how to handle updates, deletes, late-arriving data, and dedupe.
Unity Catalog supports this natively using dynamic views. If your company handles PII, this is non-negotiable.
Workflows is simply Jobs with orchestration superpowers — task dependencies, conditionals, triggers, and retries.
You need to know:
Half of bad pipelines come from bad orchestration.
You fix skew using salting, repartitioning by a proper key, caching strategically, and avoiding wide transformations unnecessarily.
DLT is not “another ETL tool.” It enforces data quality expectations, handles retries, auto-scaling, lineage, and guarantees data reliability.
A serious team uses: