Recently one of my friend went through the CGI Data Engineer interview process, and trust me — it was a mix of real-world data problems, PySpark logic, and Azure ecosystem understanding.
If you’re preparing for Azure + Databricks roles, these questions will hit the exact level you’ll face.
Role : AZURE Data Engineer
CTC : 25 LPA
Exp : 5+ years
Difficulty Level : MEDIUM
1️⃣ PySpark Medium-Level Coding Question:
👉 Given a dataset of transactions, find the top 3 customers with the highest total spend per region.
(Focus on groupBy, agg, window functions, and performance tuning.)
2️⃣ SQL Query:
👉 Write a query to find users who made more than 3 purchases in the last 7 days but didn’t purchase in the previous week.
(They check your ability to use CTEs, window functions, and date filtering.)
3️⃣ Explain your end-to-end data ingestion architecture in your recent project.
4️⃣ How did you handle schema drift in your pipelines (ADF or Databricks)?
5️⃣ Describe a performance optimization you did in PySpark – what changed and what was the impact?
6️⃣ How did you manage data partitioning and bucketing in Delta Lake?
7️⃣ What was your data validation and quality check approach before loading data to Azure Synapse or Snowflake?
8️⃣ How do you orchestrate jobs between ADF and Databricks notebooks?
9️⃣ How do you version control your notebooks in Databricks?
10️⃣ Have you implemented CI/CD pipelines for Databricks? How? (mention Azure DevOps or GitHub Actions)
11️⃣ What are the best practices for optimizing Delta tables (vacuum, ZORDER, OPTIMIZE, caching)?
12️⃣ How do you manage secrets and credentials securely within Databricks?
13️⃣ How did you implement incremental loading (CDC) from source systems?
14️⃣ What’s your approach to handling late arriving data in streaming or batch ETL?
15️⃣ How do you ensure data consistency and reliability in your pipelines?
16️⃣ Explain Star vs Snowflake schema – which one did you choose and why?
17️⃣ How do you model a data mart for sales analytics in Azure Synapse or Databricks SQL?
18️⃣ What’s your approach to dimension management (SCD Type 2) in PySpark?
19️⃣ How do you optimize joins and aggregations in a large-scale Azure Databricks environment?
20️⃣ If your data lake grew 10x overnight, what’s your scaling strategy (compute, storage, cost, and partitioning)?