Deloitte Azure Data Engineer Interview Q/A

  1. Write a SQL query to find the third highest salary from an employee table.

I would use DENSE_RANK() with descending salary order and filter where rank equals 3. This approach handles duplicate salaries correctly compared to simple TOP or LIMIT queries. Window functions are preferred because they are scalable and easier to maintain.

  1. Difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN with use cases.

INNER JOIN returns matching records from both tables and is mostly used in transactional reporting. LEFT JOIN returns all records from the left table even if no match exists, which is useful for finding missing data. FULL JOIN returns all records from both tables and is commonly used in reconciliation scenarios.

  1. How do you handle duplicate records in PySpark DataFrame?

I use dropDuplicates() for simple duplicate removal and window functions for advanced deduplication logic. In production, I usually define business keys and timestamps to keep only the latest valid record. Deduplication is typically performed before loading curated layers.

  1. Write a PySpark transformation to filter and aggregate large datasets.

I generally use filter(), groupBy(), and agg() transformations for large-scale processing. Predicate pushdown and partition pruning help reduce scanned data for better performance. I also avoid unnecessary shuffles and cache intermediate DataFrames when reused multiple times.

  1. What are lazy evaluation and actions in Spark?

Spark follows lazy evaluation, meaning transformations are not executed immediately until an action is triggered. This helps Spark optimize the execution plan before processing data. Actions like collect(), count(), and write() trigger actual computation.

  1. How do you optimize a slow-running Spark job?

I optimize Spark jobs using partitioning, broadcast joins, caching, and reducing shuffle operations. I also analyze the Spark UI to identify bottlenecks like skewed joins or excessive memory usage. Using native Spark functions instead of UDFs improves execution performance significantly.

  1. Explain how you built an ETL pipeline using Azure Data Factory.

I built ETL pipelines where ADF orchestrated ingestion from SQL databases and APIs into Azure Data Lake. Databricks notebooks handled transformation logic, validation, and loading into curated Delta tables. Logging, monitoring, and parameterization were implemented for scalability and maintainability.

  1. How do you implement incremental load using watermark or timestamp?

I maintain a watermark column such as last_updated_timestamp to identify newly inserted or modified records. During each pipeline run, only records greater than the previous watermark value are processed. This reduces processing time and avoids reloading the entire dataset.

  1. What is Azure Databricks architecture?

Azure Databricks consists of a control plane and a data plane integrated with Azure services. The control plane manages clusters, notebooks, and job orchestration, while the data plane processes workloads inside the customerโ€™s cloud environment. It provides scalable distributed processing using Apache Spark.

  1. Difference between cache() and persist() in Spark.

cache() stores DataFrames in memory using the default storage level, while persist() allows choosing different storage options like memory and disk combinations. persist() is more flexible for handling large datasets that may not fit entirely in memory. Both improve performance when DataFrames are reused.

  1. How do you handle schema drift in ADF pipelines?

I use schema mapping, validation checks, and flexible ingestion logic to handle schema drift. In Databricks, schema evolution features help accommodate new columns dynamically. Monitoring and alerts are also configured to identify unexpected schema changes before impacting downstream systems.

  1. How do you manage secrets and credentials securely in Azure?

I use Azure Key Vault to securely store secrets, connection strings, and credentials. ADF and Databricks access secrets through managed identities instead of hardcoding sensitive information. This improves security, governance, and compliance across environments.

  1. Tell me about a project where you handled large-scale data processing.

In one project, we processed hundreds of GBs of transactional and customer data daily using Databricks and ADLS. I worked on ingestion, transformation, optimization, and monitoring of Spark jobs. Partitioning and Delta Lake optimization significantly improved processing efficiency.

  1. How do you deal with tight deadlines in client projects?

I prioritize high-impact tasks first and break work into manageable deliverables with realistic timelines. Regular communication with stakeholders helps avoid surprises and align expectations early. During critical phases, I focus on structured execution instead of rushing blindly.

  1. Describe a situation where you improved pipeline performance.

One pipeline was taking more than three hours due to inefficient joins and excessive shuffling. I optimized partitioning, introduced broadcast joins, and reduced unnecessary transformations. After optimization, the execution time reduced to less than one hour.

  1. How do you communicate with non-technical stakeholders?

I explain solutions using business-oriented language instead of deep technical details. Rather than discussing Spark internals, I focus on how the solution improves reporting speed, accuracy, or operational efficiency. This helps stakeholders understand business impact clearly.

  1. What will you do if your data pipeline fails in production?

First, I would analyze logs and monitoring dashboards to identify the root cause quickly. After fixing the issue, I would rerun failed stages using checkpoints or restart mechanisms. I would also communicate status updates proactively to minimize business impact.

  1. Why Deloitte?

Deloitte works on enterprise-scale digital transformation projects using modern cloud and analytics technologies. I believe the role aligns well with my experience in Azure, Databricks, and Data Engineering. The opportunity to work on diverse client projects and solve large-scale business problems makes Deloitte a strong fit for my career growth.
๐—œ ๐—ต๐—ฎ๐˜ƒ๐—ฒ ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ฎ ๐—–๐—ผ๐—บ๐—ฝ๐—น๐—ฒ๐˜๐—ฒ ๐—ฃ๐—ฟ๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—š๐˜‚๐—ถ๐—ฑ๐—ฒ ๐—ณ๐—ผ๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐˜€.

๐—š๐—ฒ๐˜ ๐˜๐—ต๐—ฒ ๐—š๐˜‚๐—ถ๐—ฑ๐—ฒ ๐—ต๐—ฒ๐—ฟ๐—ฒ – ๐Ÿ‘‰https://topmate.io/kasi_v/1823412?utm_source=public_profile&utm_campaign=kasi_v

If you’ve read this far, LIKE ๐Ÿ‘ and RESHARE ๐Ÿ” to help more engineers prepare confidently.

Leave a Reply

Your email address will not be published. Required fields are marked *