I would use DENSE_RANK() with descending salary order and filter where rank equals 3. This approach handles duplicate salaries correctly compared to simple TOP or LIMIT queries. Window functions are preferred because they are scalable and easier to maintain.
INNER JOIN returns matching records from both tables and is mostly used in transactional reporting. LEFT JOIN returns all records from the left table even if no match exists, which is useful for finding missing data. FULL JOIN returns all records from both tables and is commonly used in reconciliation scenarios.
I use dropDuplicates() for simple duplicate removal and window functions for advanced deduplication logic. In production, I usually define business keys and timestamps to keep only the latest valid record. Deduplication is typically performed before loading curated layers.
I generally use filter(), groupBy(), and agg() transformations for large-scale processing. Predicate pushdown and partition pruning help reduce scanned data for better performance. I also avoid unnecessary shuffles and cache intermediate DataFrames when reused multiple times.
Spark follows lazy evaluation, meaning transformations are not executed immediately until an action is triggered. This helps Spark optimize the execution plan before processing data. Actions like collect(), count(), and write() trigger actual computation.
I optimize Spark jobs using partitioning, broadcast joins, caching, and reducing shuffle operations. I also analyze the Spark UI to identify bottlenecks like skewed joins or excessive memory usage. Using native Spark functions instead of UDFs improves execution performance significantly.
I built ETL pipelines where ADF orchestrated ingestion from SQL databases and APIs into Azure Data Lake. Databricks notebooks handled transformation logic, validation, and loading into curated Delta tables. Logging, monitoring, and parameterization were implemented for scalability and maintainability.
I maintain a watermark column such as last_updated_timestamp to identify newly inserted or modified records. During each pipeline run, only records greater than the previous watermark value are processed. This reduces processing time and avoids reloading the entire dataset.
Azure Databricks consists of a control plane and a data plane integrated with Azure services. The control plane manages clusters, notebooks, and job orchestration, while the data plane processes workloads inside the customerโs cloud environment. It provides scalable distributed processing using Apache Spark.
cache() stores DataFrames in memory using the default storage level, while persist() allows choosing different storage options like memory and disk combinations. persist() is more flexible for handling large datasets that may not fit entirely in memory. Both improve performance when DataFrames are reused.
I use schema mapping, validation checks, and flexible ingestion logic to handle schema drift. In Databricks, schema evolution features help accommodate new columns dynamically. Monitoring and alerts are also configured to identify unexpected schema changes before impacting downstream systems.
I use Azure Key Vault to securely store secrets, connection strings, and credentials. ADF and Databricks access secrets through managed identities instead of hardcoding sensitive information. This improves security, governance, and compliance across environments.
In one project, we processed hundreds of GBs of transactional and customer data daily using Databricks and ADLS. I worked on ingestion, transformation, optimization, and monitoring of Spark jobs. Partitioning and Delta Lake optimization significantly improved processing efficiency.
I prioritize high-impact tasks first and break work into manageable deliverables with realistic timelines. Regular communication with stakeholders helps avoid surprises and align expectations early. During critical phases, I focus on structured execution instead of rushing blindly.
One pipeline was taking more than three hours due to inefficient joins and excessive shuffling. I optimized partitioning, introduced broadcast joins, and reduced unnecessary transformations. After optimization, the execution time reduced to less than one hour.
I explain solutions using business-oriented language instead of deep technical details. Rather than discussing Spark internals, I focus on how the solution improves reporting speed, accuracy, or operational efficiency. This helps stakeholders understand business impact clearly.
First, I would analyze logs and monitoring dashboards to identify the root cause quickly. After fixing the issue, I would rerun failed stages using checkpoints or restart mechanisms. I would also communicate status updates proactively to minimize business impact.
Deloitte works on enterprise-scale digital transformation projects using modern cloud and analytics technologies. I believe the role aligns well with my experience in Azure, Databricks, and Data Engineering. The opportunity to work on diverse client projects and solve large-scale business problems makes Deloitte a strong fit for my career growth.
๐ ๐ต๐ฎ๐๐ฒ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ๐ฑ ๐ฎ ๐๐ผ๐บ๐ฝ๐น๐ฒ๐๐ฒ ๐ฃ๐ฟ๐ฒ๐ฝ๐ฎ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐๐๐ถ๐ฑ๐ฒ ๐ณ๐ผ๐ฟ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐.
๐๐ฒ๐ ๐๐ต๐ฒ ๐๐๐ถ๐ฑ๐ฒ ๐ต๐ฒ๐ฟ๐ฒ – ๐https://topmate.io/kasi_v/1823412?utm_source=public_profile&utm_campaign=kasi_v
If you’ve read this far, LIKE ๐ and RESHARE ๐ to help more engineers prepare confidently.