I handle data skew using techniques like salting, broadcast joins, and repartitioning skewed keys evenly across executors. I also analyze Spark UI to identify skewed stages causing long execution times. In production pipelines, I optimized joins and reduced shuffle operations to improve overall processing performance. repartition() increases or decreases partitions with full shuffle and provides […]
Read MoreI would use DENSE_RANK() with descending salary order and filter where rank equals 3. This approach handles duplicate salaries correctly compared to simple TOP or LIMIT queries. Window functions are preferred because they are scalable and easier to maintain. INNER JOIN returns matching records from both tables and is mostly used in transactional reporting. LEFT […]
Read MoreI worked on Azure-based Data Engineering projects involving ADF, Databricks, ADLS, and PySpark. My role included building ingestion pipelines, developing transformation logic, optimizing Spark jobs, and handling deployments through Azure DevOps. In my recent project, we processed large-scale transactional data and built reporting-ready Gold layer datasets for business teams. ADF mainly supports Schedule Trigger, Tumbling […]
Read MoreI worked on Azure-based Data Engineering projects involving ADF, Databricks, ADLS, and PySpark. My role included building ingestion pipelines, developing transformation logic, optimizing Spark jobs, and handling deployments through Azure DevOps. In my recent project, we processed large-scale transactional data and built reporting-ready Gold layer datasets for business teams. ADF mainly supports Schedule Trigger, Tumbling […]
Read MoreData Engineering interviews are no longer about definitions or SQL puzzles. Companies want engineers who can think like owners, handle uncertainty, and solve messy real-world problems. That’s why scenario-based questions dominate every major tech interview today — from fintech to SaaS to product companies. Below is a breakdown of twelve core scenarios that expose whether […]
Read MoreBelow are the 20 most commonly asked Snowflake interview topics — explained clearly and practically. 1. What is Snowflake’s architecture and why is it unique? Snowflake uses a multi-cluster shared data architecture, separating compute, storage, and cloud services. 2. Explain Virtual Warehouses and how scaling works. A Virtual Warehouse is compute used for query processing.Supports: […]
Read MoreDatabricks interviews are brutally simple: either you understand the core building blocks, or you don’t. And the gap shows immediately. These 20 questions represent what actually matters — the logic behind Delta Lake, pipeline reliability, ingestion patterns, and performance. Let’s break them down with clarity and precision. 1. Delta Lake vs Parquet A Delta table […]
Read MoreIn most data engineering setups, teams rely on scheduled batch jobs that reload entire datasets daily — even if only 1% of the files have changed. That’s inefficient, slow, and expensive. Databricks Auto Loader solves this with one key idea: incremental ingestion. It continuously tracks and processes only new files, enabling near real-time ingestion without […]
Read MoreRecently one of my friend went through the CGI Data Engineer interview process, and trust me — it was a mix of real-world data problems, PySpark logic, and Azure ecosystem understanding.If you’re preparing for Azure + Databricks roles, these questions will hit the exact level you’ll face. Role : AZURE Data EngineerCTC : 25 LPAExp […]
Read MoreRole : Data EngineerCTC : 25 LPAExp : 5+ yearsDifficulty Level : MEDIUM1️⃣ PySpark: You have a dataset with user_id, timestamp, and transaction_amount.Write PySpark code to calculate each user’s average transaction in the last 30 days using window functions.2️⃣ SQL: Write a query to find the second highest salary in each department and handle cases […]
Read More