Running a Spark Batch Job on Google Cloud Dataproc

Running a Spark Batch Job on Google Cloud Dataproc As Data Engineers, one of the most powerful capabilities we often use is running batch Spark jobs on cloud clusters. Google Cloud Dataproc makes this seamless by letting us submit jobs directly to a managed Spark cluster. Here’s how I recently submitted a batch Spark job […]

Read More

Truecaller’s PySpark ETL Challenge

Problem StatementTruecaller deals with millions of user settings change events daily.Each event looks like this: id (long)name (string)value (string)timestamp (long)The goal: Group events by id. Convert (name, value) pairs into a Map. Always pick the value for each key that has the latest timestamp. Output a partitioned table for faster downstream queries. Example: id name […]

Read More

Real-Time Data Streaming with GCP Pub/Sub

IntroductionRetail today is not just about selling products – it’s about instant insights. Customers expect personalized offers, faster checkouts, and always-available inventory. For that, retailers need real-time data processing. In this tutorial, we’ll build a real-time data streaming pipeline for a retail company using Google Cloud Pub/Sub. Use CaseA retail chain with 500+ stores wants […]

Read More

Catalyst Optimizer in Spark 3.0: Types, Use Cases, and Real-World Commands

Catalyst Optimizer in Spark 3.0: Types, Use Cases, and Real-World CommandsIf you’re working with big data, speed matters. Spark 3.0 introduced major internal improvements—none more powerful than the Catalyst Optimizer and Adaptive Query Execution (AQE). 🧠 What is Catalyst Optimizer?It’s Spark’s rule-based and cost-based optimizer. Every time you write PySpark code, it: Converts it to […]

Read More

PySpark vs Pandas – When to Use What for Big Data

PySpark vs Pandas – When to Use What for Big Data “Should I use Pandas or PySpark for my data processing?” Let’s break this down across key dimensions with real examples: 1️⃣ Performance & ScalabilityFeature Pandas PySparkExecution Single-threaded Distributed (multi-node)In-memory limit Limited to RAM Designed for TBs+File handling Local files only HDFS, S3, GCS, JDBC, […]

Read More

How to handle 100GB data in PySpark: A real-world guide with commands

“If you have 100GB of data, how do you process it efficiently in PySpark?” It’s a classic interview question — but also a real challenge every data engineer faces when working with big data in production. Let’s break down the best practices step by step, using practical commands you can reuse in your projects. ✅ […]

Read More