August 2025 - Hinzinfotech

Running a Spark Batch Job on Google Cloud Dataproc

Running a Spark Batch Job on Google Cloud Dataproc As Data Engineers, one of the most powerful capabilities we often use is running batch Spark jobs on cloud clusters. Google Cloud Dataproc makes this seamless by letting us submit jobs directly to a managed Spark cluster. Here’s how I recently submitted a batch Spark job […]

Truecaller’s PySpark ETL Challenge

Problem StatementTruecaller deals with millions of user settings change events daily.Each event looks like this: id (long)name (string)value (string)timestamp (long)The goal: Group events by id. Convert (name, value) pairs into a Map. Always pick the value for each key that has the latest timestamp. Output a partitioned table for faster downstream queries. Example: id name […]

Real-Time Data Streaming with GCP Pub/Sub

IntroductionRetail today is not just about selling products – it’s about instant insights. Customers expect personalized offers, faster checkouts, and always-available inventory. For that, retailers need real-time data processing. In this tutorial, we’ll build a real-time data streaming pipeline for a retail company using Google Cloud Pub/Sub. Use CaseA retail chain with 500+ stores wants […]

Catalyst Optimizer in Spark 3.0: Types, Use Cases, and Real-World Commands

Catalyst Optimizer in Spark 3.0: Types, Use Cases, and Real-World CommandsIf you’re working with big data, speed matters. Spark 3.0 introduced major internal improvements—none more powerful than the Catalyst Optimizer and Adaptive Query Execution (AQE). 🧠 What is Catalyst Optimizer?It’s Spark’s rule-based and cost-based optimizer. Every time you write PySpark code, it: Converts it to […]

PySpark vs Pandas – When to Use What for Big Data

PySpark vs Pandas – When to Use What for Big Data “Should I use Pandas or PySpark for my data processing?” Let’s break this down across key dimensions with real examples: 1️⃣ Performance & ScalabilityFeature Pandas PySparkExecution Single-threaded Distributed (multi-node)In-memory limit Limited to RAM Designed for TBs+File handling Local files only HDFS, S3, GCS, JDBC, […]

How to handle 100GB data in PySpark: A real-world guide with commands

“If you have 100GB of data, how do you process it efficiently in PySpark?” It’s a classic interview question — but also a real challenge every data engineer faces when working with big data in production. Let’s break down the best practices step by step, using practical commands you can reuse in your projects. ✅ […]

Hinzinfotech

Running a Spark Batch Job on Google Cloud Dataproc

Truecaller’s PySpark ETL Challenge

Real-Time Data Streaming with GCP Pub/Sub

Catalyst Optimizer in Spark 3.0: Types, Use Cases, and Real-World Commands

PySpark vs Pandas – When to Use What for Big Data

How to handle 100GB data in PySpark: A real-world guide with commands

Recent Posts

Recent Comments

Archives

Categories

Month: August 2025

Running a Spark Batch Job on Google Cloud Dataproc

Truecaller’s PySpark ETL Challenge

Real-Time Data Streaming with GCP Pub/Sub

Catalyst Optimizer in Spark 3.0: Types, Use Cases, and Real-World Commands

PySpark vs Pandas – When to Use What for Big Data

How to handle 100GB data in PySpark: A real-world guide with commands

Recent Posts

Recent Comments

Archives

Categories