Blog Archives - Hinzinfotech

Running a Spark Batch Job on Google Cloud Dataproc

Running a Spark Batch Job on Google Cloud Dataproc As Data Engineers, one of the most powerful capabilities we often use is running batch Spark jobs on cloud clusters. Google Cloud Dataproc makes this seamless by letting us submit jobs directly to a managed Spark cluster. Here’s how I recently submitted a batch Spark job […]

Truecaller’s PySpark ETL Challenge

Problem StatementTruecaller deals with millions of user settings change events daily.Each event looks like this: id (long)name (string)value (string)timestamp (long)The goal: Group events by id. Convert (name, value) pairs into a Map. Always pick the value for each key that has the latest timestamp. Output a partitioned table for faster downstream queries. Example: id name […]

Real-Time Data Streaming with GCP Pub/Sub

IntroductionRetail today is not just about selling products – it’s about instant insights. Customers expect personalized offers, faster checkouts, and always-available inventory. For that, retailers need real-time data processing. In this tutorial, we’ll build a real-time data streaming pipeline for a retail company using Google Cloud Pub/Sub. Use CaseA retail chain with 500+ stores wants […]

Catalyst Optimizer in Spark 3.0: Types, Use Cases, and Real-World Commands

Catalyst Optimizer in Spark 3.0: Types, Use Cases, and Real-World CommandsIf you’re working with big data, speed matters. Spark 3.0 introduced major internal improvements—none more powerful than the Catalyst Optimizer and Adaptive Query Execution (AQE). 🧠 What is Catalyst Optimizer?It’s Spark’s rule-based and cost-based optimizer. Every time you write PySpark code, it: Converts it to […]

PySpark vs Pandas – When to Use What for Big Data

PySpark vs Pandas – When to Use What for Big Data “Should I use Pandas or PySpark for my data processing?” Let’s break this down across key dimensions with real examples: 1️⃣ Performance & ScalabilityFeature Pandas PySparkExecution Single-threaded Distributed (multi-node)In-memory limit Limited to RAM Designed for TBs+File handling Local files only HDFS, S3, GCS, JDBC, […]

How to handle 100GB data in PySpark: A real-world guide with commands

“If you have 100GB of data, how do you process it efficiently in PySpark?” It’s a classic interview question — but also a real challenge every data engineer faces when working with big data in production. Let’s break down the best practices step by step, using practical commands you can reuse in your projects. ✅ […]

Real-Time Stock Data Pipeline Using AWS – Built for Speed & Scale!

Source: Real-time CSV files with stock prices Target: JSON format files consumable by Data Analysts Goal: Automate the transformation and cataloging for query-ready analytics 🛠️ Step-by-Step Pipeline with AWS Services🔹 1. CSV Files Drop into S3Incoming files: stock data like stock_data_2025-06-16.csv S3 Source Bucket: s3://reedx-stock-raw/ These files were pushed by upstream providers or batched ingestion […]

PySpark Interview Q&As for Data Engineers

✅ 15 PySpark Interview Q&As for Data Engineers: pythonCopyEditfrom pyspark.sql.functions import udffrom pyspark.sql.types import StringType def convert_upper(text):return text.upper() upper_udf = udf(convert_upper, StringType())df.withColumn(“upper_name”, upper_udf(df[“name”]))

Oracle GoldenGate Replicating On-Prem Oracle DB to Google Cloud

🔹 Looking to replicate your on-prem Oracle database to Google Cloud with real-time changes? Oracle GoldenGate (OGG) provides a seamless solution for heterogeneous replication with minimal latency. 👉 In this post, I’ll walk you through a step-by-step process to configure Oracle GoldenGate replication from an On-Premises Oracle database to Google Cloud (Cloud SQL / Bare […]

Essential PostgreSQL Queries Data Engineer

Essential PostgreSQL Queries Every Data Engineer Should Know 🚀 As a Data Engineer, mastering PostgreSQL queries can help you optimize database performance and troubleshoot issues efficiently. Here are some essential queries to keep in your toolkit! 🛠️ 1️⃣ Check Tablespace Size Monitor the disk space used by your tablespaces: SELECT spcname AS tablespace, pg_size_pretty(pg_tablespace_size(spcname))FROM pg_tablespace; […]

Hinzinfotech

Running a Spark Batch Job on Google Cloud Dataproc

Truecaller’s PySpark ETL Challenge

Real-Time Data Streaming with GCP Pub/Sub

Catalyst Optimizer in Spark 3.0: Types, Use Cases, and Real-World Commands

PySpark vs Pandas – When to Use What for Big Data

How to handle 100GB data in PySpark: A real-world guide with commands

Real-Time Stock Data Pipeline Using AWS – Built for Speed & Scale!

PySpark Interview Q&As for Data Engineers

Oracle GoldenGate Replicating On-Prem Oracle DB to Google Cloud

Essential PostgreSQL Queries Data Engineer

Recent Posts

Recent Comments

Archives

Categories

Oracle

Cloud

Postgresql

Category: Blog

Running a Spark Batch Job on Google Cloud Dataproc

Truecaller’s PySpark ETL Challenge

Real-Time Data Streaming with GCP Pub/Sub

Catalyst Optimizer in Spark 3.0: Types, Use Cases, and Real-World Commands

PySpark vs Pandas – When to Use What for Big Data

How to handle 100GB data in PySpark: A real-world guide with commands

Real-Time Stock Data Pipeline Using AWS – Built for Speed & Scale!

PySpark Interview Q&As for Data Engineers

Oracle GoldenGate Replicating On-Prem Oracle DB to Google Cloud

Essential PostgreSQL Queries Data Engineer

Recent Posts

Recent Comments

Archives

Categories

Oracle

Cloud

Postgresql