👉 Use staging tables and commit data only after full batch success. For Spark streaming, rely on checkpointing. Idempotent operations like MERGE ensure no duplicates.
👉 Idempotency means re-running the pipeline multiple times gives the same result. Important to avoid duplicates after failure/restart. Implemented using UPSERT/MERGE, unique keys, or partition overwrite.
👉 Track batch metadata (batch ID, processed files, row counts). Use retries for failed jobs. Store unprocessed files in a “pending” bucket and rerun only failed files.
👉 Spark Structured Streaming checkpoints save offsets & states to HDFS/GCS/S3. If pipeline fails, Spark resumes from last checkpoint → avoids reprocessing or data loss.
At-least-once: Messages may be processed multiple times (duplicates possible).
Exactly-once: Messages processed only once (requires idempotent sinks like BigQuery/Snowflake with MERGE).
👉 Use deduplication strategies like primary keys, hash columns, or window functions (ROW_NUMBER() OVER(PARTITION BY key ORDER BY timestamp)).
👉 Options:
PySpark: Use mode(“DROPMALFORMED”) or badRecordsPath.
Airflow DAGs: Validate schema before ingestion.
DLQ (Dead Letter Queue): Push bad records to a separate bucket/table.
👉 Airflow allows retries with exponential backoff. If a task fails, it can retry 3-5 times before marking as failed, reducing transient errors.
👉 Use schema registry or allow schema evolution in Spark (mergeSchema option). In Snowflake, handle via ALTER TABLE ADD COLUMN. Store metadata in a data catalog (Glue/BigLake).
👉 Use watermarking in Spark (withWatermark) or event-time windows. In batch, rerun with partition overwrite (e.g., BigQuery MERGE).
👉 Use task-level retries. Airflow can rerun only failed tasks instead of the whole DAG with –task-id option in CLI.
👉 By enabling checkpointing, using parquet/orc (splittable formats), and leveraging Spark speculative execution to retry slow/failing tasks.
👉 Use file compaction:
Ingest small files → batch into larger parquet files (~128MB).
Use Spark repartitioning (coalesce, repartition).
👉
Airflow: Email/SMS/Slack alerts.
GCP: Cloud Monitoring + Error Reporting.
AWS: CloudWatch alarms on Lambda/EMR/S3 events.
👉 Load data into staging tables first, validate, then MERGE INTO final table. If batch fails → staging table can be dropped/reloaded safely.
👉 DLQ stores invalid or unprocessed messages/events instead of blocking pipeline. Used in Kafka, Pub/Sub, Dataflow, etc.
👉 Maintain metadata-driven pipelines:
Track processed file names (in DB or metadata table).
Check before ingestion → skip already processed.
Use unique batch IDs.
👉 BigQuery load jobs are atomic → either the full load succeeds or nothing is written. No partial data.
👉 By tuning:
Kafka → consumer lag monitoring.
Spark → increasing partitions, enabling backpressure (spark.streaming.backpressure.enabled=true).
👉 Store processed data in a cloud bucket/staging layer temporarily. Once DB is back online, replay/reload only pending batches.