Every data pipeline makes a fundamental choice before any code is written: does it process data in scheduled chunks, or continuously process data as it arrives?
This choice—batch versus streaming—determines the structure of everything. The tools you use, the guarantees you can make about data up-to-date, the complexity of your error handling, and the infrastructure you need to run it all flow directly from this decision.
Getting it wrong is expensive. Teams that build streaming pipelines when batch has enough to maintain complex infrastructure for a problem that didn’t need to.
Teams that build batch pipelines when their use cases demand real-time action discover this gap at the worst possible moment—when a stakeholder asks why the dashboard is six hours out of date.
In this article, you’ll learn what batch and streaming pipelines actually are, how they differ architecturally and commercially, and how to implement both paradigms in Python. By the end, you’ll have a clear framework for choosing the right approach to solving any data engineering problem.
Conditions
To follow comfortably, make sure you have:
Practice writing Python functions and working with modules.
Familiarity with pandas data frames and basic data manipulation
A general understanding of what ETL pipelines do — extract, transform, load
Table of Contents
What is a Batch Pipeline?
A batch pipeline processes a finite, finite collection of records at once—a file, a snapshot of a database, a day’s worth of transactions. It runs on a schedule, somewhere, hourly, nightly, weekly, reads all the data for that period, transforms it, and writes the result somewhere. Then it stops and waits until the next run.
The mental model is simple: Collect, then act. Nothing happens between runs.
In a retail ETL context, a typical batch pipeline might look like this:
At midnight, pull all orders placed in the last 24 hours from the transactional database.
Join with product catalog and customer dimension tables.
Calculate daily revenue totals by region and product category.
Load results into a data warehouse for reporting.
The pipeline runs, terminates and creates a complete, continuous snapshot of tomorrow’s business. By the time the analysts arrive in the morning, the warehouse is up to date.
Batch pipeline implementation in Python
A batch pipeline in its simplest form is a Python script with three clearly distinct steps: extract, transform, load.
import pandas as pd
from datetime import datetime, timedelta
def extract(filepath: str) -> pd.DataFrame:
"""Load raw orders from a daily export file."""
df = pd.read_csv(filepath, parse_dates=("order_timestamp"))
return df
def transform(df: pd.DataFrame) -> pd.DataFrame:
"""Clean and aggregate orders into daily revenue by region."""
# Filter to completed orders only
df = df(df("status") == "completed").copy()
# Extract date from timestamp for grouping
df("order_date") = df("order_timestamp").dt.date
# Aggregate: total revenue and order count per region per day
summary = (
df.groupby(("order_date", "region"))
.agg(
total_revenue=("order_value_gbp", "sum"),
order_count=("order_id", "count"),
avg_order_value=("order_value_gbp", "mean"),
)
.reset_index()
)
return summary
def load(df: pd.DataFrame, output_path: str) -> None:
"""Write the aggregated result to the warehouse (here, a CSV)."""
df.to_csv(output_path, index=False)
print(f"Loaded {len(df)} rows to {output_path}")
# Run the pipeline
raw = extract("orders_2024_06_01.csv")
aggregated = transform(raw)
load(aggregated, "warehouse/daily_revenue_2024_06_01.csv")
Let’s see what this code is doing:
extractReads a CSV file representing the daily order export. gaveparse_datesThe argument asks Pandas to interpret.order_timestampcolumn as a datetime object rather than a simple string – this is important for the date extraction step in the transformation.transformIt does two things: it filters out any orders that haven’t been completed (returns, cancellations), and then groups the remaining orders by date and region to produce a revenue total. gave.agg()The call computes three matrices per group in a single pass.loadThe result writes to a destination—in production it would be a database insert or cloud storage upload, but the pattern is the same regardless.
The three functions are deliberately kept separate. This separation—extract, transform, load—makes each step independently testable, changeable, and debuggable. If the transform logic changes, you don’t need to modify the extract or load code.
When the batch works well.
Batch pipelines are the right choice when:
Data freshness requirements are measured in hours, not seconds. The daily sales report does not need to be updated every minute. The Weekly Marketing Attribution Model certainly doesn’t.
You are processing large historical datasets. Backfilling two years of transaction history into a new data warehouse is inherently a batch job—the data is there, it’s bound, and you want to process it as efficiently as possible in a single run.
Consistency is more important than delay. Batch pipelines produce complete, point-in-time snapshots. Each row in the output was computed from the same input state. This consistency is valuable for financial reporting, regulatory compliance, and any downstream process that requires a stable, reproducible dataset.
What is a streaming pipeline?
A streaming pipeline processes data continuously, record by record or in small micro-batches, as it arrives. The dataset has no “end” — the pipeline runs indefinitely, consuming events from sources such as a message queue, Kafka topic, or webhook, and processing each one as it arrives.
The mental model is: Act as you collect.. The pipeline is always running.
In the same retail ETL context, a streaming pipeline can handle order events as they are placed:
An order is placed on the website and an event is published to the message queue.
The streaming pipeline consumes the event in milliseconds.
It validates, enriches, and transports events to downstream systems.
A fraud detection service, inventory system, and real-time dashboard all provide up-to-date information instantly.
The difference from batch is fundamental: the data isn’t sitting in a file waiting to be processed. It’s flowing, and the pipeline has to be maintained.
Implementation of Streaming Pipeline in Python
Python’s generator functions are a natural building block for streaming pipelines. A generator produces one value at a time and pauses between outputs – which maps directly onto the idea of processing records as they arrive without loading everything into memory.
import json
import time
from typing import Generator, Dict
def event_source(filepath: str) -> Generator(Dict, None, None):
"""
Simulate a stream of order events from a file.
In production, this would consume from Kafka or a message queue.
"""
with open(filepath, "r") as f:
for line in f:
event = json.loads(line.strip())
yield event
time.sleep(0.01) # simulate arrival delay between events
def validate(event: Dict) -> bool:
"""Check that the event has the required fields and valid values."""
required_fields = ("order_id", "customer_id", "order_value_gbp", "region")
if not all(field in event for field in required_fields):
return False
if event("order_value_gbp") <= 0:
return False
return True
def enrich(event: Dict) -> Dict:
"""Add derived fields to the event before routing downstream."""
event("processed_at") = time.strftime("%Y-%m-%dT%H:%M:%S")
event("value_tier") = (
"high" if event("order_value_gbp") >= 500
else "mid" if event("order_value_gbp") >= 100
else "low"
)
return event
def run_streaming_pipeline(source_file: str) -> None:
"""Process each event as it arrives from the source."""
processed = 0
skipped = 0
for raw_event in event_source(source_file):
if not validate(raw_event):
skipped += 1
continue
enriched_event = enrich(raw_event)
# In production: publish to downstream topic or write to sink
print(f"({enriched_event('processed_at')}) "
f"Order {enriched_event('order_id')} | "
f"£{enriched_event('order_value_gbp'):.2f} | "
f"tier={enriched_event('value_tier')}")
processed += 1
print(f"\nDone. Processed: {processed} | Skipped: {skipped}")
run_streaming_pipeline("order_events.jsonl")
Here’s what’s going on:
event_sourceis a generator function – note thatyieldInstead of keywordsreturn. Every callyield eventStops the function and returns an event to the caller. The pipeline processes this event and takes the next one before the generator restarts. This means that only one event is in memory at a time, regardless of how large the stream is. gavetime.sleep(0.01)Simulates the real-world delay between events arriving from a message queue.validateChecks for required fields and valid values before doing anything else with each event. In the streaming context, bad events are very common—network problems, upstream bugs, and schema changes all produce bad records. It is safer to initially validate and skip false events than to propagate them to downstream systems.enrichAdds derived fields to the event. This can be a processing timestamp and value classification. In production, this step can also join against a lookup table, call an external API, or apply model predictions.run_streaming_pipelineties it together. gaveforLoop upevent_sourceUses events one at a time, processing each one through.validate → enrich → routeMaintains a running count of steps, and processed and skipped events.
When streaming works well.
Streaming pipelines are the right choice when:
Data freshness is measured in seconds or milliseconds. Fraud detection, real-time inventory updates, live dashboards, and alerting systems all require immediate data processing—an hourly batch job would render them useless.
The volume of data is too large to collect. High-frequency IoT sensor data, clickstream events, and financial tick data can generate millions of records per hour. Collecting them all before processing is often impractical – you’d need a lot of storage and the processing would take too long to be useful.
You need to react, not just report. Streaming pipelines can trigger downstream actions – in response to individual events – send a notification, block a transaction, update a recommendation. Batch pipelines can only report what has already happened.
Key differences at a glance
Here’s an overview of the differences between batch and stream processing that we’ve discussed so far:
| Dimension | Beach | Continuity |
|---|---|---|
| The data model | Bounded, limited dataset | Boundless, continuous flow |
| Processing trigger | schedule (time or event) | Each record arrival |
| delay | Minutes to hours | Milliseconds to seconds |
| Throughput | High (optimized for bulk processing) | Low per-record overhead |
| Complexity | the bottom | high |
| State management | Stateless per run | In most cases the state |
| Handling error | Retry the entire operation. | Per event deadletter row |
| Consistency | Robust (point-in-time snapshot) | Finally compatible |
| Best for | Reporting, ML Training, Backfills | Alerting, real-time features, event routing |
Choosing between batch and streaming
Well, all this information is great. But how Are you choosing between batch and stream processing? The decision comes down to three questions:
How fresh does the data need to be? If stakeholders can tolerate results that are hours old, the batch is easier and more cost-effective. If they need results in seconds, streaming is inevitable.
How complex is your processing logic? Batch jobs can ingest large data sets, perform expensive aggregations, and apply complex business logic without worrying about latency. Streaming pipelines must process each event quickly, which limits how much you can do per record.
What is your operational capacity? The streaming infrastructure — Kafka clusters, Flink or Spark streaming jobs, deadletter queues, exactly-once delivery guarantees — is significantly more complex to work with than a standard Python script. If your team is small or your use case doesn’t demand real-time results, this complexity comes at a cost without benefit.
Start with the middle. It is easy to build, easy to test, easy to debug and easy to maintain. Go streaming when a specific, concrete need—not a hypothetical future—makes batch insufficient. Most data problems are batch problems, and the ones that actually require streaming are usually obvious when you run into them.
And as you might have guessed, you may need to combine them for some data processing systems. This is why the hybrid approach exists.
Hybrid Patterns: Lambda and Kappa Architectures
In practice, many production data systems use both models together. The two most common hybrid architectures are: Lambda and Kappa architectures.
Lambda architecture Runs a batch layer and a streaming layer in parallel. The batch layer processes complete historical data and produces accurate, consistent results over a delay. The streaming layer processes the live data and produces approximate results instantly. Downstream users merge both outputs—the streaming result for freshness and the batch result for accuracy.
The trade-off is operational complexity: you’re maintaining two separate processing codebases that produce essentially equivalent results.
Kappa architecture Makes it simple by using only one streaming layer, but with the ability to replay historical data through the same pipeline when you need batch-style reprocessing. It works well when your streaming framework likes it. Apache Kafka And Apache Flink Supports log retention and replay. You get a code base, a set of logic, and the ability to reprocess history when your pipeline changes.
Nor is the architecture universally better. Lambda is more common in organizations that first adopted batch processing and gradually added streaming. Kappa is more common in systems designed with streaming as the primary pattern.
The result
Batch and streaming are tools with different tradeoffs, each suited to a different class of problems. Batch pipelines excel in consistency, simplicity, and bulk throughput. Streaming pipelines excel at latency, reactivity, and continuous processing.
Understanding both patterns at an architectural level—before reaching for specific frameworks like Apache Spark, Kafka, or Flink—gives you the decision to make the right choice and to clearly explain that choice. Frameworks implement these patterns, while deciding which pattern best fits your problem is up to you.