Batch vs streaming pipelines explained.

by SkillAiNest

Every data pipeline makes a fundamental choice before any code is written: does it process data in scheduled chunks, or continuously process data as it arrives?

This choice—batch versus streaming—determines the structure of everything. The tools you use, the guarantees you can make about data up-to-date, the complexity of your error handling, and the infrastructure you need to run it all flow directly from this decision.

Getting it wrong is expensive. Teams that build streaming pipelines when batch has enough to maintain complex infrastructure for a problem that didn’t need to.

Teams that build batch pipelines when their use cases demand real-time action discover this gap at the worst possible moment—when a stakeholder asks why the dashboard is six hours out of date.

In this article, you’ll learn what batch and streaming pipelines actually are, how they differ architecturally and commercially, and how to implement both paradigms in Python. By the end, you’ll have a clear framework for choosing the right approach to solving any data engineering problem.

Conditions

To follow comfortably, make sure you have:

  • Practice writing Python functions and working with modules.

  • Familiarity with pandas data frames and basic data manipulation

  • A general understanding of what ETL pipelines do — extract, transform, load

Table of Contents

What is a Batch Pipeline?

A batch pipeline processes a finite, finite collection of records at once—a file, a snapshot of a database, a day’s worth of transactions. It runs on a schedule, somewhere, hourly, nightly, weekly, reads all the data for that period, transforms it, and writes the result somewhere. Then it stops and waits until the next run.

The mental model is simple: Collect, then act. Nothing happens between runs.

In a retail ETL context, a typical batch pipeline might look like this:

  1. At midnight, pull all orders placed in the last 24 hours from the transactional database.

  2. Join with product catalog and customer dimension tables.

  3. Calculate daily revenue totals by region and product category.

  4. Load results into a data warehouse for reporting.

The pipeline runs, terminates and creates a complete, continuous snapshot of tomorrow’s business. By the time the analysts arrive in the morning, the warehouse is up to date.

Batch pipeline implementation in Python

A batch pipeline in its simplest form is a Python script with three clearly distinct steps: extract, transform, load.

import pandas as pd
from datetime import datetime, timedelta

def extract(filepath: str) -> pd.DataFrame:
    """Load raw orders from a daily export file."""
    df = pd.read_csv(filepath, parse_dates=("order_timestamp"))
    return df

def transform(df: pd.DataFrame) -> pd.DataFrame:
    """Clean and aggregate orders into daily revenue by region."""
    # Filter to completed orders only
    df = df(df("status") == "completed").copy()

    # Extract date from timestamp for grouping
    df("order_date") = df("order_timestamp").dt.date

    # Aggregate: total revenue and order count per region per day
    summary = (
        df.groupby(("order_date", "region"))
        .agg(
            total_revenue=("order_value_gbp", "sum"),
            order_count=("order_id", "count"),
            avg_order_value=("order_value_gbp", "mean"),
        )
        .reset_index()
    )
    return summary

def load(df: pd.DataFrame, output_path: str) -> None:
    """Write the aggregated result to the warehouse (here, a CSV)."""
    df.to_csv(output_path, index=False)
    print(f"Loaded {len(df)} rows to {output_path}")

# Run the pipeline
raw = extract("orders_2024_06_01.csv")
aggregated = transform(raw)
load(aggregated, "warehouse/daily_revenue_2024_06_01.csv")

Let’s see what this code is doing:

  • extract Reads a CSV file representing the daily order export. gave parse_dates The argument asks Pandas to interpret. order_timestamp column as a datetime object rather than a simple string – this is important for the date extraction step in the transformation.

  • transform It does two things: it filters out any orders that haven’t been completed (returns, cancellations), and then groups the remaining orders by date and region to produce a revenue total. gave .agg() The call computes three matrices per group in a single pass.

  • load The result writes to a destination—in production it would be a database insert or cloud storage upload, but the pattern is the same regardless.

The three functions are deliberately kept separate. This separation—extract, transform, load—makes each step independently testable, changeable, and debuggable. If the transform logic changes, you don’t need to modify the extract or load code.

When the batch works well.

Batch pipelines are the right choice when:

  • Data freshness requirements are measured in hours, not seconds. The daily sales report does not need to be updated every minute. The Weekly Marketing Attribution Model certainly doesn’t.

  • You are processing large historical datasets. Backfilling two years of transaction history into a new data warehouse is inherently a batch job—the data is there, it’s bound, and you want to process it as efficiently as possible in a single run.

  • Consistency is more important than delay. Batch pipelines produce complete, point-in-time snapshots. Each row in the output was computed from the same input state. This consistency is valuable for financial reporting, regulatory compliance, and any downstream process that requires a stable, reproducible dataset.

What is a streaming pipeline?

A streaming pipeline processes data continuously, record by record or in small micro-batches, as it arrives. The dataset has no “end” — the pipeline runs indefinitely, consuming events from sources such as a message queue, Kafka topic, or webhook, and processing each one as it arrives.

The mental model is: Act as you collect.. The pipeline is always running.

In the same retail ETL context, a streaming pipeline can handle order events as they are placed:

  1. An order is placed on the website and an event is published to the message queue.

  2. The streaming pipeline consumes the event in milliseconds.

  3. It validates, enriches, and transports events to downstream systems.

  4. A fraud detection service, inventory system, and real-time dashboard all provide up-to-date information instantly.

The difference from batch is fundamental: the data isn’t sitting in a file waiting to be processed. It’s flowing, and the pipeline has to be maintained.

Implementation of Streaming Pipeline in Python

Python’s generator functions are a natural building block for streaming pipelines. A generator produces one value at a time and pauses between outputs – which maps directly onto the idea of ​​processing records as they arrive without loading everything into memory.

import json
import time
from typing import Generator, Dict

def event_source(filepath: str) -> Generator(Dict, None, None):
    """
    Simulate a stream of order events from a file.
    In production, this would consume from Kafka or a message queue.
    """
    with open(filepath, "r") as f:
        for line in f:
            event = json.loads(line.strip())
            yield event
            time.sleep(0.01)  # simulate arrival delay between events

def validate(event: Dict) -> bool:
    """Check that the event has the required fields and valid values."""
    required_fields = ("order_id", "customer_id", "order_value_gbp", "region")
    if not all(field in event for field in required_fields):
        return False
    if event("order_value_gbp") <= 0:
        return False
    return True

def enrich(event: Dict) -> Dict:
    """Add derived fields to the event before routing downstream."""
    event("processed_at") = time.strftime("%Y-%m-%dT%H:%M:%S")
    event("value_tier") = (
        "high"   if event("order_value_gbp") >= 500
        else "mid"    if event("order_value_gbp") >= 100
        else "low"
    )
    return event

def run_streaming_pipeline(source_file: str) -> None:
    """Process each event as it arrives from the source."""
    processed = 0
    skipped = 0

    for raw_event in event_source(source_file):
        if not validate(raw_event):
            skipped += 1
            continue

        enriched_event = enrich(raw_event)

        # In production: publish to downstream topic or write to sink
        print(f"({enriched_event('processed_at')}) "
              f"Order {enriched_event('order_id')} | "
              f"£{enriched_event('order_value_gbp'):.2f} | "
              f"tier={enriched_event('value_tier')}")
        processed += 1

    print(f"\nDone. Processed: {processed} | Skipped: {skipped}")

run_streaming_pipeline("order_events.jsonl")

Here’s what’s going on:

  • event_source is a generator function – note that yield Instead of keywords return. Every call yield event Stops the function and returns an event to the caller. The pipeline processes this event and takes the next one before the generator restarts. This means that only one event is in memory at a time, regardless of how large the stream is. gave time.sleep(0.01) Simulates the real-world delay between events arriving from a message queue.

  • validate Checks for required fields and valid values ​​before doing anything else with each event. In the streaming context, bad events are very common—network problems, upstream bugs, and schema changes all produce bad records. It is safer to initially validate and skip false events than to propagate them to downstream systems.

  • enrich Adds derived fields to the event. This can be a processing timestamp and value classification. In production, this step can also join against a lookup table, call an external API, or apply model predictions.

  • run_streaming_pipeline ties it together. gave for Loop up event_source Uses events one at a time, processing each one through. validate → enrich → route Maintains a running count of steps, and processed and skipped events.

When streaming works well.

Streaming pipelines are the right choice when:

  • Data freshness is measured in seconds or milliseconds. Fraud detection, real-time inventory updates, live dashboards, and alerting systems all require immediate data processing—an hourly batch job would render them useless.

  • The volume of data is too large to collect. High-frequency IoT sensor data, clickstream events, and financial tick data can generate millions of records per hour. Collecting them all before processing is often impractical – you’d need a lot of storage and the processing would take too long to be useful.

  • You need to react, not just report. Streaming pipelines can trigger downstream actions – in response to individual events – send a notification, block a transaction, update a recommendation. Batch pipelines can only report what has already happened.

Key differences at a glance

Here’s an overview of the differences between batch and stream processing that we’ve discussed so far:

DimensionBeachContinuity
The data modelBounded, limited datasetBoundless, continuous flow
Processing triggerschedule (time or event)Each record arrival
delayMinutes to hoursMilliseconds to seconds
ThroughputHigh (optimized for bulk processing)Low per-record overhead
Complexitythe bottomhigh
State managementStateless per runIn most cases the state
Handling errorRetry the entire operation.Per event deadletter row
ConsistencyRobust (point-in-time snapshot)Finally compatible
Best forReporting, ML Training, BackfillsAlerting, real-time features, event routing

Choosing between batch and streaming

Well, all this information is great. But how Are you choosing between batch and stream processing? The decision comes down to three questions:

How fresh does the data need to be? If stakeholders can tolerate results that are hours old, the batch is easier and more cost-effective. If they need results in seconds, streaming is inevitable.

How complex is your processing logic? Batch jobs can ingest large data sets, perform expensive aggregations, and apply complex business logic without worrying about latency. Streaming pipelines must process each event quickly, which limits how much you can do per record.

What is your operational capacity? The streaming infrastructure — Kafka clusters, Flink or Spark streaming jobs, deadletter queues, exactly-once delivery guarantees — is significantly more complex to work with than a standard Python script. If your team is small or your use case doesn’t demand real-time results, this complexity comes at a cost without benefit.

Start with the middle. It is easy to build, easy to test, easy to debug and easy to maintain. Go streaming when a specific, concrete need—not a hypothetical future—makes batch insufficient. Most data problems are batch problems, and the ones that actually require streaming are usually obvious when you run into them.

And as you might have guessed, you may need to combine them for some data processing systems. This is why the hybrid approach exists.

Hybrid Patterns: Lambda and Kappa Architectures

In practice, many production data systems use both models together. The two most common hybrid architectures are: Lambda and Kappa architectures.

Lambda architecture Runs a batch layer and a streaming layer in parallel. The batch layer processes complete historical data and produces accurate, consistent results over a delay. The streaming layer processes the live data and produces approximate results instantly. Downstream users merge both outputs—the streaming result for freshness and the batch result for accuracy.

The trade-off is operational complexity: you’re maintaining two separate processing codebases that produce essentially equivalent results.

Kappa architecture Makes it simple by using only one streaming layer, but with the ability to replay historical data through the same pipeline when you need batch-style reprocessing. It works well when your streaming framework likes it. Apache Kafka And Apache Flink Supports log retention and replay. You get a code base, a set of logic, and the ability to reprocess history when your pipeline changes.

Nor is the architecture universally better. Lambda is more common in organizations that first adopted batch processing and gradually added streaming. Kappa is more common in systems designed with streaming as the primary pattern.

The result

Batch and streaming are tools with different tradeoffs, each suited to a different class of problems. Batch pipelines excel in consistency, simplicity, and bulk throughput. Streaming pipelines excel at latency, reactivity, and continuous processing.

Understanding both patterns at an architectural level—before reaching for specific frameworks like Apache Spark, Kafka, or Flink—gives you the decision to make the right choice and to clearly explain that choice. Frameworks implement these patterns, while deciding which pattern best fits your problem is up to you.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro