Docker for Python and Data Projects: A Beginner's Guide

Photo by author

# Introduction

The python And data projects have a dependency problem. Between differences in Python versions, virtual environments, system-level packages, and operating systems, running someone else’s code on your machine can sometimes take longer than understanding the code itself.

Docker It solves this by packaging your code and its entire environment—Python version, dependencies, system libraries—into an artifact called an image. From an image you can start containers that run on your laptop, your colleague’s machine, and a cloud server alike. You stop debugging the environment and start the shipping task.

In this article, you’ll learn Docker through practical examples focusing on data projects: containerizing scripts, serving machine learning models Fast APIWiring of multi-service pipeline with Docker Composeand scheduling a job with the cron container.

# Conditions

Before working through the examples, you will need:

Docker and Docker Compose are installed for your operating system. Follow the Official Installation Guide For your platform.
Familiarity with command line and Python.
Familiarity with writing a docker file, building an image and running a container from that image.

If you want a quick refresher, here are some articles to get you up to speed:

You don’t need deep Docker knowledge to follow along. Each example explains what is going on.

# Containerizing a Python script with pinned dependencies

Let’s start with the most common use case: you have a Python script and a requirements.txtand you want it to run reliably anywhere.

We’ll create a data cleaning script that reads the raw sales CSV file, removes duplicates, fills in missing values, and writes a cleaned version to disk.

// Project structure

The project is structured as follows:

data-cleaner/
├── Dockerfile
├── requirements.txt
├── clean_data.py
└── data/
    └── raw_sales.csv

// Writing a script

Here is the data cleaning script that uses Panda. For heavy lifting:

# clean_data.py
import pandas as pd
import os

INPUT_PATH = "data/raw_sales.csv"
OUTPUT_PATH = "data/cleaned_sales.csv"

print("Reading data...")
df = pd.read_csv(INPUT_PATH)
print(f"Rows before cleaning: {len(df)}")

# Drop duplicate rows
df = df.drop_duplicates()

# Fill missing numeric values with column median
for col in df.select_dtypes(include="number").columns:
    df(col) = df(col).fillna(df(col).median())

# Fill missing text values with 'Unknown'
for col in df.select_dtypes(include="object").columns:
    df(col) = df(col).fillna('Unknown')

print(f"Rows after cleaning: {len(df)}")
df.to_csv(OUTPUT_PATH, index=False)
print(f"Cleaned file saved to {OUTPUT_PATH}")

// Pinning dependency

Valid version must be pinned. without it, pip install pandas Can install different versions on different machines. Pinned versions guarantee that everyone gets the same treatment. You can specify the correct version. requirements.txt file like this:

pandas==2.2.0
openpyxl==3.1.2

// Specifying a docker file

This creates a minimal, cache-friendly image for the Docker file cleaning script:

# Use a slim Python 3.11 base image
FROM python:3.11-slim

# Set the working directory inside the container
WORKDIR /app

# Copy and install dependencies first (for layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the script into the container
COPY clean_data.py .

# Default command to run when the container starts
CMD ("python", "clean_data.py")

A few things are worth mentioning here. We use python:3.11-slim rather than the full Python image because it’s significantly smaller and removes packages you don’t need.

We copy. requirements.txt Before copying the rest of the code and this is intentional. Docker creates images in layers and caches each one. If you just change. clean_data.pyDocker will not reinstall all your dependencies on the next build. It reuses the cached Pipe layer and jumps straight to copying your updated script. This small ordering decision can save you minutes of rebuild time.

// Building and running

With the image created, you can run the container and mount your local data folder:

# Build the image and tag it
docker build -t data-cleaner .

# Run it, mounting your local data/ folder into the container
docker run --rm -v $(pwd)/data:/app/data data-cleaner

gave -v $(pwd)/data:/app/data The flag sets your locale. data/ folder in the container /app/data. This is how the script reads your CSV and how the cleaned output is written back to your machine. Nothing is baked into the image and the data remains on your file system.

gave --rm The flag is automatically removed after the container is finished. Since this is a one-way script, there is no reason to keep the stopped container around.

# Presenting Machine Learning Models with FastAPI

You’ve trained a model and want to make it available over HTTP so that other services can send data and get predictions back. FastAPI works great for this: it’s fast, lightweight, and handles input validation with Pedantic.

// Project structure

The project model separates the artifact from the application code:

ml-api/
├── Dockerfile
├── requirements.txt
├── app.py
└── model.pkl

// Writing an app

The following app loads the model once on startup and exposes a /predict End point:

# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pickle
import numpy as np

app = FastAPI(title="Sales Forecast API")

# Load the model once at startup
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

class PredictRequest(BaseModel):
    region: str
    month: int
    marketing_spend: float
    units_in_stock: int

class PredictResponse(BaseModel):
    region: str
    predicted_revenue: float

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/predict", response_model=PredictResponse)
def predict(request: PredictRequest):
    try:
        features = ((
            request.month,
            request.marketing_spend,
            request.units_in_stock
        ))
        prediction = model.predict(features)
        return PredictResponse(
            region=request.region,
            predicted_revenue=round(float(prediction(0)), 2)
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

gave PredictRequest The class validates the input for you. If someone sends a request with a missing field or string where a number is expected, FastAPI rejects it with an explicit error message before your model code runs. The model is loaded once at startup — not on every request — which keeps response times fast.

gave /health Endpoint is a small but important addition: Docker, load balancers, and cloud platforms use it to check if your service is actually up and ready.

// Specifying a docker file

It bakes the Docker file model directly into the image so that the container is completely self-contained:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the model and the app together
COPY model.pkl .
COPY app.py .

EXPOSE 8000

CMD ("uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000")

gave model.pkl Baked in picture at time of construction. This means that the container is completely self-contained, and you don’t need to install anything when you run it. gave --host 0.0.0.0 The flag tells Uvicorn To listen on all network interfaces inside the container, not just localhost. Without it, you won’t be able to access the API from outside the container.

// Building and running

Create the image and start the API server:

docker build -t ml-api .
docker run --rm -p 8000:8000 ml-api

Test it with curl:

curl -X POST  \
  -H "Content-Type: application/json" \
  -d '{"region": "North", "month": 3, "marketing_spend": 5000.0, "units_in_stock": 320}'

# Building a Multi-Service Pipeline with Docker Compose

Real data projects rarely involve just one process. You might need a database, a script that loads data into it, and a dashboard that reads from it — all running together.

Docker Compose lets you define and run multiple containers as a single application. Each service has its own container, but they all share a private network so they can talk to each other.

// Project structure

Pipeline splits each service into its own subdirectory:

pipeline/
├── docker-compose.yml
├── loader/
│   ├── Dockerfile
│   ├── requirements.txt
│   └── load_data.py
└── dashboard/
    ├── Dockerfile
    ├── requirements.txt
    └── app.py

// Compose file definition

This compose file declares all three services and wires them together with health checks and shared URL environment variables:

# docker-compose.yml
version: "3.9"

services:

  db:
    image: postgres:15
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: analytics
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ("CMD-SHELL", "pg_isready -U admin -d analytics")
      interval: 5s
      retries: 5

  loader:
    build: ./loader
    depends_on:
      db:
        condition: service_healthy
    environment:
      DATABASE_URL: postgresql://admin:secret@db:5432/analytics

  dashboard:
    build: ./dashboard
    depends_on:
      db:
        condition: service_healthy
    ports:
      - "8501:8501"
    environment:
      DATABASE_URL: postgresql://admin:secret@db:5432/analytics

volumes:
  pgdata:

// Writing a loader script

This script waits briefly for the database, then loads the CSV into the sales table using SQLAlchemy:

# loader/load_data.py
import pandas as pd
from sqlalchemy import create_engine
import os
import time

DATABASE_URL = os.environ("DATABASE_URL")

# Give the DB a moment to be fully ready
time.sleep(3)

engine = create_engine(DATABASE_URL)

df = pd.read_csv("sales_data.csv")
df.to_sql("sales", engine, if_exists="replace", index=False)

print(f"Loaded {len(df)} rows into the sales table.")

Let’s take a closer look at the compose file. Each service runs in its own container, but they’re all on the same Docker-managed network, so they can reach each other using the service name as the hostname. The loader connects to it. db:5432 — and not localhost — because db is the service name, and Docker handles DNS resolution automatically.

Health check PostgreSQL Service is important. depends_on Alone only waits for the container to start, not for PostgreSQL to be ready to accept connections. Uses a health check. pg_isready To verify that the database is actually dead before the loader attempts to connect. gave pgdata Maintains the database between volume runs; Stopping and restarting the pipeline will not clear your data.

// Starting all over

Bring up all services with one command:

docker compose up --build

To stop everything, run:

# Scheduling Jobs with the Cron Container

Sometimes you need a script to run on a schedule. Maybe it fetches data from an API every hour and writes it to a database or file. You don’t want to set up a full orchestration system like Airflow for something this simple. A cron container works cleanly.

// Project structure

This project also includes a crontab file with scripts and a docker file:

data-fetcher/
├── Dockerfile
├── requirements.txt
├── fetch_data.py
└── crontab

// Writing a recovery script

It uses a script. Requests To hit the API endpoint and save the results as a time-stamped CSV:

# fetch_data.py
import requests
import pandas as pd
from datetime import datetime
import os

API_URL = "
OUTPUT_DIR = "/app/output"

os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"({datetime.now()}) Fetching data...")

response = requests.get(API_URL, timeout=10)
response.raise_for_status()

data = response.json()
df = pd.DataFrame(data("records"))

timestamp = datetime.now().strftime("%Y%m%d_%H%M")
output_path = f"{OUTPUT_DIR}/sales_{timestamp}.csv"
df.to_csv(output_path, index=False)

print(f"({datetime.now()}) Saved {len(df)} records to {output_path}")

// Defining crontab

Crontab schedules the script to run every hour and redirects all output to a log file.

# Run every hour, on the hour
0 * * * * python /app/fetch_data.py >> /var/log/fetch.log 2>&1

gave >> /var/log/fetch.log 2>&1 The component redirects both standard output and error output to a log file. This is how you evaluate what happened after the fact.

// Specifying a docker file

This docker file installs cron, registers the schedule, and keeps it running in the foreground:

FROM python:3.11-slim

# Install cron
RUN apt-get update && apt-get install -y cron && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY fetch_data.py .
COPY crontab /etc/cron.d/fetch-job

# Set correct permissions and register the crontab
RUN chmod 0644 /etc/cron.d/fetch-job && crontab /etc/cron.d/fetch-job

# cron -f runs cron in the foreground, which is required for Docker
CMD ("cron", "-f")

gave cron -f The flag is important here. Docker keeps a container alive as long as its main process is running. If cron runs in the background (its default), the main process will immediately exit and stop the Docker container. gave -f The flag keeps cron running in the foreground to keep the container alive.

// Building and running

Create the image and start the container in detached mode:

docker build -t data-fetcher .
docker run -d --name fetcher -v $(pwd)/output:/app/output data-fetcher

Check the logs anytime:

docker exec fetcher cat /var/log/fetch.log

The output folder is mounted from your local machine, so the CSV files land on your file system even though the script runs inside the container.

# wrap up

I hope you found this Docker article helpful. Docker doesn’t need to be complicated. Start with the first example, swap in your scripts and dependencies, and get comfortable with the build cycle. Once you do this, other patterns follow naturally. Docker is a good fit when:

You need a reproducible environment across machines or team members.
You are sharing scripts or models that have specific dependency requirements.
You are building multi-service systems that need to run together reliably.
You want to deploy anywhere without setup friction.

That said, you don’t always need to use Docker for all your Python work. This is probably excessive when:

You’re just doing a quick, exploratory analysis for yourself.
Your script has no external dependencies outside of the standard library.
You are early in a project and your needs are changing rapidly.

If you’re interested in going further, check out. 5 Easy Steps to Mastering Docker for Data Science.

Happy coding!

Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she’s working on learning lessons and sharing her knowledge with the developer community, writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource reviews and coding tutorials.

# Introduction

# Conditions

# Containerizing a Python script with pinned dependencies

// Project structure

// Writing a script

// Pinning dependency

// Specifying a docker file

// Building and running

# Presenting Machine Learning Models with FastAPI

// Project structure

// Writing an app

// Specifying a docker file

// Building and running

# Building a Multi-Service Pipeline with Docker Compose

// Project structure

// Compose file definition

// Writing a loader script

// Starting all over

# Scheduling Jobs with the Cron Container

// Project structure

// Writing a recovery script

// Defining crontab

// Specifying a docker file

// Building and running

# wrap up

Editor's pick

Get latest news

Docker for Python and Data Projects: A Beginner’s Guide

# Introduction

# Conditions

# Containerizing a Python script with pinned dependencies

// Project structure

// Writing a script

// Pinning dependency

// Specifying a docker file

// Building and running

# Presenting Machine Learning Models with FastAPI

// Project structure

// Writing an app

// Specifying a docker file

// Building and running

# Building a Multi-Service Pipeline with Docker Compose

// Project structure

// Compose file definition

// Writing a loader script

// Starting all over

# Scheduling Jobs with the Cron Container

// Project structure

// Writing a recovery script

// Defining crontab

// Specifying a docker file

// Building and running

# wrap up

Subspace | All your agents in one app and consistent context

Why being “human in the loop” in AI warfare is an illusion.

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news