
Photo by author
# Introduction
The python And data projects have a dependency problem. Between differences in Python versions, virtual environments, system-level packages, and operating systems, running someone else’s code on your machine can sometimes take longer than understanding the code itself.
Docker It solves this by packaging your code and its entire environment—Python version, dependencies, system libraries—into an artifact called an image. From an image you can start containers that run on your laptop, your colleague’s machine, and a cloud server alike. You stop debugging the environment and start the shipping task.
In this article, you’ll learn Docker through practical examples focusing on data projects: containerizing scripts, serving machine learning models Fast APIWiring of multi-service pipeline with Docker Composeand scheduling a job with the cron container.
# Conditions
Before working through the examples, you will need:
- Docker and Docker Compose are installed for your operating system. Follow the Official Installation Guide For your platform.
- Familiarity with command line and Python.
- Familiarity with writing a docker file, building an image and running a container from that image.
If you want a quick refresher, here are some articles to get you up to speed:
You don’t need deep Docker knowledge to follow along. Each example explains what is going on.
# Containerizing a Python script with pinned dependencies
Let’s start with the most common use case: you have a Python script and a requirements.txtand you want it to run reliably anywhere.
We’ll create a data cleaning script that reads the raw sales CSV file, removes duplicates, fills in missing values, and writes a cleaned version to disk.
// Project structure
The project is structured as follows:
data-cleaner/
├── Dockerfile
├── requirements.txt
├── clean_data.py
└── data/
└── raw_sales.csv// Writing a script
Here is the data cleaning script that uses Panda. For heavy lifting:
# clean_data.py
import pandas as pd
import os
INPUT_PATH = "data/raw_sales.csv"
OUTPUT_PATH = "data/cleaned_sales.csv"
print("Reading data...")
df = pd.read_csv(INPUT_PATH)
print(f"Rows before cleaning: {len(df)}")
# Drop duplicate rows
df = df.drop_duplicates()
# Fill missing numeric values with column median
for col in df.select_dtypes(include="number").columns:
df(col) = df(col).fillna(df(col).median())
# Fill missing text values with 'Unknown'
for col in df.select_dtypes(include="object").columns:
df(col) = df(col).fillna('Unknown')
print(f"Rows after cleaning: {len(df)}")
df.to_csv(OUTPUT_PATH, index=False)
print(f"Cleaned file saved to {OUTPUT_PATH}")// Pinning dependency
Valid version must be pinned. without it, pip install pandas Can install different versions on different machines. Pinned versions guarantee that everyone gets the same treatment. You can specify the correct version. requirements.txt file like this:
pandas==2.2.0
openpyxl==3.1.2// Specifying a docker file
This creates a minimal, cache-friendly image for the Docker file cleaning script:
# Use a slim Python 3.11 base image
FROM python:3.11-slim
# Set the working directory inside the container
WORKDIR /app
# Copy and install dependencies first (for layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the script into the container
COPY clean_data.py .
# Default command to run when the container starts
CMD ("python", "clean_data.py")A few things are worth mentioning here. We use python:3.11-slim rather than the full Python image because it’s significantly smaller and removes packages you don’t need.
We copy. requirements.txt Before copying the rest of the code and this is intentional. Docker creates images in layers and caches each one. If you just change. clean_data.pyDocker will not reinstall all your dependencies on the next build. It reuses the cached Pipe layer and jumps straight to copying your updated script. This small ordering decision can save you minutes of rebuild time.
// Building and running
With the image created, you can run the container and mount your local data folder:
# Build the image and tag it
docker build -t data-cleaner .
# Run it, mounting your local data/ folder into the container
docker run --rm -v $(pwd)/data:/app/data data-cleanergave -v $(pwd)/data:/app/data The flag sets your locale. data/ folder in the container /app/data. This is how the script reads your CSV and how the cleaned output is written back to your machine. Nothing is baked into the image and the data remains on your file system.
gave --rm The flag is automatically removed after the container is finished. Since this is a one-way script, there is no reason to keep the stopped container around.
# Presenting Machine Learning Models with FastAPI
You’ve trained a model and want to make it available over HTTP so that other services can send data and get predictions back. FastAPI works great for this: it’s fast, lightweight, and handles input validation with Pedantic.
// Project structure
The project model separates the artifact from the application code:
ml-api/
├── Dockerfile
├── requirements.txt
├── app.py
└── model.pkl// Writing an app
The following app loads the model once on startup and exposes a /predict End point:
# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pickle
import numpy as np
app = FastAPI(title="Sales Forecast API")
# Load the model once at startup
with open("model.pkl", "rb") as f:
model = pickle.load(f)
class PredictRequest(BaseModel):
region: str
month: int
marketing_spend: float
units_in_stock: int
class PredictResponse(BaseModel):
region: str
predicted_revenue: float
@app.get("/health")
def health():
return {"status": "ok"}
@app.post("/predict", response_model=PredictResponse)
def predict(request: PredictRequest):
try:
features = ((
request.month,
request.marketing_spend,
request.units_in_stock
))
prediction = model.predict(features)
return PredictResponse(
region=request.region,
predicted_revenue=round(float(prediction(0)), 2)
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))gave PredictRequest The class validates the input for you. If someone sends a request with a missing field or string where a number is expected, FastAPI rejects it with an explicit error message before your model code runs. The model is loaded once at startup — not on every request — which keeps response times fast.
gave /health Endpoint is a small but important addition: Docker, load balancers, and cloud platforms use it to check if your service is actually up and ready.
// Specifying a docker file
It bakes the Docker file model directly into the image so that the container is completely self-contained:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the model and the app together
COPY model.pkl .
COPY app.py .
EXPOSE 8000
CMD ("uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000")gave model.pkl Baked in picture at time of construction. This means that the container is completely self-contained, and you don’t need to install anything when you run it. gave --host 0.0.0.0 The flag tells Uvicorn To listen on all network interfaces inside the container, not just localhost. Without it, you won’t be able to access the API from outside the container.
// Building and running
Create the image and start the API server:
docker build -t ml-api .
docker run --rm -p 8000:8000 ml-apiTest it with curl:
curl -X POST \
-H "Content-Type: application/json" \
-d '{"region": "North", "month": 3, "marketing_spend": 5000.0, "units_in_stock": 320}'# Building a Multi-Service Pipeline with Docker Compose
Real data projects rarely involve just one process. You might need a database, a script that loads data into it, and a dashboard that reads from it — all running together.
Docker Compose lets you define and run multiple containers as a single application. Each service has its own container, but they all share a private network so they can talk to each other.
// Project structure
Pipeline splits each service into its own subdirectory:
pipeline/
├── docker-compose.yml
├── loader/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── load_data.py
└── dashboard/
├── Dockerfile
├── requirements.txt
└── app.py// Compose file definition
This compose file declares all three services and wires them together with health checks and shared URL environment variables:
# docker-compose.yml
version: "3.9"
services:
db:
image: postgres:15
environment:
POSTGRES_USER: admin
POSTGRES_PASSWORD: secret
POSTGRES_DB: analytics
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ("CMD-SHELL", "pg_isready -U admin -d analytics")
interval: 5s
retries: 5
loader:
build: ./loader
depends_on:
db:
condition: service_healthy
environment:
DATABASE_URL: postgresql://admin:secret@db:5432/analytics
dashboard:
build: ./dashboard
depends_on:
db:
condition: service_healthy
ports:
- "8501:8501"
environment:
DATABASE_URL: postgresql://admin:secret@db:5432/analytics
volumes:
pgdata:// Writing a loader script
This script waits briefly for the database, then loads the CSV into the sales table using SQLAlchemy:
# loader/load_data.py
import pandas as pd
from sqlalchemy import create_engine
import os
import time
DATABASE_URL = os.environ("DATABASE_URL")
# Give the DB a moment to be fully ready
time.sleep(3)
engine = create_engine(DATABASE_URL)
df = pd.read_csv("sales_data.csv")
df.to_sql("sales", engine, if_exists="replace", index=False)
print(f"Loaded {len(df)} rows into the sales table.")Let’s take a closer look at the compose file. Each service runs in its own container, but they’re all on the same Docker-managed network, so they can reach each other using the service name as the hostname. The loader connects to it. db:5432 — and not localhost — because db is the service name, and Docker handles DNS resolution automatically.
Health check PostgreSQL Service is important. depends_on Alone only waits for the container to start, not for PostgreSQL to be ready to accept connections. Uses a health check. pg_isready To verify that the database is actually dead before the loader attempts to connect. gave pgdata Maintains the database between volume runs; Stopping and restarting the pipeline will not clear your data.
// Starting all over
Bring up all services with one command:
docker compose up --buildTo stop everything, run:
# Scheduling Jobs with the Cron Container
Sometimes you need a script to run on a schedule. Maybe it fetches data from an API every hour and writes it to a database or file. You don’t want to set up a full orchestration system like Airflow for something this simple. A cron container works cleanly.
// Project structure
This project also includes a crontab file with scripts and a docker file:
data-fetcher/
├── Dockerfile
├── requirements.txt
├── fetch_data.py
└── crontab// Writing a recovery script
It uses a script. Requests To hit the API endpoint and save the results as a time-stamped CSV:
# fetch_data.py
import requests
import pandas as pd
from datetime import datetime
import os
API_URL = "
OUTPUT_DIR = "/app/output"
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"({datetime.now()}) Fetching data...")
response = requests.get(API_URL, timeout=10)
response.raise_for_status()
data = response.json()
df = pd.DataFrame(data("records"))
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
output_path = f"{OUTPUT_DIR}/sales_{timestamp}.csv"
df.to_csv(output_path, index=False)
print(f"({datetime.now()}) Saved {len(df)} records to {output_path}")// Defining crontab
Crontab schedules the script to run every hour and redirects all output to a log file.
# Run every hour, on the hour
0 * * * * python /app/fetch_data.py >> /var/log/fetch.log 2>&1gave >> /var/log/fetch.log 2>&1 The component redirects both standard output and error output to a log file. This is how you evaluate what happened after the fact.
// Specifying a docker file
This docker file installs cron, registers the schedule, and keeps it running in the foreground:
FROM python:3.11-slim
# Install cron
RUN apt-get update && apt-get install -y cron && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY fetch_data.py .
COPY crontab /etc/cron.d/fetch-job
# Set correct permissions and register the crontab
RUN chmod 0644 /etc/cron.d/fetch-job && crontab /etc/cron.d/fetch-job
# cron -f runs cron in the foreground, which is required for Docker
CMD ("cron", "-f")gave cron -f The flag is important here. Docker keeps a container alive as long as its main process is running. If cron runs in the background (its default), the main process will immediately exit and stop the Docker container. gave -f The flag keeps cron running in the foreground to keep the container alive.
// Building and running
Create the image and start the container in detached mode:
docker build -t data-fetcher .
docker run -d --name fetcher -v $(pwd)/output:/app/output data-fetcherCheck the logs anytime:
docker exec fetcher cat /var/log/fetch.logThe output folder is mounted from your local machine, so the CSV files land on your file system even though the script runs inside the container.
# wrap up
I hope you found this Docker article helpful. Docker doesn’t need to be complicated. Start with the first example, swap in your scripts and dependencies, and get comfortable with the build cycle. Once you do this, other patterns follow naturally. Docker is a good fit when:
- You need a reproducible environment across machines or team members.
- You are sharing scripts or models that have specific dependency requirements.
- You are building multi-service systems that need to run together reliably.
- You want to deploy anywhere without setup friction.
That said, you don’t always need to use Docker for all your Python work. This is probably excessive when:
- You’re just doing a quick, exploratory analysis for yourself.
- Your script has no external dependencies outside of the standard library.
- You are early in a project and your needs are changing rapidly.
If you’re interested in going further, check out. 5 Easy Steps to Mastering Docker for Data Science.
Happy coding!
Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she’s working on learning lessons and sharing her knowledge with the developer community, writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource reviews and coding tutorials.