5 Useful Python Scripts for Synthetic Data Generation

by SkillAiNest

5 Useful Python Scripts for Synthetic Data Generation
Photo by editor

# Introduction

Synthetic data, as the name suggests, is artificially generated rather than collected from real-world sources. It looks like real data but avoids privacy issues and high data collection costs. It allows you to easily test software and models by running experiments to simulate performance after release.

While libraries like Forgery, SDVand Synth City While there are—and even large language models (LLMs) are widely used to generate synthetic data—my focus in this article is to avoid relying on these external libraries or AI tools. Instead, you’ll learn how to get the same results by writing your own. The python The script provides a better understanding of how to structure the dataset and how bias or errors are introduced. We’ll start with a simple toy script to understand the options available. Once you understand these basics, you can comfortably move to specialized libraries.

# 1. Generating simple random data

The easiest place to start is with a desk. For example, if you need a fake customer dataset for an internal demo, you can run a script to generate comma-separated values ​​(CSV) data:

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

countries = ("Canada", "UK", "UAE", "Germany", "USA")
plans = ("Free", "Basic", "Pro", "Enterprise")

def random_signup_date():
    start = datetime(2024, 1, 1)
    end = datetime(2026, 1, 1)
    delta_days = (end - start).days
    return (start + timedelta(days=random.randint(0, delta_days))).date().isoformat()

rows = ()
for i in range(1, 1001):
    age = random.randint(18, 70)
    country = random.choice(countries)
    plan = random.choice(plans)
    monthly_spend = round(random.uniform(0, 500), 2)

    rows.append({
        "customer_id": f"CUST{i:05d}",
        "age": age,
        "country": country,
        "plan": plan,
        "monthly_spend": monthly_spend,
        "signup_date": random_signup_date()
    })

with open("customers.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=rows(0).keys())
    writer.writeheader()
    writer.writerows(rows)

print("Saved customers.csv")

Output:
Simple random data generation
This script is straightforward: you define the fields, select the ranges, and write the rows. gave random The module supports integer generation, floating point values, random choice, and sampling. gave csv The module is designed to read and write row-based tabular data. This type of dataset is suitable for:

  • Front-end demo
  • Dashboard testing
  • API development
  • Learning Structured Query Language (SQL)
  • Unit testing input pipelines

However, this approach has a fundamental weakness: everything is completely random. This often results in data that looks flat or unnatural. Enterprise users can only spend $2, while “free” users can spend $400. Older users behave just like younger ones because there is no infrastructure.

In real-world scenarios, data is rarely treated this way. Instead of creating values ​​independently, we can introduce relationships and rules. This makes the dataset feel more realistic while remaining completely synthetic. For example:

  • Enterprise users should almost never spend zero.
  • Spending limits should depend on the plan chosen
  • Older users may spend a little more on average.
  • Some plans should be more general than others

Let’s add these controls to the script:

import csv
import random

random.seed(42)

plans = ("Free", "Basic", "Pro", "Enterprise")

def choose_plan():
    roll = random.random()
    if roll < 0.45:
        return "Free"
    if roll < 0.75:
        return "Basic"
    if roll < 0.93:
        return "Pro"
    return "Enterprise"

def generate_spend(age, plan):
    if plan == "Free":
        base = random.uniform(0, 10)
    elif plan == "Basic":
        base = random.uniform(10, 60)
    elif plan == "Pro":
        base = random.uniform(50, 180)
    else:
        base = random.uniform(150, 500)

    if age >= 40:
        base *= 1.15

    return round(base, 2)

rows = ()
for i in range(1, 1001):
    age = random.randint(18, 70)
    plan = choose_plan()
    spend = generate_spend(age, plan)

    rows.append({
        "customer_id": f"CUST{i:05d}",
        "age": age,
        "plan": plan,
        "monthly_spend": spend
    })

with open("controlled_customers.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=rows(0).keys())
    writer.writeheader()
    writer.writerows(rows)

print("Saved controlled_customers.csv")

Output:
Simple Random Data Generation-2

Now the dataset preserves meaningful patterns. Instead of generating random noise, you’re simulating behavior. Effective controls may include:

  • Selection of weight category
  • Realistic minimum and maximum limits
  • Conditional logic between columns
  • Intentionally rare edge cases were added.
  • Missing values ​​were imputed at lower rates.
  • Related properties rather than independent properties

# 2. Simulation process for synthetic data

Simulation-based generation is one of the best ways to create realistic synthetic datasets. Instead of filling the columns directly, you simulate a process. For example, consider a small warehouse where orders come in, stock is low, and low stock levels trigger backorders.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

inventory = {
    "A": 120,
    "B": 80,
    "C": 50
}

rows = ()
current_time = datetime(2026, 1, 1)

for day in range(30):
    for product in inventory:
        daily_orders = random.randint(0, 12)

        for _ in range(daily_orders):
            qty = random.randint(1, 5)
            before = inventory(product)

            if inventory(product) >= qty:
                inventory(product) -= qty
                status = "fulfilled"
            else:
                status = "backorder"

            rows.append({
                "time": current_time.isoformat(),
                "product": product,
                "qty": qty,
                "stock_before": before,
                "stock_after": inventory(product),
                "status": status
            })

        if inventory(product) < 20:
            restock = random.randint(30, 80)
            inventory(product) += restock
            rows.append({
                "time": current_time.isoformat(),
                "product": product,
                "qty": restock,
                "stock_before": inventory(product) - restock,
                "stock_after": inventory(product),
                "status": "restock"
            })

    current_time += timedelta(days=1)

with open("warehouse_sim.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=rows(0).keys())
    writer.writeheader()
    writer.writerows(rows)

print("Saved warehouse_sim.csv")

Output:
Synthetic data based on simulation
This method is optimal because the data is a byproduct of the system’s behavior, which typically produces more realistic relationships than direct random queue generation. Other simulation ideas include:

  • Call center queues.
  • Ride requests and driver matching
  • Loan applications and approvals
  • Subscriptions and churn
  • Patient appointment is ongoing.
  • Website traffic and conversion

# 3. Generating time series synthetic data

Synthetic data is not limited to static tables. Many systems generate patterns over time, such as app traffic, sensor readings, orders per hour, or server response times. Here’s a simple time series generator for hourly website visits with a day-of-the-week pattern.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

start = datetime(2026, 1, 1, 0, 0, 0)
hours = 24 * 30
rows = ()

for i in range(hours):
    ts = start + timedelta(hours=i)
    weekday = ts.weekday()

    base = 120
    if weekday >= 5:
        base = 80

    hour = ts.hour
    if 8 <= hour <= 11:
        base += 60
    elif 18 <= hour <= 21:
        base += 40
    elif 0 <= hour <= 5:
        base -= 30

    visits = max(0, int(random.gauss(base, 15)))

    rows.append({
        "timestamp": ts.isoformat(),
        "visits": visits
    })

with open("traffic_timeseries.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=("timestamp", "visits"))
    writer.writeheader()
    writer.writerows(rows)

print("Saved traffic_timeseries.csv")

Output:
Time series synthetic data
This approach works well because it incorporates trends, noise, and cyclical behavior while remaining easy to define and debug.

# 4. Creating an event log

Event logs are another useful script style, ideal for product analytics and workflow testing. Instead of one queue per user, you create one queue per process.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

events = ("signup", "login", "view_page", "add_to_cart", "purchase", "logout")

rows = ()
start = datetime(2026, 1, 1)

for user_id in range(1, 201):
    event_count = random.randint(5, 30)
    current_time = start + timedelta(days=random.randint(0, 10))

    for _ in range(event_count):
        event = random.choice(events)

        if event == "purchase" and random.random() < 0.6:
            value = round(random.uniform(10, 300), 2)
        else:
            value = 0.0

        rows.append({
            "user_id": f"USER{user_id:04d}",
            "event_time": current_time.isoformat(),
            "event_name": event,
            "event_value": value
        })

        current_time += timedelta(minutes=random.randint(1, 180))

with open("event_log.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=rows(0).keys())
    writer.writeheader()
    writer.writerows(rows)

print("Saved event_log.csv")

Output:
Event log generation
This format is useful for:

  • Funnel analysis
  • Analytical pipeline testing
  • Business intelligence (BI) dashboards
  • session reconstruction
  • Anomaly detection experiments

A useful technique here is to make events dependent on previous actions. For example, a purchase should typically follow a login or page view, making the simulated log more reliable.

# 5. Generating synthetic text data with templates

Synthetic data is also valuable for natural language processing (NLP). You don’t always need an LLM to get started. You can create efficient text datasets using templates and controlled variables. For example, you can create support ticket training data:

import json
import random

random.seed(42)

issues = (
    ("billing", "I was charged twice for my subscription"),
    ("login", "I cannot log into my account"),
    ("shipping", "My order has not arrived yet"),
    ("refund", "I want to request a refund"),
)

tones = ("Please help", "This is urgent", "Can you check this", "I need support")

records = ()

for _ in range(100):
    label, message = random.choice(issues)
    tone = random.choice(tones)

    text = f"{tone}. {message}."
    records.append({
        "text": text,
        "label": label
    })

with open("support_tickets.jsonl", "w", encoding="utf-8") as f:
    for item in records:
        f.write(json.dumps(item) + "\n")

print("Saved support_tickets.jsonl")

Output:
Synthetic text data using templates
This approach works well for:

  • Text classification demos
  • Detecting intent
  • Chatbot Testing
  • Quick diagnosis

# Final thoughts

Synthetic data scripts are powerful tools, but they can be implemented incorrectly. Be sure to avoid these common mistakes:

  • Randomize all values ​​equally
  • Forgetting dependencies between fields
  • Creating values ​​that defy business logic.
  • Assume that synthetic data is inherently protected by default.
  • Creating data that is very “clean” is useful for testing real-world edge cases.
  • Using the same pattern so often that the dataset becomes predictable and unrealistic.

Privacy is the most important consideration. Although synthetic data reduces the exposure of real records, it is not risk-free. If a generator is too closely connected to the original sensitive data, leakage can still occur. This is why privacy protection methods, such as discretely private synthetic data, are essential.

Kanwal Mehreen is a machine learning engineer and a technical writer with a deep passion for AI along with data science and medicine. He co-authored the e-book “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she is a champion of diversity and academic excellence. She is also recognized as a Teradata Diversity in Tech Scholar, a Mitacs Globalink Research Scholar, and a Harvard WeCode Scholar. Kanwal is a passionate advocate for change, having founded FEMCodes to empower women in STEM fields.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro