How to handle large datasets in Python even if you're a beginner

Photo by author

# Introduction

Working with large datasets in Python often leads to a common problem: you load your data with Pandasand your program slows to a crawl or crashes completely. This is usually because you are trying to load everything into memory at the same time.

Most memory problems arise how You load and process data. With a handful of practical techniques, you can handle datasets much larger than your available memory.

In this article, you’ll learn seven techniques for working efficiently with large datasets in Python. We’ll start and build simply, so by the end, you’ll know exactly which approach fits your use case.

🔗 You can find Code on GitHub. If you want, you can run it Sample Data Generator Python Script Use code snippets to retrieve and process sample CSV files.

# 1. Read the data in segments

The most beginner-friendly approach is to process your data in small chunks rather than loading everything at once.

Consider a scenario where you have a large dataset of sales and want to find the total revenue. The following code demonstrates this approach:

import pandas as pd

# Define chunk size (number of rows per chunk)
chunk_size = 100000
total_revenue = 0

# Read and process the file in chunks
for chunk in pd.read_csv('large_sales_data.csv', chunksize=chunk_size):
    # Process each chunk
    total_revenue += chunk('revenue').sum()

print(f"Total Revenue: ${total_revenue:,.2f}")

Instead of loading all 10 million rows at once, we are loading 100,000 rows at a time. We calculate the amount for each part and add it to our run. You only have 100,000 rows in RAM, no matter how big the file is.

When to use it: When you need to perform aggregation (sum, count, average) or filtering operations on large files.

# 2. Use only specified columns

Often, you don’t need every column in your dataset. Loading only what you need can significantly reduce memory usage.

Suppose you are analyzing customer data, but instead of several other columns you only need age and purchase amount.

import pandas as pd

# Only load the columns you actually need
columns_to_use = ('customer_id', 'age', 'purchase_amount')

df = pd.read_csv('customers.csv', usecols=columns_to_use)

# Now work with a much lighter dataframe
average_purchase = df.groupby('age')('purchase_amount').mean()
print(average_purchase)

By explaining usecolsPandas only loads these three columns into memory. If your original file has 50 columns, you’ve just reduced your memory usage by about 94%.

When to use it: When you know exactly which columns you need before loading the data.

# 3. Improve data types

By default, pandas can use more memory than necessary. A column of integers can be stored as 64-bit when 8-bit will work fine.

For example, if you are loading a dataset with product ratings (1-5 stars) and user IDs:

import pandas as pd

# First, let's see the default memory usage
df = pd.read_csv('ratings.csv')
print("Default memory usage:")
print(df.memory_usage(deep=True))

# Now optimize the data types
df('rating') = df('rating').astype('int8')  # Ratings are 1-5, so int8 is enough
df('user_id') = df('user_id').astype('int32')  # Assuming user IDs fit in int32

print("\nOptimized memory usage:")
print(df.memory_usage(deep=True))

By changing the classification column from Possible int64 (8 bytes per number) from int8 (1 byte per number), we get an 8x memory reduction for this column.

Common conversions include:

int64 → int8for , for , for , . int16or int32 (Depending on the number limit).
float64 → float32 (If you don’t need extreme precision).
object → category (for columns with repeated values).

# 4. Use a variety of data types

When a column contains repeated text values (such as country names or product categories), Pandas stores each value separately. category The D type stores unique values once and uses efficient code to refer to them.

Suppose you’re working with a product inventory file where the category column only has 20 unique values, but they repeat across all the rows in the dataset:

import pandas as pd

df = pd.read_csv('products.csv')

# Check memory before conversion
print(f"Before: {df('category').memory_usage(deep=True) / 1024**2:.2f} MB")

# Convert to category
df('category') = df('category').astype('category')

# Check memory after conversion
print(f"After: {df('category').memory_usage(deep=True) / 1024**2:.2f} MB")

# It still works like normal text
print(df('category').value_counts())

This change can significantly reduce memory usage for columns with low cardinality (few unique values). Columns still work just like standard text data: you can filter, group and sort as usual.

When to use it: for any text column where values repeat frequently (categories, states, countries, departments, and the like).

# 5. Filter while reading

Sometimes you know you only need a subset of rows. Instead of loading everything and then filtering, you can filter during the load process.

For example, if you only care about transactions from the year 2024:

import pandas as pd

# Read in chunks and filter
chunk_size = 100000
filtered_chunks = ()

for chunk in pd.read_csv('transactions.csv', chunksize=chunk_size):
    # Filter each chunk before storing it
    filtered = chunk(chunk('year') == 2024)
    filtered_chunks.append(filtered)

# Combine the filtered chunks
df_2024 = pd.concat(filtered_chunks, ignore_index=True)

print(f"Loaded {len(df_2024)} rows from 2024")

We are combining chunking with filtering. Each part is filtered before being added to our list, so we never keep the entire dataset in memory, only the rows we actually want.

When to use it: When you need only a subset of rows based on a condition.

# 6 Use disk for parallel processing

For datasets that are really large, Disk Provides a Pandas-like API but handles all chunking and parallel processing automatically.

Here’s how you would calculate the average of a column in a very large dataset:

import dask.dataframe as dd

# Read with Dask (it handles chunking automatically)
df = dd.read_csv('huge_dataset.csv')

# Operations look just like pandas
result = df('sales').mean()

# Dask is lazy - compute() actually executes the calculation
average_sales = result.compute()

print(f"Average Sales: ${average_sales:,.2f}")

Disk does not load the entire file into memory. Instead, it creates a plan to process the data in chunks when you call and executes that plan. .compute(). It can even use multiple CPU cores to speed up calculations.

When to use it: when your dataset is too big for pandas, even with chunking, or when you want parallel processing without writing complex code.

# 7. Sample your data for search

When you’re just searching or testing code, you don’t need the full dataset. First load a sample.

Suppose you are building a machine learning model and want to test your preprocessing pipeline. You can sample your dataset as shown:

import pandas as pd

# Read just the first 50,000 rows
df_sample = pd.read_csv('huge_dataset.csv', nrows=50000)

# Or read a random sample using skiprows
import random
skip_rows = lambda x: x > 0 and random.random() > 0.01  # Keep ~1% of rows

df_random_sample = pd.read_csv('huge_dataset.csv', skiprows=skip_rows)

print(f"Sample size: {len(df_random_sample)} rows")

The first approach loads the first N rows, which is suitable for fast search. The second approach randomly generates rows throughout the file, which is better for statistical analysis or when the file is arranged in a way that makes the top rows unusual.

When to use it: During development, testing, or exploratory analysis before running your code on the full dataset.

# The result

Handling large datasets does not require expert-level skills. Here’s a quick summary of the techniques we’ve discussed:

The technique	To use it
chunking	You can’t fit in RAM for collecting, filtering, and processing data.
Column selection	When you need only a few columns from a large dataset.
Data type optimization	Always; Do this after loading to save memory.
Categorical types	For text columns with repeated values (categories, states, etc.).
Filter while reading	When you only need a subset of rows.
Disk	For very large datasets or when you want parallel processing.
Sampling	During development and exploration.

The first step is to know both your data and your work. Most of the time, a combination of chunking and smart column selection will get you 90% of the way there.

As your needs grow, move to more advanced tools like Disk or consider converting your data to more efficient file formats. Parquet or HDF5.

Now go ahead and start working with these massive datasets. Happy analysis!

Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include devops, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces and more. Bala also engages resource reviews and coding lessons.

# Introduction

# 1. Read the data in segments

# 2. Use only specified columns

# 3. Improve data types

# 4. Use a variety of data types

# 5. Filter while reading

# 6 Use disk for parallel processing

# 7. Sample your data for search

# The result

Editor's pick

Get latest news

How to handle large datasets in Python even if you’re a beginner

# Introduction

# 1. Read the data in segments

# 2. Use only specified columns

# 3. Improve data types

# 4. Use a variety of data types

# 5. Filter while reading

# 6 Use disk for parallel processing

# 7. Sample your data for search

# The result

Bolmo’s architecture unlocks efficient byte-level LM training without sacrificing quality

AI is moving to the edge – and network security needs to catch up

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news