

Photo by author
# Introduction
Developers use Pandas for data manipulation, but it can be slow, especially with large datasets. Because of this, many are looking for faster and lighter alternatives. These options retain the basic features needed for analysis while focusing on speed, low memory usage, and simplicity. In this article, we look at five lightweight alternatives to Pandas that you can try.
# 1. Dick DB
Duck DB is like sqlite For analytics. You can run SQL queries directly on comma-separated values (CSV) files. This is useful if you know SQL or work with machine learning pipelines. Install with:
We’ll use the Titanic dataset and run a simple SQL query on it:
import duckdb
url = "
# Run SQL query on the CSV
result = duckdb.query(f"""
SELECT sex, age, survived
FROM read_csv_auto('{url}')
WHERE age > 18
""").to_df()
print(result.head())Output:
sex age survived
0 male 22.0 0
1 female 38.0 1
2 female 26.0 1
3 female 35.0 1
4 male 35.0 0DUCKDB runs the SQL query directly on the CSV file and then converts the output into a data frame. You get the speed of SQL with the flexibility of Python.
# 2. Polarity
Polar is one of the most popular data libraries available today. It is implemented rust The language is exceptionally fast with minimal memory requirements. The syntax is also very clean. Let’s install it using PIP:
Now, let’s use the Titanic dataset as a simple example:
import polars as pl
# Load dataset
url = "
df = pl.read_csv(url)
result = df.filter(pl.col("age") > 40).select(("sex", "age", "survived"))
print(result)Output:
shape: (150, 3)
┌────────┬──────┬──────────┐
│ sex ┆ age ┆ survived │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪══════╪══════════╡
│ male ┆ 54.0 ┆ 0 │
│ female ┆ 58.0 ┆ 1 │
│ female ┆ 55.0 ┆ 1 │
│ male ┆ 66.0 ┆ 0 │
│ male ┆ 42.0 ┆ 0 │
│ … ┆ … ┆ … │
│ female ┆ 48.0 ┆ 1 │
│ female ┆ 42.0 ┆ 1 │
│ female ┆ 47.0 ┆ 1 │
│ male ┆ 47.0 ┆ 0 │
│ female ┆ 56.0 ┆ 1 │
└────────┴──────┴──────────┘Poller reads CSV, filters rows based on age condition, and selects a subset of columns.
# 3. Peru
Peru Columnar is a lightweight library for data. Use tools like polarizers Apache arrow For speed and memory efficiency. It is not a complete replacement for pandas but is great for reading and preprocessing files. Install with:
For our example, let’s use the IRIS dataset in CSV form as follows:
import pyarrow.csv as csv
import pyarrow.compute as pc
import urllib.request
# Download the Iris CSV
url = "
local_file = "iris.csv"
urllib.request.urlretrieve(url, local_file)
# Read with PyArrow
table = csv.read_csv(local_file)
# Filter rows
filtered = table.filter(pc.greater(table('sepal_length'), 5.0))
print(filtered.slice(0, 5))Output:
pyarrow.Table
sepal_length: double
sepal_width: double
petal_length: double
petal_width: double
species: string
----
sepal_length: ((5.1,5.4,5.4,5.8,5.7))
sepal_width: ((3.5,3.9,3.7,4,4.4))
petal_length: ((1.4,1.7,1.5,1.2,1.5))
petal_width: ((0.2,0.4,0.2,0.2,0.4))
species: (("setosa","setosa","setosa","setosa","setosa"))Peru reads the CSV and converts it to columnar format. Each column name and type is listed in a clear schema. This setup makes it fast to inspect and filter large datasets.
# 4. Modine
Modine For anyone who wants faster performance without having to learn a new library. It uses the same Pandas API but runs operations in parallel. You don’t need to change your existing code. Just update the import. Everything else works like normal pandas. Install it with PIP:
For better understanding, let’s try a small example using the same Titanic dataset as follows:
import modin.pandas as pd
url = "
# Load the dataset
df = pd.read_csv(url)
# Filter the dataset
adults = df(df("age") > 18)
# Select only a few columns to display
adults_small = adults(("survived", "sex", "age", "class"))
# Display result
adults_small.head()Output:
survived sex age class
0 0 male 22.0 Third
1 1 female 38.0 First
2 1 female 26.0 Third
3 1 female 35.0 First
4 0 male 35.0 ThirdModine spreads the work across CPU cores, which means you get better performance without doing anything extra.
# 5. Disk
How do you handle large data without increasing RAM? Disk When you have files that are larger than your computer’s random access memory (RAM). It uses lazy evaluation, so it doesn’t load the entire dataset into memory. It helps you process millions of rows easily. Install with:
pip install dask(complete)To test this, we can use the Chicago crime dataset, as follows:
import dask.dataframe as dd
import urllib.request
url = "
local_file = "chicago_crime.csv"
urllib.request.urlretrieve(url, local_file)
# Read CSV with Dask (lazy evaluation)
df = dd.read_csv(local_file, dtype=str) # all columns as string
# Filter crimes classified as 'THEFT'
thefts = df(df('Primary Type') == 'THEFT')
# Select a few relevant columns
thefts_small = thefts(("ID", "Date", "Primary Type", "Description", "District"))
print(thefts_small.head())Output:
ID Date Primary Type Description District
5 13204489 09/06/2023 11:00:00 AM THEFT OVER $500 001
50 13179181 08/17/2023 03:15:00 PM THEFT RETAIL THEFT 014
51 13179344 08/17/2023 07:25:00 PM THEFT RETAIL THEFT 014
53 13181885 08/20/2023 06:00:00 AM THEFT $500 AND UNDER 025
56 13184491 08/22/2023 11:44:00 AM THEFT RETAIL THEFT 014filtering (Primary Type == 'THEFT') and select columns are slow operations. Filtering is instant because the disk processes data in chunks instead of loading everything at once.
# The result
We covered five alternatives to Pandas and how to use them. The article keeps things simple and focused. Check the official documentation for each library for complete details:
If you run into any issues, leave a comment and I’ll help.
Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of AI with medicine. He co-authored the eBook “Maximizing Productivity with ChatGPT.” As a 2022 Google Generation Scholar for APAC, she champions diversity and academic excellence. He has also been recognized as a Teradata Diversity in Tech Scholar, a MITACS Globallink Research Scholar, and a Harvard Wicked Scholar. Kanwal is a passionate advocate for change, having founded the Fame Code to empower women in stem fields.