

Picture by the writer
. Introduction
There are numerous tools to take action on datas today. They all claim – of course they do – that they are the best and right choices for you. But are they? There are two important requirements that these tools should satisfy: they should easily perform daily data analysts and even under the pressure of large datases.
To determine the best toll between Duck dbFor, for, for,. sqliteAnd PandasWe tested them under these conditions.
First, we just gave them everyday analytics tasks: swimming values, grouping through category, filtering with conditions, and multi -field collection. This shows how the analysts actually work with the original datases, compared to the scenario designed to display the best features of a device.
Second, we carried out these operations with more than 1 million rows on the cogel dataset. This is a realistic tapping point – so small to run on a single machine, yet so large that the pressure of memory and the speed of questioning starts to show clear differences between the tools.
Let’s see how they were testing.
. We used the datasate
!! Dataset Review
We used bank datasate Cogl. This dataset contains more than 1 million rows, including five columns:
| Name of the column | Detail |
|---|---|
| History | The transaction of history happened |
| Domain | Business category or type (retail, restaurant) |
| Location | Geographical Area (Goa, Mathura) |
| Charge | The value of the transaction |
| Transaction_Count | The total number of transactions that day |
This is developed using a datastate. Although it does not fully resemble real -life data, it is enough to test and compare the performance differences between size and structure tools.
!! Peek into data with pandas
We used to use Pandas To load the dataset into the Gapter notebook and check its general structures, dimensions and banned values. Here is a code.
import pandas as pd
df = pd.read_excel('bankdataset.xlsx')
print("Dataset shape:", df.shape)
df.head()Here is an output.
![]()
![]()
If you want a quick reference to general actions when you seek datases, see this easy Pandas cheating sheet.
Before the bench marking, let’s see how to set the environment.
. Setting the appropriate testing environment
These three tools – Dick DB, Scalite, and Pandas – were established and operated in the same Gapter notebook environment to ensure that the test was appropriate. It ensured that the use of conditions and memory during the run time remained permanent.
First, we installed and loaded the necessary packages.
Here are the tools we need:
- Pandas: For quality
DataFrameOperations - Dick DB: For implementation of SQL A
DataFrame - SQLITE3: Embeded SQL Database Management
- Time: to capture the timing of hanging
- Memory_Proofeler: Measurement of memory allocation
# Install if any of them are not in your environment
!pip install duckdb --quiet
import pandas as pd
import duckdb
import sqlite3
import time
from memory_profiler import memory_usageNow let’s develop the data in a form that can be shared in all three tools.
!! Loading data into pandas
We will use pandas to load the dataset once, and then we will share or register it for Duckdb and SQLITE.
df = pd.read_excel('bankdataset.xlsx')
df.head()There is output here to confirm.
![]()
![]()
!! Register data with Dickdb
Dick DB lets you access directly pandas DataFrames. You don’t need to change anything – just register and inquire. Here is a code.
# Register DataFrame as a DuckDB table
duckdb.register("bank_data", df)
# Query via DuckDB
duckdb.query("SELECT * FROM bank_data LIMIT 5").to_df()Here is an output.
![]()
![]()
!! To prepare data for sqlite
Since SQLITE does not read Excel files directly, we started by adding pandas DataFrame In a memory database. After that, we used a simple inquiry to test the data format.
conn_sqlite = sqlite3.connect(":memory:")
df.to_sql("bank_data", conn_sqlite, index=False, if_exists="replace")
pd.read_sql_query("SELECT * FROM bank_data LIMIT 5", conn_sqlite)Here is an output.
![]()
![]()
. How did we mark the tools
We used the same four questions on Dick DB, Scalite, and Pandas to compare their performance. Each inquiry was designed to solve a shared analytical work that is mirroring how data analysis applies to the real world.
!! To ensure permanent setup
These memory datasets were used by all three tools.
- Pandas questioned
DataFrameStraight - Dick DB hangs SQL questions directly against
DataFrame - Sqlite stored a copy of it
DataFrameIn a memory database and running SQL questions on this
This procedure ensured that all three tools use the same data and run with the same system settings.
!! Measure of hanging time
To track the period of questioning time The module wrapped each inquiry into a simple start/closing timer. Only the time for the implementation of the inquiry was recorded. Data loading and manufacture measures were excluded.
!! Tracking the use of memory
Along with the processing time, the use of memory indicates the extent to which each engine performs with major datases.
If desired, the use of memory can be sample before and after each question so that the consumption of extra Ram can be estimated.
!! Benchmark questions
We tested each engine on the same four daily analytics:
- Total transaction price: Summary of a numerical column
- Group by Domain: Counting of overall transaction
- Filter in terms of location: Filtering rows from a condition before the gathering
- Domain and Location Group: Multi -field collected with average
. Benchmark results
!! Question 1: The total value of transactions
Here we measure that it is summarized when panda, dick DB, and scallets summarize Value Dataset column.
!! Pandas’s performance
We calculate the total value of transactions by using .sum() On Value The column is the code here.
pandas_results = ()
def pandas_q1():
return df('Value').sum()
mem_before = memory_usage(-1)(0)
start = time.time()
pandas_q1()
end = time.time()
mem_after = memory_usage(-1)(0)
pandas_results.append({
"engine": "Pandas",
"query": "Total transaction value",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
pandas_resultsHere is an output.
![]()
![]()
!! Dick DB performance
We calculate the total transaction value using the full column deposit. Here is a code.
duckdb_results = ()
def duckdb_q1():
return duckdb.query("SELECT SUM(value) FROM bank_data").to_df()
mem_before = memory_usage(-1)(0)
start = time.time()
duckdb_q1()
end = time.time()
mem_after = memory_usage(-1)(0)
duckdb_results.append({
"engine": "DuckDB",
"query": "Total transaction value",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
duckdb_resultsHere is an output.
![]()
![]()
!! Sqlite performance
We summarize the value of total transactions value The column is the code here.
sqlite_results = ()
def sqlite_q1():
return pd.read_sql_query("SELECT SUM(value) FROM bank_data", conn_sqlite)
mem_before = memory_usage(-1)(0)
start = time.time()
sqlite_q1()
end = time.time()
mem_after = memory_usage(-1)(0)
sqlite_results.append({
"engine": "SQLite",
"query": "Total transaction value",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
sqlite_resultsHere is an output.
![]()
![]()
!! Overall performance analysis
Let’s now compare the time of execution and the use of memory. Here is a code.
import matplotlib.pyplot as plt
all_q1 = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)
fig, axes = plt.subplots(1, 2, figsize=(10,4))
all_q1.plot(x="engine", y="time", kind="barh", ax=axes(0), legend=False, title="Execution Time (s)")
all_q1.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes(1), legend=False, title="Memory Usage (MB)")
plt.tight_layout()
plt.show()Here is an output.
![]()
![]()
Pandas is even the fastest and most memory efficient, which is completely completed immediately with the use of Ram. Dick DB is slightly slowly and uses more memory but it remains efficient, while the scalest memory is the slowest and the highest in terms of memory consumption.
!! Question 2: Group by domain
Here we measure when grouping of transactions Domain And summarize their count.
!! Pandas’s performance
We calculate the total transaction count using a domain per domain .groupby() On Domain Column
def pandas_q2():
return df.groupby('Domain')('Transaction_count').sum()
mem_before = memory_usage(-1)(0)
start = time.time()
pandas_q2()
end = time.time()
mem_after = memory_usage(-1)(0)
pandas_results.append({
"engine": "Pandas",
"query": "Group by domain",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
(p for p in pandas_results if p("query") == "Group by domain")Here is an output.
![]()
![]()
!! Dick DB performance
We calculate the total transaction of the domain using SQL GROUP BY On domain Column
def duckdb_q2():
return duckdb.query("""
SELECT domain, SUM(transaction_count)
FROM bank_data
GROUP BY domain
""").to_df()
mem_before = memory_usage(-1)(0)
start = time.time()
duckdb_q2()
end = time.time()
mem_after = memory_usage(-1)(0)
duckdb_results.append({
"engine": "DuckDB",
"query": "Group by domain",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
(p for p in duckdb_results if p("query") == "Group by domain")Here is an output.
![]()
![]()
!! Sqlite performance
We calculate the total transaction of the domain using SQL GROUP BY On the memory table.
def sqlite_q2():
return pd.read_sql_query("""
SELECT domain, SUM(transaction_count) AS total_txn
FROM bank_data
GROUP BY domain
""", conn_sqlite)
mem_before = memory_usage(-1)(0)
start = time.time()
sqlite_q2()
end = time.time()
mem_after = memory_usage(-1)(0)
sqlite_results.append({
"engine": "SQLite",
"query": "Group by domain",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
(p for p in sqlite_results if p("query") == "Group by domain")Here is an output.
![]()
![]()
!! Overall performance analysis
Let’s now compare the time of execution and the use of memory. Here is a code.
import pandas as pd
import matplotlib.pyplot as plt
groupby_results = (r for r in (pandas_results + duckdb_results + sqlite_results)
if "Group by" in r("query"))
df_groupby = pd.DataFrame(groupby_results)
fig, axes = plt.subplots(1, 2, figsize=(10,4))
df_groupby.plot(x="engine", y="time", kind="barh", ax=axes(0), legend=False, title="Execution Time (s)")
df_groupby.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes(1), legend=False, title="Memory Usage (MB)")
plt.tight_layout()
plt.show()Here is an output.
![]()
![]()
Dick DB is the fastest, Pandas trades a little more time for low memory, while SQ Elite is the slowest and most memory appetite.
!! Question 3: Filter (Goa) by location
Here we measure how Pandas, Dick DB, and SQ Elite perform while filter Dataset Location = 'Goa' And summarize the values ​​of transactions.
!! Pandas’s performance
We filter the row Location == 'Goa' And summarize their values. Here is a code.
def pandas_q3():
return df(df('Location') == 'Goa')('Value').sum()
mem_before = memory_usage(-1)(0)
start = time.time()
pandas_q3()
end = time.time()
mem_after = memory_usage(-1)(0)
pandas_results.append({
"engine": "Pandas",
"query": "Filter by location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
(p for p in pandas_results if p("query") == "Filter by location")Here is an output.
![]()
![]()
!! Dick DB performance
We filter the transaction Location = 'Goa' And calculate their total price. Here is a code.
def duckdb_q3():
return duckdb.query("""
SELECT SUM(value)
FROM bank_data
WHERE location = 'Goa'
""").to_df()
mem_before = memory_usage(-1)(0)
start = time.time()
duckdb_q3()
end = time.time()
mem_after = memory_usage(-1)(0)
duckdb_results.append({
"engine": "DuckDB",
"query": "Filter by location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
(p for p in duckdb_results if p("query") == "Filter by location")Here is an output.
![]()
![]()
!! Sqlite performance
We filter the transaction Location = 'Goa' And summarize their values. Here is a code.
def sqlite_q3():
return pd.read_sql_query("""
SELECT SUM(value) AS total_value
FROM bank_data
WHERE location = 'Goa'
""", conn_sqlite)
mem_before = memory_usage(-1)(0)
start = time.time()
sqlite_q3()
end = time.time()
mem_after = memory_usage(-1)(0)
sqlite_results.append({
"engine": "SQLite",
"query": "Filter by location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
(p for p in sqlite_results if p("query") == "Filter by location")Here is an output.
![]()
![]()
!! Overall performance analysis
Let’s now compare the time of execution and the use of memory. Here is a code.
import pandas as pd
import matplotlib.pyplot as plt
filter_results = (r for r in (pandas_results + duckdb_results + sqlite_results)
if r("query") == "Filter by location")
df_filter = pd.DataFrame(filter_results)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
df_filter.plot(x="engine", y="time", kind="barh", ax=axes(0), legend=False, title="Execution Time (s)")
df_filter.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes(1), legend=False, title="Memory Usage (MB)")
plt.tight_layout()
plt.show()Here is an output.
![]()
![]()
Dick DB is the fastest and most effective. Pandas is slow with more memory use. And the Sqlite is the largest but lighter on memory.
!! Question 4: Group through the domain and location
!! Pandas’s performance
Calculate the average transaction value of groups we both Domain And Location. Here is a code.
def pandas_q4():
return df.groupby(('Domain', 'Location'))('Value').mean()
mem_before = memory_usage(-1)(0)
start = time.time()
pandas_q4()
end = time.time()
mem_after = memory_usage(-1)(0)
pandas_results.append({
"engine": "Pandas",
"query": "Group by domain & location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
(p for p in pandas_results if p("query") == "Group by domain & location")Here is an output.
![]()
![]()
!! Dick DB performance
Calculate the average transaction value of groups we both domain And location. Here is a code.
def duckdb_q4():
return duckdb.query("""
SELECT domain, location, AVG(value) AS avg_value
FROM bank_data
GROUP BY domain, location
""").to_df()
mem_before = memory_usage(-1)(0)
start = time.time()
duckdb_q4()
end = time.time()
mem_after = memory_usage(-1)(0)
duckdb_results.append({
"engine": "DuckDB",
"query": "Group by domain & location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
(p for p in duckdb_results if p("query") == "Group by domain & location")Here is an output.
![]()
![]()
!! Sqlite performance
Calculate the average transaction value of groups we both domain And location. Here is a code.
def sqlite_q4():
return pd.read_sql_query("""
SELECT domain, location, AVG(value) AS avg_value
FROM bank_data
GROUP BY domain, location
""", conn_sqlite)
mem_before = memory_usage(-1)(0)
start = time.time()
sqlite_q4()
end = time.time()
mem_after = memory_usage(-1)(0)
sqlite_results.append({
"engine": "SQLite",
"query": "Group by domain & location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
(p for p in sqlite_results if p("query") == "Group by domain & location")Here is an output.
![]()
![]()
!! Overall performance analysis
Let’s now compare the time of execution and the use of memory. Here is a code.
import pandas as pd
import matplotlib.pyplot as plt
gdl_results = (r for r in (pandas_results + duckdb_results + sqlite_results)
if r("query") == "Group by domain & location")
df_gdl = pd.DataFrame(gdl_results)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
df_gdl.plot(x="engine", y="time", kind="barh", ax=axes(0), legend=False,
title="Execution Time (s)")
df_gdl.plot(x="engine", y="memory", kind="barh", ax=axes(1), legend=False,
title="Memory Usage (MB)", color="salmon")
plt.tight_layout()
plt.show()Here is an output.
![]()
![]()
The Dick DB Multi -Field Group Handles the fastest with the use of moderate memory, Pandas is slowly with the use of memory, and the slowest with enough consumption of skilite memory.
. Final comparisons in all questions
We have compared these three engines to each other in terms of memory and speed. Let’s check the execution time again. Here is a code.
import pandas as pd
import matplotlib.pyplot as plt
all_results = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)
measure_order = (
"Total transaction value",
"Group by domain",
"Filter by location",
"Group by domain & location",
)
engine_colors = {"Pandas": "#1f77b4", "DuckDB": "#ff7f0e", "SQLite": "#2ca02c"}
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()
for i, q in enumerate(measure_order):
d = all_results(all_results("query") == q)
axes(i).barh(d("engine"), d("time"),
color=(engine_colors(e) for e in d("engine")))
for y, v in enumerate(d("time")):
axes(i).text(v, y, f" {v:.3f}", va="center")
axes(i).set_title(q, fontsize=10)
axes(i).set_xlabel("Seconds")
fig.suptitle("Per-Measure Comparison — Execution Time", fontsize=14)
plt.tight_layout()
plt.show()Here is an output.


This chart shows that DuckDB almost all maintains the lowest execution times for all questions, except that the total value of the transaction where Pandas has eliminated it. Scaleite is the slow margin throughout the board due to a wide margin. Let’s check the next memory. Here is a code.
import pandas as pd
import matplotlib.pyplot as plt
all_results = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)
measure_order = (
"Total transaction value",
"Group by domain",
"Filter by location",
"Group by domain & location",
)
engine_colors = {"Pandas": "#1f77b4", "DuckDB": "#ff7f0e", "SQLite": "#2ca02c"}
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()
for i, q in enumerate(measure_order):
d = all_results(all_results("query") == q)
axes(i).barh(d("engine"), d("memory"),
color=(engine_colors(e) for e in d("engine")))
for y, v in enumerate(d("memory")):
axes(i).text(v, y, f" {v:.1f}", va="center")
axes(i).set_title(q, fontsize=10)
axes(i).set_xlabel("MB")
fig.suptitle("Per-Measure Comparison — Memory Usage", fontsize=14)
plt.tight_layout()
plt.show()Here is an output.


This chart suggests that the use of memory is the best and worst of the scalite swinging, Pandas is extremely with two best and two worst cases, while Dick DB remains permanently in the middle of all questions. As a result, Dick DB proves to be the most balanced choice, which provides permanent fast performance with the use of moderate memory. Pandas shows extremism – sometimes the fastest, sometimes heavy – while the SQ struggles at a speed and often ends on the ineffective aspect for memory.
Net Razii A data is in a scientist and product strategy. He is also an affiliated professor of Teaching Analytics, and is the founder of Stratskrich, a platform that helps data scientists prepare for his interview with the real questions of high companies. The net carrier writes on the latest trends in the market, gives interview advice, sharing data science projects, and everything covers SQL.