We began marking Dick DB, Skyklite, and Pandas on 1m rows: what happened here

by SkillAiNest

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas
Picture by the writer

. Introduction

There are numerous tools to take action on datas today. They all claim – of course they do – that they are the best and right choices for you. But are they? There are two important requirements that these tools should satisfy: they should easily perform daily data analysts and even under the pressure of large datases.

To determine the best toll between Duck dbFor, for, for,. sqliteAnd PandasWe tested them under these conditions.

First, we just gave them everyday analytics tasks: swimming values, grouping through category, filtering with conditions, and multi -field collection. This shows how the analysts actually work with the original datases, compared to the scenario designed to display the best features of a device.

Second, we carried out these operations with more than 1 million rows on the cogel dataset. This is a realistic tapping point – so small to run on a single machine, yet so large that the pressure of memory and the speed of questioning starts to show clear differences between the tools.

Let’s see how they were testing.

. We used the datasate

!! Dataset Review

We used bank datasate Cogl. This dataset contains more than 1 million rows, including five columns:

Name of the columnDetail
HistoryThe transaction of history happened
DomainBusiness category or type (retail, restaurant)
LocationGeographical Area (Goa, Mathura)
ChargeThe value of the transaction
Transaction_CountThe total number of transactions that day

This is developed using a datastate. Although it does not fully resemble real -life data, it is enough to test and compare the performance differences between size and structure tools.

!! Peek into data with pandas

We used to use Pandas To load the dataset into the Gapter notebook and check its general structures, dimensions and banned values. Here is a code.

import pandas as pd
df = pd.read_excel('bankdataset.xlsx')

print("Dataset shape:", df.shape)

df.head()

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

If you want a quick reference to general actions when you seek datases, see this easy Pandas cheating sheet.

Before the bench marking, let’s see how to set the environment.

. Setting the appropriate testing environment

These three tools – Dick DB, Scalite, and Pandas – were established and operated in the same Gapter notebook environment to ensure that the test was appropriate. It ensured that the use of conditions and memory during the run time remained permanent.

First, we installed and loaded the necessary packages.

Here are the tools we need:

  • Pandas: For quality DataFrame Operations
  • Dick DB: For implementation of SQL A DataFrame
  • SQLITE3: Embeded SQL Database Management
  • Time: to capture the timing of hanging
  • Memory_Proofeler: Measurement of memory allocation
# Install if any of them are not in your environment
!pip install duckdb --quiet

import pandas as pd
import duckdb
import sqlite3
import time
from memory_profiler import memory_usage

Now let’s develop the data in a form that can be shared in all three tools.

!! Loading data into pandas

We will use pandas to load the dataset once, and then we will share or register it for Duckdb and SQLITE.

df = pd.read_excel('bankdataset.xlsx')

df.head()

There is output here to confirm.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Register data with Dickdb

Dick DB lets you access directly pandas DataFrames. You don’t need to change anything – just register and inquire. Here is a code.

# Register DataFrame as a DuckDB table
duckdb.register("bank_data", df)

# Query via DuckDB
duckdb.query("SELECT * FROM bank_data LIMIT 5").to_df()

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! To prepare data for sqlite

Since SQLITE does not read Excel files directly, we started by adding pandas DataFrame In a memory database. After that, we used a simple inquiry to test the data format.

conn_sqlite = sqlite3.connect(":memory:")

df.to_sql("bank_data", conn_sqlite, index=False, if_exists="replace")

pd.read_sql_query("SELECT * FROM bank_data LIMIT 5", conn_sqlite)

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

. How did we mark the tools

We used the same four questions on Dick DB, Scalite, and Pandas to compare their performance. Each inquiry was designed to solve a shared analytical work that is mirroring how data analysis applies to the real world.

!! To ensure permanent setup

These memory datasets were used by all three tools.

  • Pandas questioned DataFrame Straight
  • Dick DB hangs SQL questions directly against DataFrame
  • Sqlite stored a copy of it DataFrame In a memory database and running SQL questions on this

This procedure ensured that all three tools use the same data and run with the same system settings.

!! Measure of hanging time

To track the period of questioning time The module wrapped each inquiry into a simple start/closing timer. Only the time for the implementation of the inquiry was recorded. Data loading and manufacture measures were excluded.

!! Tracking the use of memory

Along with the processing time, the use of memory indicates the extent to which each engine performs with major datases.

If desired, the use of memory can be sample before and after each question so that the consumption of extra Ram can be estimated.

!! Benchmark questions

We tested each engine on the same four daily analytics:

  1. Total transaction price: Summary of a numerical column
  2. Group by Domain: Counting of overall transaction
  3. Filter in terms of location: Filtering rows from a condition before the gathering
  4. Domain and Location Group: Multi -field collected with average

. Benchmark results

!! Question 1: The total value of transactions

Here we measure that it is summarized when panda, dick DB, and scallets summarize Value Dataset column.

!! Pandas’s performance

We calculate the total value of transactions by using .sum() On Value The column is the code here.

pandas_results = ()

def pandas_q1():
    return df('Value').sum()

mem_before = memory_usage(-1)(0)
start = time.time()
pandas_q1()
end = time.time()
mem_after = memory_usage(-1)(0)

pandas_results.append({
    "engine": "Pandas",
    "query": "Total transaction value",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
pandas_results

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Dick DB performance

We calculate the total transaction value using the full column deposit. Here is a code.

duckdb_results = ()

def duckdb_q1():
    return duckdb.query("SELECT SUM(value) FROM bank_data").to_df()

mem_before = memory_usage(-1)(0)
start = time.time()
duckdb_q1()
end = time.time()
mem_after = memory_usage(-1)(0)

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Total transaction value",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
duckdb_results

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Sqlite performance

We summarize the value of total transactions value The column is the code here.

sqlite_results = ()

def sqlite_q1():
    return pd.read_sql_query("SELECT SUM(value) FROM bank_data", conn_sqlite)

mem_before = memory_usage(-1)(0)
start = time.time()
sqlite_q1()
end = time.time()
mem_after = memory_usage(-1)(0)

sqlite_results.append({
    "engine": "SQLite",
    "query": "Total transaction value",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
sqlite_results

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Overall performance analysis

Let’s now compare the time of execution and the use of memory. Here is a code.

import matplotlib.pyplot as plt


all_q1 = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)

fig, axes = plt.subplots(1, 2, figsize=(10,4))

all_q1.plot(x="engine", y="time", kind="barh", ax=axes(0), legend=False, title="Execution Time (s)")
all_q1.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes(1), legend=False, title="Memory Usage (MB)")

plt.tight_layout()
plt.show()

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

Pandas is even the fastest and most memory efficient, which is completely completed immediately with the use of Ram. Dick DB is slightly slowly and uses more memory but it remains efficient, while the scalest memory is the slowest and the highest in terms of memory consumption.

!! Question 2: Group by domain

Here we measure when grouping of transactions Domain And summarize their count.

!! Pandas’s performance

We calculate the total transaction count using a domain per domain .groupby() On Domain Column

def pandas_q2():
    return df.groupby('Domain')('Transaction_count').sum()

mem_before = memory_usage(-1)(0)
start = time.time()
pandas_q2()
end = time.time()
mem_after = memory_usage(-1)(0)

pandas_results.append({
    "engine": "Pandas",
    "query": "Group by domain",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
(p for p in pandas_results if p("query") == "Group by domain")

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Dick DB performance

We calculate the total transaction of the domain using SQL GROUP BY On domain Column

def duckdb_q2():
    return duckdb.query("""
        SELECT domain, SUM(transaction_count) 
        FROM bank_data 
        GROUP BY domain
    """).to_df()

mem_before = memory_usage(-1)(0)
start = time.time()
duckdb_q2()
end = time.time()
mem_after = memory_usage(-1)(0)

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Group by domain",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

(p for p in duckdb_results if p("query") == "Group by domain")

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Sqlite performance

We calculate the total transaction of the domain using SQL GROUP BY On the memory table.

def sqlite_q2():
    return pd.read_sql_query("""
        SELECT domain, SUM(transaction_count) AS total_txn
        FROM bank_data
        GROUP BY domain
    """, conn_sqlite)

mem_before = memory_usage(-1)(0)
start = time.time()
sqlite_q2()
end = time.time()
mem_after = memory_usage(-1)(0)

sqlite_results.append({
    "engine": "SQLite",
    "query": "Group by domain",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

(p for p in sqlite_results if p("query") == "Group by domain")

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Overall performance analysis

Let’s now compare the time of execution and the use of memory. Here is a code.

import pandas as pd
import matplotlib.pyplot as plt

groupby_results = (r for r in (pandas_results + duckdb_results + sqlite_results) 
                   if "Group by" in r("query"))

df_groupby = pd.DataFrame(groupby_results)

fig, axes = plt.subplots(1, 2, figsize=(10,4))

df_groupby.plot(x="engine", y="time", kind="barh", ax=axes(0), legend=False, title="Execution Time (s)")
df_groupby.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes(1), legend=False, title="Memory Usage (MB)")

plt.tight_layout()
plt.show()

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

Dick DB is the fastest, Pandas trades a little more time for low memory, while SQ Elite is the slowest and most memory appetite.

!! Question 3: Filter (Goa) by location

Here we measure how Pandas, Dick DB, and SQ Elite perform while filter Dataset Location = 'Goa' And summarize the values ​​of transactions.

!! Pandas’s performance

We filter the row Location == 'Goa' And summarize their values. Here is a code.

def pandas_q3():
    return df(df('Location') == 'Goa')('Value').sum()

mem_before = memory_usage(-1)(0)
start = time.time()
pandas_q3()
end = time.time()
mem_after = memory_usage(-1)(0)

pandas_results.append({
    "engine": "Pandas",
    "query": "Filter by location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

(p for p in pandas_results if p("query") == "Filter by location")

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Dick DB performance

We filter the transaction Location = 'Goa' And calculate their total price. Here is a code.

def duckdb_q3():
    return duckdb.query("""
        SELECT SUM(value) 
        FROM bank_data 
        WHERE location = 'Goa'
    """).to_df()

mem_before = memory_usage(-1)(0)
start = time.time()
duckdb_q3()
end = time.time()
mem_after = memory_usage(-1)(0)

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Filter by location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

(p for p in duckdb_results if p("query") == "Filter by location")

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Sqlite performance

We filter the transaction Location = 'Goa' And summarize their values. Here is a code.

def sqlite_q3():
    return pd.read_sql_query("""
        SELECT SUM(value) AS total_value
        FROM bank_data
        WHERE location = 'Goa'
    """, conn_sqlite)

mem_before = memory_usage(-1)(0)
start = time.time()
sqlite_q3()
end = time.time()
mem_after = memory_usage(-1)(0)

sqlite_results.append({
    "engine": "SQLite",
    "query": "Filter by location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

(p for p in sqlite_results if p("query") == "Filter by location")

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Overall performance analysis

Let’s now compare the time of execution and the use of memory. Here is a code.

import pandas as pd
import matplotlib.pyplot as plt

filter_results = (r for r in (pandas_results + duckdb_results + sqlite_results)
                  if r("query") == "Filter by location")

df_filter = pd.DataFrame(filter_results)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

df_filter.plot(x="engine", y="time", kind="barh", ax=axes(0), legend=False, title="Execution Time (s)")
df_filter.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes(1), legend=False, title="Memory Usage (MB)")

plt.tight_layout()
plt.show()

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

Dick DB is the fastest and most effective. Pandas is slow with more memory use. And the Sqlite is the largest but lighter on memory.

!! Question 4: Group through the domain and location

!! Pandas’s performance

Calculate the average transaction value of groups we both Domain And Location. Here is a code.

def pandas_q4():
    return df.groupby(('Domain', 'Location'))('Value').mean()

mem_before = memory_usage(-1)(0)
start = time.time()
pandas_q4()
end = time.time()
mem_after = memory_usage(-1)(0)

pandas_results.append({
    "engine": "Pandas",
    "query": "Group by domain & location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

(p for p in pandas_results if p("query") == "Group by domain & location")

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Dick DB performance

Calculate the average transaction value of groups we both domain And location. Here is a code.

def duckdb_q4():
    return duckdb.query("""
        SELECT domain, location, AVG(value) AS avg_value
        FROM bank_data
        GROUP BY domain, location
    """).to_df()

mem_before = memory_usage(-1)(0)
start = time.time()
duckdb_q4()
end = time.time()
mem_after = memory_usage(-1)(0)

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Group by domain & location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

(p for p in duckdb_results if p("query") == "Group by domain & location")

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Sqlite performance

Calculate the average transaction value of groups we both domain And location. Here is a code.

def sqlite_q4():
    return pd.read_sql_query("""
        SELECT domain, location, AVG(value) AS avg_value
        FROM bank_data
        GROUP BY domain, location
    """, conn_sqlite)

mem_before = memory_usage(-1)(0)
start = time.time()
sqlite_q4()
end = time.time()
mem_after = memory_usage(-1)(0)

sqlite_results.append({
    "engine": "SQLite",
    "query": "Group by domain & location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

(p for p in sqlite_results if p("query") == "Group by domain & location")

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

!! Overall performance analysis

Let’s now compare the time of execution and the use of memory. Here is a code.

import pandas as pd
import matplotlib.pyplot as plt

gdl_results = (r for r in (pandas_results + duckdb_results + sqlite_results)
               if r("query") == "Group by domain & location")

df_gdl = pd.DataFrame(gdl_results)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

df_gdl.plot(x="engine", y="time", kind="barh", ax=axes(0), legend=False,
            title="Execution Time (s)")
df_gdl.plot(x="engine", y="memory", kind="barh", ax=axes(1), legend=False,
            title="Memory Usage (MB)", color="salmon")

plt.tight_layout()
plt.show()

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

The Dick DB Multi -Field Group Handles the fastest with the use of moderate memory, Pandas is slowly with the use of memory, and the slowest with enough consumption of skilite memory.

. Final comparisons in all questions

We have compared these three engines to each other in terms of memory and speed. Let’s check the execution time again. Here is a code.

import pandas as pd
import matplotlib.pyplot as plt

all_results = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)

measure_order = (
    "Total transaction value",
    "Group by domain",
    "Filter by location",
    "Group by domain & location",
)
engine_colors = {"Pandas": "#1f77b4", "DuckDB": "#ff7f0e", "SQLite": "#2ca02c"}

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()

for i, q in enumerate(measure_order):
    d = all_results(all_results("query") == q)
    axes(i).barh(d("engine"), d("time"), 
                 color=(engine_colors(e) for e in d("engine")))
    for y, v in enumerate(d("time")):
        axes(i).text(v, y, f" {v:.3f}", va="center")
    axes(i).set_title(q, fontsize=10)
    axes(i).set_xlabel("Seconds")

fig.suptitle("Per-Measure Comparison — Execution Time", fontsize=14)
plt.tight_layout()
plt.show()

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

This chart shows that DuckDB almost all maintains the lowest execution times for all questions, except that the total value of the transaction where Pandas has eliminated it. Scaleite is the slow margin throughout the board due to a wide margin. Let’s check the next memory. Here is a code.

import pandas as pd
import matplotlib.pyplot as plt

all_results = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)

measure_order = (
    "Total transaction value",
    "Group by domain",
    "Filter by location",
    "Group by domain & location",
)
engine_colors = {"Pandas": "#1f77b4", "DuckDB": "#ff7f0e", "SQLite": "#2ca02c"}

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()

for i, q in enumerate(measure_order):
    d = all_results(all_results("query") == q)
    axes(i).barh(d("engine"), d("memory"), 
                 color=(engine_colors(e) for e in d("engine")))
    for y, v in enumerate(d("memory")):
        axes(i).text(v, y, f" {v:.1f}", va="center")
    axes(i).set_title(q, fontsize=10)
    axes(i).set_xlabel("MB")

fig.suptitle("Per-Measure Comparison — Memory Usage", fontsize=14)
plt.tight_layout()
plt.show()

Here is an output.

Dick DB vs Skyllite vs PandasDick DB vs Skyllite vs Pandas

This chart suggests that the use of memory is the best and worst of the scalite swinging, Pandas is extremely with two best and two worst cases, while Dick DB remains permanently in the middle of all questions. As a result, Dick DB proves to be the most balanced choice, which provides permanent fast performance with the use of moderate memory. Pandas shows extremism – sometimes the fastest, sometimes heavy – while the SQ struggles at a speed and often ends on the ineffective aspect for memory.

Net Razii A data is in a scientist and product strategy. He is also an affiliated professor of Teaching Analytics, and is the founder of Stratskrich, a platform that helps data scientists prepare for his interview with the real questions of high companies. The net carrier writes on the latest trends in the market, gives interview advice, sharing data science projects, and everything covers SQL.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro