Report this

What is the reason for this report?

How DuckDB Complements Pandas for Large-Scale Analytics

Published on December 5, 2025
How DuckDB Complements Pandas for Large-Scale Analytics

Introduction

Pandas has been the standard choice for data manipulation in Python for more than a decade. If you’ve been cleaning features for a machine learning model, exploring a CSV in a notebook, or writing a lightweight ETL pipeline, it’s likely that your code has revolved around the Pandas DataFrame. It has occupied a central place in Python’s data landscape.

However, larger datasets and more compute-intensive analytical workloads have long been stretching the in-memory, single-threaded limits of Pandas. Enter DuckDB, an in-process SQL OLAP(Online Analytical Processing) engine whose “SQLite for analytics” tagline reflects its promise of:

  • High-performance, scalable analytical queries
  • Native Parquet and Arrow support
  • Parallel execution
  • Handling of datasets larger than memory

This article presents these two tools through a practitioner-focused, on-the-ground comparison. We examine how the tools differ in architecture, performance, scalability, and developer experience. We show where DuckDB can meaningfully replace Pandas and where Pandas still has the edge.

Key Takeaways

  • DuckDB is a complement, not a replacement for Pandas. DuckDB was designed for large-scale analytical workloads, while Pandas is a great choice for small to medium-sized datasets, feature engineering, and general Python-native data manipulation tasks.
  • Architecture fundamentally differs: Pandas is a Python library that implements an in-memory DataFrame object; DuckDB is a vectorized, SQL-first, columnar OLAP engine with support for parallel execution and querying data larger than memory.
  • Performance and scalability favor DuckDB, especially on large Parquet/Arrow datasets. Query execution can be orders of magnitude faster than Pandas, thanks to projection and filter pushdown, multi-threading, and disk-backed query execution.
  • A key part of this is DuckDB’s ability to integrate with hybrid workflows: DuckDB can query Pandas, Polars, and Arrow DataFrames directly, join them with large external Parquet/CSV/SQLite files, and return results in a variety of formats. This enables a seamless “SQL inside your notebook” workflow.
  • Use the right tool for the job: DuckDB is ideal for OLAP-style queries, heavy joins, and large datasets; Pandas shines in interactive data cleaning, custom logic, tight integration with ML/visualization libraries, and scenarios where everything fits comfortably in RAM.

What Are Pandas and DuckDB?

Understanding what each tool represents will better prepare you for comparing performance and workflow capabilities. Pandas is a DataFrame library that is very tightly coupled to Python. DuckDB, on the other hand, is an embedded SQL analytics engine that runs in your Python process.

Pandas as a Python DataFrame Library

Pandas is the de facto standard library for data wrangling in Python. Pandas brought the DataFrame to Python back in 2008; it’s become one of the key tools for the data science ecosystem. A Pandas DataFrame is a 2D, in-memory tabular data structure (axes labeled by rows and columns), providing fast, indexed access with a Python API. Operations on Pandas DataFrames (e.g., slicing, filtering, transformations) are backed by C/NumPy operations for vectorized execution.

Key characteristics of Pandas:

  • In-memory operation: Pandas DataFrames live in RAM. This allows very fast random access to small data, but all data must fit in memory. When scaling Pandas to a few million rows, you quickly hit memory errors and significant slowdowns as the system runs out of memory.
  • Rich API and ecosystem: Pandas is a rich, native Pythonic API for filtering, aggregating, merging, and handling time series, among other features. It’s very flexible – it also allows you to write your own custom Python code for complex logic in the DataFrame context. Pandas also provides great interoperability with other libraries (visualization libraries, scikit-learn, etc. ), so it’s easy to pass a DataFrame and then either visualize the data or feed it to an ML model.
  • Ease of use: Most data scientists find Pandas to be quite intuitive to use. The code is written in plain Python, so it’s relatively easy to debug and iterate on. Learning how to use Pandas for basic operations has a shallow learning curve.

DuckDB as an In-Process SQL

DuckDB is sometimes described as the “SQLite for analytics”. You can query data inside your Python process, like Pandas. However, under the hood, DuckDB is very different from Pandas:

  • Columnar, vectorized engine: DuckDB stores data in columnar format and executes in a vectorized model (processing data in chunks of columns (“vectors”) at a time).
  • SQL-centric interface: You interact with DuckDB primarily through SQL (queries, SELECT, JOIN, GROUP BY, etc.). It’s a full-featured SQL engine with a rich dialect (similar to PostgreSQL’s SQL). Operations that require many lines of Pandas code can often be expressed as a single SQL query in DuckDB.
  • Disk-based and memory-friendly: DuckDB can query data directly from disk-based columnar formats (Parquet, CSV, JSON, etc.) without having to load it into memory first. In practice, this means you can point DuckDB at a 10 GB Parquet file, and run SELECT * FROM ‘file.parquet’ LIMIT 10 almost instantly, with very little memory overhead.
  • Embedded OLAP database feature: Since DuckDB is a database engine, it has all the features you would expect from a database: support for concurrent queries, transactions, advanced SQL features (window functions, CTEs, etc. ), parallel execution, etc.
  • Lightweight and zero-dependency: DuckDB is a ~10MB package with zero external dependencies – no server to run – you pip install duckdb, and you’re good to go. It statically links in everything it needs (no separate database server).

Architectural differences

Comparing DuckDB and pandas reveals several critical architectural differences:

Feature pandas (DataFrame library) DuckDB (in-process SQL engine)
Core abstraction Python DataFrame and Series objects with an imperative API. Relational tables queried with declarative SQL; exposed through a Python API.
Storage format Row-oriented DataFrame in practice: columns backed by NumPy/Arrow arrays, but many operations traverse rows in Python. Columnar storage: data stored in contiguous column blocks, optimized for scans and aggregations.
Execution model Executes via Python loops and vectorized NumPy / Arrow operations; single-threaded by default (multi-threaded only in specific cases). A vectorized query engine that processes batches of values (vectors) per operator, using a C++ engine with multi-threading across all available CPU cores.
Memory & scale Entire DataFrame (or groups in a group-by) must fit in RAM; out-of-core patterns require manual workarounds or other tools. Designed for larger-than-memory analytics; can spill to disk transparently and stream data from columnar files (e.g., Parquet).
Concurrency & transactions No ACID transaction layer; operations mutate in-memory objects within a single Python process. Provides ACID transactions with MVCC; supports safe, transactional updates and consistent reads, even when persisting to a .duckdb file.
Query language & ergonomics Python DataFrame API: method chains (.groupby(), .agg(), .merge(), etc.); very flexible but can become verbose for complex pipelines. Standard SQL with joins, window functions, CTEs, and macros; often more concise and readable for complex analytics if you are comfortable with SQL.
Integration with DataFrames Operates directly on in-memory DataFrames; ecosystem built around the pandas API. Can query pandas / Polars DataFrames via replacement scans (zero-copy where possible), and return results as pandas, Polars, Arrow, or NumPy.
File & external data access Uses separate Python libraries for many tasks (e.g., pyarrow, boto3) for Parquet, cloud storage, and other formats; I/O patterns are usually explicit Python code. Built-in connectors and extensions: query CSV/Parquet/JSON and many other formats, including S3 and HTTP, directly from SQL without extra Python packages.
Ecosystem & tooling Large, mature ecosystem; first-class support in many visualization and ML libraries (e.g., Matplotlib, Seaborn, scikit-learn, statsmodels). Growing ecosystem focused on analytics; integrates with tools via Arrow and connectors, and can act as a fast local engine under BI tools, dbt, or custom applications.
Typical sweet spot Interactive analysis, feature engineering, and small-to-medium datasets that fit comfortably in memory; quick experimentation in notebooks. Heavy analytical queries (joins, aggregations, window functions) on large local datasets; “SQL-first” workflows, local OLAP, and offloading expensive operations from pandas.

DuckDB is designed as a columnar database, so an aggregation such as SELECT SUM(value) only reads the relevant columns instead of the entire rows. Operations are applied to large blocks of values at once using vectorized execution, which reduces per-operation overhead. Pandas, by default, processes data one row at a time, or in small slices of data. The end result is that DuckDB is often faster and more memory‑efficient for operations such as aggregation, filtering, and joins, particularly with larger datasets.

Benchmarks

The following is a recent benchmark comparison between DuckDB, Polars, and Pandas. The hardware specification on which this example was run: MacBook Pro M2, 16 GB RAM. The artificial benchmark dataset represents 1 million persons and 10 million position records. The performance results showed that DuckDB is the fastest performing solution. On average, DuckDB completed in 3.84 seconds. Polars comes in a close second with a runtime average of 5.77 seconds. Pandas is a slower solution compared to both DuckDB and Polars, as it took an average runtime of 19.57 seconds, over five times slower than DuckDB.

This result can inform discussions about the use cases and limitations of Pandas, especially for large datasets or performance-sensitive data processing tasks. It also highlights the increasing importance of newer, more specialized data processing libraries such as DuckDB and Polars. This may be better suited to certain types of workloads and offer significant performance advantages over Pandas.

Integrating DuckDB and pandas

One of DuckDB’s biggest advantages is that it does not force you to choose between SQL and pandas. Instead, it plugs into your existing Python workflow:

  • Run SQL queries directly against in-memory pandas DataFrames.
  • Fetch the results back as Pandas (or Polars / Arrow) objects.
  • Mix external files (CSV/Parquet/JSON) with DataFrames in the same query.
  • The same style of integration is possible with Polars and Apache Arrow.

DuckDB becomes a “SQL engine inside your notebook” rather than a separate system to manage.

1. Querying pandas DataFrames with SQL (replacement scans)

With DuckDB’s replacement scan mechanism, you can treat a pandas DataFrame like a SQL table by just referencing the variable name in the query. No explicit registration required.

Example: filter and transform a pandas DataFrame with SQL

pip install duckdb
import duckdb
import pandas as pd

# A small DataFrame in pandas
df = pd.DataFrame(
    {
        "city": ["Paris", "Berlin", "Lagos", "Delhi"],
        "temp_c": [21, 19, 30, 35],
    }
)

# DuckDB sees 'df' as a table via replacement scans
result = duckdb.sql("""
    SELECT
        city,
        temp_c,
        temp_c * 9.0/5.0 + 32 AS temp_f
    FROM df
    WHERE temp_c >= 25
""").df()

print(result)

Key points:

  • df should be a plain pandas DataFrame.
  • duckdb.sql(…) implicitly reads df as a table called df.
  • There is no copy operation; DuckDB accesses the existing DataFrame in memory.
  • The result is materialized as another pandas DataFrame at the end.

You may also use an explicit connection and registration if you want more control:

con = duckdb.connect()
con.register("cities", df)

hot_cities = con.sql("""
    SELECT city, temp_c
    FROM cities
    WHERE temp_c >= 25
""").df()

2. Fetching results back as pandas / Polars / Arrow

After running a SQL query, you typically want to continue working in your favorite Python data library. For this purpose, DuckDB’s Python API exposes convenient conversion methods:

  • .df() → pandas DataFrame
  • .pl() → Polars DataFrame
  • .arrow() → Apache Arrow Table
  • .fetchnumpy() → NumPy arrays

Example: one SQL query, multiple result formats

import duckdb
import pandas as pd
import polars as pl
import pyarrow as pa

# Simple demo dataset
df = pd.DataFrame(
    {
        "product": ["A", "A", "B", "B"],
        "amount":  [100, 150,  80, 120],
    }
)

rel = duckdb.sql("""
    SELECT
        product,
        SUM(amount) AS total_amount
    FROM df
    GROUP BY product
    ORDER BY total_amount DESC
""")

# 1) Get results as pandas
pandas_result = rel.df()

# 2) Get results as Polars
polars_result = rel.pl()

# 3) Get results as an Arrow table
arrow_result = rel.arrow()

print("Pandas:\n", pandas_result)
print("Polars:\n", polars_result)
print("Arrow schema:\n", arrow_result.schema)

In a real workflow, you might:

  • Use .df() if you are feeding the results into scikit-learn or other Python libraries for plotting.
  • Use .pl if your downstream processing code is also written in Polars.
  • Use .arrow() when you need to pass your data into systems that speak Arrow (or for zero-copy interchange).

3. Hybrid workflows: joining files and DataFrames in one query

where DuckDB excels, however, is in hybrid workflows that involve:

  • large datasets in file-based storage (Parquet on disk or object storage, for example).
  • smaller reference tables loaded in-memory via pandas.

You can join them in a single SQL statement, without having to write custom ETL steps.

Example: joining a large Parquet fact table with a small pandas dimension table

For example, you might have:

  • data/orders.parquet – tens of millions of orders (too large to work with comfortably in pandas).
  • A small pandas DataFrame with customer segments, computed from a CRM export.
import os
import duckdb
import pandas as pd
from pathlib import Path

# 1. Ensure the 'data' directory exists
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

# 2. Create a small demo "orders" dataset in pandas
orders = pd.DataFrame(
    {
        "order_id":    [101, 102, 103, 104, 105],
        "customer_id": [1,   2,   1,   3,   2],
        "order_date":  pd.to_datetime(
            ["2025-01-02", "2025-01-03", "2025-01-05", "2025-02-01", "2025-02-10"]
        ),
        "order_amount": [200.0, 150.0, 300.0, 80.0, 120.0],
    }
)

parquet_path = data_dir / "orders.parquet"

# 3. Save as Parquet (requires pyarrow or fastparquet installed)
orders.to_parquet(parquet_path, index=False)

print(f"Saved demo file at: {parquet_path.resolve()}")

# 4. Create a small customer segments DataFrame
customer_segments = pd.DataFrame(
    {
        "customer_id": [1, 2, 3],
        "segment":     ["Enterprise", "SMB", "Consumer"],
    }
)

# 5. Connect DuckDB and register the dimension table
con = duckdb.connect()
con.register("customer_segments", customer_segments)

# 6. Run the hybrid query joining Parquet + pandas
query = f"""
    SELECT
        o.customer_id,
        s.segment,
        SUM(o.order_amount) AS total_spend
    FROM '{parquet_path.as_posix()}' AS o
    JOIN customer_segments AS s
      ON o.customer_id = s.customer_id
    WHERE o.order_date >= DATE '2025-01-01'
    GROUP BY o.customer_id, s.segment
    ORDER BY total_spend DESC
"""

top_customers = con.sql(query).df()
print(top_customers)

This pattern illustrates the “best-of-both-worlds” approach:

  • Pandas for ad-hoc data munging, feature engineering, or building lookup tables.
  • DuckDB for efficiently scanning and aggregating large columnar files.
  • All within a single Python process/notebook with no Spark, Dask, or warehouse to spin up.

4. Replacement scans with Polars and Arrow

DuckDB plays well with pandas, but that is not the end of the story. It also integrates very nicely with Polars as well as Apache Arrow. This is essential if you are using a heterogeneous stack.

There are two common scenarios:

  1. Pass a Polars DataFrame or Arrow table to DuckDB as a query input.
  2. Retrieve DuckDB query results as Polars or Arrow.

Example A: querying a Polars DataFrame with SQL

import duckdb
import polars as pl

# Polars DataFrame
pl_df = pl.DataFrame(
    {
        "id":    [1, 2, 3, 4],
        "value": [10, 20, 30, 40],
    }
)

# DuckDB can see 'pl_df' similarly to pandas via replacement scans
rel = duckdb.sql("""
    SELECT
        id,
        value,
        value * 2 AS value_x2
    FROM pl_df
    WHERE value >= 20
""")

# Get the result as Polars again
result_pl = rel.pl()
print(result_pl)

Example B: working with Arrow tables

import duckdb
import pyarrow as pa

# Create a small Arrow table
arrow_table = pa.table(
    {
        "event_type": ["click", "view", "click", "purchase"],
        "user_id":    [10, 10, 11, 10],
        "amount":     [None, None, None, 99.0],
    }
)

# Register the Arrow table explicitly (optional but clear)
con = duckdb.connect()
con.register("events", arrow_table)

# Run SQL on the Arrow table
agg = con.sql("""
    SELECT
        user_id,
        COUNT(*)                    AS num_events,
        SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) AS revenue
    FROM events
    GROUP BY user_id
""").arrow()

print(agg)

In both cases:

  • You avoid unnecessary copies by remaining in Arrow/Polar’s memory layouts.
  • DuckDB acts as a high-performance SQL layer on top of your existing DataFrame/Arrow data.
  • You are free to convert between pandas, Polars, and Arrow at the edges of your pipeline as each step demands.

When DuckDB Can Replace Pandas (and When It Shouldn’t)

Should you drop Pandas entirely in favor of DuckDB? In some cases, the answer is yes, but not always. To clarify, let’s look at the types of problems DuckDB can (and should) solve, and those that are better left to Pandas:

Tool Use Case When to Prefer It / Description
DuckDB Huge datasets / “big data on a laptop” Analyzing datasets too large for Pandas (tens of millions of rows, multi-GB files). Ideal for log files, telemetry, and other big tables where Pandas would require sampling/downcasting or fail. Often handles 100M+ rows on a typical 16 GB RAM laptop.
DuckDB Analytical queries & OLAP-style reporting Workflows that look like SQL: grouping, joining, filtering, aggregates, window functions. DuckDB can fully replace df.groupby().agg() + merge chains with concise SQL. Commonly used for reports, exploratory analysis, and transformations in notebooks.
DuckDB Columnar file formats (Parquet / Arrow) Pipelines where data is stored mainly as Parquet/Arrow. DuckDB can query these files in place without loading entire DataFrames, and can also write Parquet (e.g., COPY TO). Acts as a local query engine over “data lake” files, faster and more memory-efficient.
DuckDB Interactive data exploration with SQL Users prefer exploring data via SQL (SELECT ... WHERE ... LIMIT 100). DuckDB allows iterative querying in notebooks/scripts, similar to a SQL IDE but local and offline. Can replace many Pandas-based exploratory steps for SQL-oriented users.
DuckDB Memory-bound computations Operations that are very memory-intensive in Pandas (large group-bys with many categories, multiple big joins, wide tables). DuckDB uses streaming aggregation and disk-spilling join strategies, making these workloads feasible where Pandas might blow up.
DuckDB Production / data engineering pipelines Recurring analytics in scripts/apps on large data. More robust and predictable at scale than Pandas, which may fail as data grows. Good for nightly jobs aggregating tens of millions of rows, batch ETL, or embedded analytics; usable from Python, R, Java, etc.
Pandas Small data and quick scripts Tiny or small datasets (e.g., a 5,000-row CSV) and quick one-off analyses. Loading and manipulating data in Pandas is instantaneous, and writing a short snippet is often faster than formulating a SQL query, especially for Python-fluent users.
Pandas Complex data manipulation & feature engineering Tasks needing arbitrary Python logic, external libraries, or iterative algorithms. Easy to generate new columns via loops, vectorization, or apply. Ideal for custom text processing, complex ranking rules, bespoke pivots, and advanced feature engineering.
Pandas Ecosystem and library compatibility Many libraries expect Pandas DataFrames or NumPy arrays (e.g., scikit-learn, Matplotlib, seaborn). Even if DuckDB prepares data, you typically end up in Pandas at the edges (modeling, visualization). Staying in Pandas can be simpler when every step needs it.
Pandas Real-time interactive tweaks in notebooks Very interactive data cleaning and inspection: df.head(), boolean filters, manual edits like df.iloc[5, 3] = None. Data is fully in memory and directly editable, making ad-hoc tweaks more convenient than re-running SQL queries.
Pandas Team habits and existing codebase Large existing Pandas codebase or a team skilled in Pandas but not SQL. Switching everything to DuckDB has a cost; incremental adoption (using DuckDB only for pain points) is usually wiser. Pandas’ maturity and familiarity remain strong advantages.

Pandas retains an edge for flexibility and for the long tail of small or medium data workloads where its overhead is unnoticeable, and its native Python integration is a benefit. DuckDB doesn’t aim to be a drop-in replacement for that; it’s designed for cases where Pandas struggles (large data, intensive analytics, SQL users).

Limitations and considerations

Despite its advantages, DuckDB is not a silver bullet. Some limitations include:

  • Concurrency: DuckDB has a single‑writer model. It supports concurrent readers but only one write transaction at a time. For multi‑writer workloads or heavy OLTP, a server database such as PostgreSQL is more suitable.
  • Maturity of ecosystem: Pandas has existed since 2008 and has a well‑developed ecosystem of associated visualization and machine‑learning libraries. While DuckDB’s ecosystem is rapidly developing, some higher‑level convenience functions found in pandas are still missing.
  • Learning curve: The syntax of pandas may be more intuitive to analysts with no SQL background, but SQL has the benefits of long‑term stability, portability, and universal familiarity.
  • No built‑in plotting: DuckDB is a query engine. To plot results, you’ll still need to use pandas, matplotlib, or other libraries.
  • Experimental features: DuckDB is still actively being developed, and some functionality may be considered experimental (e.g., installing extensions, support for new file formats, etc.) and may change across versions.

Conclusion

DuckDB is not here to kill pandas, but rather it is here to complete it. As data workloads outgrow a single, in-memory DataFrame, DuckDB can bring a modern, high-performance engine right into your Python workflow. Read Parquet files, push filters down to storage, scale across all CPU cores, and work with datasets much larger than RAM. You can perform this without deploying a server, changing your environment, or any other hassles of distributed databases.

Pandas remains a great fit for day-to-day exploration, prototyping, or the kind of quick feature engineering that can benefit from Python’s flexibility. However, when the job requires scanning millions of rows or joining large Parquet files, or simply working with Arrow-backed datasets, DuckDB provides the speed and scale that pandas does not.

DuckDB empowers data scientists with the speed and power of a local analytical database from within a notebook, while seamlessly extending pandas, Polars, Arrow, and dbt.

In practice, the future is not “DuckDB versus pandas,” but DuckDB and pandas together. Each has a role to play in the modern analytics stack. Together, they compose a workflow that is fast, flexible, and fit for the scale of today’s data.

References and Resources

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Adrien Payong
Adrien Payong
Author
AI consultant and technical writer
See author profile

I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.

Shaoni Mukherjee
Shaoni Mukherjee
Editor
Technical Writer
See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.