Skip to content

ENH: Integration with DuckDB #45678

Open
Open
@pdet

Description

@pdet

Is your feature request related to a problem?

DuckDB supports the consumption and production of data frames. In this issue, I discuss how we could leverage these on two different ends.

  1. Loading data frames as DuckDB Tables:
    The to_sql() function can be extended to leverage the consumption to efficiently load data frames as tables in DuckDB (See Adding DuckDB as a SQL Database for the to_sql() function #45675).

  2. Run SQL on data frames:

Another possibility that I would like to propose, is to use the DuckDB engine to directly run queries on a data frame. Since DuckDB is able to consume and produce those, a user could get a data frame from running SQL directly on that data frame.

e.g.,

d = {'col1': [1, 2], 'col2': [3, 4]}
df_1 = pd.DataFrame(data=d)

d = {'col1': [2, 3]}
df_2 = pd.DataFrame(data=d)

df_result = pandas.sql("select df_1.col2 from df_1 inner join df_2 on (df_1.col1 = df_2.col2)")

This should be possible since it would run Duckdb's default connection under the hood. With the advantages of consuming/producing df, directly running SQL, and parallel query execution.

If this is something that the core-dev team/ pandas community would be interested in. I'm happy to give a go to a PR :-)

Additional context

DuckDB's blogpost of SQL on Pandas, including benchmarks. https://duckdb.org/2021/05/14/sql-on-pandas.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions