Description
Is your feature request related to a problem?
DuckDB supports the consumption and production of data frames. In this issue, I discuss how we could leverage these on two different ends.
-
Loading data frames as DuckDB Tables:
Theto_sql()
function can be extended to leverage the consumption to efficiently load data frames as tables in DuckDB (See Adding DuckDB as a SQL Database for the to_sql() function #45675). -
Run SQL on data frames:
Another possibility that I would like to propose, is to use the DuckDB engine to directly run queries on a data frame. Since DuckDB is able to consume and produce those, a user could get a data frame from running SQL directly on that data frame.
e.g.,
d = {'col1': [1, 2], 'col2': [3, 4]}
df_1 = pd.DataFrame(data=d)
d = {'col1': [2, 3]}
df_2 = pd.DataFrame(data=d)
df_result = pandas.sql("select df_1.col2 from df_1 inner join df_2 on (df_1.col1 = df_2.col2)")
This should be possible since it would run Duckdb's default connection under the hood. With the advantages of consuming/producing df, directly running SQL, and parallel query execution.
If this is something that the core-dev team/ pandas community would be interested in. I'm happy to give a go to a PR :-)
Additional context
DuckDB's blogpost of SQL on Pandas, including benchmarks. https://duckdb.org/2021/05/14/sql-on-pandas.html