From 8986592b478974f8bcaccbe9fecd68411332e285 Mon Sep 17 00:00:00 2001 From: Greg Reda Date: Thu, 5 Dec 2013 20:30:26 -0600 Subject: [PATCH] DOC: SQL to pandas comparison (#4524) --- doc/source/comparison_with_sql.rst | 380 +++++++++++++++++++++++++++++ doc/source/index.rst | 1 + doc/source/merging.rst | 5 +- doc/source/v0.13.0.txt | 3 + 4 files changed, 388 insertions(+), 1 deletion(-) create mode 100644 doc/source/comparison_with_sql.rst diff --git a/doc/source/comparison_with_sql.rst b/doc/source/comparison_with_sql.rst new file mode 100644 index 0000000000000..3d8b85e9460c4 --- /dev/null +++ b/doc/source/comparison_with_sql.rst @@ -0,0 +1,380 @@ +.. currentmodule:: pandas +.. _compare_with_sql: + +Comparison with SQL +******************** +Since many potential pandas users have some familiarity with +`SQL `_, this page is meant to provide some examples of how +various SQL operations would be performed using pandas. + +If you're new to pandas, you might want to first read through :ref:`10 Minutes to Pandas<10min>` +to familiarize yourself with the library. + +As is customary, we import pandas and numpy as follows: + +.. ipython:: python + + import pandas as pd + import numpy as np + +Most of the examples will utilize the ``tips`` dataset found within pandas tests. We'll read +the data into a DataFrame called `tips` and assume we have a database table of the same name and +structure. + +.. ipython:: python + + url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv' + tips = pd.read_csv(url) + tips.head() + +SELECT +------ +In SQL, selection is done using a comma-separated list of columns you'd like to select (or a ``*`` +to select all columns): + +.. code-block:: sql + + SELECT total_bill, tip, smoker, time + FROM tips + LIMIT 5; + +With pandas, column selection is done by passing a list of column names to your DataFrame: + +.. ipython:: python + + tips[['total_bill', 'tip', 'smoker', 'time']].head(5) + +Calling the DataFrame without the list of column names would display all columns (akin to SQL's +``*``). + +WHERE +----- +Filtering in SQL is done via a WHERE clause. + +.. code-block:: sql + + SELECT * + FROM tips + WHERE time = 'Dinner' + LIMIT 5; + +DataFrames can be filtered in multiple ways; the most intuitive of which is using +`boolean indexing `_. + +.. ipython:: python + + tips[tips['time'] == 'Dinner'].head(5) + +The above statement is simply passing a ``Series`` of True/False objects to the DataFrame, +returning all rows with True. + +.. ipython:: python + + is_dinner = tips['time'] == 'Dinner' + is_dinner.value_counts() + tips[is_dinner].head(5) + +Just like SQL's OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & +(AND). + +.. code-block:: sql + + -- tips of more than $5.00 at Dinner meals + SELECT * + FROM tips + WHERE time = 'Dinner' AND tip > 5.00; + +.. ipython:: python + + # tips of more than $5.00 at Dinner meals + tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)] + +.. code-block:: sql + + -- tips by parties of at least 5 diners OR bill total was more than $45 + SELECT * + FROM tips + WHERE size >= 5 OR total_bill > 45; + +.. ipython:: python + + # tips by parties of at least 5 diners OR bill total was more than $45 + tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)] + +NULL checking is done using the :meth:`~pandas.Series.notnull` and :meth:`~pandas.Series.isnull` +methods. + +.. ipython:: python + + frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'], + 'col2': ['F', np.NaN, 'G', 'H', 'I']}) + frame + +Assume we have a table of the same structure as our DataFrame above. We can see only the records +where ``col2`` IS NULL with the following query: + +.. code-block:: sql + + SELECT * + FROM frame + WHERE col2 IS NULL; + +.. ipython:: python + + frame[frame['col2'].isnull()] + +Getting items where ``col1`` IS NOT NULL can be done with :meth:`~pandas.Series.notnull`. + +.. code-block:: sql + + SELECT * + FROM frame + WHERE col1 IS NOT NULL; + +.. ipython:: python + + frame[frame['col1'].notnull()] + + +GROUP BY +-------- +In pandas, SQL's GROUP BY operations performed using the similarly named +:meth:`~pandas.DataFrame.groupby` method. :meth:`~pandas.DataFrame.groupby` typically refers to a +process where we'd like to split a dataset into groups, apply some function (typically aggregation) +, and then combine the groups together. + +A common SQL operation would be getting the count of records in each group throughout a dataset. +For instance, a query getting us the number of tips left by sex: + +.. code-block:: sql + + SELECT sex, count(*) + FROM tips + GROUP BY sex; + /* + Female 87 + Male 157 + */ + + +The pandas equivalent would be: + +.. ipython:: python + + tips.groupby('sex').size() + +Notice that in the pandas code we used :meth:`~pandas.DataFrameGroupBy.size` and not +:meth:`~pandas.DataFrameGroupBy.count`. This is because :meth:`~pandas.DataFrameGroupBy.count` +applies the function to each column, returning the number of ``not null`` records within each. + +.. ipython:: python + + tips.groupby('sex').count() + +Alternatively, we could have applied the :meth:`~pandas.DataFrameGroupBy.count` method to an +individual column: + +.. ipython:: python + + tips.groupby('sex')['total_bill'].count() + +Multiple functions can also be applied at once. For instance, say we'd like to see how tip amount +differs by day of the week - :meth:`~pandas.DataFrameGroupBy.agg` allows you to pass a dictionary +to your grouped DataFrame, indicating which functions to apply to specific columns. + +.. code-block:: sql + + SELECT day, AVG(tip), COUNT(*) + FROM tips + GROUP BY day; + /* + Fri 2.734737 19 + Sat 2.993103 87 + Sun 3.255132 76 + Thur 2.771452 62 + */ + +.. ipython:: python + + tips.groupby('day').agg({'tip': np.mean, 'day': np.size}) + +Grouping by more than one column is done by passing a list of columns to the +:meth:`~pandas.DataFrame.groupby` method. + +.. code-block:: sql + + SELECT smoker, day, COUNT(*), AVG(tip) + FROM tip + GROUP BY smoker, day; + /* + smoker day + No Fri 4 2.812500 + Sat 45 3.102889 + Sun 57 3.167895 + Thur 45 2.673778 + Yes Fri 15 2.714000 + Sat 42 2.875476 + Sun 19 3.516842 + Thur 17 3.030000 + */ + +.. ipython:: python + + tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]}) + +.. _compare_with_sql.join: + +JOIN +---- +JOINs can be performed with :meth:`~pandas.DataFrame.join` or :meth:`~pandas.merge`. By default, +:meth:`~pandas.DataFrame.join` will join the DataFrames on their indices. Each method has +parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the +columns to join on (column names or indices). + +.. ipython:: python + + df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], + 'value': np.random.randn(4)}) + df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'], + 'value': np.random.randn(4)}) + +Assume we have two database tables of the same name and structure as our DataFrames. + +Now let's go over the various types of JOINs. + +INNER JOIN +~~~~~~~~~~ +.. code-block:: sql + + SELECT * + FROM df1 + INNER JOIN df2 + ON df1.key = df2.key; + +.. ipython:: python + + # merge performs an INNER JOIN by default + pd.merge(df1, df2, on='key') + +:meth:`~pandas.merge` also offers parameters for cases when you'd like to join one DataFrame's +column with another DataFrame's index. + +.. ipython:: python + + indexed_df2 = df2.set_index('key') + pd.merge(df1, indexed_df2, left_on='key', right_index=True) + +LEFT OUTER JOIN +~~~~~~~~~~~~~~~ +.. code-block:: sql + + -- show all records from df1 + SELECT * + FROM df1 + LEFT OUTER JOIN df2 + ON df1.key = df2.key; + +.. ipython:: python + + # show all records from df1 + pd.merge(df1, df2, on='key', how='left') + +RIGHT JOIN +~~~~~~~~~~ +.. code-block:: sql + + -- show all records from df2 + SELECT * + FROM df1 + RIGHT OUTER JOIN df2 + ON df1.key = df2.key; + +.. ipython:: python + + # show all records from df2 + pd.merge(df1, df2, on='key', how='right') + +FULL JOIN +~~~~~~~~~ +pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the +joined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL). + +.. code-block:: sql + + -- show all records from both tables + SELECT * + FROM df1 + FULL OUTER JOIN df2 + ON df1.key = df2.key; + +.. ipython:: python + + # show all records from both frames + pd.merge(df1, df2, on='key', how='outer') + + +UNION +----- +UNION ALL can be performed using :meth:`~pandas.concat`. + +.. ipython:: python + + df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'], + 'rank': range(1, 4)}) + df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'], + 'rank': [1, 4, 5]}) + +.. code-block:: sql + + SELECT city, rank + FROM df1 + UNION ALL + SELECT city, rank + FROM df2; + /* + city rank + Chicago 1 + San Francisco 2 + New York City 3 + Chicago 1 + Boston 4 + Los Angeles 5 + */ + +.. ipython:: python + + pd.concat([df1, df2]) + +SQL's UNION is similar to UNION ALL, however UNION will remove duplicate rows. + +.. code-block:: sql + + SELECT city, rank + FROM df1 + UNION + SELECT city, rank + FROM df2; + -- notice that there is only one Chicago record this time + /* + city rank + Chicago 1 + San Francisco 2 + New York City 3 + Boston 4 + Los Angeles 5 + */ + +In pandas, you can use :meth:`~pandas.concat` in conjunction with +:meth:`~pandas.DataFrame.drop_duplicates`. + +.. ipython:: python + + pd.concat([df1, df2]).drop_duplicates() + + +UPDATE +------ + + +DELETE +------ \ No newline at end of file diff --git a/doc/source/index.rst b/doc/source/index.rst index 02e193cdb6938..c406c4f2cfa27 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -132,5 +132,6 @@ See the package overview for more detail about what's in the library. r_interface related comparison_with_r + comparison_with_sql api release diff --git a/doc/source/merging.rst b/doc/source/merging.rst index a68fc6e0739d5..8f1eb0dc779be 100644 --- a/doc/source/merging.rst +++ b/doc/source/merging.rst @@ -305,7 +305,10 @@ better) than other open source implementations (like ``base::merge.data.frame`` in R). The reason for this is careful algorithmic design and internal layout of the data in DataFrame. -See the :ref:`cookbook` for some advanced strategies +See the :ref:`cookbook` for some advanced strategies. + +Users who are familiar with SQL but new to pandas might be interested in a +:ref:`comparison with SQL`. pandas provides a single function, ``merge``, as the entry point for all standard database join operations between DataFrame objects: diff --git a/doc/source/v0.13.0.txt b/doc/source/v0.13.0.txt index a39e415abe519..9f410cf7b4f8b 100644 --- a/doc/source/v0.13.0.txt +++ b/doc/source/v0.13.0.txt @@ -10,6 +10,9 @@ Highlights include support for a new index type ``Float64Index``, support for ne Several experimental features are added, including new ``eval/query`` methods for expression evaluation, support for ``msgpack`` serialization, and an io interface to Google's ``BigQuery``. +The docs also received a new section, :ref:`Comparison with SQL`, which should +be useful for those familiar with SQL but still learning pandas. + .. warning:: In 0.13.0 ``Series`` has internally been refactored to no longer sub-class ``ndarray``