Skip to content

WIP - Additional documentation for Performance Improvements using Pandas #16439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions doc/source/enhancingperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,147 @@
import pandas as pd
pd.options.display.max_rows=15

import sys
import os
import csv

plt.style.use('default')


*********************
Enhancing Performance
*********************

.. _enhancingperf.overview:


Pandas, for most use cases, will be fast; however, there are a few
anti-patterns to avoid. The following is a guide on achieving better
performance with pandas. These are the low-hanging optimizations
that should be considered before rewriting your code in Cython or to use Numba

.. ipython:: python

Overview of Common Optimizations
----------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be the same length as text


1) Using pandas when it is the wrong tool.

* At least not for things it's not meant for.
* Pandas is very fast at joins, reindex, factorization
* Not as great at, say, matrix multiplications or problems that aren't vectorizable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u be more specific

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this first 1) part is too generic in its entirety. For me, now it just raises more questions than it answers.

Eg "At least not for things it's not meant for." What is it not meant for?

One option is to turn this around; list some things for which pandas is meant and then say for other problems it is less suited.




2) Another optimization is to make sure to use native types instead of python types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you mean is avoid object types for non-strings


.. ipython:: python

s1 = pd.Series(range(10000), dtype=object)
s2 = pd.Series(range(10000), dtype=np.int64)

.. ipython:: python

%timeit s1.sum()

.. ipython:: python

%timeit s2.sum()


NumPy int64 datatype arrays are much faster than the python object version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?


For other datatypes:

* Strings - This is usually unavoidable. Pandas 2 will have a specialized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much more specific here. and formatting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't talk yet about pandas 2 here IMO (or at least not start with it). This should mainly explain how to currently deal with strings.

I would explain how strings are handles in pandas -> uses object dtype. And then mention Categoricals as a way to speed up certain use cases when having a string column with many re-occuring strings.

string type, but for now you're stuck with python objects.
If you have few distinct values (relative to the number of rows), you could
use a Categorical.

* Dates, Times - Pandas has implemented a specialized version of `datetime.datetime`,
and `datetime.timedelta`, but not `datetime.date` or `datetime.time`. Depending on your
application, you might be able to treat dates as datetimes at midnight.

* Decimal Types - Pandas uses floating-point arrays; there isn't a
native arbitrary-precision Decimal type.


In certain cases, some things will unavoidably be object type:

* Reading messy Excel files - read_excel will preserve the dtype
of each cell in the spreadsheet. If you have a single column with
an int, a float, and a datetime, pandas will have to store all of
those as objects. This dataset probably isn't tidy though.

* Integer NA - Unfortunately, pandas doesn't have real nullable types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, read the section this. We almost always cast to float.

To represent missingness, pandas uses NaN (not a number) which is a special
floating point value. If you have to represent nullable integers, you can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you add things like this, there are many references in the docs, pls add links.

use object dtype.

.. ipython:: python

s = pd.Series([1, 2, 3, np.nan, 5, 6, 7, 8, 9])
s


.. ipython:: python

type(s[0])

.. ipython:: python

s = pd.Series([1, 2, 3, np.nan, 5, 6, 7, 8, 9], dtype=object)
type(s[0])

3) Optimize loading dataframes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a good name for this section

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a useful section (often occrring anti-pattern), but maybe something with "Combining dataframes"? (but we want something for which people don't think it is about merging ..)


There are two ways to get data in to a DataFrame: Make a single empty DataFrame
and append to that single DataFrame, or make a list of many DataFrames
and concatenate at the end.

The following code will make 100 sets of 50 records each. This could represent
any datasource, with any number of items in each. A DataFrame was built out of the
first list of records.

.. ipython:: python

import string

records = [[(random.choice(string.ascii_letters),
random.choice(string.ascii_letters),
random.choice(range(10)))
for i in range(50)]
for j in range(100)]

pd.DataFrame(records[0], columns=['A', 'B', 'C'])

DataFrame.append is not efficient, as seen below.

.. ipython:: python

%%timeit
# Make an empty dataframe with the correct columns
df = pd.DataFrame(columns=['A', 'B', 'C'])

for set_ in records:
subdf = pd.DataFrame(set_, columns=['A', 'B', 'C'])
# append to that original df
df = df.append(subdf, ignore_index=True)

A better solution is to use DataFrame.concat.

.. ipython:: python

pd.concat([pd.DataFrame(set_, columns=['A', 'B', 'C']) for set_ in records],
ignore_index=True)

4) Doing too much work with Pandas

5) Reduce the amount of iteration required

Using .apply (with axis=1) (Avoid Iteration)


.. _enhancingperf.cython:

Cython (Writing C extensions for pandas)
Expand Down