-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
WIP - Additional documentation for Performance Improvements using Pandas #16439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
8d42658
d7192f2
e2ac61a
7f08e3c
39ded19
803f2da
16467cb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,14 +11,147 @@ | |
import pandas as pd | ||
pd.options.display.max_rows=15 | ||
|
||
import sys | ||
import os | ||
import csv | ||
|
||
plt.style.use('default') | ||
|
||
|
||
********************* | ||
Enhancing Performance | ||
********************* | ||
|
||
.. _enhancingperf.overview: | ||
|
||
|
||
Pandas, for most use cases, will be fast; however, there are a few | ||
anti-patterns to avoid. The following is a guide on achieving better | ||
performance with pandas. These are the low-hanging optimizations | ||
that should be considered before rewriting your code in Cython or to use Numba | ||
|
||
.. ipython:: python | ||
|
||
Overview of Common Optimizations | ||
---------------------------------------- | ||
|
||
1) Using pandas when it is the wrong tool. | ||
|
||
* At least not for things it's not meant for. | ||
* Pandas is very fast at joins, reindex, factorization | ||
* Not as great at, say, matrix multiplications or problems that aren't vectorizable | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can u be more specific There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this first 1) part is too generic in its entirety. For me, now it just raises more questions than it answers. Eg "At least not for things it's not meant for." What is it not meant for? One option is to turn this around; list some things for which pandas is meant and then say for other problems it is less suited. |
||
|
||
|
||
|
||
2) Another optimization is to make sure to use native types instead of python types. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what you mean is avoid |
||
|
||
.. ipython:: python | ||
|
||
s1 = pd.Series(range(10000), dtype=object) | ||
s2 = pd.Series(range(10000), dtype=np.int64) | ||
|
||
.. ipython:: python | ||
|
||
%timeit s1.sum() | ||
|
||
.. ipython:: python | ||
|
||
%timeit s2.sum() | ||
|
||
|
||
NumPy int64 datatype arrays are much faster than the python object version. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does this mean? |
||
|
||
For other datatypes: | ||
|
||
* Strings - This is usually unavoidable. Pandas 2 will have a specialized | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. much more specific here. and formatting. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We shouldn't talk yet about pandas 2 here IMO (or at least not start with it). This should mainly explain how to currently deal with strings. I would explain how strings are handles in pandas -> uses object dtype. And then mention Categoricals as a way to speed up certain use cases when having a string column with many re-occuring strings. |
||
string type, but for now you're stuck with python objects. | ||
If you have few distinct values (relative to the number of rows), you could | ||
use a Categorical. | ||
|
||
* Dates, Times - Pandas has implemented a specialized version of `datetime.datetime`, | ||
and `datetime.timedelta`, but not `datetime.date` or `datetime.time`. Depending on your | ||
application, you might be able to treat dates as datetimes at midnight. | ||
|
||
* Decimal Types - Pandas uses floating-point arrays; there isn't a | ||
native arbitrary-precision Decimal type. | ||
|
||
|
||
In certain cases, some things will unavoidably be object type: | ||
|
||
* Reading messy Excel files - read_excel will preserve the dtype | ||
of each cell in the spreadsheet. If you have a single column with | ||
an int, a float, and a datetime, pandas will have to store all of | ||
those as objects. This dataset probably isn't tidy though. | ||
|
||
* Integer NA - Unfortunately, pandas doesn't have real nullable types. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, read the section this. We almost always cast to float. |
||
To represent missingness, pandas uses NaN (not a number) which is a special | ||
floating point value. If you have to represent nullable integers, you can | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you add things like this, there are many references in the docs, pls add links. |
||
use object dtype. | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series([1, 2, 3, np.nan, 5, 6, 7, 8, 9]) | ||
s | ||
|
||
|
||
.. ipython:: python | ||
|
||
type(s[0]) | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series([1, 2, 3, np.nan, 5, 6, 7, 8, 9], dtype=object) | ||
type(s[0]) | ||
|
||
3) Optimize loading dataframes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is not a good name for this section There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this is a useful section (often occrring anti-pattern), but maybe something with "Combining dataframes"? (but we want something for which people don't think it is about merging ..) |
||
|
||
There are two ways to get data in to a DataFrame: Make a single empty DataFrame | ||
and append to that single DataFrame, or make a list of many DataFrames | ||
and concatenate at the end. | ||
|
||
The following code will make 100 sets of 50 records each. This could represent | ||
any datasource, with any number of items in each. A DataFrame was built out of the | ||
first list of records. | ||
|
||
.. ipython:: python | ||
|
||
import string | ||
|
||
records = [[(random.choice(string.ascii_letters), | ||
random.choice(string.ascii_letters), | ||
random.choice(range(10))) | ||
for i in range(50)] | ||
for j in range(100)] | ||
|
||
pd.DataFrame(records[0], columns=['A', 'B', 'C']) | ||
|
||
DataFrame.append is not efficient, as seen below. | ||
|
||
.. ipython:: python | ||
|
||
%%timeit | ||
# Make an empty dataframe with the correct columns | ||
df = pd.DataFrame(columns=['A', 'B', 'C']) | ||
|
||
for set_ in records: | ||
subdf = pd.DataFrame(set_, columns=['A', 'B', 'C']) | ||
# append to that original df | ||
df = df.append(subdf, ignore_index=True) | ||
|
||
A better solution is to use DataFrame.concat. | ||
|
||
.. ipython:: python | ||
|
||
pd.concat([pd.DataFrame(set_, columns=['A', 'B', 'C']) for set_ in records], | ||
ignore_index=True) | ||
|
||
4) Doing too much work with Pandas | ||
|
||
5) Reduce the amount of iteration required | ||
|
||
Using .apply (with axis=1) (Avoid Iteration) | ||
|
||
|
||
.. _enhancingperf.cython: | ||
|
||
Cython (Writing C extensions for pandas) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be the same length as text