-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
WIP - Additional documentation for Performance Improvements using Pandas #16439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/source/index.rst.template
Outdated
@@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library. | |||
comparison_with_r | |||
comparison_with_sql | |||
comparison_with_sas | |||
performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should go as a section in enhancingperf.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can start a new section at the same level as Cython (Writing C extensions for pandas)
though. And maybe even put it first, since people should be fixing these issues before moving to cython.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes putting this at the top of enhancing perf is just fine
doc/source/index.rst.template
Outdated
@@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library. | |||
comparison_with_r | |||
comparison_with_sql | |||
comparison_with_sas | |||
performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can start a new section at the same level as Cython (Writing C extensions for pandas)
though. And maybe even put it first, since people should be fixing these issues before moving to cython.
doc/source/performance.rst
Outdated
|
||
****************** | ||
Performance Considerations | ||
****************** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think length of the *
has to match exactly the length of the header, otherwise sphinx complains.
doc/source/performance.rst
Outdated
****************** | ||
|
||
Overview | ||
-------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May need a blank line below this and before the text (could be wrong)
doc/source/performance.rst
Outdated
-------- | ||
Pandas, for most use cases, will be fast; however, there are a few | ||
anti-patterns to avoid. The following is a guide on achieving better | ||
performance with pandas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a note like, "these are the low-hanging optimizations that should be considered before rewriting your code in Cython or to use Numba"
doc/source/performance.rst
Outdated
import pandas as pd | ||
import matplotlib.pyplot as plt | ||
|
||
pd.options.display.max_rows = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this may mess with the formatting else where (I think it's all a global state). Maybe just remove it for now and see how things look.
doc/source/performance.rst
Outdated
|
||
|
||
|
||
2) Another optimization is to make sure to use native types instead of pandas types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native types instead of python types (or python objects).
doc/source/performance.rst
Outdated
* Strings - This is usually unavoidable. Pandas 2 will have a specialized | ||
string type, but for now you're stuck with python objects. | ||
If you have few distinct values (relative to the number of rows), you could | ||
use a Categorical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to the categorical docs.
doc/source/performance.rst
Outdated
If you have few distinct values (relative to the number of rows), you could | ||
use a Categorical. | ||
|
||
* Dates, Times - Pandas has implemented a specialized version of datetime.datetime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put the objects like datetime.datetime
in double backticks to make them format nicely in the HTML
doc/source/performance.rst
Outdated
|
||
* Dates, Times - Pandas has implemented a specialized version of datetime.datetime, | ||
and datetime.timedelta, but not datetime.date or datetime.time. Depending on your | ||
application, you might be able to treat dates as datetimess, at midnight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: datetimess. Probably don't need the comma before "at midnight" either.
doc/source/performance.rst
Outdated
native arbitrary-precision Decimal type. | ||
|
||
|
||
In certain cases, some things will automatically be object type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use "unavoidably" instead of "automatically"
Codecov Report
@@ Coverage Diff @@
## master #16439 +/- ##
==========================================
- Coverage 90.42% 90.42% -0.01%
==========================================
Files 161 161
Lines 51023 51023
==========================================
- Hits 46138 46137 -1
- Misses 4885 4886 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16439 +/- ##
==========================================
+ Coverage 90.42% 90.43% +<.01%
==========================================
Files 161 161
Lines 51023 51048 +25
==========================================
+ Hits 46138 46164 +26
+ Misses 4885 4884 -1
Continue to review full report at Codecov.
|
Thank you for the feedback, will incorporate it now.
…Sent from my iPhone
On May 23, 2017, at 3:51 AM, Jeff Reback ***@***.***> wrote:
@jreback commented on this pull request.
In doc/source/index.rst.template:
> @@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library.
comparison_with_r
comparison_with_sql
comparison_with_sas
+ performance
yes putting this at the top of enhancing perf is just fine
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Still need to add notes from the pycon 2017 lecture, and bullets 3-5. |
.. ipython:: python | ||
|
||
Overview of Common Optimizations | ||
---------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be the same length as text
|
||
* At least not for things it's not meant for. | ||
* Pandas is very fast at joins, reindex, factorization | ||
* Not as great at, say, matrix multiplications or problems that aren't vectorizable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u be more specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this first 1) part is too generic in its entirety. For me, now it just raises more questions than it answers.
Eg "At least not for things it's not meant for." What is it not meant for?
One option is to turn this around; list some things for which pandas is meant and then say for other problems it is less suited.
|
||
|
||
|
||
2) Another optimization is to make sure to use native types instead of python types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you mean is avoid object
types for non-strings
%timeit s2.sum() | ||
|
||
|
||
NumPy int64 datatype arrays are much faster than the python object version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean?
|
||
For other datatypes: | ||
|
||
* Strings - This is usually unavoidable. Pandas 2 will have a specialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much more specific here. and formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't talk yet about pandas 2 here IMO (or at least not start with it). This should mainly explain how to currently deal with strings.
I would explain how strings are handles in pandas -> uses object dtype. And then mention Categoricals as a way to speed up certain use cases when having a string column with many re-occuring strings.
an int, a float, and a datetime, pandas will have to store all of | ||
those as objects. This dataset probably isn't tidy though. | ||
|
||
* Integer NA - Unfortunately, pandas doesn't have real nullable types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, read the section this. We almost always cast to float.
|
||
* Integer NA - Unfortunately, pandas doesn't have real nullable types. | ||
To represent missingness, pandas uses NaN (not a number) which is a special | ||
floating point value. If you have to represent nullable integers, you can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you add things like this, there are many references in the docs, pls add links.
s = pd.Series([1, 2, 3, np.nan, 5, 6, 7, 8, 9], dtype=object) | ||
type(s[0]) | ||
|
||
3) Optimize loading dataframes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a good name for this section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a useful section (often occrring anti-pattern), but maybe something with "Combining dataframes"? (but we want something for which people don't think it is about merging ..)
Thanks will fix today! |
if you can update and respond to comments. |
closing as stale if you'd like to cointinue, pls ping |
Building off #16310, to start discussing ways to be more performant with pandas.