Skip to content

WIP - Additional documentation for Performance Improvements using Pandas #16439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

andymaheshw
Copy link
Contributor

Building off #16310, to start discussing ways to be more performant with pandas.

@andymaheshw andymaheshw changed the title Tutorial update2 Adding documentation for Performance Improvements using Pandas May 22, 2017
@@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library.
comparison_with_r
comparison_with_sql
comparison_with_sas
performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should go as a section in enhancingperf.rst

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can start a new section at the same level as Cython (Writing C extensions for pandas) though. And maybe even put it first, since people should be fixing these issues before moving to cython.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes putting this at the top of enhancing perf is just fine

@jreback jreback added Docs Performance Memory or execution speed performance labels May 22, 2017
@@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library.
comparison_with_r
comparison_with_sql
comparison_with_sas
performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can start a new section at the same level as Cython (Writing C extensions for pandas) though. And maybe even put it first, since people should be fixing these issues before moving to cython.


******************
Performance Considerations
******************
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think length of the * has to match exactly the length of the header, otherwise sphinx complains.

******************

Overview
--------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need a blank line below this and before the text (could be wrong)

--------
Pandas, for most use cases, will be fast; however, there are a few
anti-patterns to avoid. The following is a guide on achieving better
performance with pandas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note like, "these are the low-hanging optimizations that should be considered before rewriting your code in Cython or to use Numba"

import pandas as pd
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this may mess with the formatting else where (I think it's all a global state). Maybe just remove it for now and see how things look.




2) Another optimization is to make sure to use native types instead of pandas types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native types instead of python types (or python objects).

* Strings - This is usually unavoidable. Pandas 2 will have a specialized
string type, but for now you're stuck with python objects.
If you have few distinct values (relative to the number of rows), you could
use a Categorical.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the categorical docs.

If you have few distinct values (relative to the number of rows), you could
use a Categorical.

* Dates, Times - Pandas has implemented a specialized version of datetime.datetime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put the objects like datetime.datetime in double backticks to make them format nicely in the HTML


* Dates, Times - Pandas has implemented a specialized version of datetime.datetime,
and datetime.timedelta, but not datetime.date or datetime.time. Depending on your
application, you might be able to treat dates as datetimess, at midnight.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: datetimess. Probably don't need the comma before "at midnight" either.

native arbitrary-precision Decimal type.


In certain cases, some things will automatically be object type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use "unavoidably" instead of "automatically"

@codecov
Copy link

codecov bot commented May 23, 2017

Codecov Report

Merging #16439 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16439      +/-   ##
==========================================
- Coverage   90.42%   90.42%   -0.01%     
==========================================
  Files         161      161              
  Lines       51023    51023              
==========================================
- Hits        46138    46137       -1     
- Misses       4885     4886       +1
Flag Coverage Δ
#multiple 88.26% <ø> (-0.01%) ⬇️
#single 40.17% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/common.py 91.05% <0%> (-0.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49ec31b...e2ac61a. Read the comment docs.

@codecov
Copy link

codecov bot commented May 23, 2017

Codecov Report

Merging #16439 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16439      +/-   ##
==========================================
+ Coverage   90.42%   90.43%   +<.01%     
==========================================
  Files         161      161              
  Lines       51023    51048      +25     
==========================================
+ Hits        46138    46164      +26     
+ Misses       4885     4884       -1
Flag Coverage Δ
#multiple 88.27% <ø> (ø) ⬆️
#single 40.16% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/feather_format.py 85.71% <0%> (-2.17%) ⬇️
pandas/core/frame.py 97.66% <0%> (-0.03%) ⬇️
pandas/core/reshape/merge.py 94.18% <0%> (-0.01%) ⬇️
pandas/errors/__init__.py 100% <0%> (ø) ⬆️
pandas/core/missing.py 84.27% <0%> (ø) ⬆️
pandas/core/reshape/reshape.py 99.28% <0%> (ø) ⬆️
pandas/io/formats/style.py 95.92% <0%> (+0.01%) ⬆️
pandas/io/parsers.py 95.33% <0%> (+0.01%) ⬆️
pandas/core/ops.py 91.69% <0%> (+0.01%) ⬆️
pandas/plotting/_core.py 81.92% <0%> (+0.02%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49ec31b...16467cb. Read the comment docs.

@andymaheshw
Copy link
Contributor Author

andymaheshw commented May 23, 2017 via email

@andymaheshw andymaheshw changed the title Adding documentation for Performance Improvements using Pandas WIP - Additional documentation for Performance Improvements using Pandas May 24, 2017
@andymaheshw
Copy link
Contributor Author

Still need to add notes from the pycon 2017 lecture, and bullets 3-5.

.. ipython:: python

Overview of Common Optimizations
----------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be the same length as text


* At least not for things it's not meant for.
* Pandas is very fast at joins, reindex, factorization
* Not as great at, say, matrix multiplications or problems that aren't vectorizable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u be more specific

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this first 1) part is too generic in its entirety. For me, now it just raises more questions than it answers.

Eg "At least not for things it's not meant for." What is it not meant for?

One option is to turn this around; list some things for which pandas is meant and then say for other problems it is less suited.




2) Another optimization is to make sure to use native types instead of python types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you mean is avoid object types for non-strings

%timeit s2.sum()


NumPy int64 datatype arrays are much faster than the python object version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?


For other datatypes:

* Strings - This is usually unavoidable. Pandas 2 will have a specialized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much more specific here. and formatting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't talk yet about pandas 2 here IMO (or at least not start with it). This should mainly explain how to currently deal with strings.

I would explain how strings are handles in pandas -> uses object dtype. And then mention Categoricals as a way to speed up certain use cases when having a string column with many re-occuring strings.

an int, a float, and a datetime, pandas will have to store all of
those as objects. This dataset probably isn't tidy though.

* Integer NA - Unfortunately, pandas doesn't have real nullable types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, read the section this. We almost always cast to float.


* Integer NA - Unfortunately, pandas doesn't have real nullable types.
To represent missingness, pandas uses NaN (not a number) which is a special
floating point value. If you have to represent nullable integers, you can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you add things like this, there are many references in the docs, pls add links.

s = pd.Series([1, 2, 3, np.nan, 5, 6, 7, 8, 9], dtype=object)
type(s[0])

3) Optimize loading dataframes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a good name for this section

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a useful section (often occrring anti-pattern), but maybe something with "Combining dataframes"? (but we want something for which people don't think it is about merging ..)

@andymaheshw
Copy link
Contributor Author

Thanks will fix today!

@jreback
Copy link
Contributor

jreback commented Aug 17, 2017

if you can update and respond to comments.

@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

closing as stale if you'd like to cointinue, pls ping

@jreback jreback closed this Oct 28, 2017
@jreback jreback added this to the No action milestone Oct 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants