WIP - Additional documentation for Performance Improvements using Pandas #16439

andymaheshw · 2017-05-22T22:17:51Z

Building off #16310, to start discussing ways to be more performant with pandas.

jreback · 2017-05-22T22:20:08Z

doc/source/index.rst.template

@@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library.
    comparison_with_r
    comparison_with_sql
    comparison_with_sas
+    performance


these should go as a section in enhancingperf.rst

You can start a new section at the same level as Cython (Writing C extensions for pandas) though. And maybe even put it first, since people should be fixing these issues before moving to cython.

yes putting this at the top of enhancing perf is just fine

TomAugspurger · 2017-05-23T01:51:28Z

doc/source/index.rst.template

@@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library.
    comparison_with_r
    comparison_with_sql
    comparison_with_sas
+    performance


You can start a new section at the same level as Cython (Writing C extensions for pandas) though. And maybe even put it first, since people should be fixing these issues before moving to cython.

TomAugspurger · 2017-05-23T01:52:00Z

doc/source/performance.rst

+
+******************
+Performance Considerations
+******************


I think length of the * has to match exactly the length of the header, otherwise sphinx complains.

TomAugspurger · 2017-05-23T01:52:19Z

doc/source/performance.rst

+******************
+
+Overview
+--------


May need a blank line below this and before the text (could be wrong)

TomAugspurger · 2017-05-23T01:53:17Z

doc/source/performance.rst

+--------
+Pandas, for most use cases, will be fast; however, there are a few
+anti-patterns to avoid. The following is a guide on achieving better
+performance with pandas.


Maybe add a note like, "these are the low-hanging optimizations that should be considered before rewriting your code in Cython or to use Numba"

TomAugspurger · 2017-05-23T01:54:22Z

doc/source/performance.rst

+   import pandas as pd
+   import matplotlib.pyplot as plt
+
+   pd.options.display.max_rows = 10


Changing this may mess with the formatting else where (I think it's all a global state). Maybe just remove it for now and see how things look.

TomAugspurger · 2017-05-23T01:55:39Z

doc/source/performance.rst

+
+
+
+2) Another optimization is to make sure to use native types instead of pandas types.


Native types instead of python types (or python objects).

TomAugspurger · 2017-05-23T01:56:23Z

doc/source/performance.rst

+   * Strings - This is usually unavoidable. Pandas 2 will have a specialized
+     string type, but for now you're stuck with python objects.
+     If you have few distinct values (relative to the number of rows), you could
+     use a Categorical.


Add a link to the categorical docs.

TomAugspurger · 2017-05-23T01:56:56Z

doc/source/performance.rst

+     If you have few distinct values (relative to the number of rows), you could
+     use a Categorical.
+
+   * Dates, Times - Pandas has implemented a specialized version of datetime.datetime,


You can put the objects like datetime.datetime in double backticks to make them format nicely in the HTML

TomAugspurger · 2017-05-23T01:57:08Z

doc/source/performance.rst

+
+   * Dates, Times - Pandas has implemented a specialized version of datetime.datetime,
+     and datetime.timedelta, but not datetime.date or datetime.time. Depending on your
+     application, you might be able to treat dates as datetimess, at midnight.


typo: datetimess. Probably don't need the comma before "at midnight" either.

TomAugspurger · 2017-05-23T01:58:02Z

doc/source/performance.rst

+     native arbitrary-precision Decimal type.
+
+
+In certain cases, some things will automatically be object type:


Maybe use "unavoidably" instead of "automatically"

codecov · 2017-05-23T02:21:44Z

Codecov Report

Merging #16439 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #16439      +/-   ##
==========================================
- Coverage   90.42%   90.42%   -0.01%     
==========================================
  Files         161      161              
  Lines       51023    51023              
==========================================
- Hits        46138    46137       -1     
- Misses       4885     4886       +1

Flag	Coverage Δ
#multiple	`88.26% <ø> (-0.01%)`	⬇️
#single	`40.17% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/common.py	`91.05% <0%> (-0.34%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49ec31b...e2ac61a. Read the comment docs.

codecov · 2017-05-23T02:22:03Z

Codecov Report

Merging #16439 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #16439      +/-   ##
==========================================
+ Coverage   90.42%   90.43%   +<.01%     
==========================================
  Files         161      161              
  Lines       51023    51048      +25     
==========================================
+ Hits        46138    46164      +26     
+ Misses       4885     4884       -1

Flag	Coverage Δ
#multiple	`88.27% <ø> (ø)`	⬆️
#single	`40.16% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/feather_format.py	`85.71% <0%> (-2.17%)`	⬇️
pandas/core/frame.py	`97.66% <0%> (-0.03%)`	⬇️
pandas/core/reshape/merge.py	`94.18% <0%> (-0.01%)`	⬇️
pandas/errors/__init__.py	`100% <0%> (ø)`	⬆️
pandas/core/missing.py	`84.27% <0%> (ø)`	⬆️
pandas/core/reshape/reshape.py	`99.28% <0%> (ø)`	⬆️
pandas/io/formats/style.py	`95.92% <0%> (+0.01%)`	⬆️
pandas/io/parsers.py	`95.33% <0%> (+0.01%)`	⬆️
pandas/core/ops.py	`91.69% <0%> (+0.01%)`	⬆️
pandas/plotting/_core.py	`81.92% <0%> (+0.02%)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49ec31b...16467cb. Read the comment docs.

andymaheshw · 2017-05-23T11:29:12Z

Thank you for the feedback, will incorporate it now.

…

Sent from my iPhone

On May 23, 2017, at 3:51 AM, Jeff Reback ***@***.***> wrote: @jreback commented on this pull request. In doc/source/index.rst.template: > @@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library. comparison_with_r comparison_with_sql comparison_with_sas + performance yes putting this at the top of enhancing perf is just fine — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

andymaheshw · 2017-05-24T11:07:25Z

Still need to add notes from the pycon 2017 lecture, and bullets 3-5.

jreback · 2017-05-26T11:57:23Z

doc/source/enhancingperf.rst

+.. ipython:: python
+
+Overview of Common Optimizations
+----------------------------------------


needs to be the same length as text

jreback · 2017-05-26T11:57:54Z

doc/source/enhancingperf.rst

+
+   * At least not for things it's not meant for.
+   * Pandas is very fast at joins, reindex, factorization
+   * Not as great at, say, matrix multiplications or problems that aren't vectorizable


can u be more specific

I think this first 1) part is too generic in its entirety. For me, now it just raises more questions than it answers.

Eg "At least not for things it's not meant for." What is it not meant for?

One option is to turn this around; list some things for which pandas is meant and then say for other problems it is less suited.

jreback · 2017-05-26T11:58:21Z

doc/source/enhancingperf.rst

+
+
+
+2) Another optimization is to make sure to use native types instead of python types.


what you mean is avoid object types for non-strings

jreback · 2017-05-26T11:58:36Z

doc/source/enhancingperf.rst

+   %timeit s2.sum()
+
+
+  NumPy int64 datatype arrays are much faster than the python object version.


what does this mean?

jreback · 2017-05-26T11:58:57Z

doc/source/enhancingperf.rst

+
+  For other datatypes:
+
+     * Strings - This is usually unavoidable. Pandas 2 will have a specialized


much more specific here. and formatting.

We shouldn't talk yet about pandas 2 here IMO (or at least not start with it). This should mainly explain how to currently deal with strings.

I would explain how strings are handles in pandas -> uses object dtype. And then mention Categoricals as a way to speed up certain use cases when having a string column with many re-occuring strings.

jreback · 2017-05-26T11:59:26Z

doc/source/enhancingperf.rst

+       an int, a float, and a datetime, pandas will have to store all of
+       those as objects. This dataset probably isn't tidy though.
+
+     * Integer NA - Unfortunately, pandas doesn't have real nullable types.


no, read the section this. We almost always cast to float.

jreback · 2017-05-26T11:59:45Z

doc/source/enhancingperf.rst

+
+     * Integer NA - Unfortunately, pandas doesn't have real nullable types.
+       To represent missingness, pandas uses NaN (not a number) which is a special
+       floating point value. If you have to represent nullable integers, you can


if you add things like this, there are many references in the docs, pls add links.

jreback · 2017-05-26T12:00:30Z

doc/source/enhancingperf.rst

+   s = pd.Series([1, 2, 3, np.nan, 5, 6, 7, 8, 9], dtype=object)
+   type(s[0])
+
+3) Optimize loading dataframes


this is not a good name for this section

Yeah, this is a useful section (often occrring anti-pattern), but maybe something with "Combining dataframes"? (but we want something for which people don't think it is about merging ..)

andymaheshw · 2017-06-08T11:11:43Z

Thanks will fix today!

jreback · 2017-08-17T10:27:29Z

if you can update and respond to comments.

jreback · 2017-10-28T00:24:00Z

closing as stale if you'd like to cointinue, pls ping

andymaheshw added 3 commits May 22, 2017 10:19

update

8d42658

update performance.rst file

d7192f2

update performance location

e2ac61a

andymaheshw changed the title ~~Tutorial update2~~ Adding documentation for Performance Improvements using Pandas May 22, 2017

jreback requested changes May 22, 2017

View reviewed changes

jreback added Docs Performance Memory or execution speed performance labels May 22, 2017

TomAugspurger reviewed May 23, 2017

View reviewed changes

andymaheshw added 3 commits May 23, 2017 07:51

update for requested changes

7f08e3c

update to enhancing

39ded19

added to enhancing, removing from performance

803f2da

andymaheshw changed the title ~~Adding documentation for Performance Improvements using Pandas~~ WIP - Additional documentation for Performance Improvements using Pandas May 24, 2017

update for enhancing

16467cb

jreback reviewed May 26, 2017

View reviewed changes

jreback closed this Oct 28, 2017

jreback added this to the No action milestone Oct 28, 2017




		2) Another optimization is to make sure to use native types instead of pandas types.

		native arbitrary-precision Decimal type.


		In certain cases, some things will automatically be object type:

		%timeit s2.sum()


		NumPy int64 datatype arrays are much faster than the python object version.


		For other datatypes:

		* Strings - This is usually unavoidable. Pandas 2 will have a specialized

Uh oh!

WIP - Additional documentation for Performance Improvements using Pandas #16439

WIP - Additional documentation for Performance Improvements using Pandas #16439

Uh oh!

Conversation

andymaheshw commented May 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented May 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codecov bot commented May 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andymaheshw commented May 23, 2017 via email

Uh oh!

andymaheshw commented May 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andymaheshw commented Jun 8, 2017

Uh oh!

jreback commented Aug 17, 2017

Uh oh!

jreback commented Oct 28, 2017

Uh oh!

Uh oh!

codecov bot commented May 23, 2017 •

edited

Loading

codecov bot commented May 23, 2017 •

edited

Loading