diff --git a/doc/source/enhancingperf.rst b/doc/source/enhancingperf.rst index 685a8690a53d5..7d4dca9ac645d 100644 --- a/doc/source/enhancingperf.rst +++ b/doc/source/enhancingperf.rst @@ -11,14 +11,147 @@ import pandas as pd pd.options.display.max_rows=15 + import sys import os import csv + plt.style.use('default') + ********************* Enhancing Performance ********************* +.. _enhancingperf.overview: + + +Pandas, for most use cases, will be fast; however, there are a few +anti-patterns to avoid. The following is a guide on achieving better +performance with pandas. These are the low-hanging optimizations +that should be considered before rewriting your code in Cython or to use Numba + +.. ipython:: python + +Overview of Common Optimizations +---------------------------------------- + +1) Using pandas when it is the wrong tool. + + * At least not for things it's not meant for. + * Pandas is very fast at joins, reindex, factorization + * Not as great at, say, matrix multiplications or problems that aren't vectorizable + + + +2) Another optimization is to make sure to use native types instead of python types. + +.. ipython:: python + + s1 = pd.Series(range(10000), dtype=object) + s2 = pd.Series(range(10000), dtype=np.int64) + +.. ipython:: python + + %timeit s1.sum() + +.. ipython:: python + + %timeit s2.sum() + + + NumPy int64 datatype arrays are much faster than the python object version. + + For other datatypes: + + * Strings - This is usually unavoidable. Pandas 2 will have a specialized + string type, but for now you're stuck with python objects. + If you have few distinct values (relative to the number of rows), you could + use a Categorical. + + * Dates, Times - Pandas has implemented a specialized version of `datetime.datetime`, + and `datetime.timedelta`, but not `datetime.date` or `datetime.time`. Depending on your + application, you might be able to treat dates as datetimes at midnight. + + * Decimal Types - Pandas uses floating-point arrays; there isn't a + native arbitrary-precision Decimal type. + + + In certain cases, some things will unavoidably be object type: + + * Reading messy Excel files - read_excel will preserve the dtype + of each cell in the spreadsheet. If you have a single column with + an int, a float, and a datetime, pandas will have to store all of + those as objects. This dataset probably isn't tidy though. + + * Integer NA - Unfortunately, pandas doesn't have real nullable types. + To represent missingness, pandas uses NaN (not a number) which is a special + floating point value. If you have to represent nullable integers, you can + use object dtype. + +.. ipython:: python + + s = pd.Series([1, 2, 3, np.nan, 5, 6, 7, 8, 9]) + s + + +.. ipython:: python + + type(s[0]) + +.. ipython:: python + + s = pd.Series([1, 2, 3, np.nan, 5, 6, 7, 8, 9], dtype=object) + type(s[0]) + +3) Optimize loading dataframes + + There are two ways to get data in to a DataFrame: Make a single empty DataFrame + and append to that single DataFrame, or make a list of many DataFrames + and concatenate at the end. + + The following code will make 100 sets of 50 records each. This could represent + any datasource, with any number of items in each. A DataFrame was built out of the + first list of records. + +.. ipython:: python + + import string + + records = [[(random.choice(string.ascii_letters), + random.choice(string.ascii_letters), + random.choice(range(10))) + for i in range(50)] + for j in range(100)] + + pd.DataFrame(records[0], columns=['A', 'B', 'C']) + + DataFrame.append is not efficient, as seen below. + +.. ipython:: python + + %%timeit + # Make an empty dataframe with the correct columns + df = pd.DataFrame(columns=['A', 'B', 'C']) + + for set_ in records: + subdf = pd.DataFrame(set_, columns=['A', 'B', 'C']) + # append to that original df + df = df.append(subdf, ignore_index=True) + + A better solution is to use DataFrame.concat. + +.. ipython:: python + + pd.concat([pd.DataFrame(set_, columns=['A', 'B', 'C']) for set_ in records], + ignore_index=True) + +4) Doing too much work with Pandas + +5) Reduce the amount of iteration required + + Using .apply (with axis=1) (Avoid Iteration) + + .. _enhancingperf.cython: Cython (Writing C extensions for pandas)