Skip to content

Cythonized GroupBy mad #20024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -791,6 +791,7 @@ Performance Improvements
- Improved performance of variable ``.rolling()`` on ``.min()`` and ``.max()`` (:issue:`19521`)
- Improved performance of :func:`pandas.core.groupby.GroupBy.ffill` and :func:`pandas.core.groupby.GroupBy.bfill` (:issue:`11296`)
- Improved performance of :func:`pandas.core.groupby.GroupBy.any` and :func:`pandas.core.groupby.GroupBy.all` (:issue:`15435`)
- Improved performance of :func:`pandas.core.groupby.GroupBy.mad` (:issue:`19165`)

.. _whatsnew_0230.docs:

Expand Down
20 changes: 20 additions & 0 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1353,6 +1353,26 @@ def var(self, ddof=1, *args, **kwargs):
f = lambda x: x.var(ddof=ddof, **kwargs)
return self._python_agg_general(f)

@Substitution(name='groupby')
@Appender(_doc_template)
def mad(self, skipna=True):
if not skipna:
raise NotImplementedError("'skipna=False' not yet implemented")

if self.axis != 0:
return self.apply(lambda x: x.mad(axis=self.axis))

# Wrap in a try..except to catch a TypeError with bool data
# Ideally this would be implemented in `mean` instead of here
try:
Copy link
Member Author

@WillAyd WillAyd Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As touched on in the comments I ideally would not want this try...except and think instead that the mean application should be throwing the error. While mean does raise for object types, something like pd.Series([True, False, True]).mean() is entirely valid and therefore this ends up throwing a TypeError in subsequent operation

I think mean should raise on boolean data and have that pass through. Will open separate issue if you agree

demeaned = np.abs(self.shift(0) - self.transform('mean'))
result = demeaned.groupby(self.grouper.labels).mean()
result.index = self.grouper.result_index
except TypeError:
raise DataError('No numeric types to aggregate')

return result

@Substitution(name='groupby')
@Appender(_doc_template)
def sem(self, ddof=1):
Expand Down
58 changes: 47 additions & 11 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from pandas import (date_range, bdate_range, Timestamp,
Index, MultiIndex, DataFrame, Series,
concat, Panel, DatetimeIndex, read_csv)
from pandas.core.base import DataError
from pandas.core.dtypes.missing import isna
from pandas.errors import UnsupportedFunctionCall, PerformanceWarning
from pandas.util.testing import (assert_frame_equal, assert_index_equal,
Expand Down Expand Up @@ -1300,17 +1301,6 @@ def test_non_cython_api(self):
g = df.groupby('A')
gni = df.groupby('A', as_index=False)

# mad
Copy link
Member Author

@WillAyd WillAyd Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was failing with this change because it allowed for mad operations on object types, returning NaN rather than raising. I figured the explicit tests added elsewhere made more sense and that raising is preferable to NaN

There's also some code that tests any further down within this test. Didn't check in detail but can probably be removed on account of changes done in #19722. Assuming I have updates I'll include in the next batch or alternately open up a separate issue

expected = DataFrame([[0], [np.nan]], columns=['B'], index=[1, 3])
expected.index.name = 'A'
result = g.mad()
assert_frame_equal(result, expected)

expected = DataFrame([[0., 0.], [0, np.nan]], columns=['A', 'B'],
index=[0, 1])
result = gni.mad()
assert_frame_equal(result, expected)

# describe
expected_index = pd.Index([1, 3], name='A')
expected_col = pd.MultiIndex(levels=[['B'],
Expand Down Expand Up @@ -2141,6 +2131,52 @@ def test_groupby_bool_aggs(self, agg_func, skipna, vals):
result = getattr(df.groupby('key'), agg_func)(skipna=skipna)
assert_frame_equal(result, exp_df)

@pytest.mark.parametrize("klass", [Series, DataFrame])
@pytest.mark.parametrize("test_mi", [True, False])
@pytest.mark.parametrize("dtype", ['int', 'float'])
def test_groupby_mad(self, klass, test_mi, dtype):
vals = np.array(range(10)).astype(dtype)
df = DataFrame({'key': ['a'] * 5 + ['b'] * 5, 'val': vals})

idx = pd.Index(['a', 'b'], name='key')
exp = klass([1.2, 1.2], index=idx)
grping = ['key']

if test_mi:
df = df.append(df) # Double the size of the frame
df['newcol'] = ['foo'] * 10 + ['bar'] * 10
grping.append('newcol')

mi = pd.MultiIndex.from_product((exp.index.values,
['bar', 'foo']),
names=['key', 'newcol'])
exp = exp.append(exp)
exp.index = mi

if klass is Series:
exp.name = 'val'
result = df.groupby(grping)['val'].mad()
tm.assert_series_equal(result, exp)
else:
exp = exp.rename(columns={0: 'val'})
result = df.groupby(grping).mad()
tm.assert_frame_equal(result, exp)

@pytest.mark.parametrize("vals", [
['foo'] * 10, [True] * 10])
def test_groupby_mad_raises(self, vals):
df = DataFrame({'key': ['a'] * 5 + ['b'] * 5, 'val': vals})

with tm.assert_raises_regex(DataError,
"No numeric types to aggregate"):
df.groupby('key').mad()

def test_groupby_mad_skipna(self):
df = DataFrame({'key': ['a'] * 5 + ['b'] * 5, 'val': range(10)})
with tm.assert_raises_regex(
NotImplementedError, "'skipna=False' not yet implemented"):
df.groupby('key').mad(skipna=False)

def test_dont_clobber_name_column(self):
df = DataFrame({'key': ['a', 'a', 'a', 'b', 'b', 'b'],
'name': ['foo', 'bar', 'baz'] * 2})
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/groupby/test_whitelist.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

AGG_FUNCTIONS = ['sum', 'prod', 'min', 'max', 'median', 'mean', 'skew',
'mad', 'std', 'var', 'sem']
AGG_FUNCTIONS_WITH_SKIPNA = ['skew', 'mad']
AGG_FUNCTIONS_WITH_SKIPNA = ['skew']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking to why I did this - since mad uses the mean behind the scenes I figured it made sense to rely on mean to do as much of the heavy lifting as possible. Unfortunately mean doesn't currently handle the skipna parameter but I think it's worth addressing within that function and allowing that to pass through to mad rather than implementing specifically within mad.

As always glad to open an issue for that if you agree on approach


df_whitelist = frozenset([
'last',
Expand Down