Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({'A': [7, -1, 4, 5], 'B': [10, 4, 2, 8]}, index= pd.Index(['i3', 'i2', 'i1', 'i0'], name='i0'))
################################
# For transforms, like lambda x: x
################################
# when group_keys=True, apply() respects as_index=False. same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=True).apply(lambda x: x, include_groups=False))
print(df.groupby('A', as_index=False, group_keys=True).apply(lambda x: x, include_groups=False))
# when group_keys=False, apply() does not respect as_index=False. same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=False).apply(lambda x: x, include_groups=False))
print(df.groupby('A', as_index=False, group_keys=False).apply(lambda x: x, include_groups=False))
################################
# For non-transform lambda x: pd.DataFrame([x.iloc[0].sum()])
################################
# when group_keys=True, grouping by data column respects as_index=False. same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=True).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
print(df.groupby('A', as_index=False, group_keys=True).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
# when group_keys=False, grouping by data column does not respect as_index=False. same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=False).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
print(df.groupby('A', as_index=False, group_keys=False).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
Issue Description
groupby.apply respects as_index=False if and only if group_keys=True
, but the documentation suggests that it should only respect as_index
if group_keys=False
.
My apologies in advance if I'm duplicating an issue or misunderstanding the intended behavior here. I know there has been some relevant discussion in #49543.
Expected Behavior
I don't know what the correct behavior is here. A simple and easily explainable behavior would be to always respect as_index=False
. However, to be consistent with the documentation here, transform-like applies should never respect as_index=False
, and I suppose that non-transform-like applies should respect it:
Since transformations do not include the groupings that are used to split the result, the arguments as_index and sort in DataFrame.groupby() and Series.groupby() have no effect.
When group_keys=True
, the result does include the "groupings that are used to split the result", so for the same reason that this note gives, as_index
should have no effect. The current behavior is the opposite, though: as_index
has an effect only when group_keys=True
. (despite the description of group_keys, it appears that apply
includes the group keys in the index if and only if group_keys=False
, regardless of whether func
is a transform.)
Installed Versions
INSTALLED VERSIONS
------------------
commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python : 3.9.18.final.0
python-bits : 64
OS : Darwin
OS-release : 23.3.0
Version : Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:00 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.1
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None