Skip to content

BUG: pandas >= 2.0 parses 'May' in ambiguous way. #57521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
ffyring opened this issue Feb 20, 2024 · 8 comments
Closed
3 tasks done

BUG: pandas >= 2.0 parses 'May' in ambiguous way. #57521

ffyring opened this issue Feb 20, 2024 · 8 comments
Labels
Bug datetime.date stdlib datetime.date support Needs Discussion Requires discussion from core team before further action

Comments

@ffyring
Copy link

ffyring commented Feb 20, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from io import StringIO

csv_ok = """
REF_DATE, VAL
20-Apr-2024, 120
20-May-2024, 100
"""

csv_error = """
REF_DATE, VAL
20-May-2024, 120
20-Apr-2024, 100
"""

# This works fine
df_ok = pd.read_csv(StringIO(csv_ok))
df_ok['REF_DATE'] = pd.to_datetime(df_ok['REF_DATE'])
df_ok['NEXT_DATE'] = df_ok['REF_DATE'] + pd.Timedelta(days=1)

# This gives an error
df_error = pd.read_csv(StringIO(csv_error))
df_error['REF_DATE'] = pd.to_datetime(df_error['REF_DATE'])
df_error['NEXT_DATE'] = df_error['REF_DATE'] + pd.Timedelta(days=1)

Issue Description

Code above works with pandas<2.0 and fails with pandas>=2.0

Failure occurs because pandas >= 2.0 parses the date format in second example as "%d-%B-%Y" (long datename) instead if "%d-%b-%Y" (3-letter dastename) since "May" is both the full month name and three-letter-abbreviation of the 5th month of the year.

The cause is the removal of the default argument infer_datetime_format=False in core.tools.datetimes._convert_listlike_datetimes
Pandas>=2.0 then uses

def _guess_datetime_format_for_array(arr, dayfirst: bool | None = False) -> str | None:
    # Try to guess the format based on the first non-NaN element, return None if can't
    if (first_non_null := tslib.first_non_null(arr)) != -1:
        if type(first_non_nan_element := arr[first_non_null]) is str:  # noqa: E721
            # GH#32264 np.str_ object
            guessed_format = guess_datetime_format(
                first_non_nan_element, dayfirst=dayfirst
            )
            if guessed_format is not None:
                return guessed_format
            # If there are multiple non-null elements, warn about
            # how parsing might not be consistent
            if tslib.first_non_null(arr[first_non_null + 1 :]) != -1:
                warnings.warn(
                    "Could not infer format, so each element will be parsed "
                    "individually, falling back to `dateutil`. To ensure parsing is "
                    "consistent and as-expected, please specify a format.",
                    UserWarning,
                    stacklevel=find_stack_level(),
                )
    return None

using the ambiguous first element.

Expected Behavior

pandas should infer the datetime format correctly, using several entries when the format is ambiguous.

Installed Versions

INSTALLED VERSIONS

commit : f538741
python : 3.11.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 5, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Swedish_Sweden.1252
pandas : 2.2.0
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.1.0
pip : 24.0
Cython : None
pytest : 8.0.0
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.21.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.2.0
gcsfs : None
matplotlib : 3.8.3
numba : 0.59.0
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : 2.0.27
tables : None
tabulate : 0.9.0
xarray : 2024.1.1

@ffyring ffyring added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 20, 2024
@ffyring ffyring changed the title BUG: BUG: pandas >= 2.0 parses 'May' in ambiguous way. Feb 20, 2024
@santhoshbethi
Copy link
Contributor

Pandas_parse It is working fine for me. I am using the installed versions.

@ffyring
Copy link
Author

ffyring commented Feb 20, 2024

It is working fine for me. I am using the installed versions.

And the installed version is above 2.0?

@santhoshbethi
Copy link
Contributor

santhoshbethi commented Feb 20, 2024

Yes

My bad. I can reproduce the error. I am using pandas 2.2.0 Now.

@santhoshbethi
Copy link
Contributor

I did some debugging. The below format is also failing and working fine in pandas < 2.0 versions.

csv_error = """
REF_DATE, VAL
20-05-2024, 120
20-May-2024, 100
"""

When we pass format='mixed' it works as expected. I guess this would be the solution.

df_error = pd.read_csv(StringIO(csv_error)) df_error['REF_DATE'] = pd.to_datetime(df_error['REF_DATE'],format='mixed') df_error['NEXT_DATE'] = df_error['REF_DATE'] + pd.Timedelta(days=1)

@ffyring
Copy link
Author

ffyring commented Feb 21, 2024

I did some debugging. The below format is also failing and working fine in pandas < 2.0 versions.

csv_error = """ REF_DATE, VAL 20-05-2024, 120 20-May-2024, 100 """

When we pass format='mixed' it works as expected. I guess this would be the solution.

df_error = pd.read_csv(StringIO(csv_error)) df_error['REF_DATE'] = pd.to_datetime(df_error['REF_DATE'],format='mixed') df_error['NEXT_DATE'] = df_error['REF_DATE'] + pd.Timedelta(days=1)

Thanks, passing format='mixed' works fine.

I still think it is a bug or a regression though, possibly breaking old code bases, since the format really isn't mixed in this case.

@rhshadrach
Copy link
Member

rhshadrach commented Feb 25, 2024

Thanks for the report!

pandas should infer the datetime format correctly, using several entries when the format is ambiguous.

How many? And if they are all "May"?

I still think it is a bug or a regression though, possibly breaking old code bases, since the format really isn't mixed in this case.

The breaking change was announced here: https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#datetimes-are-now-parsed-with-a-consistent-format

Thanks, passing format='mixed' works fine.

I'd recommend format="%d-%b-%Y" instead, if the format is indeed not mixed.

It seems to me that inference is a guess, and sometimes guesses can be wrong. I'd recommend not relying on inference when possible.

cc @MarcoGorelli

@rhshadrach rhshadrach added datetime.date stdlib datetime.date support Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 25, 2024
@ffyring
Copy link
Author

ffyring commented Feb 25, 2024

Thanks for the report!

pandas should infer the datetime format correctly, using several entries when the format is ambiguous.

How many? And if they are all "May"?

I still think it is a bug or a regression though, possibly breaking old code bases, since the format really isn't mixed in this case.

The breaking change was announced here:
I'd recommend format="%d-%b-%Y" instead, if the format is indeed not mixed.

It seems to me that inference is a guess, and sometimes guesses can be wrong. I'd recommend not relying on inference when possible.

cc @MarcoGorelli

Thanks, definitely not a bug then. I think your viewpoint as reasonable and this can be closed.

@ffyring
Copy link
Author

ffyring commented Feb 25, 2024

Not a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug datetime.date stdlib datetime.date support Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants