Skip to content

BUG GH#43414 Infer and coerce datetimes am pm #43416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

BUG GH#43414 Infer and coerce datetimes am pm #43416

wants to merge 3 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Sep 5, 2021

@ghost
Copy link
Author

ghost commented Sep 5, 2021

Something interesting I found is that when we infer_datetime_format we actually only use the first record. This is important because in some of the tests we generate data with pd.date_range where by default the first record will start at midnight. This can lead to different behavior than if the data did not start with a record at midnight, which is probably more common in practice.

@simonjayhawkins
Copy link
Member

Thanks @mikephung122 for the PR.

tests are failing https://github.com/pandas-dev/pandas/pull/43416/checks?check_run_id=3517046932

Something interesting I found is that when we infer_datetime_format we actually only use the first record.

This is clearly stated in the docs.

infer_datetime_format : bool, default False
        If True and no `format` is given, attempt to infer the format of the
        datetime strings based on the first non-NaN element,
        and if it can be inferred, switch to a faster method of parsing them.
        In some cases this can increase the parsing speed by ~5-10x.

There is an underlying issue with infer_datetime_format=True and errors='coerce' specified together, #25143

The fact that specifying errors='raise' or not specifying infer_datetime_format=True gives the correct result with the code sample in the issue OP #43414, would fixing the underlying issue not fix this issue without the code added here?

@simonjayhawkins simonjayhawkins added Bug Datetime Datetime data dtype labels Sep 5, 2021
@ghost ghost changed the title Infer datetimes am pm BUG GH#43414 Infer datetimes am pm Sep 5, 2021
@ghost ghost changed the title BUG GH#43414 Infer datetimes am pm BUG GH#43414 Infer and coerce datetimes am pm Sep 5, 2021
@ghost
Copy link
Author

ghost commented Sep 5, 2021

@simonjayhawkins When I found this problem I also thought it was related to #25143. However, the more I looked into it the more I feel like it's unclear and now I believe fixing this as a separate issue might make more sense.

I think this issue and fix by itself is straight-forward. When converting we always try to guess the datetime format first, regardless of the errors argument and error handling. Ideally, the guess_datetime_format function can and should be able to guess date and time formats with AM/PM. However, right now it will return something like this '%m/%d/%Y %H:%M:%S AM' for AM/PM. So when we try to use that to parse with our best guess it never works and the subsequent issue of inconsistent error handling and the fallback is surfaced.
Screen Shot 2021-09-04 at 10 08 45 PM

We might be able to completely ignore this issue since the fallback method works. However, I'm wondering why it was including in the first place if that's true. The other option would be trying the fallback before accepting the coerced NaT, but then I would say this is still an issue worth fixing. Overall, I think cleaning this up and ensuring that no existing functionality is broken would be significantly more difficult. What do you think?

@ghost
Copy link
Author

ghost commented Sep 5, 2021

@simonjayhawkins I checked the test, it looks like its the new unit tests that are failing for one posix environment while another posix environment shows INTERNALERROR and 'Error: Process completed with exit code 3.'. However, the unit tests don't fail locally or on the other posix environments. I'm wondering if this is an issue with how these are built and ran but don't see anything immediately obvious from the logs during the build or test? This is actually my first time touching any of the cython, are there any additional steps that need to be done outside of changing the .pyx file?
Screen Shot 2021-09-07 at 4 48 23 PM

@ghost ghost closed this Sep 18, 2021
@ghost ghost deleted the infer-datetimes-am-pm branch September 30, 2021 03:03
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.to_datetime cannot infer and coerce dates with AM/PM and infer_datetime_format=True and errors="coerce".
1 participant