Skip to content

coerce OutOfBoundsDatetime into 2199 #602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rainermensing opened this issue May 22, 2024 · 10 comments
Closed

coerce OutOfBoundsDatetime into 2199 #602

rainermensing opened this issue May 22, 2024 · 10 comments

Comments

@rainermensing
Copy link
Contributor

rainermensing commented May 22, 2024

The Problem is that currently, the sasdata2dataframe method involves as step where timestamps are coerced silently using df[dvarlist[i]] = pd.to_datetime(df[dvarlist[i]], errors='coerce'). Pandas has the issue that it cannot handle timestamps larger than a certain size ('2262-04-11 23:47:16.854775807'). SAS however does support timestamps of larger sizes.
This means all timestamps larger than this are silently coerced into NaT within pandas. However, these values are not so uncommen i.e. 9999-12-31 as a default 'valid_to' value.

My proposal is to not change the timestamp logic in all relevant methods along these lines:

import pandas as pd

# Sample date string
coerce_into_2199_ts = False
timestamp_format = 'mixed' #avoids warnings of unspecified timestamp
date_string = '9999-01-01' #None # vs '9999-01-01' vs '2000-01-01'

df = pd.DataFrame({'timestamp': [date_string]})

if coerce_into_2199_ts:
    # Convert the date string to a timestamp
    #credit: https://stackoverflow.com/questions/77690640/pandas-to-datetime-when-dates-are-9999-01-01
    df["timestamp2"] = (pd.to_datetime(df["timestamp"], errors='coerce',format=timestamp_format)
                        .mask(lambda x: x.isna() & df["timestamp"].notna(), #handle existing null values.
                                pd.Timestamp('2199-12-31'))
                    )
    # Create a DataFrame with the timestamp
else:
    try:
        df["timestamp2"] = pd.to_datetime(df["timestamp"],format=timestamp_format)
    except pd.errors.OutOfBoundsDatetime:
        raise pd.errors.OutOfBoundsDatetime("The column contains timestamps that are too large. Consider using coerce_into_2199_ts = True")

The default of coerce_into_2199_ts would be False, so that the coercion is not silent anymore. At the same time, the information of the maximum timestamp should be preserved, as well as that of possible existing NaT in the timestamp column.
Hence i.e. for sasioiom.py

               if k_dts is None:  # don't override these if user provided their own dtypes
                  for i in range(nvars):
                     if vartype[i] == 'N':
                        if varcat[i] in self._sb.sas_date_fmts + self._sb.sas_time_fmts + self._sb.sas_datetime_fmts:
                           if coerce_into_2199_ts:
                              # Convert the date string to a timestamp
                              #credit: https://stackoverflow.com/questions/77690640/pandas-to-datetime-when-dates-are-9999-01-01
                              df[dvarlist[i]] = (pd.to_datetime(df[dvarlist[i]], errors='coerce',format=timestamp_format)
                                                   .mask(lambda x: x.isna() & df[dvarlist[i]].notna(),
                                                         pd.Timestamp('2199-12-31'))
                                             )
                              # Create a DataFrame with the timestamp
                           else:
                              try:
                                 df[dvarlist[i]] = pd.to_datetime(df[dvarlist[i]],format=timestamp_format)
                              except pd.errors.OutOfBoundsDatetime:
                                 raise pd.errors.OutOfBoundsDatetime("The column contains timestamps that are too large. Consider using coerce_into_2199_ts = True")
@rainermensing
Copy link
Contributor Author

Actually, I just realized that the ideal solution would be if you could somehow do the preprocessing of the date/timestamp columns in sas aready, thus take all date and timestamp columns and replace too large values with 2199-12-31... Then you did not need to use coerce at all...
For the sasdata2parquet, the best solution would be if we just did not rely on pandas anymore but went straight to pyarrow... but that came with it's own pitfalls as we have seen.

@tomweber-sas
Copy link
Collaborator

Hey @rainermensing let me take a look at this when I get a chance. It would be good to have an option to address this.

@tomweber-sas
Copy link
Collaborator

Hey @rainermensing , I have to apologize for taking so long to get to this. I've been swamped w/ internal SAS work, but I'm looking at this now. I hope to have this and the parquet support from the other issue finished up this week. Did you still have some changes for that other parquet issue to see about integrating? Or is it good as it is?

Thanks!
Tom

@rainermensing
Copy link
Contributor Author

Hi @tomweber-sas . Thank you for getting back to this.
As for this issue, one thing I did not consider when opening this issue is that out of bounds can be very large positive AND negative dates. So you don't want to cast both these to the same values but to an upper and lower value.

For the parquet issue, I can only test our full workflow during weekends and we have discovered one more bug last week that I hope I have now finally fixed (and there won't be anything else). I will create a new merge request with final logic some time this or next week so please be patient... I hope this will not cause you any issues

@tomweber-sas
Copy link
Collaborator

Yes, I started playing with the code sample you provided and I started thinking of adding a couple options to sd2df for max/min timestamp values. If either or both are provided, do it (I'm going to look at doing it in SAS code or the Python code, just to see if either works better). That way users can provide the replacement values they want, and w/out specifying either, the behavior remains the same so as to not break anyone. I hope to finish looking at that tomorrow.

And yes, as for the parquet, once you have those last changes, I can try them out and see about finishing that up too. No problem with any of this, that's what I'm here for :)

Thanks,
Tom

@tomweber-sas
Copy link
Collaborator

@rainermensing, I've finally had time to prototype this. I just pushed to a new branch named datetime. I was able to do this on the SAS side; generating SAS code so the timestamps come across w/out need of coercion, if you prove a timestamp to replace for high or low values. I've done it for both SAS datetimes and SAS dates, as they both go to a pandas timestamp, so both need this.

I added two parms to sd2df, tsmin and tsmax. They provide replacement values. Here's a test case I've used and the output from the sd2df methods. I also added this to the sd2pq method, as it's the same implementation.

If you have a chance to play with this and see what you think, that would be great!

import saspy; sas = saspy.SASsession(cfgname='iomj', results='text'); sas
sd  = sas.sasdata('b')
sda = sas.sasdata('a')


sas.submitLST('''
data a; format x best32.10;
   x='1677-09-21 00:12:42'dt;output;
   x='1677-09-21 00:12:43.145224193'dt;output;
   x='1677-09-21 00:12:43.145225'dt;output;
   x='1677-09-21 00:12:43.146'dt;output;
   x='1866-01-03 12:00:00.101010101'dt;output;
   x='1966-01-03 12:00:00.101010101'dt;output;
   x='2066-01-03 12:00:00.101010101'dt;output;
   x='2262-04-11 23:47:16.854'dt;output;
   x='2262-04-11 23:47:16.854775'dt;output;
   x='2262-04-11 23:47:16.9'dt;output;
   x='2262-04-11 23:47:17'dt;output;
run;
proc print;run;

data b; set a;
     format x E8601DT26.6;
put x;
run;
''', method='listandlog')

sd.to_df()
sd.to_df(tsmin='1966-01-03 00:00:00.000000', tsmax='1966-01-03 23:59:59.111111')

sd.to_df(tsmax='1966-01-03 23:59:59.111111')
sd.to_df(tsmin='1966-01-03 00:00:00.000000')



sas.submitLST('''
data a; format x best32.10;
   x=datepart('1677-09-20 00:12:42'dt);output;
   x=datepart('1677-09-21 00:12:43.145224193'dt);output;
   x=datepart('1677-09-22 00:12:43.145225'dt);output;
   x=datepart('1866-01-03 12:00:00.101010101'dt);output;
   x=datepart('1966-01-03 12:00:00.101010101'dt);output;
   x=datepart('2066-01-03 12:00:00.101010101'dt);output;
   x=datepart('2262-04-10 23:47:16.854'dt);output;
   x=datepart('2262-04-11 23:47:16.854775'dt);output;
   x=datepart('2262-04-12 23:47:16.9'dt);output;
run;
proc print;run;

data b; set a;
     format x E8601DA10.;
put x;
run;
''', method='listandlog')

sd.to_df()
sd.to_df(tsmin='1966-01-03 00:00:00.000000', tsmax='1966-01-03 23:59:59.111111')

sd.to_df(tsmax='1966-01-03 23:59:59.111111')
sd.to_df(tsmin='1966-01-03 00:00:00.000000')

for the first, datetimes:

>>> sd.to_df()
                            x
0                         NaT
1                         NaT
2  1677-09-21 00:12:43.145226
3  1677-09-21 00:12:43.146000
4  1866-01-03 12:00:00.101010
5  1966-01-03 12:00:00.101010
6  2066-01-03 12:00:00.101010
7  2262-04-11 23:47:16.854000
8  2262-04-11 23:47:16.854774
9                         NaT
10                        NaT
>>> sd.to_df(tsmin='1966-01-03 00:00:00.000000', tsmax='1966-01-03 23:59:59.111111')
                            x
0  1966-01-03 00:00:00.000000
1  1966-01-03 00:00:00.000000
2  1966-01-03 00:00:00.000000
3  1677-09-21 00:12:43.146000
4  1866-01-03 12:00:00.101010
5  1966-01-03 12:00:00.101010
6  2066-01-03 12:00:00.101010
7  2262-04-11 23:47:16.854000
8  1966-01-03 23:59:59.111111
9  1966-01-03 23:59:59.111111
10 1966-01-03 23:59:59.111111
>>>
>>> sd.to_df(tsmax='1966-01-03 23:59:59.111111')
                            x
0                         NaT
1                         NaT
2  1677-09-21 00:12:43.145226
3  1677-09-21 00:12:43.146000
4  1866-01-03 12:00:00.101010
5  1966-01-03 12:00:00.101010
6  2066-01-03 12:00:00.101010
7  2262-04-11 23:47:16.854000
8  1966-01-03 23:59:59.111111
9  1966-01-03 23:59:59.111111
10 1966-01-03 23:59:59.111111
>>> sd.to_df(tsmin='1966-01-03 00:00:00.000000')
                            x
0  1966-01-03 00:00:00.000000
1  1966-01-03 00:00:00.000000
2  1966-01-03 00:00:00.000000
3  1677-09-21 00:12:43.146000
4  1866-01-03 12:00:00.101010
5  1966-01-03 12:00:00.101010
6  2066-01-03 12:00:00.101010
7  2262-04-11 23:47:16.854000
8  2262-04-11 23:47:16.854774
9                         NaT
10                        NaT
>>>

and for dates:

>>> sd.to_df()
           x
0        NaT
1        NaT
2 1677-09-22
3 1866-01-03
4 1966-01-03
5 2066-01-03
6 2262-04-10
7 2262-04-11
8        NaT
>>> sd.to_df(tsmin='1966-01-03 00:00:00.000000', tsmax='1966-01-03 23:59:59.111111')
           x
0 1966-01-03
1 1966-01-03
2 1677-09-22
3 1866-01-03
4 1966-01-03
5 2066-01-03
6 2262-04-10
7 2262-04-11
8 1966-01-03
>>>
>>> sd.to_df(tsmax='1966-01-03 23:59:59.111111')
           x
0        NaT
1        NaT
2 1677-09-22
3 1866-01-03
4 1966-01-03
5 2066-01-03
6 2262-04-10
7 2262-04-11
8 1966-01-03
>>> sd.to_df(tsmin='1966-01-03 00:00:00.000000')
           x
0 1966-01-03
1 1966-01-03
2 1677-09-22
3 1866-01-03
4 1966-01-03
5 2066-01-03
6 2262-04-10
7 2262-04-11
8        NaT
>>>

@tomweber-sas
Copy link
Collaborator

#609 merged this in. I believe it allows for what you wanted and more. You can set the high and low timestamps and it works for SAS dates as well as datetimes (timestamps). I've merged this in to main now. I need to write doc and test some more, but then I'll build this into a new release. I'll close this after that's all done.
Thanks!
Tom

@rainermensing
Copy link
Contributor Author

rainermensing commented Jul 10, 2024 via email

@tomweber-sas
Copy link
Collaborator

Thanks @rainermensing , and no problem, enjoy your vacation! This will all be there when you get back; should be in the latest production version then :)
Tom

@tomweber-sas
Copy link
Collaborator

I've built this into the latest production version: V5.100.0 - SASPy's 100th release! I'll close this, but if you see anything once you have a chance to try it out, just reopen or open a new one. Thanks again for all of your help with these enhancements! They all help make SASPy better for all!
Thanks,
Tom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants