coerce OutOfBoundsDatetime into 2199 #602

rainermensing · 2024-05-22T15:47:29Z

The Problem is that currently, the sasdata2dataframe method involves as step where timestamps are coerced silently using df[dvarlist[i]] = pd.to_datetime(df[dvarlist[i]], errors='coerce'). Pandas has the issue that it cannot handle timestamps larger than a certain size ('2262-04-11 23:47:16.854775807'). SAS however does support timestamps of larger sizes.
This means all timestamps larger than this are silently coerced into NaT within pandas. However, these values are not so uncommen i.e. 9999-12-31 as a default 'valid_to' value.

My proposal is to not change the timestamp logic in all relevant methods along these lines:

import pandas as pd

# Sample date string
coerce_into_2199_ts = False
timestamp_format = 'mixed' #avoids warnings of unspecified timestamp
date_string = '9999-01-01' #None # vs '9999-01-01' vs '2000-01-01'

df = pd.DataFrame({'timestamp': [date_string]})

if coerce_into_2199_ts:
    # Convert the date string to a timestamp
    #credit: https://stackoverflow.com/questions/77690640/pandas-to-datetime-when-dates-are-9999-01-01
    df["timestamp2"] = (pd.to_datetime(df["timestamp"], errors='coerce',format=timestamp_format)
                        .mask(lambda x: x.isna() & df["timestamp"].notna(), #handle existing null values.
                                pd.Timestamp('2199-12-31'))
                    )
    # Create a DataFrame with the timestamp
else:
    try:
        df["timestamp2"] = pd.to_datetime(df["timestamp"],format=timestamp_format)
    except pd.errors.OutOfBoundsDatetime:
        raise pd.errors.OutOfBoundsDatetime("The column contains timestamps that are too large. Consider using coerce_into_2199_ts = True")

The default of coerce_into_2199_ts would be False, so that the coercion is not silent anymore. At the same time, the information of the maximum timestamp should be preserved, as well as that of possible existing NaT in the timestamp column.
Hence i.e. for sasioiom.py

               if k_dts is None:  # don't override these if user provided their own dtypes
                  for i in range(nvars):
                     if vartype[i] == 'N':
                        if varcat[i] in self._sb.sas_date_fmts + self._sb.sas_time_fmts + self._sb.sas_datetime_fmts:
                           if coerce_into_2199_ts:
                              # Convert the date string to a timestamp
                              #credit: https://stackoverflow.com/questions/77690640/pandas-to-datetime-when-dates-are-9999-01-01
                              df[dvarlist[i]] = (pd.to_datetime(df[dvarlist[i]], errors='coerce',format=timestamp_format)
                                                   .mask(lambda x: x.isna() & df[dvarlist[i]].notna(),
                                                         pd.Timestamp('2199-12-31'))
                                             )
                              # Create a DataFrame with the timestamp
                           else:
                              try:
                                 df[dvarlist[i]] = pd.to_datetime(df[dvarlist[i]],format=timestamp_format)
                              except pd.errors.OutOfBoundsDatetime:
                                 raise pd.errors.OutOfBoundsDatetime("The column contains timestamps that are too large. Consider using coerce_into_2199_ts = True")

The text was updated successfully, but these errors were encountered:

rainermensing · 2024-05-22T20:36:12Z

Actually, I just realized that the ideal solution would be if you could somehow do the preprocessing of the date/timestamp columns in sas aready, thus take all date and timestamp columns and replace too large values with 2199-12-31... Then you did not need to use coerce at all...
For the sasdata2parquet, the best solution would be if we just did not rely on pandas anymore but went straight to pyarrow... but that came with it's own pitfalls as we have seen.

tomweber-sas · 2024-05-23T15:17:17Z

Hey @rainermensing let me take a look at this when I get a chance. It would be good to have an option to address this.

tomweber-sas · 2024-06-17T18:02:05Z

Hey @rainermensing , I have to apologize for taking so long to get to this. I've been swamped w/ internal SAS work, but I'm looking at this now. I hope to have this and the parquet support from the other issue finished up this week. Did you still have some changes for that other parquet issue to see about integrating? Or is it good as it is?

Thanks!
Tom

rainermensing · 2024-06-17T19:53:41Z

Hi @tomweber-sas . Thank you for getting back to this.
As for this issue, one thing I did not consider when opening this issue is that out of bounds can be very large positive AND negative dates. So you don't want to cast both these to the same values but to an upper and lower value.

For the parquet issue, I can only test our full workflow during weekends and we have discovered one more bug last week that I hope I have now finally fixed (and there won't be anything else). I will create a new merge request with final logic some time this or next week so please be patient... I hope this will not cause you any issues

tomweber-sas · 2024-06-17T20:52:47Z

Yes, I started playing with the code sample you provided and I started thinking of adding a couple options to sd2df for max/min timestamp values. If either or both are provided, do it (I'm going to look at doing it in SAS code or the Python code, just to see if either works better). That way users can provide the replacement values they want, and w/out specifying either, the behavior remains the same so as to not break anyone. I hope to finish looking at that tomorrow.

And yes, as for the parquet, once you have those last changes, I can try them out and see about finishing that up too. No problem with any of this, that's what I'm here for :)

Thanks,
Tom

tomweber-sas · 2024-07-03T18:41:01Z

@rainermensing, I've finally had time to prototype this. I just pushed to a new branch named datetime. I was able to do this on the SAS side; generating SAS code so the timestamps come across w/out need of coercion, if you prove a timestamp to replace for high or low values. I've done it for both SAS datetimes and SAS dates, as they both go to a pandas timestamp, so both need this.

I added two parms to sd2df, tsmin and tsmax. They provide replacement values. Here's a test case I've used and the output from the sd2df methods. I also added this to the sd2pq method, as it's the same implementation.

If you have a chance to play with this and see what you think, that would be great!

import saspy; sas = saspy.SASsession(cfgname='iomj', results='text'); sas
sd  = sas.sasdata('b')
sda = sas.sasdata('a')


sas.submitLST('''
data a; format x best32.10;
   x='1677-09-21 00:12:42'dt;output;
   x='1677-09-21 00:12:43.145224193'dt;output;
   x='1677-09-21 00:12:43.145225'dt;output;
   x='1677-09-21 00:12:43.146'dt;output;
   x='1866-01-03 12:00:00.101010101'dt;output;
   x='1966-01-03 12:00:00.101010101'dt;output;
   x='2066-01-03 12:00:00.101010101'dt;output;
   x='2262-04-11 23:47:16.854'dt;output;
   x='2262-04-11 23:47:16.854775'dt;output;
   x='2262-04-11 23:47:16.9'dt;output;
   x='2262-04-11 23:47:17'dt;output;
run;
proc print;run;

data b; set a;
     format x E8601DT26.6;
put x;
run;
''', method='listandlog')

sd.to_df()
sd.to_df(tsmin='1966-01-03 00:00:00.000000', tsmax='1966-01-03 23:59:59.111111')

sd.to_df(tsmax='1966-01-03 23:59:59.111111')
sd.to_df(tsmin='1966-01-03 00:00:00.000000')



sas.submitLST('''
data a; format x best32.10;
   x=datepart('1677-09-20 00:12:42'dt);output;
   x=datepart('1677-09-21 00:12:43.145224193'dt);output;
   x=datepart('1677-09-22 00:12:43.145225'dt);output;
   x=datepart('1866-01-03 12:00:00.101010101'dt);output;
   x=datepart('1966-01-03 12:00:00.101010101'dt);output;
   x=datepart('2066-01-03 12:00:00.101010101'dt);output;
   x=datepart('2262-04-10 23:47:16.854'dt);output;
   x=datepart('2262-04-11 23:47:16.854775'dt);output;
   x=datepart('2262-04-12 23:47:16.9'dt);output;
run;
proc print;run;

data b; set a;
     format x E8601DA10.;
put x;
run;
''', method='listandlog')

sd.to_df()
sd.to_df(tsmin='1966-01-03 00:00:00.000000', tsmax='1966-01-03 23:59:59.111111')

sd.to_df(tsmax='1966-01-03 23:59:59.111111')
sd.to_df(tsmin='1966-01-03 00:00:00.000000')

for the first, datetimes:

>>> sd.to_df()
                            x
0                         NaT
1                         NaT
2  1677-09-21 00:12:43.145226
3  1677-09-21 00:12:43.146000
4  1866-01-03 12:00:00.101010
5  1966-01-03 12:00:00.101010
6  2066-01-03 12:00:00.101010
7  2262-04-11 23:47:16.854000
8  2262-04-11 23:47:16.854774
9                         NaT
10                        NaT
>>> sd.to_df(tsmin='1966-01-03 00:00:00.000000', tsmax='1966-01-03 23:59:59.111111')
                            x
0  1966-01-03 00:00:00.000000
1  1966-01-03 00:00:00.000000
2  1966-01-03 00:00:00.000000
3  1677-09-21 00:12:43.146000
4  1866-01-03 12:00:00.101010
5  1966-01-03 12:00:00.101010
6  2066-01-03 12:00:00.101010
7  2262-04-11 23:47:16.854000
8  1966-01-03 23:59:59.111111
9  1966-01-03 23:59:59.111111
10 1966-01-03 23:59:59.111111
>>>
>>> sd.to_df(tsmax='1966-01-03 23:59:59.111111')
                            x
0                         NaT
1                         NaT
2  1677-09-21 00:12:43.145226
3  1677-09-21 00:12:43.146000
4  1866-01-03 12:00:00.101010
5  1966-01-03 12:00:00.101010
6  2066-01-03 12:00:00.101010
7  2262-04-11 23:47:16.854000
8  1966-01-03 23:59:59.111111
9  1966-01-03 23:59:59.111111
10 1966-01-03 23:59:59.111111
>>> sd.to_df(tsmin='1966-01-03 00:00:00.000000')
                            x
0  1966-01-03 00:00:00.000000
1  1966-01-03 00:00:00.000000
2  1966-01-03 00:00:00.000000
3  1677-09-21 00:12:43.146000
4  1866-01-03 12:00:00.101010
5  1966-01-03 12:00:00.101010
6  2066-01-03 12:00:00.101010
7  2262-04-11 23:47:16.854000
8  2262-04-11 23:47:16.854774
9                         NaT
10                        NaT
>>>

and for dates:

>>> sd.to_df()
           x
0        NaT
1        NaT
2 1677-09-22
3 1866-01-03
4 1966-01-03
5 2066-01-03
6 2262-04-10
7 2262-04-11
8        NaT
>>> sd.to_df(tsmin='1966-01-03 00:00:00.000000', tsmax='1966-01-03 23:59:59.111111')
           x
0 1966-01-03
1 1966-01-03
2 1677-09-22
3 1866-01-03
4 1966-01-03
5 2066-01-03
6 2262-04-10
7 2262-04-11
8 1966-01-03
>>>
>>> sd.to_df(tsmax='1966-01-03 23:59:59.111111')
           x
0        NaT
1        NaT
2 1677-09-22
3 1866-01-03
4 1966-01-03
5 2066-01-03
6 2262-04-10
7 2262-04-11
8 1966-01-03
>>> sd.to_df(tsmin='1966-01-03 00:00:00.000000')
           x
0 1966-01-03
1 1966-01-03
2 1677-09-22
3 1866-01-03
4 1966-01-03
5 2066-01-03
6 2262-04-10
7 2262-04-11
8        NaT
>>>

tomweber-sas · 2024-07-10T15:13:56Z

#609 merged this in. I believe it allows for what you wanted and more. You can set the high and low timestamps and it works for SAS dates as well as datetimes (timestamps). I've merged this in to main now. I need to write doc and test some more, but then I'll build this into a new release. I'll close this after that's all done.
Thanks!
Tom

rainermensing · 2024-07-10T16:29:13Z

Hi Tom, sorry that I did not get back to you. I could not test it as I am on vacation (still) .... Sounds great. Thanks for your work/ support!! Best Rainer

…

________________________________ Von: Tom Weber ***@***.***> Gesendet: Mittwoch, Juli 10, 2024 5:14:20 PM An: sassoftware/saspy ***@***.***> Cc: Rainer Mensing ***@***.***>; Mention ***@***.***> Betreff: Re: [sassoftware/saspy] coerce OutOfBoundsDatetime into 2199 (Issue #602) #609<#609> merged this in. I believe it allows for what you wanted and more. You can set the high and low timestamps and it works for SAS dates as well as datetimes (timestamps). I've merged this in to main now. I need to write doc and test some more, but then I'll build this into a new release. I'll close this after that's all done. Thanks! Tom — Reply to this email directly, view it on GitHub<#602 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHRLJF6QYJ23ZUSQKRQUMLLZLVFUVAVCNFSM6AAAAABID5TI4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRQG43TOMZSG4>. You are receiving this because you were mentioned.Message ID: ***@***.***>

tomweber-sas · 2024-07-10T16:40:59Z

Thanks @rainermensing , and no problem, enjoy your vacation! This will all be there when you get back; should be in the latest production version then :)
Tom

tomweber-sas · 2024-07-10T21:03:22Z

I've built this into the latest production version: V5.100.0 - SASPy's 100th release! I'll close this, but if you see anything once you have a chance to try it out, just reopen or open a new one. Thanks again for all of your help with these enhancements! They all help make SASPy better for all!
Thanks,
Tom

rainermensing mentioned this issue May 22, 2024

sasdata2parquet #600

Closed

tomweber-sas closed this as completed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

coerce OutOfBoundsDatetime into 2199 #602

coerce OutOfBoundsDatetime into 2199 #602

rainermensing commented May 22, 2024 •

edited

Loading

rainermensing commented May 22, 2024

Uh oh!

tomweber-sas commented May 23, 2024

Uh oh!

tomweber-sas commented Jun 17, 2024

Uh oh!

rainermensing commented Jun 17, 2024

Uh oh!

tomweber-sas commented Jun 17, 2024

Uh oh!

tomweber-sas commented Jul 3, 2024

Uh oh!

tomweber-sas commented Jul 10, 2024

Uh oh!

rainermensing commented Jul 10, 2024 via email

Uh oh!

tomweber-sas commented Jul 10, 2024

Uh oh!

tomweber-sas commented Jul 10, 2024

Uh oh!

coerce OutOfBoundsDatetime into 2199 #602

coerce OutOfBoundsDatetime into 2199 #602

Comments

rainermensing commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

rainermensing commented May 22, 2024

Uh oh!

tomweber-sas commented May 23, 2024

Uh oh!

tomweber-sas commented Jun 17, 2024

Uh oh!

rainermensing commented Jun 17, 2024

Uh oh!

tomweber-sas commented Jun 17, 2024

Uh oh!

tomweber-sas commented Jul 3, 2024

Uh oh!

tomweber-sas commented Jul 10, 2024

Uh oh!

rainermensing commented Jul 10, 2024 via email

Uh oh!

tomweber-sas commented Jul 10, 2024

Uh oh!

tomweber-sas commented Jul 10, 2024

Uh oh!

rainermensing commented May 22, 2024 •

edited

Loading