-
Notifications
You must be signed in to change notification settings - Fork 952
[BUG] cuDF and Pandas return different results for ... #16507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Generally for This is tough because if the passed objects are strings, we want to accelerate the operation with cudf; otherwise, it can never be accelerated by cudf. There might need to be introspection of the first element to see whether |
#16516) xref #16507 `date_range` generates its dates via `range`, and the end of this range was calculated via `math.ceil((end - start) / freq)`. If `(end - start) / freq` did not produce a remainder, `math.ceil` would not correctly increment this value by `1` to capture the last date. Instead, this PR uses `math.floor((end - start) / freq) + 1` to always ensure the last date is captured Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: #16516
More comments:
Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's
Do your observations persist if you use a fixed random seed?
This is tracked in #14149 |
xref #16507 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #16515
This is tracked in #7478, but IMO cudf is doing the right thing here |
xref #16507 I would say this was a bug before because we would silently return a new DataFrame with just `len(set(column_labels))` when selecting by column. Now this operation raises since duplicate column labels are generally not supported. Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - https://github.com/brandon-b-miller URL: #16514
xref #16507 Raising a `NotImplementedError` gives a chance for this work in `cudf.pandas` Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16525
…ions (#16523) xref #16507 In non pandas compat mode, I think this still makes sense to return a `dict` since that's the "scalar" type of a cudf struct/interval type, but in pandas compat mode we should match pandas and return an Interval. Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16523
…es (#16527) xref #16507 This turned into a little bit of a refactor that also fixes the following: * `cudf.DataFrame.from_pandas` not preserving the `pandas.DataFrame.column.dtype` * `cudf.DataFrame.<reduction>(axis=0)` not preserving the `.column` properties in the resulting `.index` Authors: - Matthew Roeschke (https://github.com/mroeschke) - Matthew Murray (https://github.com/Matt711) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #16527
Sounds good. The dfs were. df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]})
cdf = cudf.from_pandas(df)
They do. I'm setting the seed like import random
random.seed(2)
Thanks! |
xref #16507 Similar to what is done in `IntervalIndex.from_breaks`, `interval_index` generates the right edges by slicing a range of fencepost edges. However, we don't want to maintain the new `offset` (`1`) on the right edge after slicing as it adversely impacts subsequent indexing operations. ~~Additionally, I noticed that `Index(struct_data)` would automatically convert it to an `IntervalIndex`, but `IntervalIndex` has a strict requirement on the data have `left/right` keys, so making this raise a `NotImplementedError` instead~~ ^ Will tackle this in a follow up, looks like there are cases where this is valid Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #16651
Most of the differences here have been closed except for a few (which I'll open separate issues for if need be). |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
This issue is for documenting differences found between cudf and pandas.
cudf.Index([True],dtype=object)
pd.Index([True],dtype=object)
inferred_type
are different. [cudf]:string
[pandas]:bool
cudf.date_range('2011-01-01', '2011-01-02', freq='h')
pd.date_range('2011-01-01', '2011-01-02', freq='h')
24
[pandas]:25
cdf[["a","a"]].shape # cdf = cudf.from_pandas(df)
df[["a","a"]].shape # df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]})
(3,1)
[pandas]:(3,2)
cdf.agg({'a': 'unique', 'b': 'unique'}).dtype
df.aggregate({'a': 'unique', 'b': 'unique'}).dtype
ListDtype(int64)
[pandas]:dtype('O')
cudf.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None)
pd.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None)
'2016-01-01 01:01:00'
[pandas]: Starts with'2016-01-03 01:01:00'
cudf.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min()
pd.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min()
dict
[pandas]:pd.Interval
dict
is the "scalar" type of a cudf struct/interval typecudf.MultiIndex.from_arrays([cudf.Index([1],name="foo"),cudf.Index([2], name="bar")])
pd.MultiIndex.from_arrays([pd.Index([1],name="foo"),pd.Index([2], name="bar")])
names
attribute is empty in cudfcudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype
pd.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype
dtype('int64')
[pandas]:Int64Dtype()
cudf.Series(range(2), index=["a", "b"]).rename(str.upper).index
pd.Series(range(2), index=["a", "b"]).rename(str.upper).index
Index(['a', 'b'], dtype='object')
[pandas]:Index(['A', 'B'], dtype='object')
NotImplementedError
.cudf.DataFrame({"A":[1,2]}).median()
pd.DataFrame({"A":[1,2]}).median()
np.float64
[pandas]:pd.Series
cudf.DataFrame({"A":[1]})**cudf.Series([0])
pd.DataFrame({"A":[1]})**pd.Series([0])
NA
[pandas]:1.0
cudf.interval_range(start=0, end=1).repeat(3)
pd.interval_range(start=0, end=1).repeat(3)
IntervalIndex([(0, 0], (0, 0], (0, 0]], dtype='interval[int64, right]')
[pandas]:IntervalIndex([(0, 1], (0, 1], (0, 1]], dtype='interval[int64, right]')
Steps/Code to reproduce bug
I'll add a repro for each one I find.
Expected behavior
It should probably match pandas.
The text was updated successfully, but these errors were encountered: