[BUG] cuDF and Pandas return different results for ... #16507

Matt711 · 2024-08-07T16:13:56Z

Describe the bug
This issue is for documenting differences found between cudf and pandas.

cudf	Pandas	Difference	Fixed
`cudf.Index([True],dtype=object)`	`pd.Index([True],dtype=object)`	Attribute `inferred_type` are different. [cudf]: `string` [pandas]: `bool`	Won't Fix (see this comment)
`cudf.date_range('2011-01-01', '2011-01-02', freq='h')`	`pd.date_range('2011-01-01', '2011-01-02', freq='h')`	Index length are different. [cudf]: `24` [pandas]: `25`	closed by #16516
`cdf[["a","a"]].shape # cdf = cudf.from_pandas(df)`	`df[["a","a"]].shape # df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]})`	Shapes are different. [cudf]: `(3,1)` [pandas]: `(3,2)`	closed by #16514 because duplicate column labels are generally not supported.
`cdf.agg({'a': 'unique', 'b': 'unique'}).dtype`	`df.aggregate({'a': 'unique', 'b': 'unique'}).dtype`	dtypes aren't the same. [cudf]: `ListDtype(int64)` [pandas]: `dtype('O')`	Won't fix because the cudf result is better than the pandas one. See this comment.
`cudf.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None)`	`pd.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None)`	Index values are different [cudf]: Starts with `'2016-01-01 01:01:00'` [pandas]: Starts with `'2016-01-03 01:01:00'`
`cudf.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min()`	`pd.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min()`	Types are different [cudf]: `dict` [pandas]: `pd.Interval`	Won't fix because a `dict` is the "scalar" type of a cudf struct/interval type
`cudf.MultiIndex.from_arrays([cudf.Index([1],name="foo"),cudf.Index([2], name="bar")])`	`pd.MultiIndex.from_arrays([pd.Index([1],name="foo"),pd.Index([2], name="bar")])`	`names` attribute is empty in cudf	closed by #16515
`cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype`	`pd.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype`	dtypes are different. [cudf]: `dtype('int64')` [pandas]: `Int64Dtype()`	This is should be fixed by closing #14149
`cudf.Series(range(2), index=["a", "b"]).rename(str.upper).index`	`pd.Series(range(2), index=["a", "b"]).rename(str.upper).index`	Index values are different [cudf]: `Index(['a', 'b'], dtype='object')` [pandas]: `Index(['A', 'B'], dtype='object')`	#16525 now causes cudf to raise `NotImplementedError`.
`cudf.DataFrame({"A":[1,2]}).median()`	`pd.DataFrame({"A":[1,2]}).median()`	Types are different [cudf]: `np.float64` [pandas]: `pd.Series`	closed by #16527
`cudf.DataFrame({"A":[1]})**cudf.Series([0])`	`pd.DataFrame({"A":[1]})**pd.Series([0])`	At positional index 0, first diff: nan != 1.0 [cudf]: `NA` [pandas]: `1.0`	Tracked in #7478
`cudf.interval_range(start=0, end=1).repeat(3)`	`pd.interval_range(start=0, end=1).repeat(3)`	All Index values are all different [cudf]: `IntervalIndex([(0, 0], (0, 0], (0, 0]], dtype='interval[int64, right]')` [pandas]: `IntervalIndex([(0, 1], (0, 1], (0, 1]], dtype='interval[int64, right]')`	closed by #16651

Steps/Code to reproduce bug
I'll add a repro for each one I find.

Expected behavior
It should probably match pandas.

The text was updated successfully, but these errors were encountered:

mroeschke · 2024-08-07T17:42:37Z

cudf.Index([True],dtype=object)

Generally for dtype=object findings, there will always be a discrepancy since in pandas dtype=object mean "as pyobject" while in cudf it means "as string".

This is tough because if the passed objects are strings, we want to accelerate the operation with cudf; otherwise, it can never be accelerated by cudf. There might need to be introspection of the first element to see whether cudf.pandas falls back or not based on whether that element is a string.

#16516) xref #16507 `date_range` generates its dates via `range`, and the end of this range was calculated via `math.ceil((end - start) / freq)`. If `(end - start) / freq` did not produce a remainder, `math.ceil` would not correctly increment this value by `1` to capture the last date. Instead, this PR uses `math.floor((end - start) / freq) + 1` to always ensure the last date is captured Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: #16516

mroeschke · 2024-08-09T20:07:32Z

More comments:

cdf.agg({'a': 'unique', 'b': 'unique'}).dtype

Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's object data type)

cudf.Series(range(2)).sample(n=2, replace=True).index

Do your observations persist if you use a fixed random seed?

cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype

This is tracked in #14149

xref #16507 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #16515

mroeschke · 2024-08-09T23:29:57Z

cudf.DataFrame({"A":[1]})**cudf.Series([0])

This is tracked in #7478, but IMO cudf is doing the right thing here

xref #16507 I would say this was a bug before because we would silently return a new DataFrame with just `len(set(column_labels))` when selecting by column. Now this operation raises since duplicate column labels are generally not supported. Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - https://github.com/brandon-b-miller URL: #16514

xref #16507 Raising a `NotImplementedError` gives a chance for this work in `cudf.pandas` Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16525

…ions (#16523) xref #16507 In non pandas compat mode, I think this still makes sense to return a `dict` since that's the "scalar" type of a cudf struct/interval type, but in pandas compat mode we should match pandas and return an Interval. Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16523

…es (#16527) xref #16507 This turned into a little bit of a refactor that also fixes the following: * `cudf.DataFrame.from_pandas` not preserving the `pandas.DataFrame.column.dtype` * `cudf.DataFrame.<reduction>(axis=0)` not preserving the `.column` properties in the resulting `.index` Authors: - Matthew Roeschke (https://github.com/mroeschke) - Matthew Murray (https://github.com/Matt711) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #16527

Matt711 · 2024-08-22T01:50:28Z

More comments:

cdf.agg({'a': 'unique', 'b': 'unique'}).dtype

Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's object data type)

Sounds good. The dfs were.

df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]})
cdf = cudf.from_pandas(df)

cudf.Series(range(2)).sample(n=2, replace=True).index

Do your observations persist if you use a fixed random seed?

They do. I'm setting the seed like

import random
random.seed(2)

cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype

This is tracked in #14149

Thanks!

xref #16507 Similar to what is done in `IntervalIndex.from_breaks`, `interval_index` generates the right edges by slicing a range of fencepost edges. However, we don't want to maintain the new `offset` (`1`) on the right edge after slicing as it adversely impacts subsequent indexing operations. ~~Additionally, I noticed that `Index(struct_data)` would automatically convert it to an `IntervalIndex`, but `IntervalIndex` has a strict requirement on the data have `left/right` keys, so making this raise a `NotImplementedError` instead~~ ^ Will tackle this in a follow up, looks like there are cases where this is valid Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #16651

Matt711 · 2024-12-02T19:39:00Z

Most of the differences here have been closed except for a few (which I'll open separate issues for if need be).

Matt711 added bug Something isn't working Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Aug 7, 2024

Matt711 self-assigned this Aug 7, 2024

github-project-automation bot added this to cuDF Python Aug 7, 2024

github-project-automation bot moved this to Todo in cuDF Python Aug 7, 2024

This was referenced Aug 8, 2024

Disallow indexing by selecting duplicate labels #16514

Merged

Preserve array name in MultiIndex.from_arrays #16515

Merged

Fix date_range(start, end, freq) when end-start is divisible by freq #16516

Merged

Matt711 mentioned this issue Aug 9, 2024

[BUG] sort_values fails on groupby aggregation #14794

Closed

mroeschke mentioned this issue Aug 9, 2024

Return Interval object in pandas compat mode for IntervalIndex reductions #16523

Merged

3 tasks

mroeschke mentioned this issue Aug 9, 2024

Raise NotImplementedError for Series.rename that's not a scalar #16525

Merged

3 tasks

rapids-bot bot pushed a commit that referenced this issue Aug 9, 2024

Preserve array name in MultiIndex.from_arrays (#16515)

45b20d1

xref #16507 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #16515

mroeschke mentioned this issue Aug 9, 2024

Fix DataFrame reductions with median returning scalar instead of Series #16527

Merged

3 tasks

mroeschke mentioned this issue Aug 23, 2024

Fix interval_range right child non-zero offset #16651

Merged

3 tasks

Matt711 closed this as completed Dec 2, 2024

github-project-automation bot moved this from Todo to Done in cuDF Python Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] cuDF and Pandas return different results for ... #16507

[BUG] cuDF and Pandas return different results for ... #16507

Matt711 commented Aug 7, 2024 •

edited

Loading

mroeschke commented Aug 7, 2024

Uh oh!

mroeschke commented Aug 9, 2024

Uh oh!

mroeschke commented Aug 9, 2024

Uh oh!

Matt711 commented Aug 22, 2024

Uh oh!

Matt711 commented Dec 2, 2024

Uh oh!

[BUG] cuDF and Pandas return different results for ... #16507

[BUG] cuDF and Pandas return different results for ... #16507

Comments

Matt711 commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mroeschke commented Aug 7, 2024

Uh oh!

mroeschke commented Aug 9, 2024

Uh oh!

mroeschke commented Aug 9, 2024

Uh oh!

Matt711 commented Aug 22, 2024

Uh oh!

Matt711 commented Dec 2, 2024

Uh oh!

Matt711 commented Aug 7, 2024 •

edited

Loading