Skip to content

[BUG] cuDF and Pandas return different results for ... #16507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Matt711 opened this issue Aug 7, 2024 · 5 comments
Closed

[BUG] cuDF and Pandas return different results for ... #16507

Matt711 opened this issue Aug 7, 2024 · 5 comments
Assignees
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas Python Affects Python cuDF API.

Comments

@Matt711
Copy link
Contributor

Matt711 commented Aug 7, 2024

Describe the bug
This issue is for documenting differences found between cudf and pandas.

cudf Pandas Difference Fixed
cudf.Index([True],dtype=object) pd.Index([True],dtype=object) Attribute inferred_type are different. [cudf]: string [pandas]: bool Won't Fix (see this comment)
cudf.date_range('2011-01-01', '2011-01-02', freq='h') pd.date_range('2011-01-01', '2011-01-02', freq='h') Index length are different. [cudf]: 24 [pandas]: 25 closed by #16516
cdf[["a","a"]].shape # cdf = cudf.from_pandas(df) df[["a","a"]].shape # df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]}) Shapes are different. [cudf]: (3,1) [pandas]: (3,2) closed by #16514 because duplicate column labels are generally not supported.
cdf.agg({'a': 'unique', 'b': 'unique'}).dtype df.aggregate({'a': 'unique', 'b': 'unique'}).dtype dtypes aren't the same. [cudf]: ListDtype(int64) [pandas]: dtype('O') Won't fix because the cudf result is better than the pandas one. See this comment.
cudf.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None) pd.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None) Index values are different [cudf]: Starts with '2016-01-01 01:01:00' [pandas]: Starts with '2016-01-03 01:01:00'
cudf.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min() pd.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min() Types are different [cudf]: dict [pandas]: pd.Interval Won't fix because a dict is the "scalar" type of a cudf struct/interval type
cudf.MultiIndex.from_arrays([cudf.Index([1],name="foo"),cudf.Index([2], name="bar")]) pd.MultiIndex.from_arrays([pd.Index([1],name="foo"),pd.Index([2], name="bar")]) names attribute is empty in cudf closed by #16515
cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype pd.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype dtypes are different. [cudf]: dtype('int64') [pandas]: Int64Dtype() This is should be fixed by closing #14149
cudf.Series(range(2), index=["a", "b"]).rename(str.upper).index pd.Series(range(2), index=["a", "b"]).rename(str.upper).index Index values are different [cudf]: Index(['a', 'b'], dtype='object') [pandas]: Index(['A', 'B'], dtype='object') #16525 now causes cudf to raise NotImplementedError.
cudf.DataFrame({"A":[1,2]}).median() pd.DataFrame({"A":[1,2]}).median() Types are different [cudf]: np.float64 [pandas]: pd.Series closed by #16527
cudf.DataFrame({"A":[1]})**cudf.Series([0]) pd.DataFrame({"A":[1]})**pd.Series([0]) At positional index 0, first diff: nan != 1.0 [cudf]: NA [pandas]: 1.0 Tracked in #7478
cudf.interval_range(start=0, end=1).repeat(3) pd.interval_range(start=0, end=1).repeat(3) All Index values are all different [cudf]: IntervalIndex([(0, 0], (0, 0], (0, 0]], dtype='interval[int64, right]') [pandas]: IntervalIndex([(0, 1], (0, 1], (0, 1]], dtype='interval[int64, right]') closed by #16651

Steps/Code to reproduce bug
I'll add a repro for each one I find.

Expected behavior
It should probably match pandas.

@Matt711 Matt711 added bug Something isn't working Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Aug 7, 2024
@Matt711 Matt711 self-assigned this Aug 7, 2024
@mroeschke
Copy link
Contributor

cudf.Index([True],dtype=object)

Generally for dtype=object findings, there will always be a discrepancy since in pandas dtype=object mean "as pyobject" while in cudf it means "as string".

This is tough because if the passed objects are strings, we want to accelerate the operation with cudf; otherwise, it can never be accelerated by cudf. There might need to be introspection of the first element to see whether cudf.pandas falls back or not based on whether that element is a string.

rapids-bot bot pushed a commit that referenced this issue Aug 9, 2024
#16516)

xref #16507

`date_range` generates its dates via `range`, and the end of this range was calculated via `math.ceil((end - start) / freq)`. If `(end - start) / freq` did not produce a remainder, `math.ceil` would not correctly increment this value by `1` to capture the last date.

Instead, this PR uses `math.floor((end - start) / freq) + 1` to always ensure the last date is captured

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #16516
@mroeschke
Copy link
Contributor

More comments:

cdf.agg({'a': 'unique', 'b': 'unique'}).dtype

Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's object data type)

cudf.Series(range(2)).sample(n=2, replace=True).index

Do your observations persist if you use a fixed random seed?

cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype

This is tracked in #14149

@mroeschke
Copy link
Contributor

cudf.DataFrame({"A":[1]})**cudf.Series([0])

This is tracked in #7478, but IMO cudf is doing the right thing here

rapids-bot bot pushed a commit that referenced this issue Aug 12, 2024
xref #16507

I would say this was a bug before because we would silently return a new DataFrame with just `len(set(column_labels))` when selecting by column. Now this operation raises since duplicate column labels are generally not supported.

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - https://github.com/brandon-b-miller

URL: #16514
rapids-bot bot pushed a commit that referenced this issue Aug 14, 2024
xref #16507

Raising a `NotImplementedError` gives a chance for this work in `cudf.pandas`

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #16525
rapids-bot bot pushed a commit that referenced this issue Aug 15, 2024
…ions (#16523)

xref #16507

In non pandas compat mode, I think this still makes sense to return a `dict` since that's the "scalar" type of a cudf struct/interval type, but in pandas compat mode we should match pandas and return an Interval.

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #16523
rapids-bot bot pushed a commit that referenced this issue Aug 16, 2024
…es (#16527)

xref #16507

This turned into a little bit of a refactor that also fixes the following:

* `cudf.DataFrame.from_pandas` not preserving the `pandas.DataFrame.column.dtype`
* `cudf.DataFrame.<reduction>(axis=0)` not preserving the `.column` properties in the resulting `.index`

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #16527
@Matt711
Copy link
Contributor Author

Matt711 commented Aug 22, 2024

More comments:

cdf.agg({'a': 'unique', 'b': 'unique'}).dtype

Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's object data type)

Sounds good. The dfs were.

df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]})
cdf = cudf.from_pandas(df)

cudf.Series(range(2)).sample(n=2, replace=True).index

Do your observations persist if you use a fixed random seed?

They do. I'm setting the seed like

import random
random.seed(2)

cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype

This is tracked in #14149

Thanks!

rapids-bot bot pushed a commit that referenced this issue Aug 27, 2024
xref #16507

Similar to what is done in `IntervalIndex.from_breaks`, `interval_index` generates the right edges by slicing a range of fencepost edges. However, we don't want to maintain the new `offset` (`1`) on the right edge after slicing as it adversely impacts subsequent indexing operations.

~~Additionally, I noticed that `Index(struct_data)` would automatically convert it to an `IntervalIndex`, but `IntervalIndex` has a strict requirement on the data have `left/right` keys, so making this raise a `NotImplementedError` instead~~
^ Will tackle this in a follow up, looks like there are cases where this is valid

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #16651
@Matt711
Copy link
Contributor Author

Matt711 commented Dec 2, 2024

Most of the differences here have been closed except for a few (which I'll open separate issues for if need be).

@Matt711 Matt711 closed this as completed Dec 2, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in cuDF Python Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas Python Affects Python cuDF API.
Projects
Archived in project
Development

No branches or pull requests

2 participants