-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Added DataFrame.compare and Series.compare (GH30429) #30852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c13af19
8f5d0fb
c5b793a
5eff415
0bc8529
d22e21a
83f31df
488c8a8
322ff20
71f5eef
e50172c
4a82bec
b2849ed
8e0e441
ff7a572
bc969e8
26c6ca6
de2195b
5fb2edc
dcc2d71
35ccb5f
586e37c
d13db2f
5342208
35e9be6
77b1c9e
51ffe0e
71b0332
827b69c
53918a5
110f138
acd51e0
1ef31c9
a898b87
3bc7485
36024d5
06ed216
e4729ca
b6c0f78
c352ee2
0850420
a709db7
9509604
e1a1c49
a8caa53
4056f90
e50772d
39f857e
d0226d8
6c62b0e
098d40c
131ea95
4223eb4
c5246d6
91758c8
7dde706
774ff5d
eb6d33d
5d34fc4
c358e3d
0189623
b0b3e24
cdb03b2
007eeb7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5755,6 +5755,116 @@ def _construct_result(self, result) -> "DataFrame": | |
out.index = self.index | ||
return out | ||
|
||
@Appender( | ||
fujiaxiang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Returns | ||
------- | ||
DataFrame | ||
DataFrame that shows the differences stacked side by side. | ||
fujiaxiang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The resulting index will be a MultiIndex with 'self' and 'other' | ||
stacked alternately at the inner level. | ||
|
||
See Also | ||
-------- | ||
Series.compare : Compare with another Series and show differences. | ||
|
||
Notes | ||
----- | ||
Matching NaNs will not appear as a difference. | ||
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame( | ||
... { | ||
... "col1": ["a", "a", "b", "b", "a"], | ||
... "col2": [1.0, 2.0, 3.0, np.nan, 5.0], | ||
... "col3": [1.0, 2.0, 3.0, 4.0, 5.0] | ||
... }, | ||
... columns=["col1", "col2", "col3"], | ||
... ) | ||
>>> df | ||
col1 col2 col3 | ||
0 a 1.0 1.0 | ||
1 a 2.0 2.0 | ||
2 b 3.0 3.0 | ||
3 b NaN 4.0 | ||
4 a 5.0 5.0 | ||
|
||
>>> df2 = df.copy() | ||
>>> df2.loc[0, 'col1'] = 'c' | ||
>>> df2.loc[2, 'col3'] = 4.0 | ||
>>> df2 | ||
col1 col2 col3 | ||
0 c 1.0 1.0 | ||
1 a 2.0 2.0 | ||
2 b 3.0 4.0 | ||
3 b NaN 4.0 | ||
4 a 5.0 5.0 | ||
|
||
Align the differences on columns | ||
|
||
>>> df.compare(df2) | ||
col1 col3 | ||
self other self other | ||
fujiaxiang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
0 a c NaN NaN | ||
2 NaN NaN 3.0 4.0 | ||
|
||
Stack the differences on rows | ||
|
||
>>> df.compare(df2, align_axis=0) | ||
col1 col3 | ||
0 self a NaN | ||
other c NaN | ||
2 self NaN 3.0 | ||
other NaN 4.0 | ||
|
||
Keep the equal values | ||
|
||
>>> df.compare(df2, keep_equal=True) | ||
col1 col3 | ||
self other self other | ||
0 a c 1.0 1.0 | ||
2 b b 3.0 4.0 | ||
|
||
Keep all original rows and columns | ||
|
||
>>> df.compare(df2, keep_shape=True) | ||
col1 col2 col3 | ||
self other self other self other | ||
0 a c NaN NaN NaN NaN | ||
1 NaN NaN NaN NaN NaN NaN | ||
2 NaN NaN NaN NaN 3.0 4.0 | ||
3 NaN NaN NaN NaN NaN NaN | ||
4 NaN NaN NaN NaN NaN NaN | ||
|
||
Keep all original rows and columns and also all original values | ||
|
||
>>> df.compare(df2, keep_shape=True, keep_equal=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does keep_equal make sense w/o keep_shape==True? IOW does it stand on its own? can you add an example of just using it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I would say so. It helps identify which one could be the "anomaly". I myself have come across such use cases. For example, say I'm looking at the date a company publishes 2 types of their quarterly report. >>> import pandas as pd
>>> df1 = pd.DataFrame(columns=['Q1_filing', 'Q2_filing', 'Q3_filing', 'Q4_filing'], index=list(range(2010, 2020)))
>>> df1['Q1_filing'] = 'Mar'
>>> df1['Q2_filing'] = 'Jun'
>>> df1['Q3_filing'] = 'Sep'
>>> df1['Q4_filing'] = 'Dec'
>>> df1
Q1_filing Q2_filing Q3_filing Q4_filing
2010 Mar Jun Sep Dec
2011 Mar Jun Sep Dec
2012 Mar Jun Sep Dec
2013 Mar Jun Sep Dec
2014 Mar Jun Sep Dec
2015 Mar Jun Sep Dec
2016 Mar Jun Sep Dec
2017 Mar Jun Sep Dec
2018 Mar Jun Sep Dec
2019 Mar Jun Sep Dec
>>> df2 = df1.copy()
>>> df2.loc[2015, 'Q1_filing'] = 'Apr'
>>> df2.loc[2016, 'Q2_filing'] = 'Jul' By comparing the two, I can see that the discrepancy is in 2015 and 2016, but I don't know which one deviated from the norm. >>> df1.compare(df2)
Q1_filing Q2_filing
self other self other
2015 Mar Apr NaN NaN
2016 NaN NaN Jun Jul The natural thing for me to do now is look at 2015 Q2_filing and 2016 Q1_filing where they agree with each other. (You can of course look at the whole thing but sometimes data is too big and I just want to take a look at the relevant ones first) >>> df1.compare(df2, keep_equal=True)
Q1_filing Q2_filing
self other self other
2015 Mar Apr Jun Jun
2016 Mar Mar Jun Jul With this result I know probably something is off for the second type of reports. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have added the example in frame.py. |
||
col1 col2 col3 | ||
self other self other self other | ||
0 a c 1.0 1.0 1.0 1.0 | ||
1 a a 2.0 2.0 2.0 2.0 | ||
2 b b 3.0 3.0 3.0 4.0 | ||
3 b b NaN NaN 4.0 4.0 | ||
4 a a 5.0 5.0 5.0 5.0 | ||
""" | ||
) | ||
@Appender(_shared_docs["compare"] % _shared_doc_kwargs) | ||
def compare( | ||
self, | ||
other: "DataFrame", | ||
align_axis: Axis = 1, | ||
keep_shape: bool = False, | ||
keep_equal: bool = False, | ||
) -> "DataFrame": | ||
return super().compare( | ||
other=other, | ||
align_axis=align_axis, | ||
keep_shape=keep_shape, | ||
keep_equal=keep_equal, | ||
) | ||
|
||
def combine( | ||
self, other: "DataFrame", func, fill_value=None, overwrite=True | ||
) -> "DataFrame": | ||
|
Uh oh!
There was an error while loading. Please reload this page.