Skip to content

Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 43 commits into from
Dec 1, 2017
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
da94fdb
Support merging frames on a combo of columns and index levels (GH 14355)
Jul 19, 2017
f8c8c53
Cleanup for review
Sep 10, 2017
368844a
revert implementation (but keep documentation and tests)
Sep 11, 2017
1c4699e
Simplify and refactor column/level logic in merge
Sep 11, 2017
ac1189b
PEP8 cleanup
Sep 11, 2017
d90ed78
Extract column/level ambiguity warning logic into utility method
Sep 11, 2017
27b2d25
Add newline and add :ref: entry for new doc section
Sep 11, 2017
de6f4b1
docstring / comment cleanup
Sep 11, 2017
39d0bba
Merge branch 'master' into enh_14355
Oct 2, 2017
5b1b100
Documentation updates
Oct 9, 2017
dfc6cf7
Fix errors in _drop_columns_or_levels
Oct 9, 2017
03e3c2e
Refactor and parametrize test cases
Oct 9, 2017
bf5d349
Moved label/level helpers up to NDFrame, added axis support, and adde…
Oct 10, 2017
7da39aa
PEP8
Oct 11, 2017
f5a16ff
Revert accidental change to merging.rst
Oct 12, 2017
aa099ea
Use fixtures for new TestMergeColumnAndIndex tests
Oct 13, 2017
3be43a4
Merge branch 'master' into enh_14355
Oct 13, 2017
b655e30
Merge branch 'master' into enh_14355
Oct 20, 2017
b7e2cc2
Merge branch 'master' into enh_14355
Nov 1, 2017
e9f02b1
Update documentation for a 0.22 release
Nov 1, 2017
0cd4ef5
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 2, 2017
e029f7b
Documentation updates
Nov 6, 2017
fdddbd3
Moved test_label_or_level_utils to pandas/tests/generic
Nov 6, 2017
89061b9
Refactored level_or_level test cases to use fixtures
Nov 6, 2017
090b3e8
Moved label_or_level utils on Series and DataFrame to NDFrame
Nov 6, 2017
47ff8b8
fix test comment typo
Nov 6, 2017
59f2dce
PEP8ify
Nov 6, 2017
4c4dbd0
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 6, 2017
1d7e570
Moved column and index tests to new file
Nov 6, 2017
dd289a6
Remove test class and convert to using fixtures
Nov 6, 2017
313d2c3
Rename new test file
Nov 6, 2017
0b0397b
Documentation and testing review updates
Nov 7, 2017
bc53bef
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 7, 2017
cd17c42
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 23, 2017
1a4e3e4
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 23, 2017
a49012c
Fix generator/list lint issues
Nov 23, 2017
6fd9760
Allow non-None hashable objects to reference index levels (not just s…
Nov 26, 2017
f7e04f5
Reduce parameterized test case count by removing how fixture
Nov 26, 2017
cf8e654
Refactor warning code and add stacklevel
Nov 26, 2017
e874f04
Use single backticks to reference method params in docstrings
Nov 26, 2017
13ce87c
Add tests and docstring updates for using index levels as `on` param …
Nov 26, 2017
b5cb4c1
PEP8
Nov 26, 2017
f3b95fe
Fixed Note->Notes in docstring
Dec 1, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 59 additions & 7 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -504,7 +504,7 @@ the data in DataFrame.
See the :ref:`cookbook<cookbook.merge>` for some advanced strategies.

Users who are familiar with SQL but new to pandas might be interested in a
:ref:`comparison with SQL<compare_with_sql.join>`.
:ref:`comparison with SQL<F>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "F" refer to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nothing. Looks like I added that error back when I was adding the :ref: entry for my new section. It will be reverted in my next push. Thanks!


pandas provides a single function, ``merge``, as the entry point for all
standard database join operations between DataFrame objects:
Expand All @@ -518,14 +518,16 @@ standard database join operations between DataFrame objects:

- ``left``: A DataFrame object
- ``right``: Another DataFrame object
- ``on``: Columns (names) to join on. Must be found in both the left and
right DataFrame objects. If not passed and ``left_index`` and
- ``on``: Column or index level names to join on. Must be found in both the left
and right DataFrame objects. If not passed and ``left_index`` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an you add a comment (here and left_on/right_on) that index level merging is new in 0.22.0 (or maybe in a Note section below)

``right_index`` are ``False``, the intersection of the columns in the
DataFrames will be inferred to be the join keys
- ``left_on``: Columns from the left DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the DataFrame
- ``right_on``: Columns from the right DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the DataFrame
- ``left_on``: Columns or index levels from the left DataFrame to use as
keys. Can either be column names, index level names, or arrays with length
equal to the length of the DataFrame
- ``right_on``: Columns or index levels from the right DataFrame to use as
keys. Can either be column names, index level names, or arrays with length
equal to the length of the DataFrame
- ``left_index``: If ``True``, use the index (row labels) from the left
DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex
(hierarchical), the number of levels must match the number of join keys
Expand Down Expand Up @@ -1125,6 +1127,56 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using
labels=['left', 'right'], vertical=False);
plt.close('all');

.. _merging.merge_on_columns_and_levels:

Merging on a combination of columns and index levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a blank line here

add a :ref: entry before the sub-section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Strings passed as the ``on``, ``left_on``, and ``right_on`` parameters
may refer to either column names or index level names. This enables merging
``DataFrame`` instances on a combination of index levels and columns without
resetting indexes.

.. ipython:: python

left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')

left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)

right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')

right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)

result = left.merge(right, on=['key1', 'key2'])

.. ipython:: python
:suppress:

@savefig merge_on_index_and_column.png
p.plot([left, right], result,
labels=['left', 'right'], vertical=False);
plt.close('all');

.. note::

When DataFrames are merged on a string that matches an index level in both
frames, the index level is preserved as an index level in the resulting
DataFrame.

.. note::

If a string matches both a column name and an index level name, then a
warning is issued and the column takes precedence. This will result in an
ambiguity error in a future version.

Overlapping value columns
~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
33 changes: 33 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,37 @@ and new ``CategoricalDtype``.

See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.


.. _whatsnew_0210.enhancements.merge_on_columns_and_levels:

Merging on a combination of columns and index levels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Strings passed to :meth:`DataFrame.merge` as the ``on``, ``left_on``, and ``right_on``
parameters may now refer to either column names or index level names. This enables
merging ``DataFrame`` instances on a combination of index levels and columns
without resetting indexes. See the :ref:`Merge on columns and levels
<merging.merge_on_columns_and_levels>` documentation section.

.. ipython:: python

left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')

left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)

right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')

right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)

left.merge(right, on=['key1', 'key2'])


.. _whatsnew_0210.enhancements.other:

Other Enhancements
Expand All @@ -187,6 +218,8 @@ Other Enhancements
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. (:issue:`15838`, :issue:`17438`)
- :func:`DataFrame.add_prefix` and :func:`DataFrame.add_suffix` now accept strings containing the '%' character. (:issue:`17151`)
- Read/write methods that infer compression (:func:`read_csv`, :func:`read_table`, :func:`read_pickle`, and :meth:`~DataFrame.to_pickle`) can now infer from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`).
- :func:`DataFrame.merge` now accepts index level names as `on`, `left_on`, and `right_on` parameters, allowing frames to be merged on a combination of columns and index levels (:issue:`14355`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a sub-section for this. its a pretty big change. give the example from the whatsnew.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new subsection. Should I remove this one?

- `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`).
- :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`).
- :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`)
- :func:`Styler.where` has been implemented. It is as a convenience for :func:`Styler.applymap` and enables simple DataFrame styling on the Jupyter notebook (:issue:`17474`).
Expand Down
172 changes: 163 additions & 9 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@
standardize_mapping)
from pandas.core.generic import NDFrame, _shared_docs
from pandas.core.index import (Index, MultiIndex, _ensure_index,
_ensure_index_from_sequences)
_ensure_index_from_sequences, RangeIndex)
from pandas.core.indexing import (maybe_droplevels, convert_to_index_sliceable,
check_bool_indexer)
from pandas.core.internals import (BlockManager,
Expand Down Expand Up @@ -139,16 +139,17 @@
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys
on : label or list
Field names to join on. Must be found in both DataFrames. If on is
None and not merging on indexes, then it merges on the intersection of
the columns by default.
Column or index level names to join on. These must be found in both
DataFrames. If on is None and not merging on indexes then this defaults to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put single backticks around 'on' ? (that is typically done for parameters of the function)

the intersection of the columns in both DataFrames.
left_on : label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of
vectors of the length of the DataFrame to use a particular vector as
the join key instead of columns
Column or index level names to join on in the left DataFrame. Can also
be a vector or list of vectors of the length of the left DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now you are changing this anyway, can you change vector with array ?

These vectors are treated as though they are columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'as if' is easier language than 'as though' for non-native speakers

right_on : label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per
left_on docs
Column or index level names to join on in the right DataFrame. Can also
be a vector or list of vectors of the length of the right DataFrame.
These vectors are treated as though they are columns.
left_index : boolean, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, maybe in a Note section

Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
Expand Down Expand Up @@ -2160,6 +2161,159 @@ def _getitem_frame(self, key):
raise ValueError('Must pass DataFrame with boolean values only')
return self.where(key)

# -------------------------------------------------------------------------
# Label or Level Combination Helpers

@Appender(_shared_docs['_is_level_reference'])
def _is_level_reference(self, key, axis=0):
axis = self._get_axis_number(axis)
if axis == 0:
return (isinstance(key, compat.string_types) and
key not in self.columns and
key in self.index.names)
elif axis == 1:
return (isinstance(key, compat.string_types) and
key not in self.index and
key in self.columns.names)

@Appender(_shared_docs['_is_label_reference'])
def _is_label_reference(self, key, axis=0):
axis = self._get_axis_number(axis)
if axis == 0:
return (isinstance(key, compat.string_types) and
key in self.columns)
elif axis == 1:
return (isinstance(key, compat.string_types) and
key in self.index)

@Appender(_shared_docs['_check_label_or_level_ambiguity'])
def _check_label_or_level_ambiguity(self, key, axis=0):

axis = self._get_axis_number(axis)

def raise_warning():

# Build an informative and grammatical warning
level_article, level_type = (('an', 'index')
if axis == 0 else
('a', 'column'))

label_article, label_type = (('a', 'column')
if axis == 0 else
('an', 'index'))

warnings.warn(
("'{key}' is both {level_article} {level_type} level and "
"{label_article} {label_type} label.\n"
"Defaulting to {label_type}, but this will raise an "
"ambiguity error in a future version"
).format(key=key,
level_article=level_article,
level_type=level_type,
label_article=label_article,
label_type=label_type), FutureWarning)

if axis == 0:
if (isinstance(key, compat.string_types) and
key in self.columns and
key in self.index.names):

raise_warning()
return True
else:
return False
else:
if (isinstance(key, compat.string_types) and
key in self.index and
key in self.columns.names):

raise_warning()
return True
else:
return False

@Appender(_shared_docs['_get_label_or_level_values'])
def _get_label_or_level_values(self, key, axis=0):
axis = self._get_axis_number(axis)
if axis == 0:
if key in self:
self._check_label_or_level_ambiguity(key, axis=axis)
values = self[key]._values
elif self._is_level_reference(key, axis=axis):
values = self.index.get_level_values(key)._values
else:
raise KeyError(key)
else:
if key in self.index:
self._check_label_or_level_ambiguity(key, axis=axis)
values = self.loc[key]._values
elif self._is_level_reference(key, axis=axis):
values = self.columns.get_level_values(key)._values
else:
raise KeyError(key)

# Check for duplicates
if values.ndim > 1:
label_axis_name = 'column' if axis == 0 else 'index'
raise ValueError(("The {label_axis_name} label '{key}' "
"is not unique")
.format(key=key,
label_axis_name=label_axis_name))

return values

@Appender(_shared_docs['_drop_labels_or_levels'])
def _drop_labels_or_levels(self, keys, axis=0):
axis = self._get_axis_number(axis)
keys = com._maybe_make_list(keys)

# Validate keys
invalid_keys = [k for k in keys if not
self._is_label_or_level_reference(k, axis=axis)]

if invalid_keys:
raise ValueError(("The following keys are not valid labels or "
"levels for {axis}: {invalid_keys}")
.format(axis=axis,
invalid_keys=invalid_keys))

# Compute levels and labels to drop
levels_to_drop = [k for k in keys
if self._is_level_reference(k, axis=axis)]

labels_to_drop = [k for k in keys
if not self._is_level_reference(k, axis=axis)]

# Perform copy upfront and then use inplace operations below.
# This ensures that we always perform exactly one copy.
# ``copy`` and/or ``inplace`` options could be added in the future.
dropped = self.copy()

if axis == 0:
# Handle dropping index levels
if levels_to_drop:
dropped.reset_index(levels_to_drop, drop=True, inplace=True)

# Handle dropping columns labels
if labels_to_drop:
dropped.drop(labels_to_drop, axis=1, inplace=True)
else:
# Handle dropping column levels
if levels_to_drop:
if isinstance(dropped.columns, MultiIndex):
# Drop the specified levels from the MultiIndex
dropped.columns = dropped.columns.droplevel(levels_to_drop)
else:
# Drop the last level of Index by replacing with
# a RangeIndex
dropped.columns = RangeIndex(dropped.columns.size)

# Handle dropping index labels
if labels_to_drop:
dropped.drop(labels_to_drop, axis=0, inplace=True)

return dropped

def query(self, expr, inplace=False, **kwargs):
"""Query the columns of a frame with a boolean expression.

Expand Down
Loading