From 62a4713aaca00407fdd9d4c883e09fd605b9a82b Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Wed, 11 Aug 2021 14:36:31 +0200 Subject: [PATCH 1/2] update dosctrings on use of kind_sel --- .../_neighbourhood_cleaning_rule.py | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py index 702a022ac..e81270af4 100644 --- a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py +++ b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py @@ -28,7 +28,7 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): """Undersample based on the neighbourhood cleaning rule. - This class uses ENN and a k-NN to remove noisy samples from the datasets. + This class uses ENN and a k-NN to remove noisy samples from the dataset. Read more in the :ref:`User Guide `. @@ -40,19 +40,24 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): If ``int``, size of the neighbourhood to consider to compute the nearest neighbors. If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to - find the nearest-neighbors. By default, it will be a 3-NN. + find the nearest-neighbors. By default, it explores the 3 closest + neighbors. kind_sel : {{"all", "mode"}}, default='all' Strategy to use in order to exclude samples in the ENN sampling. - - If ``'all'``, all neighbours will have to agree with the samples of - interest to not be excluded. - - If ``'mode'``, the majority vote of the neighbours will be used in - order to exclude a sample. + - If ``'all'``, all neighbours will have to agree with a sample in order + not to be excluded. + - If ``'mode'``, the majority of the neighbours will have to agree with + a sample in order not to be excluded. The strategy `"all"` will be less conservative than `'mode'`. Thus, - more samples will be removed when `kind_sel="all"` generally. + more samples will be removed when `kind_sel="all"`, generally. + Note that this parameter only applies to the cleaning step of the NCL. + The ENN is done using majority vote, as described in the original article. + + #TODO: this is not the originally described threshold, fix threshold_cleaning : float, default=0.5 Threshold used to whether consider a class or not during the cleaning after applying ENN. A class will be considered during cleaning when: From 6c2e943e457d0080aeda5fb56818376d77eca715 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Wed, 11 Aug 2021 14:56:12 +0200 Subject: [PATCH 2/2] update user guide NCL --- doc/under_sampling.rst | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 13798ad78..bf55947f5 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -347,10 +347,29 @@ place. The class can be used as:: Our implementation offer to set the number of seeds to put in the set :math:`C` originally by setting the parameter ``n_seeds_S``. -:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than -condensing them :cite:`laurikkala2001improving`. Therefore, it will used the -union of samples to be rejected between the :class:`EditedNearestNeighbours` -and the output a 3 nearest neighbors classifier. The class can be used as:: +:class:`NeighbourhoodCleaningRule` focuses more on cleaning the data than on +reducing the number of samples :cite:`laurikkala2001improving`. It expands +the :class:`EditedNearestNeighbours` in that, it further eliminates samples from +the majority class, if they belong to the 3 closest neighbours of a sample from +the majority class, where the majority, or all of the neighbours disagree with the +minority. The procedure for the :class:`NeighbourhoodCleaningRule` is as follows: + +1. Split dataset into the class of interest C (minority) and the rest of the data O. +2. Identify noisy data A1 in O, with edited nearest neighbor rule. +3. For each (majority) class in O, if its observations are one of the 3 closest +neighbors of a minority sample where all or most of those neighbors are not minority, +add the observation to group A2. +4. Reduce the original data S = T - ( A1 union A2 ) + +The first step is an ENN, where most of neighbours need to disagree to remove a sample +from the majority. The second step is a cleaning step, that further removes samples +from the majority classes. To carry on the cleaning step there is one condition: +it will only clean samples from classes that contain a minimum number of observations. +The minimum number is regulated by the `threshold_cleaning` parameter. In the original +article :cite:`laurikkala2001improving` samples would be removed if the class had at +least half as many observations as those in the minority class. + +The class can be used as:: >>> from imblearn.under_sampling import NeighbourhoodCleaningRule >>> ncr = NeighbourhoodCleaningRule()