Skip to content

Commit bcb675e

Browse files
solegalliglemaitre
andauthored
DOC improve the documentation of CNN and OSS (#1025)
Co-authored-by: Guillaume Lemaitre <[email protected]>
1 parent 1fb69ca commit bcb675e

File tree

1 file changed

+43
-17
lines changed

1 file changed

+43
-17
lines changed

doc/under_sampling.rst

Lines changed: 43 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -306,20 +306,25 @@ impact by cleaning noisy samples next to the boundaries of the classes.
306306

307307
.. _condensed_nearest_neighbors:
308308

309-
Condensed nearest neighbors and derived algorithms
310-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
309+
Condensed nearest neighbors
310+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
311311

312312
:class:`CondensedNearestNeighbour` uses a 1 nearest neighbor rule to
313-
iteratively decide if a sample should be removed or not
314-
:cite:`hart1968condensed`. The algorithm is running as followed:
313+
iteratively decide if a sample should be removed
314+
:cite:`hart1968condensed`. The algorithm runs as follows:
315315

316316
1. Get all minority samples in a set :math:`C`.
317317
2. Add a sample from the targeted class (class to be under-sampled) in
318318
:math:`C` and all other samples of this class in a set :math:`S`.
319-
3. Go through the set :math:`S`, sample by sample, and classify each sample
320-
using a 1 nearest neighbor rule.
321-
4. If the sample is misclassified, add it to :math:`C`, otherwise do nothing.
322-
5. Reiterate on :math:`S` until there is no samples to be added.
319+
3. Train a 1-Nearest Neigbhour on :math:`C`.
320+
4. Go through the samples in set :math:`S`, sample by sample, and classify each one
321+
using a 1 nearest neighbor rule (trained in 3).
322+
5. If the sample is misclassified, add it to :math:`C`, and go to step 6.
323+
6. Repeat steps 3 to 5 until all observations in :math:`S` have been examined.
324+
325+
The final dataset is :math:`S`, containing all observations from the minority class and
326+
those from the majority that were miss-classified by the successive
327+
1-Nearest Neigbhour algorithms.
323328

324329
The :class:`CondensedNearestNeighbour` can be used in the following manner::
325330

@@ -329,23 +334,44 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner::
329334
>>> print(sorted(Counter(y_resampled).items()))
330335
[(0, 64), (1, 24), (2, 115)]
331336

332-
However as illustrated in the figure below, :class:`CondensedNearestNeighbour`
333-
is sensitive to noise and will add noisy samples.
337+
:class:`CondensedNearestNeighbour` is sensitive to noise and may add noisy samples
338+
(see figure later on).
339+
340+
One Sided Selection
341+
~~~~~~~~~~~~~~~~~~~
342+
343+
In an attempt to remove the noisy observations introduced by
344+
:class:`CondensedNearestNeighbour`, :class:`OneSidedSelection`
345+
will first find the observations that are hard to classify, and then will use
346+
:class:`TomekLinks` to remove noisy samples :cite:`hart1968condensed`.
347+
:class:`OneSidedSelection` runs as follows:
348+
349+
1. Get all minority samples in a set :math:`C`.
350+
2. Add a sample from the targeted class (class to be under-sampled) in
351+
:math:`C` and all other samples of this class in a set :math:`S`.
352+
3. Train a 1-Nearest Neighbors on :math:`C`.
353+
4. Using a 1 nearest neighbor rule trained in 3, classify all samples in
354+
set :math:`S`.
355+
5. Add all misclassified samples to :math:`C`.
356+
6. Remove Tomek Links from :math:`C`.
357+
358+
The final dataset is :math:`S`, containing all observations from the minority class,
359+
plus the observations from the majority that were added at random, plus all
360+
those from the majority that were miss-classified by the 1-Nearest Neighbors algorithms.
334361

335-
In the contrary, :class:`OneSidedSelection` will use :class:`TomekLinks` to
336-
remove noisy samples :cite:`hart1968condensed`. In addition, the 1 nearest
337-
neighbor rule is applied to all samples and the one which are misclassified
338-
will be added to the set :math:`C`. No iteration on the set :math:`S` will take
339-
place. The class can be used as::
362+
Note that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection`
363+
does not train a K-Nearest Neighbors after each sample is misclassified. It uses the
364+
1-Nearest Neighbors from step 3 to classify all samples from the majority in 1 pass.
365+
The class can be used as::
340366

341367
>>> from imblearn.under_sampling import OneSidedSelection
342368
>>> oss = OneSidedSelection(random_state=0)
343369
>>> X_resampled, y_resampled = oss.fit_resample(X, y)
344370
>>> print(sorted(Counter(y_resampled).items()))
345371
[(0, 64), (1, 174), (2, 4404)]
346372

347-
Our implementation offer to set the number of seeds to put in the set :math:`C`
348-
originally by setting the parameter ``n_seeds_S``.
373+
Our implementation offers the possibility to set the number of observations
374+
to put at random in the set :math:`C` through the parameter ``n_seeds_S``.
349375

350376
:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than
351377
condensing them :cite:`laurikkala2001improving`. Therefore, it will used the

0 commit comments

Comments
 (0)