@@ -306,20 +306,25 @@ impact by cleaning noisy samples next to the boundaries of the classes.
306
306
307
307
.. _condensed_nearest_neighbors :
308
308
309
- Condensed nearest neighbors and derived algorithms
310
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
309
+ Condensed nearest neighbors
310
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
311
311
312
312
:class: `CondensedNearestNeighbour ` uses a 1 nearest neighbor rule to
313
- iteratively decide if a sample should be removed or not
314
- :cite: `hart1968condensed `. The algorithm is running as followed :
313
+ iteratively decide if a sample should be removed
314
+ :cite: `hart1968condensed `. The algorithm runs as follows :
315
315
316
316
1. Get all minority samples in a set :math: `C`.
317
317
2. Add a sample from the targeted class (class to be under-sampled) in
318
318
:math: `C` and all other samples of this class in a set :math: `S`.
319
- 3. Go through the set :math: `S`, sample by sample, and classify each sample
320
- using a 1 nearest neighbor rule.
321
- 4. If the sample is misclassified, add it to :math: `C`, otherwise do nothing.
322
- 5. Reiterate on :math: `S` until there is no samples to be added.
319
+ 3. Train a 1-Nearest Neigbhour on :math: `C`.
320
+ 4. Go through the samples in set :math: `S`, sample by sample, and classify each one
321
+ using a 1 nearest neighbor rule (trained in 3).
322
+ 5. If the sample is misclassified, add it to :math: `C`, and go to step 6.
323
+ 6. Repeat steps 3 to 5 until all observations in :math: `S` have been examined.
324
+
325
+ The final dataset is :math: `S`, containing all observations from the minority class and
326
+ those from the majority that were miss-classified by the successive
327
+ 1-Nearest Neigbhour algorithms.
323
328
324
329
The :class: `CondensedNearestNeighbour ` can be used in the following manner::
325
330
@@ -329,23 +334,44 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner::
329
334
>>> print(sorted(Counter(y_resampled).items()))
330
335
[(0, 64), (1, 24), (2, 115)]
331
336
332
- However as illustrated in the figure below, :class: `CondensedNearestNeighbour `
333
- is sensitive to noise and will add noisy samples.
337
+ :class: `CondensedNearestNeighbour ` is sensitive to noise and may add noisy samples
338
+ (see figure later on).
339
+
340
+ One Sided Selection
341
+ ~~~~~~~~~~~~~~~~~~~
342
+
343
+ In an attempt to remove the noisy observations introduced by
344
+ :class: `CondensedNearestNeighbour `, :class: `OneSidedSelection `
345
+ will first find the observations that are hard to classify, and then will use
346
+ :class: `TomekLinks ` to remove noisy samples :cite: `hart1968condensed `.
347
+ :class: `OneSidedSelection ` runs as follows:
348
+
349
+ 1. Get all minority samples in a set :math: `C`.
350
+ 2. Add a sample from the targeted class (class to be under-sampled) in
351
+ :math: `C` and all other samples of this class in a set :math: `S`.
352
+ 3. Train a 1-Nearest Neighbors on :math: `C`.
353
+ 4. Using a 1 nearest neighbor rule trained in 3, classify all samples in
354
+ set :math: `S`.
355
+ 5. Add all misclassified samples to :math: `C`.
356
+ 6. Remove Tomek Links from :math: `C`.
357
+
358
+ The final dataset is :math: `S`, containing all observations from the minority class,
359
+ plus the observations from the majority that were added at random, plus all
360
+ those from the majority that were miss-classified by the 1-Nearest Neighbors algorithms.
334
361
335
- In the contrary, :class: `OneSidedSelection ` will use :class: `TomekLinks ` to
336
- remove noisy samples :cite: `hart1968condensed `. In addition, the 1 nearest
337
- neighbor rule is applied to all samples and the one which are misclassified
338
- will be added to the set :math: `C`. No iteration on the set :math: `S` will take
339
- place. The class can be used as::
362
+ Note that differently from :class: `CondensedNearestNeighbour `, :class: `OneSidedSelection `
363
+ does not train a K-Nearest Neighbors after each sample is misclassified. It uses the
364
+ 1-Nearest Neighbors from step 3 to classify all samples from the majority in 1 pass.
365
+ The class can be used as::
340
366
341
367
>>> from imblearn.under_sampling import OneSidedSelection
342
368
>>> oss = OneSidedSelection(random_state=0)
343
369
>>> X_resampled, y_resampled = oss.fit_resample(X, y)
344
370
>>> print(sorted(Counter(y_resampled).items()))
345
371
[(0, 64), (1, 174), (2, 4404)]
346
372
347
- Our implementation offer to set the number of seeds to put in the set :math: `C`
348
- originally by setting the parameter ``n_seeds_S ``.
373
+ Our implementation offers the possibility to set the number of observations
374
+ to put at random in the set :math: `C` through the parameter ``n_seeds_S ``.
349
375
350
376
:class: `NeighbourhoodCleaningRule ` will focus on cleaning the data than
351
377
condensing them :cite: `laurikkala2001improving `. Therefore, it will used the
0 commit comments