Skip to content

Commit 1be039e

Browse files
committed
final edits
1 parent dae3e2e commit 1be039e

File tree

2 files changed

+60
-55
lines changed

2 files changed

+60
-55
lines changed

doc/under_sampling.rst

Lines changed: 32 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ Cleaning under-sampling techniques
198198
----------------------------------
199199

200200
Cleaning under-sampling techniques do not allow to specify the number of
201-
samples to have in each class. In fact, each algorithm implements an heuristic
201+
samples to have in each class. In fact, each algorithm implement an heuristic
202202
which will clean the dataset.
203203

204204
.. _tomek_links:
@@ -237,20 +237,18 @@ figure illustrates this behaviour.
237237

238238
.. _edited_nearest_neighbors:
239239

240-
Edited data set using nearest neighbours
241-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
240+
Edited data set using nearest neighbors
241+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
242242

243-
:class:`EditedNearestNeighbours` trains a nearest-neighbors algorithm and
244-
then looks at the closest neighbours of each data point of the class to be
243+
:class:`EditedNearestNeighbours` trains a nearest neighbors algorithm and
244+
then looks at the closest neighbors of each data point of the class to be
245245
under-sampled, and "edits" the dataset by removing samples which do not agree
246246
"enough" with their neighborhood :cite:`wilson1972asymptotic`. In short,
247-
a KNN algorithm is trained on the data. Then, for each sample in the class
248-
to be under-sampled, the (K-1) nearest-neighbours are identified. Note that
249-
if a 4-KNN algorithm is trained, only 3 neighbours will be examined, because
250-
the sample being inspected is the fourth neighbour returned by the algorithm.
251-
Once the neighbours are identified, if all the neighbours or most of the
252-
neighbours agree with the class of the sample being inspected, the sample is
253-
kept, otherwise removed. Check the selection criteria below::
247+
a nearest neighbors algorithm algorithm is trained on the data. Then, for each
248+
sample in the class to be under-sampled, the nearest neighbors are identified.
249+
Once the neighbors are identified, if all the neighbors or most of the neighbors
250+
agree with the class of the sample being inspected, the sample is kept, otherwise
251+
removed::
254252

255253
>>> sorted(Counter(y).items())
256254
[(0, 64), (1, 262), (2, 4674)]
@@ -261,8 +259,8 @@ kept, otherwise removed. Check the selection criteria below::
261259
[(0, 64), (1, 213), (2, 4568)]
262260

263261
Two selection criteria are currently available: (i) the majority (i.e.,
264-
``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the
265-
nearest-neighbors must belong to the same class than the sample inspected to
262+
``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) of the
263+
nearest neighbors must belong to the same class than the sample inspected to
266264
keep it in the dataset. This means that `kind_sel='all'` will be less
267265
conservative than `kind_sel='mode'`, and more samples will be excluded::
268266

@@ -277,11 +275,12 @@ conservative than `kind_sel='mode'`, and more samples will be excluded::
277275

278276
The parameter ``n_neighbors`` can take a classifier subclassed from
279277
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors.
280-
Alternatively, an integer can be passed to indicate the size of the
281-
neighborhood to examine to make a decision. Note that if ``n_neighbors=3``
282-
this means that the edited nearest neighbors will look at the 3 closest
283-
neighbours of each sample, thus a 4-KNN algorithm will be trained
284-
on the data.
278+
Note that if a 4-KNN classifier is passed, 3 neighbors will be
279+
examined for the selection criteria, because the sample being inspected
280+
is the fourth neighbor returned by the algorithm. Alternatively, an integer
281+
can be passed to ``n_neighbors`` to indicate the size of the neighborhood
282+
to examine to make a decision. Thus, if ``n_neighbors=3`` the edited nearest
283+
neighbors will look at the 3 closest neighbors of each sample.
285284

286285
:class:`RepeatedEditedNearestNeighbours` extends
287286
:class:`EditedNearestNeighbours` by repeating the algorithm multiple times
@@ -295,16 +294,21 @@ through the parameter ``max_iter``::
295294
>>> print(sorted(Counter(y_resampled).items()))
296295
[(0, 64), (1, 208), (2, 4551)]
297296

297+
Note that :class:`RepeatedEditedNearestNeighbours` will end before reaching
298+
``max_iter`` if no more samples are removed from the data, or one of the
299+
majority classes ends up disappearing or with less samples than the minority
300+
after being "edited".
301+
298302
:class:`AllKNN` extends :class:`EditedNearestNeighbours` by repeating
299-
the algorithm multiple times, each time with an additional neighbour
303+
the algorithm multiple times, each time with an additional neighbor
300304
:cite:`tomek1976experiment`. In other words, :class:`AllKNN` differs
301305
from :class:`RepeatedEditedNearestNeighbours` in that the number of
302306
neighbors of the internal nearest neighbors algorithm increases at
303307
each iteration. In short, in the first iteration, a 2-KNN algorithm
304-
is trained on the data to examine the 1 closest neighbour of each
308+
is trained on the data to examine the 1 closest neighbor of each
305309
sample from the class to be under-sampled. In each subsequent
306-
iteration, the neighbourhood examined is increased by 1, until the
307-
number of neighbours to examine indicated in the parameter ``n_neighbors``::
310+
iteration, the neighborhood examined is increased by 1, until the
311+
number of neighbors indicated in the parameter ``n_neighbors``::
308312

309313
>>> from imblearn.under_sampling import AllKNN
310314
>>> allknn = AllKNN()
@@ -314,16 +318,16 @@ number of neighbours to examine indicated in the parameter ``n_neighbors``::
314318

315319

316320
The parameter ``n_neighbors`` can take an integer to indicate the size
317-
of the neighborhood to examine to make a decision in the last iteration.
318-
Thus, if ``n_neighbors=3``, AlKNN will examine the 1 closest neighbour
319-
in the first iteration, the 2 closest neighbours in the second iteration
321+
of the neighborhood to examine in the last iteration. Thus, if
322+
``n_neighbors=3``, AlKNN will examine the 1 closest neighbor in the
323+
first iteration, the 2 closest neighbors in the second iteration
320324
and the 3 closest neighbors in the third iteration. The parameter
321325
``n_neighbors`` can also take a classifier subclassed from
322326
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors.
323327
Again, this will be the KNN used in the last iteration.
324328

325-
In the example below, it can be seen that the three algorithms have similar
326-
impact by cleaning noisy samples next to the boundaries of the classes.
329+
In the example below, we can see that the three algorithms have a similar
330+
impact on cleaning noisy samples at the boundaries of the classes.
327331

328332
.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_004.png
329333
:target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html

imblearn/under_sampling/_prototype_selection/_edited_nearest_neighbours.py

Lines changed: 28 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
"""Classes to perform under-sampling based on the edited nearest neighbour
1+
"""Classes to perform under-sampling based on the edited nearest neighbor
22
method."""
33

44
# Authors: Guillaume Lemaitre <[email protected]>
@@ -27,7 +27,7 @@
2727
n_jobs=_n_jobs_docstring,
2828
)
2929
class EditedNearestNeighbours(BaseCleaningSampler):
30-
"""Undersample based on the edited nearest neighbour method.
30+
"""Undersample based on the edited nearest neighbor method.
3131
3232
This method will clean the data set by removing samples close to the
3333
decision boundary.
@@ -39,17 +39,17 @@ class EditedNearestNeighbours(BaseCleaningSampler):
3939
{sampling_strategy}
4040
4141
n_neighbors : int or object, default=3
42-
If ``int``, size of the neighbourhood to consider to compute the
43-
nearest neighbours. If object, an estimator that inherits from
42+
If ``int``, size of the neighborhood to consider to compute the
43+
nearest neighbors. If object, an estimator that inherits from
4444
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
45-
find the nearest-neighbours.
45+
find the nearest-neighbors.
4646
4747
kind_sel : {{'all', 'mode'}}, default='all'
4848
Strategy to use in order to exclude samples.
4949
50-
- If ``'all'``, all neighbours will have to agree with a sample in order
50+
- If ``'all'``, all neighbors will have to agree with a sample in order
5151
not to be excluded.
52-
- If ``'mode'``, the majority of the neighbours will have to agree with
52+
- If ``'mode'``, the majority of the neighbors will have to agree with
5353
a sample in order not to be excluded.
5454
5555
The strategy `"all"` will be less conservative than `'mode'`. Thus,
@@ -70,7 +70,7 @@ class EditedNearestNeighbours(BaseCleaningSampler):
7070
7171
RepeatedEditedNearestNeighbours : Undersample by repeating ENN algorithm.
7272
73-
AllKNN : Undersample using ENN and various number of neighbours.
73+
AllKNN : Undersample using ENN and various number of neighbors.
7474
7575
Notes
7676
-----
@@ -172,7 +172,7 @@ def _more_tags(self):
172172
n_jobs=_n_jobs_docstring,
173173
)
174174
class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
175-
"""Undersample based on the repeated edited nearest neighbour method.
175+
"""Undersample based on the repeated edited nearest neighbor method.
176176
177177
This method will repeat the ENN algorithm several times. The repetitions
178178
will stop when i) the maximum number of iterations is reached, or ii) no
@@ -187,20 +187,20 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
187187
{sampling_strategy}
188188
189189
n_neighbors : int or object, default=3
190-
If ``int``, size of the neighbourhood to consider to compute the
191-
nearest neighbours. If object, an estimator that inherits from
190+
If ``int``, size of the neighborhood to consider to compute the
191+
nearest neighbors. If object, an estimator that inherits from
192192
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
193-
find the nearest-neighbours.
193+
find the nearest-neighbors.
194194
195195
max_iter : int, default=100
196-
Maximum number of repetitions of the edited nearest neighbours algorithm.
196+
Maximum number of repetitions of the edited nearest neighbors algorithm.
197197
198198
kind_sel : {{'all', 'mode'}}, default='all'
199199
Strategy to use in order to exclude samples.
200200
201-
- If ``'all'``, all neighbours will have to agree with a sample in order
201+
- If ``'all'``, all neighbors will have to agree with a sample in order
202202
not to be excluded.
203-
- If ``'mode'``, the majority of the neighbours will have to agree with
203+
- If ``'mode'``, the majority of the neighbors will have to agree with
204204
a sample in order not to be excluded.
205205
206206
The strategy `"all"` will be less conservative than `'mode'`. Thus,
@@ -226,7 +226,7 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
226226
227227
EditedNearestNeighbours : Undersample by editing samples.
228228
229-
AllKNN : Undersample using ENN and various number of neighbours.
229+
AllKNN : Undersample using ENN and various number of neighbors.
230230
231231
Notes
232232
-----
@@ -364,8 +364,8 @@ class AllKNN(BaseCleaningSampler):
364364
"""Undersample based on the AllKNN method.
365365
366366
This method will apply ENN several times, starting by looking at the
367-
1 closest neighbour, and increasing the number of nearest neighbours
368-
by 1 at each round, up to the number of neighbours specified in
367+
1 closest neighbor, and increasing the number of nearest neighbors
368+
by 1 at each round, up to the number of neighbors specified in
369369
`n_neighbors`.
370370
371371
The repetitions will stop when i) one of the majority classes
@@ -379,23 +379,24 @@ class AllKNN(BaseCleaningSampler):
379379
{sampling_strategy}
380380
381381
n_neighbors : int or estimator object, default=3
382-
If ``int``, the maximum size of the neighbourhood to evaluate.
383-
The method will start by looking at the 1 closest neighbour, and
384-
then repeat the edited nearest neighbours increasing
385-
the neighbourhood by 1, until examining a neighbourhood of
382+
If ``int``, the maximum size of the the neighborhood to evaluate.
383+
The method will start by looking at the 1 closest neighbor, and
384+
then repeat the edited nearest neighbors increasing
385+
the neighborhood by 1, until examining a neighborhood of
386386
`n_neighbors` in the final iteration.
387+
387388
If object, an estimator that inherits from
388389
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
389-
find the nearest-neighbours in the final round. In this case,
390-
AllKNN will repeat edited nearest neighbours starting from a 2-KNN
390+
find the nearest-neighbors in the final round. In this case,
391+
AllKNN will repeat edited nearest neighbors starting from a 2-KNN
391392
up to the specified KNN in the object.
392393
393394
kind_sel : {{'all', 'mode'}}, default='all'
394395
Strategy to use in order to exclude samples.
395396
396-
- If ``'all'``, all neighbours will have to agree with a sample in order
397+
- If ``'all'``, all neighbors will have to agree with a sample in order
397398
not to be excluded.
398-
- If ``'mode'``, the majority of the neighbours will have to agree with
399+
- If ``'mode'``, the majority of the neighbors will have to agree with
399400
a sample in order not to be excluded.
400401
401402
The strategy `"all"` will be less conservative than `'mode'`. Thus,
@@ -434,7 +435,7 @@ class without early stopping.
434435
References
435436
----------
436437
.. [1] I. Tomek, "An Experiment with the Edited Nearest-Neighbor
437-
Rule," IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6),
438+
Rule", IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6),
438439
pp. 448-452, June 1976.
439440
440441
Examples

0 commit comments

Comments
 (0)