Skip to content

Commit ef2e75b

Browse files
authored
DOC improve introduction to undersampling methods (#1018)
1 parent 87ef4fc commit ef2e75b

File tree

1 file changed

+35
-11
lines changed

1 file changed

+35
-11
lines changed

doc/under_sampling.rst

Lines changed: 35 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,25 @@ Under-sampling
66

77
.. currentmodule:: imblearn.under_sampling
88

9-
You can refer to
9+
One way of handling imbalanced datasets is to reduce the number of observations from
10+
all classes but the minority class. The minority class is that with the least number
11+
of observations. The most well known algorithm in this group is random
12+
undersampling, where samples from the targeted classes are removed at random.
13+
14+
But there are many other algorithms to help us reduce the number of observations in the
15+
dataset. These algorithms can be grouped based on their undersampling strategy into:
16+
17+
- Prototype generation methods.
18+
- Prototype selection methods.
19+
20+
And within the latter, we find:
21+
22+
- Controlled undersampling
23+
- Cleaning methods
24+
25+
We will discuss the different algorithms throughout this document.
26+
27+
Check also
1028
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`.
1129

1230
.. _cluster_centroids:
@@ -16,7 +34,7 @@ Prototype generation
1634

1735
Given an original data set :math:`S`, prototype generation algorithms will
1836
generate a new set :math:`S'` where :math:`|S'| < |S|` and :math:`S' \not\subset
19-
S`. In other words, prototype generation technique will reduce the number of
37+
S`. In other words, prototype generation techniques will reduce the number of
2038
samples in the targeted classes but the remaining samples are generated --- and
2139
not selected --- from the original set.
2240

@@ -61,16 +79,22 @@ original one.
6179
Prototype selection
6280
===================
6381

64-
On the contrary to prototype generation algorithms, prototype selection
65-
algorithms will select samples from the original set :math:`S`. Therefore,
66-
:math:`S'` is defined such as :math:`|S'| < |S|` and :math:`S' \subset S`.
82+
Prototype selection algorithms will select samples from the original set :math:`S`,
83+
generating a dataset :math:`S'`, where :math:`|S'| < |S|` and :math:`S' \subset S`. In
84+
other words, :math:`S'` is a subset of :math:`S`.
85+
86+
Prototype selection algorithms can be divided into two groups: (i) controlled
87+
under-sampling techniques and (ii) cleaning under-sampling techniques.
88+
89+
Controlled under-sampling methods reduce the number of observations in the majority
90+
class or classes to an arbitrary number of samples specified by the user. Typically,
91+
they reduce the number of observations to the number of samples observed in the
92+
minority class.
6793

68-
In addition, these algorithms can be divided into two groups: (i) the
69-
controlled under-sampling techniques and (ii) the cleaning under-sampling
70-
techniques. The first group of methods allows for an under-sampling strategy in
71-
which the number of samples in :math:`S'` is specified by the user. By
72-
contrast, cleaning under-sampling techniques do not allow this specification
73-
and are meant for cleaning the feature space.
94+
In contrast, cleaning under-sampling techniques "clean" the feature space by removing
95+
either "noisy" or "too easy to classify" observations, depending on the method. The
96+
final number of observations in each class varies with the cleaning method and can't be
97+
specified by the user.
7498

7599
.. _controlled_under_sampling:
76100

0 commit comments

Comments
 (0)