@@ -6,7 +6,25 @@ Under-sampling
6
6
7
7
.. currentmodule :: imblearn.under_sampling
8
8
9
- You can refer to
9
+ One way of handling imbalanced datasets is to reduce the number of observations from
10
+ all classes but the minority class. The minority class is that with the least number
11
+ of observations. The most well known algorithm in this group is random
12
+ undersampling, where samples from the targeted classes are removed at random.
13
+
14
+ But there are many other algorithms to help us reduce the number of observations in the
15
+ dataset. These algorithms can be grouped based on their undersampling strategy into:
16
+
17
+ - Prototype generation methods.
18
+ - Prototype selection methods.
19
+
20
+ And within the latter, we find:
21
+
22
+ - Controlled undersampling
23
+ - Cleaning methods
24
+
25
+ We will discuss the different algorithms throughout this document.
26
+
27
+ Check also
10
28
:ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `.
11
29
12
30
.. _cluster_centroids :
@@ -16,7 +34,7 @@ Prototype generation
16
34
17
35
Given an original data set :math: `S`, prototype generation algorithms will
18
36
generate a new set :math: `S'` where :math: `|S'| < |S|` and :math: `S' \not \subset
19
- S`. In other words, prototype generation technique will reduce the number of
37
+ S`. In other words, prototype generation techniques will reduce the number of
20
38
samples in the targeted classes but the remaining samples are generated --- and
21
39
not selected --- from the original set.
22
40
@@ -61,16 +79,22 @@ original one.
61
79
Prototype selection
62
80
===================
63
81
64
- On the contrary to prototype generation algorithms, prototype selection
65
- algorithms will select samples from the original set :math: `S`. Therefore,
66
- :math: `S'` is defined such as :math: `|S'| < |S|` and :math: `S' \subset S`.
82
+ Prototype selection algorithms will select samples from the original set :math: `S`,
83
+ generating a dataset :math: `S'`, where :math: `|S'| < |S|` and :math: `S' \subset S`. In
84
+ other words, :math: `S'` is a subset of :math: `S`.
85
+
86
+ Prototype selection algorithms can be divided into two groups: (i) controlled
87
+ under-sampling techniques and (ii) cleaning under-sampling techniques.
88
+
89
+ Controlled under-sampling methods reduce the number of observations in the majority
90
+ class or classes to an arbitrary number of samples specified by the user. Typically,
91
+ they reduce the number of observations to the number of samples observed in the
92
+ minority class.
67
93
68
- In addition, these algorithms can be divided into two groups: (i) the
69
- controlled under-sampling techniques and (ii) the cleaning under-sampling
70
- techniques. The first group of methods allows for an under-sampling strategy in
71
- which the number of samples in :math: `S'` is specified by the user. By
72
- contrast, cleaning under-sampling techniques do not allow this specification
73
- and are meant for cleaning the feature space.
94
+ In contrast, cleaning under-sampling techniques "clean" the feature space by removing
95
+ either "noisy" or "too easy to classify" observations, depending on the method. The
96
+ final number of observations in each class varies with the cleaning method and can't be
97
+ specified by the user.
74
98
75
99
.. _controlled_under_sampling :
76
100
0 commit comments