Skip to content

Commit d52ec25

Browse files
committed
update docs nearmiss
1 parent 438c19c commit d52ec25

File tree

2 files changed

+40
-37
lines changed

2 files changed

+40
-37
lines changed

doc/under_sampling.rst

Lines changed: 19 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -125,22 +125,24 @@ It would also work with pandas dataframe::
125125
>>> df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult)
126126
>>> df_resampled.head() # doctest: +SKIP
127127

128-
:class:`NearMiss` adds some heuristic rules to select samples
129-
:cite:`mani2003knn`. :class:`NearMiss` implements 3 different types of
130-
heuristic which can be selected with the parameter ``version``::
128+
:class:`NearMiss` undersamples data based on heuristic rules to select the
129+
observations :cite:`mani2003knn`. :class:`NearMiss` implements 3 different
130+
methods to undersample, which can be selected with the parameter ``version``::
131131

132132
>>> from imblearn.under_sampling import NearMiss
133133
>>> nm1 = NearMiss(version=1)
134134
>>> X_resampled_nm1, y_resampled = nm1.fit_resample(X, y)
135135
>>> print(sorted(Counter(y_resampled).items()))
136136
[(0, 64), (1, 64), (2, 64)]
137137

138-
As later stated in the next section, :class:`NearMiss` heuristic rules are
139-
based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors``
140-
and ``n_neighbors_ver3`` accept classifier derived from ``KNeighborsMixin``
141-
from scikit-learn. The former parameter is used to compute the average distance
142-
to the neighbors while the latter is used for the pre-selection of the samples
143-
of interest.
138+
139+
:class:`NearMiss` heuristic rules are based on the nearest neighbors algorithm.
140+
Therefore, the parameters ``n_neighbors`` and ``n_neighbors_ver3`` accept either
141+
integers with the size of the neighbourhood to explore or a classifier derived
142+
from the ``KNeighborsMixin`` from scikit-learn. The parameter ``n_neighbors`` is
143+
used to compute the average distance to the neighbors while ``n_neighbors_ver3``
144+
is used for the pre-selection of the samples from the majority class, only in
145+
version 3. More details about NearMiss in the next section.
144146

145147
Mathematical formulation
146148
^^^^^^^^^^^^^^^^^^^^^^^^
@@ -175,19 +177,16 @@ is the largest.
175177
:scale: 60
176178
:align: center
177179

178-
In the next example, the different :class:`NearMiss` variant are applied on the
179-
previous toy example. It can be seen that the decision functions obtained in
180+
In the next example, the different :class:`NearMiss` variants are applied on the
181+
previous toy example. We can see that the decision functions obtained in
180182
each case are different.
181183

182-
When under-sampling a specific class, NearMiss-1 can be altered by the presence
183-
of noise. In fact, it will implied that samples of the targeted class will be
184-
selected around these samples as it is the case in the illustration below for
185-
the yellow class. However, in the normal case, samples next to the boundaries
186-
will be selected. NearMiss-2 will not have this effect since it does not focus
187-
on the nearest samples but rather on the farthest samples. We can imagine that
188-
the presence of noise can also altered the sampling mainly in the presence of
189-
marginal outliers. NearMiss-3 is probably the version which will be less
190-
affected by noise due to the first step sample selection.
184+
When under-sampling a specific class, NearMiss-1 can be affected by noise. In
185+
fact, samples of the targeted class located around observations from the minority
186+
class tend to be selected, as shown in the illustration below (see yellow class).
187+
NearMiss-2 might be less affected by noise as it does not focus on the nearest
188+
samples but rather on the farthest samples. NearMiss-3 is probably the version
189+
which will be less affected by noise due to the first step of sample selection.
191190

192191
.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png
193192
:target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html

imblearn/under_sampling/_prototype_selection/_nearmiss.py

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -36,20 +36,24 @@ class NearMiss(BaseUnderSampler):
3636
3737
n_neighbors : int or estimator object, default=3
3838
If ``int``, size of the neighbourhood to consider to compute the
39-
average distance to the minority point samples. If object, an
40-
estimator that inherits from
41-
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
42-
find the k_neighbors.
43-
By default, it will be a 3-NN.
39+
average distance to the minority samples. If object, an estimator
40+
that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin`
41+
that will be used to find the k_neighbors. By default, it considers
42+
the 3 closest neighbours.
4443
4544
n_neighbors_ver3 : int or estimator object, default=3
46-
If ``int``, NearMiss-3 algorithm start by a phase of re-sampling. This
47-
parameter correspond to the number of neighbours selected create the
48-
subset in which the selection will be performed. If object, an
49-
estimator that inherits from
45+
NearMiss version 3 starts by a phase of under-sampling where it selects
46+
those observations from the majority class that are closest neighbors
47+
to the minority class.
48+
49+
If ``int``, indicates to the number of neighbours to be selected in
50+
the first step. The subset in which the selection will be performed.
51+
If object, an estimator that inherits from
5052
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
51-
find the k_neighbors.
52-
By default, it will be a 3-NN.
53+
find the k_neighbors. By default, the 3 closest neighbours to the
54+
minority observations will be selected.
55+
56+
Only used in version 3.
5357
5458
{n_jobs}
5559
@@ -75,7 +79,7 @@ class NearMiss(BaseUnderSampler):
7579
References
7680
----------
7781
.. [1] I. Mani, I. Zhang. "kNN approach to unbalanced data distributions:
78-
a case study involving information extraction," In Proceedings of
82+
a case study involving information extraction", in Proceedings of
7983
workshop on learning from imbalanced datasets, 2003.
8084
8185
Examples
@@ -125,15 +129,15 @@ def _selection_dist_based(
125129
Associated label to X.
126130
127131
dist_vec : ndarray, shape (n_samples, )
128-
The distance matrix to the nearest neigbour.
132+
The distance matrix to the nearest neighbor.
129133
130134
num_samples: int
131135
The desired number of samples to select.
132136
133137
key : str or int,
134138
The target class.
135139
136-
sel_strategy : str, optional (default='nearest')
140+
sel_strategy : str, default='nearest'
137141
Strategy to select the samples. Either 'nearest' or 'farthest'
138142
139143
Returns
@@ -169,13 +173,13 @@ def _selection_dist_based(
169173
reverse=sort_way,
170174
)
171175

172-
# Throw a warning to tell the user that we did not have enough samples
173-
# to select and that we just select everything
176+
# Raise a warning to tell the user that there were not enough samples
177+
# to select from and thus, that all samples will be selected
174178
if len(sorted_idx) < num_samples:
175179
warnings.warn(
176180
"The number of the samples to be selected is larger"
177181
" than the number of samples available. The"
178-
" balancing ratio cannot be ensure and all samples"
182+
" balancing ratio cannot be ensured and all samples"
179183
" will be returned."
180184
)
181185

0 commit comments

Comments
 (0)