@@ -125,22 +125,24 @@ It would also work with pandas dataframe::
125
125
>>> df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult)
126
126
>>> df_resampled.head() # doctest: +SKIP
127
127
128
- :class: `NearMiss ` adds some heuristic rules to select samples
129
- :cite: `mani2003knn `. :class: `NearMiss ` implements 3 different types of
130
- heuristic which can be selected with the parameter ``version ``::
128
+ :class: `NearMiss ` undersamples data based on heuristic rules to select the
129
+ observations :cite: `mani2003knn `. :class: `NearMiss ` implements 3 different
130
+ methods to undersample, which can be selected with the parameter ``version ``::
131
131
132
132
>>> from imblearn.under_sampling import NearMiss
133
133
>>> nm1 = NearMiss(version=1)
134
134
>>> X_resampled_nm1, y_resampled = nm1.fit_resample(X, y)
135
135
>>> print(sorted(Counter(y_resampled).items()))
136
136
[(0, 64), (1, 64), (2, 64)]
137
137
138
- As later stated in the next section, :class: `NearMiss ` heuristic rules are
139
- based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors ``
140
- and ``n_neighbors_ver3 `` accept classifier derived from ``KNeighborsMixin ``
141
- from scikit-learn. The former parameter is used to compute the average distance
142
- to the neighbors while the latter is used for the pre-selection of the samples
143
- of interest.
138
+
139
+ :class: `NearMiss ` heuristic rules are based on the nearest neighbors algorithm.
140
+ Therefore, the parameters ``n_neighbors `` and ``n_neighbors_ver3 `` accept either
141
+ integers with the size of the neighbourhood to explore or a classifier derived
142
+ from the ``KNeighborsMixin `` from scikit-learn. The parameter ``n_neighbors `` is
143
+ used to compute the average distance to the neighbors while ``n_neighbors_ver3 ``
144
+ is used for the pre-selection of the samples from the majority class, only in
145
+ version 3. More details about NearMiss in the next section.
144
146
145
147
Mathematical formulation
146
148
^^^^^^^^^^^^^^^^^^^^^^^^
@@ -175,19 +177,16 @@ is the largest.
175
177
:scale: 60
176
178
:align: center
177
179
178
- In the next example, the different :class: `NearMiss ` variant are applied on the
179
- previous toy example. It can be seen that the decision functions obtained in
180
+ In the next example, the different :class: `NearMiss ` variants are applied on the
181
+ previous toy example. We can see that the decision functions obtained in
180
182
each case are different.
181
183
182
- When under-sampling a specific class, NearMiss-1 can be altered by the presence
183
- of noise. In fact, it will implied that samples of the targeted class will be
184
- selected around these samples as it is the case in the illustration below for
185
- the yellow class. However, in the normal case, samples next to the boundaries
186
- will be selected. NearMiss-2 will not have this effect since it does not focus
187
- on the nearest samples but rather on the farthest samples. We can imagine that
188
- the presence of noise can also altered the sampling mainly in the presence of
189
- marginal outliers. NearMiss-3 is probably the version which will be less
190
- affected by noise due to the first step sample selection.
184
+ When under-sampling a specific class, NearMiss-1 can be affected by noise. In
185
+ fact, samples of the targeted class located around observations from the minority
186
+ class tend to be selected, as shown in the illustration below (see yellow class).
187
+ NearMiss-2 might be less affected by noise as it does not focus on the nearest
188
+ samples but rather on the farthest samples. NearMiss-3 is probably the version
189
+ which will be less affected by noise due to the first step of sample selection.
191
190
192
191
.. image :: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png
193
192
:target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
0 commit comments