update docs nearmiss

solegalli · solegalli · commit d52ec25217b7 · 2021-08-11T16:34:30.000+02:00
diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst
@@ -125,22 +125,24 @@ It would also work with pandas dataframe::
   >>> df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult)
   >>> df_resampled.head()  # doctest: +SKIP
 
-:class:`NearMiss` adds some heuristic rules to select samples
-:cite:`mani2003knn`. :class:`NearMiss` implements 3 different types of
-heuristic which can be selected with the parameter ``version``::
+:class:`NearMiss` undersamples data based on heuristic rules to select the
+observations :cite:`mani2003knn`. :class:`NearMiss` implements 3 different
+methods to undersample, which can be selected with the parameter ``version``::
 
   >>> from imblearn.under_sampling import NearMiss
   >>> nm1 = NearMiss(version=1)
   >>> X_resampled_nm1, y_resampled = nm1.fit_resample(X, y)
   >>> print(sorted(Counter(y_resampled).items()))
   [(0, 64), (1, 64), (2, 64)]
 
-As later stated in the next section, :class:`NearMiss` heuristic rules are
-based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors``
-and ``n_neighbors_ver3`` accept classifier derived from ``KNeighborsMixin``
-from scikit-learn. The former parameter is used to compute the average distance
-to the neighbors while the latter is used for the pre-selection of the samples
-of interest.
+
+:class:`NearMiss` heuristic rules are based on the nearest neighbors algorithm.
+Therefore, the parameters ``n_neighbors`` and ``n_neighbors_ver3`` accept either
+integers with the size of the neighbourhood to explore or a classifier derived
+from the ``KNeighborsMixin`` from scikit-learn. The parameter ``n_neighbors`` is
+used to compute the average distance to the neighbors while ``n_neighbors_ver3``
+is used for the pre-selection of the samples from the majority class, only in
+version 3. More details about NearMiss in the next section.
 
 Mathematical formulation
 ^^^^^^^^^^^^^^^^^^^^^^^^
@@ -175,19 +177,16 @@ is the largest.
    :scale: 60
    :align: center
 
-In the next example, the different :class:`NearMiss` variant are applied on the
-previous toy example. It can be seen that the decision functions obtained in
+In the next example, the different :class:`NearMiss` variants are applied on the
+previous toy example. We can see that the decision functions obtained in
 each case are different.
 
-When under-sampling a specific class, NearMiss-1 can be altered by the presence
-of noise. In fact, it will implied that samples of the targeted class will be
-selected around these samples as it is the case in the illustration below for
-the yellow class. However, in the normal case, samples next to the boundaries
-will be selected. NearMiss-2 will not have this effect since it does not focus
-on the nearest samples but rather on the farthest samples. We can imagine that
-the presence of noise can also altered the sampling mainly in the presence of
-marginal outliers. NearMiss-3 is probably the version which will be less
-affected by noise due to the first step sample selection.
+When under-sampling a specific class, NearMiss-1 can be affected by noise. In
+fact, samples of the targeted class located around observations from the minority
+class tend to be selected, as shown in the illustration below (see yellow class).
+NearMiss-2 might be less affected by noise as it does not focus on the nearest
+samples but rather on the farthest samples. NearMiss-3 is probably the version
+which will be less affected by noise due to the first step of sample selection.
 
 .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png
    :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
diff --git a/imblearn/under_sampling/_prototype_selection/_nearmiss.py b/imblearn/under_sampling/_prototype_selection/_nearmiss.py
@@ -36,20 +36,24 @@ class NearMiss(BaseUnderSampler):
 
     n_neighbors : int or estimator object, default=3
         If ``int``, size of the neighbourhood to consider to compute the
-        average distance to the minority point samples.  If object, an
-        estimator that inherits from
-        :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
-        find the k_neighbors.
-        By default, it will be a 3-NN.
+        average distance to the minority samples. If object, an estimator
+        that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin`
+        that will be used to find the k_neighbors. By default, it considers
+        the 3 closest neighbours.
 
     n_neighbors_ver3 : int or estimator object, default=3
-        If ``int``, NearMiss-3 algorithm start by a phase of re-sampling. This
-        parameter correspond to the number of neighbours selected create the
-        subset in which the selection will be performed.  If object, an
-        estimator that inherits from
+        NearMiss version 3 starts by a phase of under-sampling where it selects
+        those observations from the majority class that are closest neighbors
+        to the minority class.
+
+        If ``int``, indicates to the number of neighbours to be selected in
+        the first step. The subset in which the selection will be performed.
+        If object, an estimator that inherits from
         :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
-        find the k_neighbors.
-        By default, it will be a 3-NN.
+        find the k_neighbors. By default, the 3 closest neighbours to the
+        minority observations will be selected.
+
+        Only used in version 3.
 
     {n_jobs}
 
@@ -75,7 +79,7 @@ class NearMiss(BaseUnderSampler):
     References
     ----------
     .. [1] I. Mani, I. Zhang. "kNN approach to unbalanced data distributions:
-       a case study involving information extraction," In Proceedings of
+       a case study involving information extraction", in Proceedings of
        workshop on learning from imbalanced datasets, 2003.
 
     Examples
@@ -125,15 +129,15 @@ def _selection_dist_based(
             Associated label to X.
 
         dist_vec : ndarray, shape (n_samples, )
-            The distance matrix to the nearest neigbour.
+            The distance matrix to the nearest neighbor.
 
         num_samples: int
             The desired number of samples to select.
 
         key : str or int,
             The target class.
 
-        sel_strategy : str, optional (default='nearest')
+        sel_strategy : str, default='nearest'
             Strategy to select the samples. Either 'nearest' or 'farthest'
 
         Returns
@@ -169,13 +173,13 @@ def _selection_dist_based(
             reverse=sort_way,
         )
 
-        # Throw a warning to tell the user that we did not have enough samples
-        # to select and that we just select everything
+        # Raise a warning to tell the user that there were not enough samples
+        # to select from and thus, that all samples will be selected
         if len(sorted_idx) < num_samples:
             warnings.warn(
                 "The number of the samples to be selected is larger"
                 " than the number of samples available. The"
-                " balancing ratio cannot be ensure and all samples"
+                " balancing ratio cannot be ensured and all samples"
                 " will be returned."
             )