Skip to content

statistics module: NaN handling in median, median_high an median_low #77265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dcasmr mannequin opened this issue Mar 16, 2018 · 16 comments
Closed

statistics module: NaN handling in median, median_high an median_low #77265

dcasmr mannequin opened this issue Mar 16, 2018 · 16 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@dcasmr
Copy link
Mannequin

dcasmr mannequin commented Mar 16, 2018

BPO 33084
Nosy @rhettinger, @mdickinson, @ericvsmith, @stevendaprano, @DavidMertz, @maheshwark97, @dcasmr, @jfine2358, @iritkatriel

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/stevendaprano'
closed_at = None
created_at = <Date 2018-03-16.05:58:02.759>
labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
title = 'statistics module: NaN handling in median, median_high an median_low'
updated_at = <Date 2021-08-28.02:16:38.517>
user = 'https://github.com/dcasmr'

bugs.python.org fields:

activity = <Date 2021-08-28.02:16:38.517>
actor = 'steven.daprano'
assignee = 'steven.daprano'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2018-03-16.05:58:02.759>
creator = 'dcasmr'
dependencies = []
files = []
hgrepos = []
issue_num = 33084
keywords = []
message_count = 15.0
messages = ['313933', '313938', '313940', '313949', '313950', '313953', '313956', '313965', '313972', '327281', '333135', '333151', '399968', '399995', '400456']
nosy_count = 9.0
nosy_names = ['rhettinger', 'mark.dickinson', 'eric.smith', 'steven.daprano', 'DavidMertz', 'maheshwark97', 'dcasmr', 'jfine2358', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue33084'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

@dcasmr
Copy link
Mannequin Author

dcasmr mannequin commented Mar 16, 2018

When a list or dataframe serie contains NaN(s), the median, median_low and median_high are computed in Python 3.6.4 statistics library, however, the results are wrong.
Either, it should return a NaN just like when we try to compute a mean or point the user to drop the NaNs before computing those statistics.
Example:
import numpy as np
import statistics as stats

data = [75, 90,85, 92, 95, 80, np.nan]
Median  = stats.median(data)
Median_low = stats.median_low(data)
Median_high = stats.median_high(data)
The results from above return ALL 90 which are incorrect.

Correct answers should be:
Median = 87.5
Median_low = 85
Median_high = 92
Thanks,
Luc

@dcasmr dcasmr mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 16, 2018
@maheshwark97
Copy link
Mannequin

maheshwark97 mannequin commented Mar 16, 2018

Will just removing all np.nan values do the job? Btw the values will be:
median = 88.5
median_low = 85
median_high = 90
I can correct it and send a pull request.

@mdickinson
Copy link
Member

Will just removing all np.nan values do the job?

Unfortunately, I don't think it's that simple. You want consistency across the various library calls, so if the various median functions are changed to treat NaNs as missing data, then the other functions should be, too.

@maheshwark97
Copy link
Mannequin

maheshwark97 mannequin commented Mar 16, 2018

Well if i dont consider np.nan as missing data and consider all other values then the answer being 90 is correct,right?

@mdickinson
Copy link
Member

then the answer being 90 is correct,right?

How do you deduce that? Why 90 rather than 85 (or 87.5, or some other value)?

For what it's worth, NumPy gives a result of NaN for the median of an array that contains NaNs:

>>> np.median([1.0, 2.0, 3.0, np.nan])
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median
  r = func(a, **kwargs)
nan

@stevendaprano
Copy link
Member

On Fri, Mar 16, 2018 at 02:32:36PM +0000, Mark Dickinson wrote:

For what it's worth, NumPy gives a result of NaN for the median of an array that contains NaNs:

By default, R gives the median of a list containing either NaN or NA
("not available", intended as a missing value) as NA:

median(c(1, 2, 3, 4, NA))
[1] NA
median(c(1, 2, 3, 4, NaN))
[1] NA

but you can ignore them too:

median(c(1, 2, 3, 4, NA), na.rm=TRUE)
[1] 2.5
median(c(1, 2, 3, 4, NaN), na.rm=TRUE)
[1] 2.5

@dcasmr
Copy link
Mannequin Author

dcasmr mannequin commented Mar 16, 2018

Just to make sure we are focused on the issue, the reported bug is with the statistics library (not with numpy). It happens, when there is at least one missing value in the data and involves the computation of the median, median_low and median_high using the statistics library.
The test was performed on Python 3.6.4.

When there is no missing values (NaNs) in the data, computing the median, median_high and median_low from the statistics library work fine.
So, yes, removing the NaNs (or imputing for them) before computing the median(s) resolve the issue.
Also, just like statistics.mean(data) when data has missing return a nan, the median, median_high and median_low should behave the same way.

import numpy
import statistics as stats

data = [75, 90,85, 92, 95, 80, np.nan]

Median = stats.median(data) 
Median_high = stats.median_high(data)
Median_low = stats.median_low(data)
print("The incorrect Median is", Median)
The incorrect Median is, 90
print("The incorrect median high is", Median_high)
The incorrect median high is, 90
print("The incorrect median low is", Median_low)
The incorrect median low is, 90

## Mean returns nan
Mean = stats.mean(data)
prin("The mean is", Mean)
The mean is, nan

Now, when we drop the missing values, we have:
data2 = [75, 90,85, 92, 95, 80]
stats.median(data2)
87.5
stats.median_high(data2)
90
stats.median_low(data2)
85

@maheshwark97
Copy link
Mannequin

maheshwark97 mannequin commented Mar 16, 2018

So From the above i am to conclude that removing np.nan is the best path to be taken? Also the above step is to be included in median_grouped as well right?

@dcasmr
Copy link
Mannequin Author

dcasmr mannequin commented Mar 16, 2018

If we are trying to fix this, the behavior should be like computing the mean or harmonic mean with the statistics library when there are missing values in the data. At least that way, it is consistent with how the statistics library works when computing with NaNs in the data. Then again, it should be mentioned somewhere in the docs.

import statistics as stats
import numpy as np
import pandas as pd
data = [75, 90,85, 92, 95, 80, np.nan]
stats.mean(data)
nan
stats.harmonic_mean(data)
nan
stats.stdev(data)
nan
As you can see, when there is a missing value, computing the mean, harmonic mean and sample standard deviation with the statistics library 
return a nan.
However, with the median, median_high and median_low, it computes those statistics incorrectly with the missing values present in the data.
It is better to return a nan, then let the user drop (or resolve) any missing values before computing.
## Another example using pandas serie
df = pd.DataFrame(data, columns=['data'])
df.head()
        data
0	75.0
1	90.0
2	85.0
3	92.0
4	95.0
5	80.0
6	NaN

### Use the statistics library to compute the median of the serie
stats.median(df1['data'])
90
 
## Pandas returns the correct median by dropping the missing values
## Now use pandas to compute the median of the serie with missing value
df['data'].median()
87.5

I did not test the median_grouped in statistics library, but will let you know afterwards if its affected as well.

@stevendaprano
Copy link
Member

I want to revisit this for 3.8.

I agree that the current implementation-dependent behaviour when there are NANs in the data is troublesome. But I don't think that there is a single right answer.

I also agree with Mark that if we change median, we ought to change the other functions so that people can get consistent behaviour. It wouldn't be good for median to ignore NANs and mean to process them.

I'm inclined to add a parameter to the statistics functions to deal with NANs, that allow the caller to select from:

  • implementation-dependent, i.e. what happens now;
    (for speed, and backwards compatibility, this would be the default)

  • raise an exception;

  • return a NAN;

  • skip any NANs (treat them as missing values to be ignored).

I think that raise/return/ignore will cover most use-cases for NANs, and the default will be suitable for the "easy cases" where there are no NANs, without paying any performance penalty if you already know your data has no NANs.

Thoughts?

I'm especially looking for ideas on what to call the first option.

@stevendaprano stevendaprano added the 3.8 (EOL) end of life label Oct 7, 2018
@stevendaprano stevendaprano self-assigned this Oct 7, 2018
@DavidMertz
Copy link
Mannequin

DavidMertz mannequin commented Jan 7, 2019

I believe that the current behavior of statistics.median[|_low|_high\] is simply broken. It relies on the particular behavior of Python sorting, which only utilizes .__lt__() between objects, and hence does not require a total order.

I can think of absolutely no way to characterize these as reasonable results:

Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42)
>>> statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
1
>>> statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
nan

@jfine2358
Copy link
Mannequin

jfine2358 mannequin commented Jan 7, 2019

Based on a quick review of the python docs, the bug report, PEP-450
and this thread, I suggest

  1. More carefully draw attention to the NaN feature, in the
    documentation for existing Python versions.
  2. Consider revising statistics.py so that it raises an exception,
    when passed NaN data.

This implies dividing this issue into two parts: legacy and future.

For more information, see:
https://mail.python.org/pipermail/python-ideas/2019-January/054872.html

@iritkatriel
Copy link
Member

Reproduced in 3.11:

>>> import numpy as np
>>> import statistics as stats
>>> data = [75, 90,85, 92, 95, 80, np.nan]
>>> stats.median(data)
90
>>> stats.median_low(data)
90
>>> stats.median_high(data)
90

@iritkatriel iritkatriel added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes and removed 3.8 (EOL) end of life labels Aug 20, 2021
@iritkatriel iritkatriel changed the title Computing median, median_high an median_low in statistics library statistics module: NaN handling in median, median_high an median_low Aug 20, 2021
@rhettinger
Copy link
Contributor

[Steven]

Thoughts?

  1. Document that results are undefined if a NaN is present in the data.

  2. Add function to strip NaNs from the data:

    def remove_nans(iterable):
        "Remove float('NaN') and other objects not equal to themselves"
        return [x for x in iterable if x == x]

@stevendaprano
Copy link
Member

stevendaprano pushed a commit that referenced this issue Jul 10, 2022
…count (#94676)

* Document NaN handling in functions that sort or count

* Update Doc/library/statistics.rst

Co-authored-by: Erlend Egeberg Aasland <[email protected]>

* Update Doc/library/statistics.rst

Co-authored-by: Erlend Egeberg Aasland <[email protected]>

* Fix trailing whitespace and rewrap text

Co-authored-by: Erlend Egeberg Aasland <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 10, 2022
…rt or count (pythonGH-94676)

* Document NaN handling in functions that sort or count

* Update Doc/library/statistics.rst

Co-authored-by: Erlend Egeberg Aasland <[email protected]>

* Update Doc/library/statistics.rst

Co-authored-by: Erlend Egeberg Aasland <[email protected]>

* Fix trailing whitespace and rewrap text

Co-authored-by: Erlend Egeberg Aasland <[email protected]>
(cherry picked from commit ef61b25)

Co-authored-by: Raymond Hettinger <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 10, 2022
…rt or count (pythonGH-94676)

* Document NaN handling in functions that sort or count

* Update Doc/library/statistics.rst

Co-authored-by: Erlend Egeberg Aasland <[email protected]>

* Update Doc/library/statistics.rst

Co-authored-by: Erlend Egeberg Aasland <[email protected]>

* Fix trailing whitespace and rewrap text

Co-authored-by: Erlend Egeberg Aasland <[email protected]>
(cherry picked from commit ef61b25)

Co-authored-by: Raymond Hettinger <[email protected]>
@hauntsaninja
Copy link
Contributor

Closing, since it looks like the documentation change has been made. Raymond additionally suggested adding a function to remove NaNs from a list; judging by the fact that the documentation change uses filterfalse + isnan, it appears we maybe elected not to make this addition (which would probably be deserving of its own issue anyway).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants