-
Notifications
You must be signed in to change notification settings - Fork 1
Sample name uniqueness check added to accucor data loader #170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This resolves issue #66. I also added: - A raised exception when the study name column can't be found - A raised exception when the header cells are empty - A more descriptive exception when the number of column headers differ in the two sheets It's notable that without the uniqueness check, the read_excel method silently modifies the column header and this results in a cryptic error stating "Could not find sample <sample name>.1 in the database.", where the ".1" appendation was the modification to the second occurrence of the duplicate header. Files checked in: DataRepo/management/commands/load_accucor_msruns.py DataRepo/tests/test_models.py DataRepo/utils.py DataRepo/example_data/obob_maven_6eaas_inf_sample_dupe.xlsx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work, and may catch some edge cases for user input.
Files checked in: DataRepo/management/commands/load_accucor_msruns.py
Files checked in: DataRepo/utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR catches duplicate sample names in the files, and provides a clear error message when found. For this reason, I think we should merge it.
I think it's worth noting, however, that if a file had duplicate sample names, I do not believe it would not have loaded before these changes since the sample names wouldn't be found in the database (they would have had an extra suffix added). The same is true for the extra check on columns without names and the test explicit check for STUDY_NAME
column, (neither of which have explicit tests, btw).
Files checked in: DataRepo/management/commands/load_accucor_msruns.py DataRepo/utils.py
Since this is all approved and I addressed all outstanding issues in the manner suggested, I will merge. |
Summary Change Description
Added:
It's notable that without the uniqueness check, the read_excel method silently modifies the column header and this results in a cryptic error stating "Could not find sample .1 in the database.", where the appended ".1" was the modification to the second occurrence of the duplicate header.
Files checked in: DataRepo/management/commands/load_accucor_msruns.py DataRepo/tests/test_models.py DataRepo/utils.py DataRepo/example_data/obob_maven_6eaas_inf_sample_dupe.xlsx
Affected Issue Numbers
Code Review Notes
Note that adding
mangle_dupe_cols=False
followed bydf.columns.duplicated()
in the validation code would have been preferable, but it turns out thatmangle_dupe_cols=False
as an option topandas.read_excel
is not yet supported.The uniqueness validation happens inside
load_accucor_data.py
since what is passed to theAccuCorDataLoader
is the sheets and the check needs to be a different read of the sheets (withheader=None
). Ideally, this check would be done inAccuCorDataLoader.validate_dataframes
, but sincepandas.read_excel
always mangles non-unique column headers, there's no way to do the check without a separate read of the file with headers turned off. I'm open to suggestions on reorganizing that check.References:
header=None
read_excel
mangle_dupe_cols=False
unsupportedChecklist