Sample name uniqueness check added to accucor data loader #170

hepcat72 · 2021-08-20T21:34:59Z

Summary Change Description

Added:

A raised exception for duplicate sample headers in the accucor data loader
A raised exception when the study name column can't be found
A raised exception when the header cells are empty
A more descriptive exception when the number of column headers differ in the two sheets

It's notable that without the uniqueness check, the read_excel method silently modifies the column header and this results in a cryptic error stating "Could not find sample .1 in the database.", where the appended ".1" was the modification to the second occurrence of the duplicate header.

Files checked in: DataRepo/management/commands/load_accucor_msruns.py DataRepo/tests/test_models.py DataRepo/utils.py DataRepo/example_data/obob_maven_6eaas_inf_sample_dupe.xlsx

Affected Issue Numbers

Resolves Check sample uniqueness in the accucor loader script #66
Partially resolves MSRUN Data loading #45

Code Review Notes

Note that adding mangle_dupe_cols=False followed by df.columns.duplicated() in the validation code would have been preferable, but it turns out that mangle_dupe_cols=False as an option to pandas.read_excel is not yet supported.

The uniqueness validation happens inside load_accucor_data.py since what is passed to the AccuCorDataLoader is the sheets and the check needs to be a different read of the sheets (with header=None). Ideally, this check would be done in AccuCorDataLoader.validate_dataframes, but since pandas.read_excel always mangles non-unique column headers, there's no way to do the check without a separate read of the file with headers turned off. I'm open to suggestions on reorganizing that check.

References:

Checklist

All issue requirements satisfied (or no linked issues)
Linting passes.
Migrations created & committed (or no model changes)
Tests implemented (or no code changes)
All tests pass

This resolves issue #66. I also added: - A raised exception when the study name column can't be found - A raised exception when the header cells are empty - A more descriptive exception when the number of column headers differ in the two sheets It's notable that without the uniqueness check, the read_excel method silently modifies the column header and this results in a cryptic error stating "Could not find sample <sample name>.1 in the database.", where the ".1" appendation was the modification to the second occurrence of the duplicate header. Files checked in: DataRepo/management/commands/load_accucor_msruns.py DataRepo/tests/test_models.py DataRepo/utils.py DataRepo/example_data/obob_maven_6eaas_inf_sample_dupe.xlsx

jcmatese

I think this should work, and may catch some edge cases for user input.

DataRepo/management/commands/load_accucor_msruns.py

DataRepo/utils.py

Files checked in: DataRepo/management/commands/load_accucor_msruns.py

Files checked in: DataRepo/utils.py

lparsons

This PR catches duplicate sample names in the files, and provides a clear error message when found. For this reason, I think we should merge it.

I think it's worth noting, however, that if a file had duplicate sample names, I do not believe it would not have loaded before these changes since the sample names wouldn't be found in the database (they would have had an extra suffix added). The same is true for the extra check on columns without names and the test explicit check for STUDY_NAME column, (neither of which have explicit tests, btw).

DataRepo/management/commands/load_accucor_msruns.py

DataRepo/utils.py

Files checked in: DataRepo/management/commands/load_accucor_msruns.py DataRepo/utils.py

hepcat72 · 2021-08-27T16:41:47Z

Since this is all approved and I addressed all outstanding issues in the manner suggested, I will merge.

hepcat72 requested review from lparsons, jcmatese and fkang-pu August 20, 2021 21:35

jcmatese approved these changes Aug 23, 2021

View reviewed changes

DataRepo/management/commands/load_accucor_msruns.py Outdated Show resolved Hide resolved

DataRepo/utils.py Outdated Show resolved Hide resolved

hepcat72 added 2 commits August 24, 2021 11:31

Removed the numpy dependency as per a review issue.

38b8513

Files checked in: DataRepo/management/commands/load_accucor_msruns.py

Changed string concatenation to use f"", per review issue.

b5267eb

Files checked in: DataRepo/utils.py

lparsons approved these changes Aug 26, 2021

View reviewed changes

fkang-pu approved these changes Aug 27, 2021

View reviewed changes

Addressed review issues.

1b9e822

Files checked in: DataRepo/management/commands/load_accucor_msruns.py DataRepo/utils.py

hepcat72 merged commit 0f85bf4 into main Aug 27, 2021

hepcat72 deleted the dupe_accucor_samples branch January 7, 2022 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sample name uniqueness check added to accucor data loader #170

Sample name uniqueness check added to accucor data loader #170

Uh oh!

hepcat72 commented Aug 20, 2021 •

edited

Loading

Uh oh!

jcmatese left a comment

Uh oh!

Uh oh!

Uh oh!

lparsons left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hepcat72 commented Aug 27, 2021

Uh oh!

Uh oh!

Sample name uniqueness check added to accucor data loader #170

Sample name uniqueness check added to accucor data loader #170

Uh oh!

Conversation

hepcat72 commented Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary Change Description

Affected Issue Numbers

Code Review Notes

Checklist

Uh oh!

jcmatese left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lparsons left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hepcat72 commented Aug 27, 2021

Uh oh!

Uh oh!

hepcat72 commented Aug 20, 2021 •

edited

Loading