Skip to content

Sample name uniqueness check added to accucor data loader #170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 27, 2021

Conversation

hepcat72
Copy link
Collaborator

@hepcat72 hepcat72 commented Aug 20, 2021

Summary Change Description

Added:

  • A raised exception for duplicate sample headers in the accucor data loader
  • A raised exception when the study name column can't be found
  • A raised exception when the header cells are empty
  • A more descriptive exception when the number of column headers differ in the two sheets

It's notable that without the uniqueness check, the read_excel method silently modifies the column header and this results in a cryptic error stating "Could not find sample .1 in the database.", where the appended ".1" was the modification to the second occurrence of the duplicate header.

Files checked in: DataRepo/management/commands/load_accucor_msruns.py DataRepo/tests/test_models.py DataRepo/utils.py DataRepo/example_data/obob_maven_6eaas_inf_sample_dupe.xlsx

Affected Issue Numbers

Code Review Notes

Note that adding mangle_dupe_cols=False followed by df.columns.duplicated() in the validation code would have been preferable, but it turns out that mangle_dupe_cols=False as an option to pandas.read_excel is not yet supported.

The uniqueness validation happens inside load_accucor_data.py since what is passed to the AccuCorDataLoader is the sheets and the check needs to be a different read of the sheets (with header=None). Ideally, this check would be done in AccuCorDataLoader.validate_dataframes, but since pandas.read_excel always mangles non-unique column headers, there's no way to do the check without a separate read of the file with headers turned off. I'm open to suggestions on reorganizing that check.

References:

Checklist

This resolves issue #66.

I also added:
- A raised exception when the study name column can't be found
- A raised exception when the header cells are empty
- A more descriptive exception when the number of column headers differ in the two sheets

It's notable that without the uniqueness check, the read_excel method silently modifies the column header and this results in a cryptic error stating "Could not find sample <sample name>.1 in the database.", where the ".1" appendation was the modification to the second occurrence of the duplicate header.

Files checked in: DataRepo/management/commands/load_accucor_msruns.py DataRepo/tests/test_models.py DataRepo/utils.py DataRepo/example_data/obob_maven_6eaas_inf_sample_dupe.xlsx
Copy link
Collaborator

@jcmatese jcmatese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work, and may catch some edge cases for user input.

Files checked in: DataRepo/management/commands/load_accucor_msruns.py
Copy link
Contributor

@lparsons lparsons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR catches duplicate sample names in the files, and provides a clear error message when found. For this reason, I think we should merge it.

I think it's worth noting, however, that if a file had duplicate sample names, I do not believe it would not have loaded before these changes since the sample names wouldn't be found in the database (they would have had an extra suffix added). The same is true for the extra check on columns without names and the test explicit check for STUDY_NAME column, (neither of which have explicit tests, btw).

Files checked in: DataRepo/management/commands/load_accucor_msruns.py DataRepo/utils.py
@hepcat72
Copy link
Collaborator Author

Since this is all approved and I addressed all outstanding issues in the manner suggested, I will merge.

@hepcat72 hepcat72 merged commit 0f85bf4 into main Aug 27, 2021
@hepcat72 hepcat72 deleted the dupe_accucor_samples branch January 7, 2022 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check sample uniqueness in the accucor loader script MSRUN Data loading
4 participants