Load AccucCor data into MsRun, PeakGroup, PeakData #55

jcmatese · 2021-04-28T18:57:31Z

Summary Change Description

add researcher to MsRun model
tweak labeled_element_count validator in PeakData model
add command to load_accucor_msruns
add AccuCorDataLoader to utils
add openpyxl to requirement for Excel reading via pandas

Affected Issue Numbers

Resolves MSRUN Data loading #45

Code Review Notes

A lot of the is new to me, so everything deserves some some close attention.
Attempt to use transaction and to accept load-all or load-nothing. Been tested some, but not exhaustive.
Some hard-cording of data files expectations (column numbers), that probably is not future-or-past-proof (tuned to the example file).
I did not implement too many post load tests, but I did try and perform some basic validations prior to load.

EDIT: We believe the known issue documented below may have been fixed with @lparsons later model edit

Known issue: It appears that the method for inserting uniquely into MsRun could be tightened up, probably because of time formatting.
example duplicate load:

{'_state': <django.db.models.base.ModelState object at 0x7fe5d820d760>,
 'date': datetime.datetime(2021, 4, 28, 17, 1, 12, 215368, tzinfo=<UTC>),
 'id': 560,
 'protocol_id': 1,
 'researcher': 'mneistat',
 'sample_id': 87}
{'_state': <django.db.models.base.ModelState object at 0x7fe5d820d370>,
 'date': datetime.datetime(2021, 4, 28, 17, 39, 57, 605820, tzinfo=<UTC>),
 'id': 617,
 'protocol_id': 1,
 'researcher': 'mneistat',
 'sample_id': 87}

resulted from
python manage.py load_accucor_msruns --accucor_filename "DataRepo/example_data/obob_maven_6eaas_inf.xlsx" --protocol 1 --date "2021-04-23" --researcher "mneistat"

I must not be setting/formatting the date correct from args.

Checklist

All issue requirements satisfied (or no linked issues)
Linting passes.
Migrations created & committed (or no model changes)
Tests implemented (or no code changes)
All tests pass

add researcher to MsRun model tweak labeled_element_count validator in PeakData model add command to load_accucor_msruns add AccuCorDataLoader to utils add openpyxl to requirement for Excel reading via pandas

jcmatese · 2021-04-28T19:42:55Z

I suspect the the date issue is due to auto_now_add behavior in the model

tracebase/DataRepo/models.py

Line 154 in 2f18245

date = models.DateTimeField(auto_now=False, auto_now_add=True, editable=True)

where I would prefer it to be "date the run was performed", not "date the record was entered", but that raised other issues of provenance and database auditing that we have not discussed in detail.

I give up on the linting issues, because Django migrations and Flake8 will always run afoul of each other, when it comes to line length.

lparsons · 2021-04-28T21:00:02Z

@jcmatese Looks like there is a missing migration (forgot to check it in perhaps?)

I give up on the linting issues, because Django migrations and Flake8 will always run afoul of each other, when it comes to line length.

This is exactly why Rob (and I) had initially suggest we not lint the generated code. If you now agree that linting the migrations is more trouble that it's worth, I'd happily agree. Feel free to submit a PR or comment your approval here and I'll do it.

DataRepo/models.py

jcmatese · 2021-04-28T21:32:31Z

@jcmatese Looks like there is a missing migration (forgot to check it in perhaps?)

I give up on the linting issues, because Django migrations and Flake8 will always run afoul of each other, when it comes to line length.

This is exactly why Rob (and I) had initially suggest we not lint the generated code. If you now agree that linting the migrations is more trouble that it's worth, I'd happily agree. Feel free to submit a PR or comment your approval here and I'll do it.

Yes, that is fine by me. I am tired of fighting with it. Skip migrations green light from me.

jcmatese · 2021-04-28T21:40:23Z

The problem is that pylint thinks this it too long

tracebase/DataRepo/migrations/0007_peakdata_validator.py

Line 19 in 4b52ec4

    
           help_text="the M+ value (i.e. Label) for this observation.  '1' means one atom is labeled.  '3' means 3 atoms are labeled",

But if I edit it to dodge that bullet

tracebase/DataRepo/migrations/0007_peakdata_validator.py

Lines 19 to 20 in 638cc46

    
           help_text="the M+ value (i.e. Label) for this observation. " 
        
           "'1' means one atom is labeled.  '3' means 3 atoms are labeled",

The make migrations complains "that is not what I would have written", presumably.
https://github.com/Princeton-LSI-ResearchComputing/tracebase/runs/2460869637

It is probably all whitespace differences, and a waste of our collective time.

lparsons · 2021-04-28T21:47:44Z

The problem is that pylint thinks this it too long

tracebase/DataRepo/migrations/0007_peakdata_validator.py

Line 19 in 4b52ec4

help_text="the M+ value (i.e. Label) for this observation. '1' means one atom is labeled. '3' means 3 atoms are labeled",

But if I edit it to doge that bullet

tracebase/DataRepo/migrations/0007_peakdata_validator.py

Lines 19 to 20 in 638cc46

help_text="the M+ value (i.e. Label) for this observation. "

"'1' means one atom is labeled. '3' means 3 atoms are labeled",

The make migrations complains "that is not what I would have written", presumably.
https://github.com/Princeton-LSI-ResearchComputing/tracebase/runs/2460869637

You removed a space when you made that change.

requirements/dev.txt

Running the load script for the example data via GitHub actions helps to ensure that piece of code is still functioning as intended

Having the time as a component of the field made the implementation of the unique constraint ineffective. Also, setting the date to now on creation is more suited a timestamp field than a field that is tracking when the Mass Spec was run.

Some Excel files can be formatted such that pandas read_excel puts rows in the data frame that are empty (all nan).

hepcat72

I want to resume my review, but since there were pushes, I don't know what would happen if I refresh this page, so I'm submitted a partial review to not lose my work.

DataRepo/management/commands/load_accucor_msruns.py

DataRepo/models.py

DataRepo/utils.py

hepcat72 · 2021-04-30T15:58:49Z

So 4 of my issues became "outdated" due to the pushes. The code the issues are about still exists and thus the issues are still relevant. They are just disconnected. I'm not sure what the best way is to deal with those disconnected issues. Probably we should establish a policy to not push until all reviewers have responded, but on the grand scale, it seems an unnecessary delay. Ideally, there would be an easy way to "fix" these "outdated" issues by re-attaching them to the new line numbers. Unfortunately, there's no mechanism to do that. (Is that correct?)

The only viable workaround is somewhat labor-intensive: Copy the issue and comments, locate the new location of the code, create a new issue, and paste in the issue content.

Do you guys have an opinion?

lparsons · 2021-04-30T18:17:37Z

So 4 of my issues became "outdated" due to the pushes.

I seem to see the comments and the code in the "Conversation" tab. I don't think it will be an issue to address them as is, but I'll let @jcmatese have a go at things first. Once this round of comments is addressed we can reassess, but I think this is pretty close.

fkang-pu

The transaction.atomic seems working. I checked out this pull request yesterday morning, and was able to load data from obob_maven_6eaas_inf.xlsx (after fixing row for "citrate" in compound table), but got error for loading "obob_maven_c160_inf.xlsx" due to missing sample "bkbk1123".

One concern I have is we only store accucor file name(s) in the log for msrun/peakdata loading. Querying database after loading, we wouldn't know its source file(s) for a specific peakgroup. I think it's important to be able to trace back to the source file, which would be helpful for data verification and search/grouping. My proposal is adding dataset and archived_file tables, or at least dataset for now (peakgroup has a foreign key to dataset, dataset has M:M relationship to archive_file). Not sure if it makes sense to you. I raise the question without approving this pull request, as it could affect Peakgroup model structure and loading script.

p.s. could add code to check formula in addition to verify/link peakgroup to compound(s). make sure the formula for a peakgroup matches that in compound table.

DataRepo/utils.py

hepcat72 · 2021-05-04T16:28:06Z

Let me know when this is ready for a second-round review.

hepcat72 · 2021-05-04T18:54:28Z

Finishing up my initial review. Note, I haven't read yet any comments on the first half of my review, so please point out any possible redundancies. I have a number of requested changes. One last change request:

B. We need tests for each of the requirements in the issue:

xlsx file (accucor format, required) test that supplying the wrong file type or no file fails

date required test that fails when no date supplied

protocol either a integer (protocol.id) or name test supplying protocol name succeeds test that supplying protocol ID succeeds test that not supplying a protocol fails

researcher name test for failure when no name supplied

sample names are unique with the file supply a file with a duplicate sample name & test for failure

sample names must already be in the database test for failure with a sample name not in the DB

compound labels must already be in the database test for failure with a compound not in the DB

Lance created an issue for these

added TracerLabeledClass.tracer_labeled_elements_list clarified error messages resolved some model/migrations conflicts

hepcat72

There were many issues marked as resolved for which no resolution or explanation was offered. And where a comment was added indicating a change, there is no change evident in the code. I do not understand why this is the case, but I can only assume that the changes have not been pushed.

Let me re-affirm the order of operations I thought we were all following:

Submit a PR & request an initial review
Reviewers add issues
Author accepts/rejects each issue
Author implements changes to the accepted issues
Author pushes that work to the PR branch
Author requests subsequent review
Reviewer checks all issues, evaluates the changes made or the reason for rejecting each issue and either agrees to the proposed resolution (and resolves) or starts a dialog on the issue. New issues may be created for new code & the review is submitted.
After all reviewers have finished their reviews, if changes are requested or there or further comments are requested, go back to step 3. Otherwise, merge.

I was also expecting that Lance's changes would be merged with this branch and that all existing changes on this branch had been pushed.

Is there a technical issue here? Why do I not see the changes that were commented to have been done?

hepcat72

There is 1 new non-blocking issue, 1 new blocking (but trivial to resolve) issue, 1 old non-blocking issue that I do not know what the intended resolution to was, and 1 old blocking issue. I can't link the new issues in this comment (so I will qualify them), but the old issues are:

Load AccucCor data into MsRun, PeakGroup, PeakData #55 (comment)
Load AccucCor data into MsRun, PeakGroup, PeakData #55 (comment)
NB. tracer_labeled_element_regex_pattern method name
B. Document the assumption about element symbols being a single character (which could possibly not be true in the future).

However the "resolved" statuses of all my issues are current. If it is or is not "resolved", that status is accurate according to my latest round review.

DataRepo/models.py

DataRepo/utils.py

removed TracerLabeledClass.tracer_labeled_element_regex_pattern added AccuCorDataLoader.corrected_file_tracer_labeled_column_regex changed regex to deal with >1 character symbols

Peak group name lookup are no case-sensitive Edited all example files to update 'Glucose' to 'glucose'

jcmatese · 2021-05-05T21:11:14Z

@hepcat72 I changed the regex, but will not address the other two in this go around.

@lparsons believes the formula validation might be best served during model save() somehow (perhaps using chempy or cross-check with compounds). We might also tweak the incoming data with pandas read_excel dtype or converters arguments, which might set default behavior for null cells/strings. The jury is still out on the best solution, and Lance proposes #61

related: I did remove the peak_group.name.lower() accommodation in commit 4314796 and updated all the example files to the best of my ability.

regarding reporting all issue in 1-pass, instead of two, I suggest we get this in usage and see how often it is a problem. I also suspect this will not be the first time this code is modified, and perhaps an entirely different interface might be put on it (multiple tsv files, web-wrapped, etc.)

hepcat72 · 2021-05-05T22:34:31Z

Well, re-request my review when you're ready for me to take another look to re-evaluate. Those 2 outstanding issues (if I know which ones you're referring to) should be ok to pass on given Lance created an issue for the blocking one.

hepcat72 · 2021-05-05T22:46:50Z

regarding reporting all issue in 1-pass, instead of two, I suggest we get this in usage and see how often it is a problem. I also suspect this will not be the first time this code is modified, and perhaps an entirely different interface might be put on it (multiple tsv files, web-wrapped, etc.)

Could you explain what you're referring to here @jcmatese?

jcmatese · 2021-05-05T22:56:55Z

regarding reporting all issue in 1-pass, instead of two, I suggest we get this in usage and see how often it is a problem. I also suspect this will not be the first time this code is modified, and perhaps an entirely different interface might be put on it (multiple tsv files, web-wrapped, etc.)

Could you explain what you're referring to here @jcmatese?

#55 (comment)

hepcat72

All issues resolved. 🎉

jcmatese added 2 commits April 28, 2021 14:39

Load AccucCor data into MsRun, PeakGroup, PeakData

4b52ec4

add researcher to MsRun model tweak labeled_element_count validator in PeakData model add command to load_accucor_msruns add AccuCorDataLoader to utils add openpyxl to requirement for Excel reading via pandas

Linting, msrun.date, and MsRun test tweaks

638cc46

jcmatese requested review from hepcat72, lparsons and fkang-pu April 28, 2021 19:51

lparsons reviewed Apr 28, 2021

View reviewed changes

DataRepo/models.py Outdated Show resolved Hide resolved

lparsons mentioned this pull request Apr 28, 2021

Ignore migrations during linting #56

Closed

add a space to help_text

7652a00

jcmatese added 3 commits April 28, 2021 18:12

change unique_constraint coding

c1b0ced

change unique_constraint coding

cc5d783

more whitespace changes requested by black

a85766a

jcmatese mentioned this pull request Apr 29, 2021

Compound example data normalization [case, leading and trailing spaces] #48

Closed

lparsons reviewed Apr 29, 2021

View reviewed changes

requirements/dev.txt Outdated Show resolved Hide resolved

lparsons added 7 commits April 29, 2021 14:01

Fix Excel file loading, add missing requirement xlrd

1c0ff46

Make load_accucor_msruns args required

9d1a46c

Add loading of AccuCor data to GitHub Actions

ccf41ee

Running the load script for the example data via GitHub actions helps to ensure that piece of code is still functioning as intended

Make MSRun.date a DateField, auto_now_add=False

b7155ea

Having the time as a component of the field made the implementation of the unique constraint ineffective. Also, setting the date to now on creation is more suited a timestamp field than a field that is tracking when the Mass Spec was run.

Fix migrations formatting

234d492

Fix typo in github actions tests

c6a7bcc

Strip empty rows from Excel import

b803989

Some Excel files can be formatted such that pandas read_excel puts rows in the data frame that are empty (all nan).

jcmatese mentioned this pull request Apr 30, 2021

Add element_count to Compound #60

Merged

5 tasks

hepcat72 requested changes Apr 30, 2021

View reviewed changes

hepcat72 self-requested a review April 30, 2021 17:57

updated load_accucor_msruns argument

c4816f6

fkang-pu reviewed May 4, 2021

View reviewed changes

DataRepo/utils.py Show resolved Hide resolved

This was referenced May 4, 2021

The accucor data loader script will fail when retrieving previously unanalyzed peak groups #65

Closed

Check sample uniqueness in the accucor loader script #66

Closed

Accucor loading script should check that researcher exists #67

Closed

merge main, add tracer_labeled_element_regex_pattern

7be23a7

hepcat72 mentioned this pull request May 4, 2021

Data Loader Tests #64

Closed

9 tasks

Code review, migrations resolution, and more

7665876

added TracerLabeledClass.tracer_labeled_elements_list clarified error messages resolved some model/migrations conflicts

jcmatese requested review from lparsons, hepcat72 and fkang-pu May 4, 2021 23:28

hepcat72 requested changes May 5, 2021

View reviewed changes

DataRepo/models.py Outdated Show resolved Hide resolved

DataRepo/models.py Outdated Show resolved Hide resolved

DataRepo/utils.py Show resolved Hide resolved

DataRepo/utils.py Show resolved Hide resolved

lparsons mentioned this pull request May 5, 2021

Add formula validation to PeakGroup and Compound #68

Open

jcmatese added 3 commits May 5, 2021 16:34

Changed corrected data label column regex

fed5cd2

removed TracerLabeledClass.tracer_labeled_element_regex_pattern added AccuCorDataLoader.corrected_file_tracer_labeled_column_regex changed regex to deal with >1 character symbols

removed lower() compound name accomodation

4314796

Peak group name lookup are no case-sensitive Edited all example files to update 'Glucose' to 'glucose'

black line wrapping

6846962

jcmatese requested a review from hepcat72 May 5, 2021 22:59

hepcat72 approved these changes May 6, 2021

View reviewed changes

lparsons approved these changes May 6, 2021

View reviewed changes

fkang-pu approved these changes May 6, 2021

View reviewed changes

jcmatese mentioned this pull request May 7, 2021

PeakGroup name parsing and compound validation #61

Closed

jcmatese merged commit 3a16e43 into main May 7, 2021

lparsons deleted the accucor-data-load branch June 17, 2021 14:27

Load AccucCor data into MsRun, PeakGroup, PeakData #55

Load AccucCor data into MsRun, PeakGroup, PeakData #55

Uh oh!

Conversation

jcmatese commented Apr 28, 2021 • edited by lparsons Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary Change Description

Affected Issue Numbers

Code Review Notes

Checklist

Uh oh!

jcmatese commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lparsons commented Apr 28, 2021

Uh oh!

Uh oh!

jcmatese commented Apr 28, 2021

Uh oh!

jcmatese commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lparsons commented Apr 28, 2021

Uh oh!

Uh oh!

hepcat72 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hepcat72 commented Apr 30, 2021

Uh oh!

lparsons commented Apr 30, 2021

Uh oh!

fkang-pu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hepcat72 commented May 4, 2021

Uh oh!

hepcat72 commented May 4, 2021

Uh oh!

hepcat72 left a comment

Choose a reason for hiding this comment

Uh oh!

hepcat72 left a comment • edited by jcmatese Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcmatese commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hepcat72 commented May 5, 2021

Uh oh!

hepcat72 commented May 5, 2021

Uh oh!

jcmatese commented May 5, 2021

Uh oh!

hepcat72 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jcmatese commented Apr 28, 2021 •

edited by lparsons

Loading

jcmatese commented Apr 28, 2021 •

edited

Loading

jcmatese commented Apr 28, 2021 •

edited

Loading

hepcat72 left a comment •

edited by jcmatese

Loading

jcmatese commented May 5, 2021 •

edited

Loading