Skip to content

Load AccucCor data into MsRun, PeakGroup, PeakData #55

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 7, 2021
Merged

Conversation

jcmatese
Copy link
Collaborator

@jcmatese jcmatese commented Apr 28, 2021

Summary Change Description

add researcher to MsRun model
tweak labeled_element_count validator in PeakData model
add command to load_accucor_msruns
add AccuCorDataLoader to utils
add openpyxl to requirement for Excel reading via pandas

Affected Issue Numbers

Code Review Notes

A lot of the is new to me, so everything deserves some some close attention.
Attempt to use transaction and to accept load-all or load-nothing. Been tested some, but not exhaustive.
Some hard-cording of data files expectations (column numbers), that probably is not future-or-past-proof (tuned to the example file).
I did not implement too many post load tests, but I did try and perform some basic validations prior to load.

EDIT: We believe the known issue documented below may have been fixed with @lparsons later model edit

Known issue: It appears that the method for inserting uniquely into MsRun could be tightened up, probably because of time formatting.
example duplicate load:

{'_state': <django.db.models.base.ModelState object at 0x7fe5d820d760>,
 'date': datetime.datetime(2021, 4, 28, 17, 1, 12, 215368, tzinfo=<UTC>),
 'id': 560,
 'protocol_id': 1,
 'researcher': 'mneistat',
 'sample_id': 87}
{'_state': <django.db.models.base.ModelState object at 0x7fe5d820d370>,
 'date': datetime.datetime(2021, 4, 28, 17, 39, 57, 605820, tzinfo=<UTC>),
 'id': 617,
 'protocol_id': 1,
 'researcher': 'mneistat',
 'sample_id': 87}

resulted from
python manage.py load_accucor_msruns --accucor_filename "DataRepo/example_data/obob_maven_6eaas_inf.xlsx" --protocol 1 --date "2021-04-23" --researcher "mneistat"

I must not be setting/formatting the date correct from args.

Checklist

add researcher to MsRun model
tweak labeled_element_count validator in PeakData model
add command to load_accucor_msruns
add AccuCorDataLoader to utils
add openpyxl to requirement for Excel reading via pandas
@jcmatese
Copy link
Collaborator Author

jcmatese commented Apr 28, 2021

I suspect the the date issue is due to auto_now_add behavior in the model

date = models.DateTimeField(auto_now=False, auto_now_add=True, editable=True)

where I would prefer it to be "date the run was performed", not "date the record was entered", but that raised other issues of provenance and database auditing that we have not discussed in detail.

I give up on the linting issues, because Django migrations and Flake8 will always run afoul of each other, when it comes to line length.

@lparsons
Copy link
Contributor

@jcmatese Looks like there is a missing migration (forgot to check it in perhaps?)

I give up on the linting issues, because Django migrations and Flake8 will always run afoul of each other, when it comes to line length.

This is exactly why Rob (and I) had initially suggest we not lint the generated code. If you now agree that linting the migrations is more trouble that it's worth, I'd happily agree. Feel free to submit a PR or comment your approval here and I'll do it.

@jcmatese
Copy link
Collaborator Author

@jcmatese Looks like there is a missing migration (forgot to check it in perhaps?)

I give up on the linting issues, because Django migrations and Flake8 will always run afoul of each other, when it comes to line length.

This is exactly why Rob (and I) had initially suggest we not lint the generated code. If you now agree that linting the migrations is more trouble that it's worth, I'd happily agree. Feel free to submit a PR or comment your approval here and I'll do it.

Yes, that is fine by me. I am tired of fighting with it. Skip migrations green light from me.

@jcmatese
Copy link
Collaborator Author

jcmatese commented Apr 28, 2021

The problem is that pylint thinks this it too long

help_text="the M+ value (i.e. Label) for this observation. '1' means one atom is labeled. '3' means 3 atoms are labeled",

But if I edit it to dodge that bullet
help_text="the M+ value (i.e. Label) for this observation. "
"'1' means one atom is labeled. '3' means 3 atoms are labeled",

The make migrations complains "that is not what I would have written", presumably.
https://github.com/Princeton-LSI-ResearchComputing/tracebase/runs/2460869637

It is probably all whitespace differences, and a waste of our collective time.

@lparsons
Copy link
Contributor

The problem is that pylint thinks this it too long

help_text="the M+ value (i.e. Label) for this observation. '1' means one atom is labeled. '3' means 3 atoms are labeled",

But if I edit it to doge that bullet

help_text="the M+ value (i.e. Label) for this observation. "
"'1' means one atom is labeled. '3' means 3 atoms are labeled",

The make migrations complains "that is not what I would have written", presumably.
https://github.com/Princeton-LSI-ResearchComputing/tracebase/runs/2460869637

You removed a space when you made that change.

Running the load script for the example data via GitHub actions helps to
ensure that piece of code is still functioning as intended
Having the time as a component of the field made the implementation of
the unique constraint ineffective. Also, setting the date to now on
creation is more suited a timestamp field than a field that is tracking
when the Mass Spec was run.
Some Excel files can be formatted such that pandas read_excel puts rows
in the data frame that are empty (all nan).
@jcmatese jcmatese mentioned this pull request Apr 30, 2021
5 tasks
Copy link
Collaborator

@hepcat72 hepcat72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to resume my review, but since there were pushes, I don't know what would happen if I refresh this page, so I'm submitted a partial review to not lose my work.

@hepcat72
Copy link
Collaborator

So 4 of my issues became "outdated" due to the pushes. The code the issues are about still exists and thus the issues are still relevant. They are just disconnected. I'm not sure what the best way is to deal with those disconnected issues. Probably we should establish a policy to not push until all reviewers have responded, but on the grand scale, it seems an unnecessary delay. Ideally, there would be an easy way to "fix" these "outdated" issues by re-attaching them to the new line numbers. Unfortunately, there's no mechanism to do that. (Is that correct?)

The only viable workaround is somewhat labor-intensive: Copy the issue and comments, locate the new location of the code, create a new issue, and paste in the issue content.

Do you guys have an opinion?

@hepcat72 hepcat72 self-requested a review April 30, 2021 17:57
@lparsons
Copy link
Contributor

So 4 of my issues became "outdated" due to the pushes.

I seem to see the comments and the code in the "Conversation" tab. I don't think it will be an issue to address them as is, but I'll let @jcmatese have a go at things first. Once this round of comments is addressed we can reassess, but I think this is pretty close.

Copy link
Collaborator

@fkang-pu fkang-pu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transaction.atomic seems working. I checked out this pull request yesterday morning, and was able to load data from obob_maven_6eaas_inf.xlsx (after fixing row for "citrate" in compound table), but got error for loading "obob_maven_c160_inf.xlsx" due to missing sample "bkbk1123".

One concern I have is we only store accucor file name(s) in the log for msrun/peakdata loading. Querying database after loading, we wouldn't know its source file(s) for a specific peakgroup. I think it's important to be able to trace back to the source file, which would be helpful for data verification and search/grouping. My proposal is adding dataset and archived_file tables, or at least dataset for now (peakgroup has a foreign key to dataset, dataset has M:M relationship to archive_file). Not sure if it makes sense to you. I raise the question without approving this pull request, as it could affect Peakgroup model structure and loading script.

p.s. could add code to check formula in addition to verify/link peakgroup to compound(s). make sure the formula for a peakgroup matches that in compound table.

@hepcat72
Copy link
Collaborator

hepcat72 commented May 4, 2021

Let me know when this is ready for a second-round review.

@hepcat72
Copy link
Collaborator

hepcat72 commented May 4, 2021

Finishing up my initial review. Note, I haven't read yet any comments on the first half of my review, so please point out any possible redundancies. I have a number of requested changes. One last change request:

B. We need tests for each of the requirements in the issue:

  1. xlsx file (accucor format, required) test that supplying the wrong file type or no file fails
  2. date required test that fails when no date supplied
  3. protocol either a integer (protocol.id) or name test supplying protocol name succeeds test that supplying protocol ID succeeds test that not supplying a protocol fails
  4. researcher name test for failure when no name supplied
  5. sample names are unique with the file supply a file with a duplicate sample name & test for failure
  6. sample names must already be in the database test for failure with a sample name not in the DB
  7. compound labels must already be in the database test for failure with a compound not in the DB
  • Lance created an issue for these

@hepcat72 hepcat72 mentioned this pull request May 4, 2021
9 tasks
added TracerLabeledClass.tracer_labeled_elements_list
clarified error messages
resolved some model/migrations conflicts
@jcmatese jcmatese requested review from lparsons, hepcat72 and fkang-pu May 4, 2021 23:28
Copy link
Collaborator

@hepcat72 hepcat72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were many issues marked as resolved for which no resolution or explanation was offered. And where a comment was added indicating a change, there is no change evident in the code. I do not understand why this is the case, but I can only assume that the changes have not been pushed.

Let me re-affirm the order of operations I thought we were all following:

  1. Submit a PR & request an initial review
  2. Reviewers add issues
  3. Author accepts/rejects each issue
  4. Author implements changes to the accepted issues
  5. Author pushes that work to the PR branch
  6. Author requests subsequent review
  7. Reviewer checks all issues, evaluates the changes made or the reason for rejecting each issue and either agrees to the proposed resolution (and resolves) or starts a dialog on the issue. New issues may be created for new code & the review is submitted.
  8. After all reviewers have finished their reviews, if changes are requested or there or further comments are requested, go back to step 3. Otherwise, merge.

I was also expecting that Lance's changes would be merged with this branch and that all existing changes on this branch had been pushed.

Is there a technical issue here? Why do I not see the changes that were commented to have been done?

Copy link
Collaborator

@hepcat72 hepcat72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is 1 new non-blocking issue, 1 new blocking (but trivial to resolve) issue, 1 old non-blocking issue that I do not know what the intended resolution to was, and 1 old blocking issue. I can't link the new issues in this comment (so I will qualify them), but the old issues are:

However the "resolved" statuses of all my issues are current. If it is or is not "resolved", that status is accurate according to my latest round review.

jcmatese added 3 commits May 5, 2021 16:34
removed TracerLabeledClass.tracer_labeled_element_regex_pattern
added AccuCorDataLoader.corrected_file_tracer_labeled_column_regex
changed regex to deal with >1 character symbols
Peak group name lookup are no case-sensitive
Edited all example files to update 'Glucose' to 'glucose'
@jcmatese
Copy link
Collaborator Author

jcmatese commented May 5, 2021

@hepcat72 I changed the regex, but will not address the other two in this go around.

@lparsons believes the formula validation might be best served during model save() somehow (perhaps using chempy or cross-check with compounds). We might also tweak the incoming data with pandas read_excel dtype or converters arguments, which might set default behavior for null cells/strings. The jury is still out on the best solution, and Lance proposes #61

related: I did remove the peak_group.name.lower() accommodation in commit 4314796 and updated all the example files to the best of my ability.

regarding reporting all issue in 1-pass, instead of two, I suggest we get this in usage and see how often it is a problem. I also suspect this will not be the first time this code is modified, and perhaps an entirely different interface might be put on it (multiple tsv files, web-wrapped, etc.)

@hepcat72
Copy link
Collaborator

hepcat72 commented May 5, 2021

Well, re-request my review when you're ready for me to take another look to re-evaluate. Those 2 outstanding issues (if I know which ones you're referring to) should be ok to pass on given Lance created an issue for the blocking one.

@hepcat72
Copy link
Collaborator

hepcat72 commented May 5, 2021

regarding reporting all issue in 1-pass, instead of two, I suggest we get this in usage and see how often it is a problem. I also suspect this will not be the first time this code is modified, and perhaps an entirely different interface might be put on it (multiple tsv files, web-wrapped, etc.)

Could you explain what you're referring to here @jcmatese?

@jcmatese
Copy link
Collaborator Author

jcmatese commented May 5, 2021

regarding reporting all issue in 1-pass, instead of two, I suggest we get this in usage and see how often it is a problem. I also suspect this will not be the first time this code is modified, and perhaps an entirely different interface might be put on it (multiple tsv files, web-wrapped, etc.)

Could you explain what you're referring to here @jcmatese?

#55 (comment)

@jcmatese jcmatese requested a review from hepcat72 May 5, 2021 22:59
Copy link
Collaborator

@hepcat72 hepcat72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All issues resolved. 🎉

@jcmatese jcmatese merged commit 3a16e43 into main May 7, 2021
@lparsons lparsons deleted the accucor-data-load branch June 17, 2021 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MSRUN Data loading
4 participants