-
Notifications
You must be signed in to change notification settings - Fork 1
Load AccucCor data into MsRun, PeakGroup, PeakData #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
add researcher to MsRun model tweak labeled_element_count validator in PeakData model add command to load_accucor_msruns add AccuCorDataLoader to utils add openpyxl to requirement for Excel reading via pandas
I suspect the the Line 154 in 2f18245
where I would prefer it to be "date the run was performed", not "date the record was entered", but that raised other issues of provenance and database auditing that we have not discussed in detail. I give up on the linting issues, because Django migrations and Flake8 will always run afoul of each other, when it comes to line length. |
@jcmatese Looks like there is a missing migration (forgot to check it in perhaps?)
This is exactly why Rob (and I) had initially suggest we not lint the generated code. If you now agree that linting the migrations is more trouble that it's worth, I'd happily agree. Feel free to submit a PR or comment your approval here and I'll do it. |
Yes, that is fine by me. I am tired of fighting with it. Skip migrations green light from me. |
The problem is that pylint thinks this it too long
But if I edit it to dodge that bullet tracebase/DataRepo/migrations/0007_peakdata_validator.py Lines 19 to 20 in 638cc46
The make migrations complains "that is not what I would have written", presumably. https://github.com/Princeton-LSI-ResearchComputing/tracebase/runs/2460869637 It is probably all whitespace differences, and a waste of our collective time. |
You removed a space when you made that change. |
Running the load script for the example data via GitHub actions helps to ensure that piece of code is still functioning as intended
Having the time as a component of the field made the implementation of the unique constraint ineffective. Also, setting the date to now on creation is more suited a timestamp field than a field that is tracking when the Mass Spec was run.
Some Excel files can be formatted such that pandas read_excel puts rows in the data frame that are empty (all nan).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to resume my review, but since there were pushes, I don't know what would happen if I refresh this page, so I'm submitted a partial review to not lose my work.
So 4 of my issues became "outdated" due to the pushes. The code the issues are about still exists and thus the issues are still relevant. They are just disconnected. I'm not sure what the best way is to deal with those disconnected issues. Probably we should establish a policy to not push until all reviewers have responded, but on the grand scale, it seems an unnecessary delay. Ideally, there would be an easy way to "fix" these "outdated" issues by re-attaching them to the new line numbers. Unfortunately, there's no mechanism to do that. (Is that correct?) The only viable workaround is somewhat labor-intensive: Copy the issue and comments, locate the new location of the code, create a new issue, and paste in the issue content. Do you guys have an opinion? |
I seem to see the comments and the code in the "Conversation" tab. I don't think it will be an issue to address them as is, but I'll let @jcmatese have a go at things first. Once this round of comments is addressed we can reassess, but I think this is pretty close. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transaction.atomic seems working. I checked out this pull request yesterday morning, and was able to load data from obob_maven_6eaas_inf.xlsx (after fixing row for "citrate" in compound table), but got error for loading "obob_maven_c160_inf.xlsx" due to missing sample "bkbk1123".
One concern I have is we only store accucor file name(s) in the log for msrun/peakdata loading. Querying database after loading, we wouldn't know its source file(s) for a specific peakgroup. I think it's important to be able to trace back to the source file, which would be helpful for data verification and search/grouping. My proposal is adding dataset and archived_file tables, or at least dataset for now (peakgroup has a foreign key to dataset, dataset has M:M relationship to archive_file). Not sure if it makes sense to you. I raise the question without approving this pull request, as it could affect Peakgroup model structure and loading script.
p.s. could add code to check formula in addition to verify/link peakgroup to compound(s). make sure the formula for a peakgroup matches that in compound table.
Let me know when this is ready for a second-round review. |
|
added TracerLabeledClass.tracer_labeled_elements_list clarified error messages resolved some model/migrations conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were many issues marked as resolved for which no resolution or explanation was offered. And where a comment was added indicating a change, there is no change evident in the code. I do not understand why this is the case, but I can only assume that the changes have not been pushed.
Let me re-affirm the order of operations I thought we were all following:
- Submit a PR & request an initial review
- Reviewers add issues
- Author accepts/rejects each issue
- Author implements changes to the accepted issues
- Author pushes that work to the PR branch
- Author requests subsequent review
- Reviewer checks all issues, evaluates the changes made or the reason for rejecting each issue and either agrees to the proposed resolution (and resolves) or starts a dialog on the issue. New issues may be created for new code & the review is submitted.
- After all reviewers have finished their reviews, if changes are requested or there or further comments are requested, go back to step 3. Otherwise, merge.
I was also expecting that Lance's changes would be merged with this branch and that all existing changes on this branch had been pushed.
Is there a technical issue here? Why do I not see the changes that were commented to have been done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is 1 new non-blocking issue, 1 new blocking (but trivial to resolve) issue, 1 old non-blocking issue that I do not know what the intended resolution to was, and 1 old blocking issue. I can't link the new issues in this comment (so I will qualify them), but the old issues are:
- Load AccucCor data into MsRun, PeakGroup, PeakData #55 (comment)
- Load AccucCor data into MsRun, PeakGroup, PeakData #55 (comment)
- NB. tracer_labeled_element_regex_pattern method name
- B. Document the assumption about element symbols being a single character (which could possibly not be true in the future).
However the "resolved" statuses of all my issues are current. If it is or is not "resolved", that status is accurate according to my latest round review.
removed TracerLabeledClass.tracer_labeled_element_regex_pattern added AccuCorDataLoader.corrected_file_tracer_labeled_column_regex changed regex to deal with >1 character symbols
Peak group name lookup are no case-sensitive Edited all example files to update 'Glucose' to 'glucose'
@hepcat72 I changed the regex, but will not address the other two in this go around. @lparsons believes the formula validation might be best served during model save() somehow (perhaps using chempy or cross-check with compounds). We might also tweak the incoming data with pandas related: I did remove the regarding reporting all issue in 1-pass, instead of two, I suggest we get this in usage and see how often it is a problem. I also suspect this will not be the first time this code is modified, and perhaps an entirely different interface might be put on it (multiple tsv files, web-wrapped, etc.) |
Well, re-request my review when you're ready for me to take another look to re-evaluate. Those 2 outstanding issues (if I know which ones you're referring to) should be ok to pass on given Lance created an issue for the blocking one. |
Could you explain what you're referring to here @jcmatese? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All issues resolved. 🎉
Summary Change Description
add researcher to MsRun model
tweak labeled_element_count validator in PeakData model
add command to load_accucor_msruns
add AccuCorDataLoader to utils
add openpyxl to requirement for Excel reading via pandas
Affected Issue Numbers
Code Review Notes
A lot of the is new to me, so everything deserves some some close attention.
Attempt to use
transaction
and to accept load-all or load-nothing. Been tested some, but not exhaustive.Some hard-cording of data files expectations (column numbers), that probably is not future-or-past-proof (tuned to the example file).
I did not implement too many post load tests, but I did try and perform some basic validations prior to load.
EDIT: We believe the known issue documented below may have been fixed with @lparsons later model edit
Known issue: It appears that the method for inserting uniquely into MsRun could be tightened up, probably because of time formatting.
example duplicate load:
resulted from
python manage.py load_accucor_msruns --accucor_filename "DataRepo/example_data/obob_maven_6eaas_inf.xlsx" --protocol 1 --date "2021-04-23" --researcher "mneistat"
I must not be setting/formatting the date correct from args.
Checklist