Skip to content

Add geomap utilities, hosp integration #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Aug 18, 2020
Merged

Add geomap utilities, hosp integration #137

merged 26 commits into from
Aug 18, 2020

Conversation

jsharpna
Copy link
Contributor

@jsharpna jsharpna commented Jul 8, 2020

Summary of changes

  1. Added geomap utility in _delphi_utils_python/delphi_utils/geomap.py (see tests/test_geomap.py)
  2. Added geo data in _delphi_utils_python/delphi_utils/data
  3. Added data processing scripts going from raw data -> cross tables in _delphi_utils_python/data_proc
  4. Integrated geomap util into emr_hosp and tested by running and comparing receiving

@krivard krivard requested review from amartyabasu and eujing July 9, 2020 18:37
Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all of my beefs are with the documentation. Aside from a couple of optional nits (& a spurious commented-out line that should definitely be fixed) the only code thing to double-check is whether megacounties have a side effect of generating averages -- I didn't think they should, but the docs mention a window size for averages and that makes me nervous.


Authors: Jingjing Tang, James Sharpnack

The data_proc/geomap directory contains original source data, processing scripts, and notes for processing from original source to crosswalk tables in the data directory for the delphi_utils package.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The data_proc/geomap directory contains original source data, processing scripts, and notes for processing from original source to crosswalk tables in the data directory for the delphi_utils package.
The `data_proc/geomap` directory contains original source data, processing scripts, and notes for generating the crosswalk tables in the data directory of the delphi_utils package.

Comment on lines +9 to +11
Requires the following source files below.

Run the following to write the cross files in the package data dir...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Requires the following source files below.
Run the following to write the cross files in the package data dir...
First, acquire the source files listed below.
Then run the following to write the crosswalk files in the package data dir...


Some of the source files were constructed by hand, most notably 02_20_uszips.csv.

The 02_20_uszips.csv file is based on the newest consensus data including 5-digit zipcode, fips code, county name, state, population, HRR, HSA (I downloaded the original file from here https://simplemaps.com/data/us-zips. This file matches best to the most recent (2020) situation in terms of the population. But there still exist some matching problems. I manually checked and corrected those lines (~20) with zip-codes.com (https://www.zip-codes.com/zip-code/58439/zip-code-58439.asp). The mapping from 5-digit zipcode to HRR is based on the file in 2017 version downloaded from https://atlasdata.dartmouth.edu/static/supp_research_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of this construction history do we need to retain in this repository?

Could some of this be replaced with shorter explanatory text (e.g. "We tried incorporating data from the cbsatocountycrosswalk.csv file at https://data.nber.org/data/cbsa-fips-county-crosswalk.html but it was worse" instead of the 4/15 and 4/19 entries)? The population files referenced in the 6/15 log are not included here; can we remove that log entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dmitry will be cleaning this README up with his PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

Introduced the March 2020 MSA file, source is [US Census Bureau](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html). This file seems to differ in a few fips codes from the source for the 02_20_uszip file which Jingjing constructed. There are at least 10 additional fips in 03_20_msa that are not in the uszip file, and one of the msa codes seems to be incorrect: 49020 (a google search confirms that it is incorrect in uszip and correct in the census data).

07/08/2020:
We are reserving 00001-00099 for states codes of the form 100XX where XX is the fips code for the state. In the case that the CBSA codes change then it should be verified that these are not used. The current smallest CBSA is 10100.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo somewhere here; either we should reserve _1_0001-_1_0099, or the codes should be of the form _0_00XX. I vote for the former, since the existing API validation code enforces MSAs to be in the range 10000-99999.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was inconsistent here, and the code actually uses 000XX. Will more everything to 100XX convention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

Comment on lines +7 to +9
import pandas as pd
from os import path
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: alphabetize these for (our perennially losing battle for) style consistency with the rest of Delphi

+ county -> state : unweighted
+ county -> msa : unweighted
+ county -> megacounty
- zip -> * not county
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(what does "* not county" mean?)

date_col='date',
mega_col='megafips',
count_cols=None):
"""convert and aggregate from zip to fips (county)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""convert and aggregate from zip to fips (county)
"""convert and aggregate from fips to fips, collapsing small-count fips into megacounties

Args:
data: pd.DataFrame input data
thr_count: numeric, if the sum of counts exceed this, then fips is converted to mega
thr_win_len: int, the number of Days to use as an average
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait does this method also convert counts to average counts? It shouldn't do that, should it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old documentation that was carried over, I think if means that the average is what determines censoring

@@ -64,7 +64,7 @@ class Constants:
# number of counties in usa, including megacounties
NUM_COUNTIES = 3141 + 52
NUM_HRRS = 308
NUM_MSAS = 392
NUM_MSAS = 392 + 52 # MSA + States
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NUM_MSAS = 392 + 52 # MSA + States
NUM_MSAS = 392 + 52 # MSA + one mega-MSA per state

data_frame = gmpr.county_to_state(data,state_id_col=geo)
elif geo == "msa":
data_frame = gmpr.county_to_msa(data,msa_col=geo)
elif geo == "hrr":
data_frame = data # data is already adjusted in aggregation step above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which aggregation step is that?)

@krivard
Copy link
Contributor

krivard commented Aug 4, 2020

Currently tracking differences to results generated by the old codebase. Mostly this seems to be due to legit changes in how we're mapping FIPS to metro areas. We'll want to make a note of these changes in the API changelog when we merge this fix.

@krivard
Copy link
Contributor

krivard commented Aug 5, 2020

We should investigate the differences between deploy-jhu and main and decide whether to merge to main.

@dshemetov dshemetov mentioned this pull request Aug 12, 2020
15 tasks
@krivard krivard merged commit 73308d9 into main Aug 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants