-
Notifications
You must be signed in to change notification settings - Fork 16
Add geomap utilities, hosp integration #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost all of my beefs are with the documentation. Aside from a couple of optional nits (& a spurious commented-out line that should definitely be fixed) the only code thing to double-check is whether megacounties have a side effect of generating averages -- I didn't think they should, but the docs mention a window size for averages and that makes me nervous.
|
||
Authors: Jingjing Tang, James Sharpnack | ||
|
||
The data_proc/geomap directory contains original source data, processing scripts, and notes for processing from original source to crosswalk tables in the data directory for the delphi_utils package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data_proc/geomap directory contains original source data, processing scripts, and notes for processing from original source to crosswalk tables in the data directory for the delphi_utils package. | |
The `data_proc/geomap` directory contains original source data, processing scripts, and notes for generating the crosswalk tables in the data directory of the delphi_utils package. |
Requires the following source files below. | ||
|
||
Run the following to write the cross files in the package data dir... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires the following source files below. | |
Run the following to write the cross files in the package data dir... | |
First, acquire the source files listed below. | |
Then run the following to write the crosswalk files in the package data dir... |
|
||
Some of the source files were constructed by hand, most notably 02_20_uszips.csv. | ||
|
||
The 02_20_uszips.csv file is based on the newest consensus data including 5-digit zipcode, fips code, county name, state, population, HRR, HSA (I downloaded the original file from here https://simplemaps.com/data/us-zips. This file matches best to the most recent (2020) situation in terms of the population. But there still exist some matching problems. I manually checked and corrected those lines (~20) with zip-codes.com (https://www.zip-codes.com/zip-code/58439/zip-code-58439.asp). The mapping from 5-digit zipcode to HRR is based on the file in 2017 version downloaded from https://atlasdata.dartmouth.edu/static/supp_research_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much of this construction history do we need to retain in this repository?
Could some of this be replaced with shorter explanatory text (e.g. "We tried incorporating data from the cbsatocountycrosswalk.csv file at https://data.nber.org/data/cbsa-fips-county-crosswalk.html but it was worse" instead of the 4/15 and 4/19 entries)? The population files referenced in the 6/15 log are not included here; can we remove that log entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dmitry will be cleaning this README up with his PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted.
Introduced the March 2020 MSA file, source is [US Census Bureau](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html). This file seems to differ in a few fips codes from the source for the 02_20_uszip file which Jingjing constructed. There are at least 10 additional fips in 03_20_msa that are not in the uszip file, and one of the msa codes seems to be incorrect: 49020 (a google search confirms that it is incorrect in uszip and correct in the census data). | ||
|
||
07/08/2020: | ||
We are reserving 00001-00099 for states codes of the form 100XX where XX is the fips code for the state. In the case that the CBSA codes change then it should be verified that these are not used. The current smallest CBSA is 10100. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo somewhere here; either we should reserve _1_0001-_1_0099, or the codes should be of the form _0_00XX. I vote for the former, since the existing API validation code enforces MSAs to be in the range 10000-99999.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was inconsistent here, and the code actually uses 000XX. Will more everything to 100XX convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted.
import pandas as pd | ||
from os import path | ||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: alphabetize these for (our perennially losing battle for) style consistency with the rest of Delphi
+ county -> state : unweighted | ||
+ county -> msa : unweighted | ||
+ county -> megacounty | ||
- zip -> * not county |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(what does "* not county" mean?)
date_col='date', | ||
mega_col='megafips', | ||
count_cols=None): | ||
"""convert and aggregate from zip to fips (county) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""convert and aggregate from zip to fips (county) | |
"""convert and aggregate from fips to fips, collapsing small-count fips into megacounties |
Args: | ||
data: pd.DataFrame input data | ||
thr_count: numeric, if the sum of counts exceed this, then fips is converted to mega | ||
thr_win_len: int, the number of Days to use as an average |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait does this method also convert counts to average counts? It shouldn't do that, should it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is old documentation that was carried over, I think if means that the average is what determines censoring
@@ -64,7 +64,7 @@ class Constants: | |||
# number of counties in usa, including megacounties | |||
NUM_COUNTIES = 3141 + 52 | |||
NUM_HRRS = 308 | |||
NUM_MSAS = 392 | |||
NUM_MSAS = 392 + 52 # MSA + States |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NUM_MSAS = 392 + 52 # MSA + States | |
NUM_MSAS = 392 + 52 # MSA + one mega-MSA per state |
data_frame = gmpr.county_to_state(data,state_id_col=geo) | ||
elif geo == "msa": | ||
data_frame = gmpr.county_to_msa(data,msa_col=geo) | ||
elif geo == "hrr": | ||
data_frame = data # data is already adjusted in aggregation step above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(which aggregation step is that?)
Currently tracking differences to results generated by the old codebase. Mostly this seems to be due to legit changes in how we're mapping FIPS to metro areas. We'll want to make a note of these changes in the API changelog when we merge this fix. |
We should investigate the differences between |
Summary of changes