Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some FIPS codes are 5 digit strings #21

Open
NickCrews opened this issue Apr 25, 2024 · 1 comment
Open

Some FIPS codes are 5 digit strings #21

NickCrews opened this issue Apr 25, 2024 · 1 comment

Comments

@NickCrews
Copy link

FIPS codes are supposed to be 5-digit strings, with leading 0s if relevant. A few errors I've found:

  • MA, ME, NH, RI, CT, VT, WI, MI all have 10-digit jurisdiction FIPS, eg one value is "0900305910". Hartford County CT has a FIPS code of "09003", so it looks like that is the prefix, but IDK what the suffix is coming from. IDK, this is a lot of instances of this, so perhaps this isn't an error, in which case sorry for the noise!
  • CA, AL, and AR all contain 4 digit county_fips and jurisdiction_fips. They were probably read in without a specfiic dtype and eg pandas interpreted it as an int
  • RI contains county_fips and jurisdiction_fips of "NA" and "NAN"

I am still very happy to write some testing/QA scripts for your exported .csvs that might catch some of these common errors, please let me know if that would be useful.

Thank you!

@sbaltzmit
Copy link
Contributor

You're right about the 4-digit FIPS codes, thanks for that! And clearly RI got caught with a data type error. I'll add those latter 4 states to the list to check. The 10-digit FIPS codes are the official municipality or township-level FIPS codes, which are the appropriate designator of jurisdiction in states that do not administer elections at the county level. I'll look into the Hartford County situation, if that's a county FIPS code it's probably fine but if it's a jurisdiction FIPS code it should probably have a suffix.

On the topic of QA, related to conversations we've had about this in two other Issues, we have scripts that both automatically apply padding to coerce to a the FIPS code into a 5 digit zero-padded string and that also then check for every issue you've raised and raise a flag if there is a problem. But QA on a dataset like this is a very involved process with a lot of flags. As I know you well understand since you've mentioned it in the past, consistently catching every subtle data type issue in a nearly 15 million row dataset with regular updates is a matter of more than just having a QA script. I really appreciate when you raise data issues that we can address. But this is fair warning that I won't engage with further Issues that continue to imply that we don't do QA -- we spend a very long time doing very extensive QA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants