Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out primary keys and update/alter #4

Open
simonw opened this issue Feb 16, 2020 · 6 comments
Open

Figure out primary keys and update/alter #4

simonw opened this issue Feb 16, 2020 · 6 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Feb 16, 2020

geojson-to-sqlite supports both --pk= and --alter options.

I'm not sure if --pk makes sense for Shapefiles - they appear to have a unijque ID property that I could use, but I'm not certain if it is guaranteed to be present. It's been present in the files I've looked at so far.

Adding tests that exercise --alter plus upserting data into existing tables would be worthwhile.

@simonw simonw added the enhancement New feature or request label Feb 16, 2020
@simonw
Copy link
Owner Author

simonw commented Feb 17, 2020

I'm pretty confident that the "id" property Fiona loads from the shapefile (which starts at 0 and increments from there) can be considered unique. It appears to be the value that is used to attach shapes in the .shp file to property records in the .dbf DBase file.

I used the dbfread Python module to read the DBase file directly to see if it had a concept of an ID, and it doesn't appear to - suggesting that the index into the file is the only way to associate it with a shape:

In [1]: from dbfread import DBF                                                                                                                                          

In [2]: db = DBF("shape/scc_parkland_boundaries.dbf")                                                                                                                    

In [3]: next(iter(db))                                                                                                                                                   
Out[3]: 
OrderedDict([('PARK_NAME', 'Sanborn'),
             ('ADDRESS', '16055 Sanborn Rd.'),
             ('CITY_ZIP', 'Saratoga, CA 95070'),
             ('PRD_OPER', 'yes'),
             ('MNT_REG', 'Region 1'),
             ('RNGR_REG', 'Region 1'),
             ('SHAPE_Leng', 128308.882691),
             ('SHAPE_Area', 149564781.558),
             ('RespCostCn', '5809'),
             ('STATUS', 'open'),
             ('RuleID', 3),
             ('SHF_BEATS', 'P1'),
             ('PARK_ID', 'Sanborn'),
             ('ACRES', 3432.2485994),
             ('PARK_SUFFI', 'County Park')])

Finally, the ESRI shapefile specification at https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf says the following:

shapefile_pdf__page_29_of_34_

@simonw
Copy link
Owner Author

simonw commented Feb 17, 2020

So... when importing a single shapefile into a single table it's safe for me to consider the id unique for the purpose of defining a primary key.

But how about loading multiple shapefiles into the same table? The tool supports this at the moment:

$ shapefile-to-sqlite features.db *.shp --table=features

But this won't work because of the duplicate primary keys.

This could be solved with a mode where the primary key becomes a string with a unique prefix for each incoming shapefile. Maybe it does that automatically if you attempt to load multiple shapefiles into the same table? Or maybe there's an option for that - which would be a bit more obvious, but people would have to remember to use it.

$ shapefile-to-sqlite features.db *.shp --table=features --prefix-pk

simonw added a commit that referenced this issue Feb 17, 2020
@simonw
Copy link
Owner Author

simonw commented Feb 17, 2020

Now using .insert_all(..., replace=True) as of 22a5611

@simonw
Copy link
Owner Author

simonw commented Jan 8, 2021

I just found a shapefile where the OBJECTID would make a better primary key:

OrderedDict([('OBJECTID', 1),
             ('CSAFP', '122'),
             ('CBSAFP', '12020'),
             ('GEOID', '12020'),
             ('NAME', 'Athens-Clarke County, GA'),
             ('NAMELSAD', 'Athens-Clarke County, GA Metro Area'),
             ('LSAD', 'M1'),
             ('MEMI', '1'),
             ('MTFCC', 'G3110'),
             ('ALAND', 2654601832),
             ('AWATER', 26140309),
             ('INTPTLAT', '+33.9439840'),
             ('INTPTLON', '-083.2138965'),
             ('Shape_Leng', 363485.4157899709),
             ('Shape_Area', 3905629939.0061007)])

https://data-usdot.opendata.arcgis.com/datasets/b0d0e777e2ad4b53803dbc0527c73d88_0

@saulshanabrook
Copy link

saulshanabrook commented May 4, 2022

Hey @simonw, I am taking a look at this issue. I have resolved it locally using the --prefix-pk idea you mentioned, optionally prefixing the PK with the file path for each file you pass in.

Should I submit a PR for that?

@simonw
Copy link
Owner Author

simonw commented May 4, 2022

Yes please! That diff looks good - just needs a test and a sentence of documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants