-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a "delete_missing" option to CKAN harvester #542
Comments
I wrote the proposed code in PR #548. Regarding my question about step 1.2, I chose to fetch from the harvest_object table, to keep the same logic from I also changed one line in the base harvester to force package update whenever the package already exists but is in the deleted state. This is necessary to address a situation when the remote CKAN instance has technical problems (with Solr) that cause It seems a very unlikely situation, but it has already happened in Brazilian government data portal. This has the inconvenient that it would also take out of trash a dataset which was harvested and then manually deleted. Anyway, I think this should be the right behaviour, since if we purge a harvested dataset, it will be reimported in the next harvest run. |
In brazilian government we have a very decentralized structure in which several entities have their own CKAN instances. We collect all data from these entities trough the harvest extension.
We have quite a lot of trouble when a dataset is deleted in one of those harvested CKAN portals because the CKAN harvester does not delete it in our CKAN, so it keeps showing many datasets with broken links or out of date information.
We propose to add an option to the CKAN harvester called
delete_missing
(boolean type), which will check for datasets that no longer exist in the harvested CKAN portal and delete them.A near identical demand was reported on issue #396 about 2 years ago. The author of the issue even said he wrote some custom code to solve it, but he never shared the code, so I am opening this new issue aiming to submit a future pull request.
My idea is to copy the same logic from the DCAT JSON harvester from ckanext-dcat:
gather_stage
function:1.2. List all dataset UIDs that were imported through the current harvest source (by querying the harvest_object table).
1.3. List all remote CKAN datasets, then check for local UIDs that are missing in the remote CKAN list.
1.4. Create harvest objects with delete state for all of those missing datasets.
import_stage
function:2.1. Effectively delete (but not purge) all those missing datasets.
About step 1.2, I don't know if it would be better to look into the harvest_object table or to look for datasets with the extra field
harvest_source_id
that matches the harvest source of the job. It seems that the extension normally uses the havest_object table, but it won't work if we use the clear_history command on the source.I kindly appreciate any feedback about this implementation idea, since this is my first contribution to the project.
The text was updated successfully, but these errors were encountered: