Automation and re-reconciliation with OpenRefine #20

malajvan · 2023-09-29T16:35:51Z

malajvan
Sep 29, 2023
Maintainer

In the future, we will need to periodically rerun the same data pipeline for each database have the latest version. Assuming the schema of the provided data dump remains unchanged, this process can be relatively straightforward by running Python scripts. However, the reconciliation process disrupts the workflow, causing a bottleneck. Additionally, since there is a significant overlap in the data, it will be great to find ways to reduce the time spent on re-reconciliation.

1. Reapplying Actions in OpenRefine

One useful approach to streamline the reconciliation process is by reapplying actions in OpenRefine. This method allows you to export the action history of your current project and apply it to the updated database version. Importantly, it also retains the reconciliation choices made earlier.

Steps:

After performing the initial reconciliation in OpenRefine UI, go to the Undo/Redo section and click Extract
Save the JSON file containing the action sequences and reconciliation choices
When you have a new database, create a new project with the new dataset. Go to the Undo/Redo section and choose Apply.
Paste the JSON history file and click Perform operations
Done!

I've done a few test and notice that if there are differences in the old and new database versions schema:

If the new version misses a column that has an action performed on, OpenRefine will freeze up and doesn't proceed pass the action.
- for example: if the list of actions includes creating a new column from a column mode_name and reconcile that columns, it won’t finish and run up until freezing.
If new new version misses a column that doesn’t have an action on or have an extra column, the actions perform as expected.

2. Further Automation Possibilities

(This is not too necessary for now, more like food for thoughts)

Now that we've simplified the process of re-reconciliation, the next question is whether we can fully automate this procedure, such as a script. There's a tool openrefine-batch that takes as input a JSON file containing the action history and applies it to the new database version without requiring users to open the OpenRefine user interface.

I'll test how this script handles Wikidata reconciliation like ours and exporting configuration options.
Another thing to note is how to reconcile new data coming in. It appears this software may not allow the addition of new values during the reconciliation process. In other words, if there are new data points that were not reconciled in the JSON history, the tool may not reconcile them automatically.

We could, in the future, write our own scripts to handle this problem. Maybe to make the script to prompt for user input when encountering unreconciled values, or add an option to open the OpenRefine UI to manually complete the reconciliation process if necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automation and re-reconciliation with OpenRefine #20

{{title}}

Replies: 0 comments

Select a reply

Automation and re-reconciliation with OpenRefine #20

malajvan Sep 29, 2023 Maintainer

1. Reapplying Actions in OpenRefine

Steps:

2. Further Automation Possibilities

Replies: 0 comments

malajvan
Sep 29, 2023
Maintainer