You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the future, we will need to periodically rerun the same data pipeline for each database have the latest version. Assuming the schema of the provided data dump remains unchanged, this process can be relatively straightforward by running Python scripts. However, the reconciliation process disrupts the workflow, causing a bottleneck. Additionally, since there is a significant overlap in the data, it will be great to find ways to reduce the time spent on re-reconciliation.
1. Reapplying Actions in OpenRefine
One useful approach to streamline the reconciliation process is by reapplying actions in OpenRefine. This method allows you to export the action history of your current project and apply it to the updated database version. Importantly, it also retains the reconciliation choices made earlier.
Steps:
After performing the initial reconciliation in OpenRefine UI, go to the Undo/Redo section and click Extract
Save the JSON file containing the action sequences and reconciliation choices
When you have a new database, create a new project with the new dataset. Go to the Undo/Redo section and choose Apply.
Paste the JSON history file and click Perform operations
Done!
I've done a few test and notice that if there are differences in the old and new database versions schema:
If the new version misses a column that has an action performed on, OpenRefine will freeze up and doesn't proceed pass the action.
for example: if the list of actions includes creating a new column from a column mode_name and reconcile that columns, it won’t finish and run up until freezing.
If new new version misses a column that doesn’t have an action on or have an extra column, the actions perform as expected.
2. Further Automation Possibilities
(This is not too necessary for now, more like food for thoughts)
Now that we've simplified the process of re-reconciliation, the next question is whether we can fully automate this procedure, such as a script. There's a tool openrefine-batch that takes as input a JSON file containing the action history and applies it to the new database version without requiring users to open the OpenRefine user interface.
I'll test how this script handles Wikidata reconciliation like ours and exporting configuration options.
Another thing to note is how to reconcile new data coming in. It appears this software may not allow the addition of new values during the reconciliation process. In other words, if there are new data points that were not reconciled in the JSON history, the tool may not reconcile them automatically.
We could, in the future, write our own scripts to handle this problem. Maybe to make the script to prompt for user input when encountering unreconciled values, or add an option to open the OpenRefine UI to manually complete the reconciliation process if necessary.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In the future, we will need to periodically rerun the same data pipeline for each database have the latest version. Assuming the schema of the provided data dump remains unchanged, this process can be relatively straightforward by running Python scripts. However, the reconciliation process disrupts the workflow, causing a bottleneck. Additionally, since there is a significant overlap in the data, it will be great to find ways to reduce the time spent on re-reconciliation.
1. Reapplying Actions in OpenRefine
One useful approach to streamline the reconciliation process is by reapplying actions in OpenRefine. This method allows you to export the action history of your current project and apply it to the updated database version. Importantly, it also retains the reconciliation choices made earlier.
Steps:
Undo/Redo
section and clickExtract
Undo/Redo
section and chooseApply
.Perform operations
I've done a few test and notice that if there are differences in the old and new database versions schema:
mode_name
and reconcile that columns, it won’t finish and run up until freezing.2. Further Automation Possibilities
(This is not too necessary for now, more like food for thoughts)
Now that we've simplified the process of re-reconciliation, the next question is whether we can fully automate this procedure, such as a script. There's a tool openrefine-batch that takes as input a JSON file containing the action history and applies it to the new database version without requiring users to open the OpenRefine user interface.
I'll test how this script handles Wikidata reconciliation like ours and exporting configuration options.
Another thing to note is how to reconcile new data coming in. It appears this software may not allow the addition of new values during the reconciliation process. In other words, if there are new data points that were not reconciled in the JSON history, the tool may not reconcile them automatically.
We could, in the future, write our own scripts to handle this problem. Maybe to make the script to prompt for user input when encountering unreconciled values, or add an option to open the OpenRefine UI to manually complete the reconciliation process if necessary.
Beta Was this translation helpful? Give feedback.
All reactions