Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out what is the best way to update the CSV without the need to reconcile the entire CSV? #144

Open
candlecao opened this issue Aug 14, 2024 · 4 comments
Assignees
Labels
Priority: low Low priority

Comments

@candlecao
Copy link
Contributor

  • without abandoning the original reconciled CSV.
  • try only updating the document manually, without modifying code.
@candlecao
Copy link
Contributor Author

_Some draft by Ich:
LinkedData CSV update

Case 1: New data in the database
Case 2: New data in Wikidata, only for static databases
...
4. Already manually reconciled CSV_X (has venue_wiki)
5. New dump of the database: CSV_Y (no venure_wiki but more rows)

How do we update CSV_X?

Replace (with venue_wiki columns) all rows that are not the same
Add any new rows

Use OpenRefine?

How do we check whether new rows can now be reconciled with Wikidata (because there are new entries in Wikidata)?
In OpenRefine, we only try to reconcile the entries in that has blank in venue_wiki column._

@Yueqiao12Zhang
Copy link
Contributor

Problem: since almost all important columns (such as IDs) are transformed into an URL, there is no way to compare the old and new CSV unless we go into the details, which makes it not very automatic.

@fujinaga
Copy link
Member

I don't understand. Can you explain further with examples?

@Yueqiao12Zhang
Copy link
Contributor

In our previous discussion, the approach is to compare & merge the new updated raw CSV to the old reconciled CSV, then reconcile only the updated part of the merged CSV and leave the old data untouched. However, since my reconciliation process modifies the raw CSV completely, pandas.concat() would consider the same row from the raw CSV and the reconciled CSV as two different rows.

For example, in reconciled CSV:

recording_id,artist,recording,recording_wiki,track,number,tune,tune_id
https://thesession.org/recordings/3720,1651,Cast A Bell,,1,1,Kettledrum,https://thesession.org/tunes/14408

in raw CSV:

id,artist,recording,track,number,tune,tune_id
3720,1651,"Cast A Bell",1,1,Kettledrum,14408

and if we update the raw CSV:

id,artist,recording,track,number,tune,tune_id
3720,1651,"Cast A Bell",1,1,Kettledrum,14408
3720,1651,"Cast A Bell",2,1,"Maiden Lane",13727

then we compare & merge, we will get

recording_id,artist,recording,recording_wiki,track,number,tune,tune_id,id
https://thesession.org/recordings/3720,1651,Cast A Bell,,1,1,Kettledrum,https://thesession.org/tunes/14408,
,1651,Cast A Bell,,1,1,Kettledrum,14408,3720
,1651,Cast A Bell,,2,1,Maiden Lane,13727,3720

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: low Low priority
Projects
None yet
Development

No branches or pull requests

3 participants