The entire process from getting the raw JSON files from MusicBrainz to importing into Virtuoso #52

Yueqiao12Zhang · 2024-06-04T13:11:43Z

Yueqiao12Zhang
Jun 4, 2024
Maintainer

          In another issue, @ahankinson mentioned my procedures for reconciling CSVs in OpenRefine. I uploaded detailed history JSON files that are exported directly from OpenRefine. They contain all the steps I went over to reconcile.

I will also describe my steps:
For all columns with "id", I replaced them with the [https://musicbrainz.org/{entity_type}/{id}] reference. They are direct links to MusicBrainz web pages.
For other columns with "names", "title", "genre names", etc, I reconciled them with the Wikidata reconciliation service. If there is no perfect match, I go to the original page in MusicBrainz and check if there is a Wikidata link since some of them have a different name or cannot be found by the reconciliation service, and I will not reconcile that cell if there's none.
After all the reconciliation procedures, I will add a column beside each reconciled column with all the reconciled Wikidata URLs.
This is ready to export, after we export, we have to go write the mapper file for the RDF conversion. Copy the header to the relations_mapping_{database name}.json, change it into a JSON format with a single dictionary, each column in the header as a key in the dictionary, and fill their values using Wikidata Property links, Wikidata Instance links, Schema.org links, or MusicBrainz documentation links (preferred from best to worst respectively).
Then we run the csv2rdf_single_subject.p, we will get an out_rdf.ttl, and this is ready to be imported into Virtuoso. We go to Conductor > Linked Data > Quad Store Upload, select the out_rdf.ttl file, give it a name, check the "create graph explicitly" and upload. We can check if the file is successfully uploaded in the Linked Data > Graphs > Graphs section. If it is there, then we can go to the Linked Data > SPARQL, enter the name we gave to the graph in the Default Graph IRI, and perform SPARQL queries.

Originally posted by @Yueqiao12Zhang in #48 (comment)

ahankinson · 2024-06-04T14:04:41Z

ahankinson
Jun 4, 2024
Maintainer

Converted to a discussion because there is no actual issue to be solved here.

0 replies

ahankinson · 2024-06-04T14:17:07Z

ahankinson
Jun 4, 2024
Maintainer

So if you do (on your command-line, assuming you have curl installed):

$ curl -vvvv -L -H "Accept: text/turtle" "http://www.wikidata.org/entity/Q202996"

You get the text/turtle representation of the Radiohead album, OK Computer. You have to dig through it a bit, but eventually you will see this statement:

s:q202996-8A7E76C9-054D-4A31-B489-D236E386E889 a wikibase:Statement,
		wikibase:BestRank ;
	wikibase:rank wikibase:NormalRank ;
	ps:P436 "b1392450-e666-3926-a536-22c65f834433" ;
	psn:P436 <http://musicbrainz.org/release-group/b1392450-e666-3926-a536-22c65f834433> ;
	pq:P1810 "OK Computer" ;
	pq:P4070 wd:Q33328630 ;
	prov:wasDerivedFrom ref:d6c4d8b9b0a646a2b5c7b998951a3a35e82fdedb .

Which is to say that the MusicBrainz ID for this album is already reconciled in Wikidata.

Similarly, if you go from the other direction:

$ curl "https://musicbrainz.org/ws/2/release-group/b1392450-e666-3926-a536-22c65f834433?inc=url-rels"

There is a <relation@type="wikidata"> block that gives the Wikidata ID for that release in MusicBrainz.

<relation type="wikidata" type-id="b988d08c-5d86-4a57-9557-c83b399e3580">
    <target id="d27dd960-9b18-44a0-9bb4-1fa53f58e33e">https://www.wikidata.org/wiki/Q202996</target>
    <direction>forward</direction>
</relation>

It's similar in other MusicBrainz resources as well. You just have to visit the page in MusicBrainz and look on the right-hand side, and see if there is a wikidata link.

So I'm still not entirely sure where OpenRefine fits in here. The two data sources are already reconciled. Are you trying to load MusicBrainz into Virtuoso via RDF? Or something else?

10 replies

Yueqiao12Zhang Jun 5, 2024
Maintainer Author

But there are many other attributes in the same CSV that are not referenced by Wikidata. For example, the genre of an artist.

dchiller Jun 5, 2024
Maintainer

For example, the genre of an artist.

Yes, these are also often reconciled, see relationships here:

dchiller Jun 5, 2024
Maintainer

Are you saying that I do not have to use OpenRefine in the MusicBrainz database conversion?

I think at least we need to make sure that we are extracting whatever has already been reconciled. There is no point in re-reconciling.

Yueqiao12Zhang Jun 5, 2024
Maintainer Author

Ok, I will add this into the code.

Yueqiao12Zhang Jun 5, 2024
Maintainer Author

What I meant is that on a single CSV generated from a single JSON, we only get the Wikidata link for the subject, but none of the other attributes. In this case, if we only convert one single JSON into CSV, we only get the Wikidata for the subject, and we still have to reconcile the others using OpenRefine. For example, we have
artist_id, id_wiki, genre_id, genre_name, genre_wiki
as the header of the CSV, we can extract the information of artist_id, id_wiki, genre_id, genre_name from the JSON, but not the genre_wiki, which we have to go through OpenRefine.

fujinaga · 2024-06-05T23:58:05Z

fujinaga
Jun 5, 2024
Maintainer

Forbidden Pigs does has a VIAF ID: http://viaf.org/viaf/144390992 Can find out if there’s any documentation on MusicBrainz site, when they would use Wikidata ID or VIAF ID? If you can’t find any information, let me know. I’ll ask Alastair Porter, who works at MusicBrainz and is a LinkedMusic team member. On Jun 5, 2024, at 10:13 PM, Yueqiao Zhang ***@***.***> wrote: In the data dump, not every band has a wikidata link. For example: https://musicbrainz.org/artist/c7ac345b-057f-4553-9eaf-76c6ce824f96 — Reply to this email directly, view it on GitHub<#52 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAH342MUP5JKDPKFBCSQ3XTZF4FHVAVCNFSM6AAAAABIYT6YJOVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TMNZYGEYDA>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

5 replies

ahankinson Jun 6, 2024
Maintainer

Here's what I could find:

https://musicbrainz.org/doc/Wikidata

" A few bots scour MusicBrainz and add Wikidata links to entities, but it can also be done manually."

I think it then enters the potential edits in for a vote

This seems to be the source code for one of the bots that adds Wikidata / VIAF / etc. links

https://github.com/loujine/musicbrainz-scripts/blob/master/mb-edit-create_from_wikidata.user.js

Which I found here, in a list of all the bots:

https://wiki.musicbrainz.org/Guides/Userscripts

Yueqiao12Zhang Jun 6, 2024
Maintainer Author

Right now my code can extract the Wikidata information if it exists in the data dumps, but I'm not sure if it still needs to be reconciled once more. In the data dumps, only the subject has the Wikidata resources, but not other objects/attributes of the subject. Should other attributes be reconciled?

ahankinson Jun 7, 2024
Maintainer

You're going to have to be clearer. What objects / attributes are you talking about, specifically?

Yueqiao12Zhang Jun 7, 2024
Maintainer Author

For example, what we will have is a CSV with artists as a subject, a Wikidata link associated with it, and its genre:
artist_id, ... , relations_wiki, ... , genre, ...
There will only be one single Wikidata link referencing to the artist, but there is no Wikidata link for his genre unless we track its id to another database. In this case I think we should reconcile the 'genre' column in OpenRefine.

Yueqiao12Zhang Jun 7, 2024
Maintainer Author

The 'recordings' database is not reconciled with Wikidata, but I also think that I should reconcile them with their original song names.

fujinaga · 2024-06-08T09:04:45Z

fujinaga
Jun 8, 2024
Maintainer

@Yueqiao12Zhang Let's step back and start from the beginning:
First, find a place to document the process. You can update this https://github.com/DDMAL/linkedmusic-datalake/blob/main/musicbrainz/musicbrainz_pipeline.md if you'd like or start your own new documentation.
This documentation should include the step-by-step explanation of:

Where you download the MusicBrainz datasets. Describe briefly what these datasets are and what each JSON file contains, e.g., you said above that the Artist file contains: artist_id, ... , relations_wiki, ... , genre, .... Provide the full list.
Store the datasets in the current repo: https://github.com/DDMAL/linkedmusic-datalake/tree/main/musicbrainz/jsonld
If you are only using portion of a file, for experiment purposes (say, the first 20 records), document how you are doing that and possibly a script that accomplishes that. I'm presuming this one script can be used for different JSON files. Store these file as a subdirectory of https://github.com/DDMAL/linkedmusic-datalake/tree/main/musicbrainz/jsonld
At this point, given the subset of a JSON file (e.g., Artist file), I would like you to write a script to convert the JSON file into a CSV file. This should be as generic and simple as possible, so that we can use this script to convert other JSON files we may get from other music databases. Use ChatGPT and Co-Pilot for this purpose (or for any tasks for that matter). ChatGPT tells me to use pandas and json libraries then define json_to_csv Function. Try to use minimum amount of normalization so that the JSON_to_CSV script can be used generically. If specialized normalization is required, this should be clearly documented as to why this is necessary and to which type of JSON files this is necessary (MusicBrainz specific or otherwise).
Now, run your JSON_to_CSV script for the subsets of MusicBrainz JSON files to produce CSV files and store them. At this point, there should be no reconciliations, i.e., no URIs unless they already exist in the JSON files.

Let us know when you have created the CSV files so that we can see them.
Make sure to document every step above.
If all looks good, the next step would be to manually use OpenRefine to reconcile the CSV file.

One of the main goals of the LinkedMusic project is to create a workflow to integrate as many different music databases as cleanly and simply as possible without relying too much on specialized or custom software so that the workflow would be easy to maintain over several years.

10 replies

Yueqiao12Zhang Jun 12, 2024
Maintainer Author

Ok. I will try.

Yueqiao12Zhang Jun 12, 2024
Maintainer Author

@Yueqiao12Zhang Can you go through the five steps using a new set of CantusDB? I think the original dump had 50 records, can you go through the workflow with 100 records, or double the original number of records? Please fully document the process so that someone can replicate the process.

I have worked on a dump with over a thousand records from previous codes: cantusdb/cantusdb_01102024.csv, and that worked fine. I reconciled it using OpenRefine, then I converted it using my script. Would you like for me to work on the large SQL dumps in Cantus DB?

fujinaga Jun 13, 2024
Maintainer

I would like you to try the entire process using a new set from CantusDB, you can use, say, 2,000 records. Please create a new issue for this and document carefully the process. Once you've completed that I would like you try other databases, e.g., SIMSSA DB and The Session. We are trying to figure how much we can generalize without writing separate codes for different databases. The less code means easier maintenance and easier to scale up.

Yueqiao12Zhang Jun 13, 2024
Maintainer Author

Ok. Dylan told me that the CSV dumps for CantusDB will soon be available, and only the SQL dumps are available right now. If we can get the CSV dumps, then no additional code will be needed. But if we use the SQL dumps, we still need to convert SQL to CSV. Should I use the SQL dumps directly?

fujinaga Jun 14, 2024
Maintainer

No. Wait until CSV is available. Also, use 2,000 records not 7,000. Move onto SIMSSA DB and The Session.

ahankinson · 2024-06-09T19:18:35Z

ahankinson
Jun 9, 2024
Maintainer

Re: the three-column CSV: we could do that if we need to, but I'd like to try the very wide CSV with repeated rows (when there are multiple entires in a row, such as an artist belonging to several genres).

I like the CSV as an intermediate step between RDB and RDF, because we can use OpenRefine to manually see what needs reconciling. Eventually, we may be able to skip the CSV step.

In my experience, OpenRefine works best when you only have enough data from the dataset you're trying to reconcile to make the link. OpenRefine will try to reconcile every row in your CSV. So if you have a set of repeated rows for, say, multiple genres, and then another set of the same repeated rows where each genre is repeated by a different track in each row, then it will try to reconcile every one of the repeated rows. You can choose to reconcile every matching value the same way, but if you're repeating each row for every combination of lists, then there is a lot of duplication in your CSV. You may end up with 20-30 rows (or more) for each genre entry, for each MusicBrainz entity.

So if you have a very simple example like this:

{
"id": "musicbrainz/release/12345"
"name": "A great Release",
"genres": [
    {"name": "pop", "id": "musicbrainz/genre/54321" }, 
    {"name": "rock", "id": "musicbrainz/genre/987654"}
],
"external_links": [
    {"name": "wikidata", "url": "wikidata/Q777777"}, 
    {"name": "discogs", "url": "discogs/33333"}
],
"tracks": [
    {"name": "intro", "id": "musicbrainz/track/666666"}, 
    {"name": "track 2", "id": "musicbrainz/track/55555"},
]}

You could expect that in "Wide" CSV format this would serialize as (ignoring the "url/id" in each entry in the list for now):

id	name	genre	external	track
musicbrainz/release/12345	A great release	pop	wikidata	intro
musicbrainz/release/12345	A great release	rock	wikidata	intro
musicbrainz/release/12345	A great release	pop	discogs	intro
musicbrainz/release/12345	A great release	rock	discogs	intro
musicbrainz/release/12345	A great release	pop	wikidata	track2
musicbrainz/release/12345	A great release	rock	wikidata	track2
musicbrainz/release/12345	A great release	pop	discogs	track2
musicbrainz/release/12345	A great release	rock	discogs	track2

That's a lot of duplicated entries in the "genre" field when all you really need in OpenRefine is this:

genre_id	genre	wikidata_id
musicbrainz/genre/54321	pop	?
musicbrainz/genre/987654	rock	?

When I reconciled DIAMM and RISM sources, all I used from DIAMM was the DIAMM id, library siglum and the shelfmark. I created a column that combined the siglum and shelfmark, and used that as the "query" in the OpenRefine service in RISM. When there is a match, you end up with a link between the DIAMM ID (in the sheet) and the RISM ID (from the reconciliation.) You then store the link.

So It's better to reconcile each genre separately, which gives you a MusicBrainz Genre ID, and the Wikidata Q-id as the reconciled values. You can then either choose to keep those and make the link in your data dump when you process, say, the recordings, OR you could load that directly into your graph database, e.g., add a bunch of statements that essentially say:

<MusicBrainz-URI> <sameAs> <Wikidata-URI>

Then when you load your MusicBrainz data into your graph database with all its MusicBrainz URIs (including the Genre ones), it will be pre-populated with the relationships and the Wikidata URIs already, and you will already be able to query those relationships.

1 reply

Yueqiao12Zhang Jun 12, 2024
Maintainer Author

You could expect that in "Wide" CSV format this would serialize as (ignoring the "url/id" in each entry in the list for now):

id name genre external track
musicbrainz/release/12345 A great release pop wikidata intro
musicbrainz/release/12345 A great release rock wikidata intro
musicbrainz/release/12345 A great release pop discogs intro
musicbrainz/release/12345 A great release rock discogs intro
musicbrainz/release/12345 A great release pop wikidata track2
musicbrainz/release/12345 A great release rock wikidata track2
musicbrainz/release/12345 A great release pop discogs track2
musicbrainz/release/12345 A great release rock discogs track2

The convert to CSV script eliminates duplicates and only adds necessary information in additional lines. It goes like:

 id	name	genre	external	track
 musicbrainz/release/12345	A great release	pop	wikidata	intro
 musicbrainz/release/12345			rock	discogs		track2

And when we convert it into RDF and import it into the graph database, there will be 7 triples in total from this CSV.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The entire process from getting the raw JSON files from MusicBrainz to importing into Virtuoso #52

{{title}}

Replies: 5 comments 26 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

The entire process from getting the raw JSON files from MusicBrainz to importing into Virtuoso #52

Yueqiao12Zhang Jun 4, 2024 Maintainer

Replies: 5 comments · 26 replies

ahankinson Jun 4, 2024 Maintainer

ahankinson Jun 4, 2024 Maintainer

Yueqiao12Zhang Jun 5, 2024 Maintainer Author

dchiller Jun 5, 2024 Maintainer

dchiller Jun 5, 2024 Maintainer

Yueqiao12Zhang Jun 5, 2024 Maintainer Author

Yueqiao12Zhang Jun 5, 2024 Maintainer Author

fujinaga Jun 5, 2024 Maintainer

ahankinson Jun 6, 2024 Maintainer

Yueqiao12Zhang Jun 6, 2024 Maintainer Author

ahankinson Jun 7, 2024 Maintainer

Yueqiao12Zhang Jun 7, 2024 Maintainer Author

Yueqiao12Zhang Jun 7, 2024 Maintainer Author

fujinaga Jun 8, 2024 Maintainer

Yueqiao12Zhang Jun 12, 2024 Maintainer Author

Yueqiao12Zhang Jun 12, 2024 Maintainer Author

fujinaga Jun 13, 2024 Maintainer

Yueqiao12Zhang Jun 13, 2024 Maintainer Author

fujinaga Jun 14, 2024 Maintainer

ahankinson Jun 9, 2024 Maintainer

Yueqiao12Zhang Jun 12, 2024 Maintainer Author

Yueqiao12Zhang
Jun 4, 2024
Maintainer

Replies: 5 comments 26 replies

ahankinson
Jun 4, 2024
Maintainer

ahankinson
Jun 4, 2024
Maintainer

Yueqiao12Zhang Jun 5, 2024
Maintainer Author

dchiller Jun 5, 2024
Maintainer

dchiller Jun 5, 2024
Maintainer

Yueqiao12Zhang Jun 5, 2024
Maintainer Author

Yueqiao12Zhang Jun 5, 2024
Maintainer Author

fujinaga
Jun 5, 2024
Maintainer

ahankinson Jun 6, 2024
Maintainer

Yueqiao12Zhang Jun 6, 2024
Maintainer Author

ahankinson Jun 7, 2024
Maintainer

Yueqiao12Zhang Jun 7, 2024
Maintainer Author

Yueqiao12Zhang Jun 7, 2024
Maintainer Author

fujinaga
Jun 8, 2024
Maintainer

Yueqiao12Zhang Jun 12, 2024
Maintainer Author

Yueqiao12Zhang Jun 12, 2024
Maintainer Author

fujinaga Jun 13, 2024
Maintainer

Yueqiao12Zhang Jun 13, 2024
Maintainer Author

fujinaga Jun 14, 2024
Maintainer

ahankinson
Jun 9, 2024
Maintainer

Yueqiao12Zhang Jun 12, 2024
Maintainer Author