The entire process from getting the raw JSON files from MusicBrainz to importing into Virtuoso #52
Replies: 5 comments 26 replies
-
Converted to a discussion because there is no actual issue to be solved here. |
Beta Was this translation helpful? Give feedback.
-
So if you do (on your command-line, assuming you have curl installed):
You get the
Which is to say that the MusicBrainz ID for this album is already reconciled in Wikidata. Similarly, if you go from the other direction:
There is a
It's similar in other MusicBrainz resources as well. You just have to visit the page in MusicBrainz and look on the right-hand side, and see if there is a wikidata link. So I'm still not entirely sure where OpenRefine fits in here. The two data sources are already reconciled. Are you trying to load MusicBrainz into Virtuoso via RDF? Or something else? |
Beta Was this translation helpful? Give feedback.
-
Forbidden Pigs does has a VIAF ID: http://viaf.org/viaf/144390992
Can find out if there’s any documentation on MusicBrainz site, when they would use Wikidata ID or VIAF ID?
If you can’t find any information, let me know. I’ll ask Alastair Porter, who works at MusicBrainz and is a LinkedMusic team member.
On Jun 5, 2024, at 10:13 PM, Yueqiao Zhang ***@***.***> wrote:
In the data dump, not every band has a wikidata link. For example: https://musicbrainz.org/artist/c7ac345b-057f-4553-9eaf-76c6ce824f96
—
Reply to this email directly, view it on GitHub<#52 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAH342MUP5JKDPKFBCSQ3XTZF4FHVAVCNFSM6AAAAABIYT6YJOVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TMNZYGEYDA>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
@Yueqiao12Zhang Let's step back and start from the beginning:
Let us know when you have created the CSV files so that we can see them. One of the main goals of the LinkedMusic project is to create a workflow to integrate as many different music databases as cleanly and simply as possible without relying too much on specialized or custom software so that the workflow would be easy to maintain over several years. |
Beta Was this translation helpful? Give feedback.
-
In my experience, OpenRefine works best when you only have enough data from the dataset you're trying to reconcile to make the link. OpenRefine will try to reconcile every row in your CSV. So if you have a set of repeated rows for, say, multiple genres, and then another set of the same repeated rows where each genre is repeated by a different track in each row, then it will try to reconcile every one of the repeated rows. You can choose to reconcile every matching value the same way, but if you're repeating each row for every combination of lists, then there is a lot of duplication in your CSV. You may end up with 20-30 rows (or more) for each genre entry, for each MusicBrainz entity. So if you have a very simple example like this: {
"id": "musicbrainz/release/12345"
"name": "A great Release",
"genres": [
{"name": "pop", "id": "musicbrainz/genre/54321" },
{"name": "rock", "id": "musicbrainz/genre/987654"}
],
"external_links": [
{"name": "wikidata", "url": "wikidata/Q777777"},
{"name": "discogs", "url": "discogs/33333"}
],
"tracks": [
{"name": "intro", "id": "musicbrainz/track/666666"},
{"name": "track 2", "id": "musicbrainz/track/55555"},
]} You could expect that in "Wide" CSV format this would serialize as (ignoring the "url/id" in each entry in the list for now):
That's a lot of duplicated entries in the "genre" field when all you really need in OpenRefine is this:
When I reconciled DIAMM and RISM sources, all I used from DIAMM was the DIAMM id, library siglum and the shelfmark. I created a column that combined the siglum and shelfmark, and used that as the "query" in the OpenRefine service in RISM. When there is a match, you end up with a link between the DIAMM ID (in the sheet) and the RISM ID (from the reconciliation.) You then store the link. So It's better to reconcile each genre separately, which gives you a MusicBrainz Genre ID, and the Wikidata Q-id as the reconciled values. You can then either choose to keep those and make the link in your data dump when you process, say, the recordings, OR you could load that directly into your graph database, e.g., add a bunch of statements that essentially say:
Then when you load your MusicBrainz data into your graph database with all its MusicBrainz URIs (including the Genre ones), it will be pre-populated with the relationships and the Wikidata URIs already, and you will already be able to query those relationships. |
Beta Was this translation helpful? Give feedback.
-
I will also describe my steps:
For all columns with "id", I replaced them with the [https://musicbrainz.org/{entity_type}/{id}] reference. They are direct links to MusicBrainz web pages.
For other columns with "names", "title", "genre names", etc, I reconciled them with the Wikidata reconciliation service. If there is no perfect match, I go to the original page in MusicBrainz and check if there is a Wikidata link since some of them have a different name or cannot be found by the reconciliation service, and I will not reconcile that cell if there's none.
After all the reconciliation procedures, I will add a column beside each reconciled column with all the reconciled Wikidata URLs.
This is ready to export, after we export, we have to go write the mapper file for the RDF conversion. Copy the header to the relations_mapping_{database name}.json, change it into a JSON format with a single dictionary, each column in the header as a key in the dictionary, and fill their values using Wikidata Property links, Wikidata Instance links, Schema.org links, or MusicBrainz documentation links (preferred from best to worst respectively).
Then we run the csv2rdf_single_subject.p, we will get an out_rdf.ttl, and this is ready to be imported into Virtuoso. We go to Conductor > Linked Data > Quad Store Upload, select the out_rdf.ttl file, give it a name, check the "create graph explicitly" and upload. We can check if the file is successfully uploaded in the Linked Data > Graphs > Graphs section. If it is there, then we can go to the Linked Data > SPARQL, enter the name we gave to the graph in the Default Graph IRI, and perform SPARQL queries.
Originally posted by @Yueqiao12Zhang in #48 (comment)
Beta Was this translation helpful? Give feedback.
All reactions