Datetime place format recognition #159

Yueqiao12Zhang · 2024-08-20T15:17:26Z

Recognize the datetime and place format in CSV, correcting a reconciliation mistake in sessions-csv.csv

dchiller

A few general comments:

There are some commits in the history that I don't understand. What does test: update based on change mean? It is very general.... Also, "delete tunes.csv". I don't see any change to "tunes.csv".
If you are going to infer these types of data types element by element, then I think it needs to be documented somewhere that when reconciling, you need to put geographic data in this format and date-time data in the other format.
What do you think about the fact that we are essentially inferring data type on an element by element basis rather than a column by column basis? Are there situations where two elements in the same column will have different data types? Probably out of scope for this PR, but the more different types of inferences that you are doing based on some pretty contingent things (for example, could there be a non-geo data element that just happens to begin with "Point("?), the more I feel like this should be done on a column by column basis.

csv2rdf/csv2rdf_single_subject.py

dchiller · 2024-08-23T12:16:10Z

csv2rdf/csv2rdf_single_subject.py

@@ -84,6 +86,10 @@ def convert_csv_to_turtle(filenames: List[str]) -> Graph:
                            obj = Literal(element, datatype=XSD.boolean)
                        elif element.isnumeric():
                            obj = Literal(element, datatype=XSD.integer)
+                        elif element.startswith("Point("):
+                            obj = Literal(element[5:], datatype=WGS.Point)


Isn't there a closing parenthesis?

Also, is element[5:] further parsed into lat/log?

In my reconciliation process and documents, it lets OpenRefine to modify the longitude and latitude to the Point(_, _) format. This is a Point format in WGS.

Also, is element[5:] further parsed into lat/log?

No, they will be in format (_, _) ^^ WGS.point

Isn't there a closing parenthesis?

It includes both parenthesis.

I'm not sure I follow. If element = "Point(123.123456, 987.987654)" won't element[5:] = "123.123456, 987.987654)"? Note the ")" at the end.

It's (123.123456, 987.987654). The t in "point" is at index 4, and [5:] starts at index 5, which is the (.

I see...I miscounted the index.

So the format for data type WGS.Point includes the parentheses and the comma?

Junjun tested and decided to change to a new namespace, it's changed now:
"Point(xxx.xxxx yyy.yyyy)"^^geo.wktLiteral

Co-authored-by: Dylan Hillerbrand <[email protected]>

Yueqiao12Zhang · 2024-08-23T14:48:21Z

A few general comments:

There are some commits in the history that I don't understand. What does test: update based on change mean? It is very general.... Also, "delete tunes.csv". I don't see any change to "tunes.csv".

I'm thinking if we see two continuous commits, the test output is changed based on the previous commit.

If you are going to infer these types of data types element by element, then I think it needs to be documented somewhere that when reconciling, you need to put geographic data in this format and date-time data in the other format.

They are documented in the reconciliation process.

What do you think about the fact that we are essentially inferring data type on an element by element basis rather than a column by column basis? Are there situations where two elements in the same column will have different data types? Probably out of scope for this PR, but the more different types of inferences that you are doing based on some pretty contingent things (for example, could there be a non-geo data element that just happens to begin with "Point("?), the more I feel like this should be done on a column by column basis.

We format the coordinates in the reconciliation process, I think it's standardized.

dchiller · 2024-08-23T14:55:13Z

I'm thinking if we see two continuous commits, the test output is changed based on the previous commit.

The best practice would be to reference the commit: test: update out_rdf.ttl based on cf33751

They are documented in the reconciliation process.

Where? Since this script applies to all DB's, then this kind of thing should be in documentation for this script.

We format the coordinates in the reconciliation process, I think it's standardized.

I don't think this addresses my question. Is there going to be a situation where the output of the reconciliation process (in other words, the reconciled csvs) has a single column with multiple data types?

Yueqiao12Zhang · 2024-08-23T15:05:25Z

They are documented in the reconciliation process.

Where? Since this script applies to all DB's, then this kind of thing should be in documentation for this script.

Right now the only place where we have to format datetime and coordinate is in The Sessions DB, so I assume this is a 1/4 chance.

We format the coordinates in the reconciliation process, I think it's standardized.

I don't think this addresses my question. Is there going to be a situation where the output of the reconciliation process (in other words, the reconciled csvs) has a single column with multiple data types?

One column will always be one single type. I can change the .startwith to another pattern matching. I think that processing by column would be extra work for the operator, where we have to indicate the columns with their types manually? Therefore I assume that pattern matching would be simpler.
One of my idea that we can process by column is that we apply pattern matching to the entire column to ensure that the column is the same type. But this would take too much resources to check.

dchiller · 2024-08-23T15:09:54Z

One column will always be one single type. I can change the .startwith to another pattern matching. I think that processing by column would be extra work for the operator, where we have to indicate the columns with their types manually? Therefore I assume that pattern matching would be simpler.
One of my idea that we can process by column is that we apply pattern matching to the entire column to ensure that the column is the same type. But this would take too much resources to check.

Ok, yes... let's not worry about it now. But wanted to know. Thanks!

Right now the only place where we have to format datetime and coordinate is in The Sessions DB, so I assume this is a 1/4 chance.

I'm not sure what you mean by 1/4 chance. There will be other databases, though, so I would put this in the README for this script.

Yueqiao12Zhang · 2024-08-23T15:18:52Z

I'm not sure what you mean by 1/4 chance. There will be other databases, though, so I would put this in the README for this script.

I work mainly on 4 different databases, and only The Session needs reformat. That's where 1/4 comes from.

Yueqiao12Zhang · 2024-08-23T17:04:53Z

One column will always be one single type. I can change the .startwith to another pattern matching. I think that processing by column would be extra work for the operator, where we have to indicate the columns with their types manually? Therefore I assume that pattern matching would be simpler.
One of my idea that we can process by column is that we apply pattern matching to the entire column to ensure that the column is the same type. But this would take too much resources to check.

Ok, yes... let's not worry about it now. But wanted to know. Thanks!

Right now the only place where we have to format datetime and coordinate is in The Sessions DB, so I assume this is a 1/4 chance.

I'm not sure what you mean by 1/4 chance. There will be other databases, though, so I would put this in the README for this script.

My mistake, but many recordings and authors have integers as their names. I would have to consider these instances

fujinaga · 2024-08-24T15:02:50Z

My mistake, but many recordings and authors have integers as their names. I would have to consider these instances

Can you give us some examples where "recordings and authors have integers as their names"?

dchiller · 2024-08-26T14:19:16Z

My mistake, but many recordings and authors have integers as their names. I would have to consider these instances

No mistake...just something to think about. Since the csv2rdf/csv2rdf_single_subject.py script is designed for use across all the databases, then it should be documented somewhere for all the databases that we are assuming a canonical datetime format. That way, if we add another database with a different date time format, we know immediately that one of the transformations we need to apply is to the datetime.

I think recordings and authors having integers in their name is a big reason why we want to do this inference of type at the column level, rather than the individual cell level. You'd have to have a recording with a pretty specific title to be caught up in this, of course, but since we already know that the whole column will have the same data type, it seems cleaner just to go ahead and use that knowledge rather than one day have a weird and hard-to-find bug because someone decided to name their some "Point(1,2)" or something.

Yueqiao12Zhang · 2024-08-29T23:53:34Z

My mistake, but many recordings and authors have integers as their names. I would have to consider these instances

Can you give us some examples where "recordings and authors have integers as their names"?

This is a recording, it's artist is "1651", which is recognized as an integer in the RDF since it does not have quotes.

<https://thesession.org/recordings/3720> a <http://www.wikidata.org/entity/Q49017950> ;
    ns3:Artist <https://thesession.org/recordings/artists/2112>,
        1651 ;

Yueqiao12Zhang · 2024-08-30T16:12:20Z

My mistake, but many recordings and authors have integers as their names. I would have to consider these instances

No mistake...just something to think about. Since the csv2rdf/csv2rdf_single_subject.py script is designed for use across all the databases, then it should be documented somewhere for all the databases that we are assuming a canonical datetime format. That way, if we add another database with a different date time format, we know immediately that one of the transformations we need to apply is to the datetime.

I think recordings and authors having integers in their name is a big reason why we want to do this inference of type at the column level, rather than the individual cell level. You'd have to have a recording with a pretty specific title to be caught up in this, of course, but since we already know that the whole column will have the same data type, it seems cleaner just to go ahead and use that knowledge rather than one day have a weird and hard-to-find bug because someone decided to name their some "Point(1,2)" or something.

Ok. Do you think that there should be some user-defined variables that specifies the type of each column? Or should it be automatically recognized?

This reverts commit be7b076.

Replace ns2:Artist by wdt:P175 and delete the assertion for name space ns2.

This is for supplementing data.

dchiller · 2024-09-12T12:21:15Z

@candlecao Why are you committing to this branch? If the changes you are making need to be committed to the repo, please make another branch and pull request.

…m/DDMAL/linkedmusic-datalake into datetime-place-format-recognition

candlecao · 2024-09-13T15:19:14Z

@candlecao Why are you committing to this branch? If the changes you are making need to be committed to the repo, please make another branch and pull request.

Ok, got it. I will be careful next time. I can give up my committing on the branch this time. Thanks.

Yueqiao12Zhang added 3 commits August 20, 2024 11:04

refactor: reads geographic and datetime

cf33751

test: update based on change

446a43d

Delete tunes.csv

f918248

Yueqiao12Zhang self-assigned this Aug 20, 2024

This was linked to issues Aug 20, 2024

The "coordinate" property cannot be rendered using a geographically-specific format for sessions & events.(TheSessionDB) #148

Closed

Cannot automate the formatting of property values about date/time & coordinate. #153

Open

Merge branch 'main' into datetime-place-format-recognition

7a676f4

Yueqiao12Zhang requested a review from dchiller August 22, 2024 16:38

dchiller reviewed Aug 23, 2024

View reviewed changes

Yueqiao12Zhang and others added 3 commits August 23, 2024 10:33

Merge branch 'main' into datetime-place-format-recognition

76a8e37

refactor: complie pattern matching csv2rdf/csv2rdf_single_subject.py

212d343

Co-authored-by: Dylan Hillerbrand <[email protected]>

refactor: compile pattern csv2rdf/csv2rdf_single_subject.py

1991475

Co-authored-by: Dylan Hillerbrand <[email protected]>

fix: reconciliation add missing countries

88eb0ac

Yueqiao12Zhang mentioned this pull request Aug 30, 2024

Some of the countries of sessions-csv.csv are not reconciled yet. #170

Closed

Yueqiao12Zhang added 5 commits August 30, 2024 11:16

Merge branch 'main' into datetime-place-format-recognition

fca03ee

merge: fix conflict

7aaaa8b

gitignore: ignore output since it's too large

8fd007e

refactor: change output coordinate type

86a6aaa

style: correct syntax

bf0fa91

test: correct coordinate, todo: correct datetime

47f2af4

Yueqiao12Zhang added 5 commits September 6, 2024 10:21

test: events.csv remove empty point coordinate

bd34b49

test: move to csv2rdf folder for testing

7fcf81f

feat: correctly checks for digits, and ignore digits in artist name

f9f226c

test: add lang tag

e860302

fix: correct find_artist filename

76c5ded

Yueqiao12Zhang linked an issue Sep 6, 2024 that may be closed by this pull request

The value of property "artist" of recording is not definitely string.(TheSession) #172

Closed

feat: use python datetime to add day of the week to the rdf

9847d33

Yueqiao12Zhang linked an issue Sep 6, 2024 that may be closed by this pull request

Not able to query "day of week" for sessions on TheSession. DDMAL/linkedmusic-queries#5

Closed

Yueqiao12Zhang added 5 commits September 6, 2024 14:37

test: add day of the week

be7b076

Revert "test: add day of the week"

a937458

This reverts commit be7b076.

feat: use datetime obj to recognize and reformat the datetime string

e1a1ff6

test: remove the history that adds the "T" in datetime

f40aee7

test: "T" in datetime removed

ce97fa5

Yueqiao12Zhang linked an issue Sep 6, 2024 that may be closed by this pull request

Not all types of literals are specified in RDF, and strings do not have language tags #164

Open

Yueqiao12Zhang and others added 10 commits September 6, 2024 15:17

refactor: create a set containing string type columns

0e1da14

test: full output

19166c1

refactor: remove lang tag

c3b5650

Update csv2rdf_single_subject.py

1a791bc

Update out_rdf.ttl

524638a

Modify the property value for "artist" in mapping.json

28f3698

Update out_rdf.ttl

48d8d29

Replace ns2:Artist by wdt:P175 and delete the assertion for name space ns2.

Update mapping.json

0cd3e4b

Update recordings-csv.csv

9012a48

Create recordings-csv_onlyForArtist_wiki.csv

24b275f

This is for supplementing data.

candlecao added 2 commits September 12, 2024 10:03

Update mapping.json

02fc059

Merge branch 'datetime-place-format-recognition' of https://github.co…

486d3f7

…m/DDMAL/linkedmusic-datalake into datetime-place-format-recognition

Yueqiao12Zhang linked an issue Sep 13, 2024 that may be closed by this pull request

Special number characters are not recognized in python. #183

Open

Yueqiao12Zhang requested a review from dchiller September 13, 2024 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datetime place format recognition #159

Datetime place format recognition #159

Yueqiao12Zhang commented Aug 20, 2024 •

edited

Loading

dchiller left a comment •

edited

Loading

dchiller Aug 23, 2024

dchiller Aug 23, 2024

Yueqiao12Zhang Aug 23, 2024

Yueqiao12Zhang Aug 23, 2024

Yueqiao12Zhang Aug 23, 2024

dchiller Aug 23, 2024 •

edited

Loading

Yueqiao12Zhang Aug 23, 2024 •

edited

Loading

dchiller Aug 23, 2024

Yueqiao12Zhang Aug 30, 2024

Yueqiao12Zhang commented Aug 23, 2024

dchiller commented Aug 23, 2024 •

edited

Loading

Yueqiao12Zhang commented Aug 23, 2024

dchiller commented Aug 23, 2024

Yueqiao12Zhang commented Aug 23, 2024

Yueqiao12Zhang commented Aug 23, 2024

fujinaga commented Aug 24, 2024

dchiller commented Aug 26, 2024 •

edited

Loading

Yueqiao12Zhang commented Aug 29, 2024

Yueqiao12Zhang commented Aug 30, 2024

dchiller commented Sep 12, 2024

candlecao commented Sep 13, 2024

Datetime place format recognition #159

Are you sure you want to change the base?

Datetime place format recognition #159

Conversation

Yueqiao12Zhang commented Aug 20, 2024 • edited Loading

dchiller left a comment • edited Loading

Choose a reason for hiding this comment

dchiller Aug 23, 2024

Choose a reason for hiding this comment

dchiller Aug 23, 2024

Choose a reason for hiding this comment

Yueqiao12Zhang Aug 23, 2024

Choose a reason for hiding this comment

Yueqiao12Zhang Aug 23, 2024

Choose a reason for hiding this comment

Yueqiao12Zhang Aug 23, 2024

Choose a reason for hiding this comment

dchiller Aug 23, 2024 • edited Loading

Choose a reason for hiding this comment

Yueqiao12Zhang Aug 23, 2024 • edited Loading

Choose a reason for hiding this comment

dchiller Aug 23, 2024

Choose a reason for hiding this comment

Yueqiao12Zhang Aug 30, 2024

Choose a reason for hiding this comment

Yueqiao12Zhang commented Aug 23, 2024

dchiller commented Aug 23, 2024 • edited Loading

Yueqiao12Zhang commented Aug 23, 2024

dchiller commented Aug 23, 2024

Yueqiao12Zhang commented Aug 23, 2024

Yueqiao12Zhang commented Aug 23, 2024

fujinaga commented Aug 24, 2024

dchiller commented Aug 26, 2024 • edited Loading

Yueqiao12Zhang commented Aug 29, 2024

Yueqiao12Zhang commented Aug 30, 2024

dchiller commented Sep 12, 2024

candlecao commented Sep 13, 2024

Yueqiao12Zhang commented Aug 20, 2024 •

edited

Loading

dchiller left a comment •

edited

Loading

dchiller Aug 23, 2024 •

edited

Loading

Yueqiao12Zhang Aug 23, 2024 •

edited

Loading

dchiller commented Aug 23, 2024 •

edited

Loading

dchiller commented Aug 26, 2024 •

edited

Loading