Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime place format recognition #159

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

Yueqiao12Zhang
Copy link
Contributor

@Yueqiao12Zhang Yueqiao12Zhang commented Aug 20, 2024

Recognize the datetime and place format in CSV, correcting a reconciliation mistake in sessions-csv.csv

Copy link
Contributor

@dchiller dchiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few general comments:

  1. There are some commits in the history that I don't understand. What does test: update based on change mean? It is very general.... Also, "delete tunes.csv". I don't see any change to "tunes.csv".
  2. If you are going to infer these types of data types element by element, then I think it needs to be documented somewhere that when reconciling, you need to put geographic data in this format and date-time data in the other format.
  3. What do you think about the fact that we are essentially inferring data type on an element by element basis rather than a column by column basis? Are there situations where two elements in the same column will have different data types? Probably out of scope for this PR, but the more different types of inferences that you are doing based on some pretty contingent things (for example, could there be a non-geo data element that just happens to begin with "Point("?), the more I feel like this should be done on a column by column basis.

csv2rdf/csv2rdf_single_subject.py Outdated Show resolved Hide resolved
csv2rdf/csv2rdf_single_subject.py Outdated Show resolved Hide resolved
@@ -84,6 +86,10 @@ def convert_csv_to_turtle(filenames: List[str]) -> Graph:
obj = Literal(element, datatype=XSD.boolean)
elif element.isnumeric():
obj = Literal(element, datatype=XSD.integer)
elif element.startswith("Point("):
obj = Literal(element[5:], datatype=WGS.Point)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a closing parenthesis?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is element[5:] further parsed into lat/log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my reconciliation process and documents, it lets OpenRefine to modify the longitude and latitude to the Point(_, _) format. This is a Point format in WGS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is element[5:] further parsed into lat/log?

No, they will be in format (_, _) ^^ WGS.point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a closing parenthesis?

It includes both parenthesis.

Copy link
Contributor

@dchiller dchiller Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. If element = "Point(123.123456, 987.987654)" won't element[5:] = "123.123456, 987.987654)"? Note the ")" at the end.

Copy link
Contributor Author

@Yueqiao12Zhang Yueqiao12Zhang Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's (123.123456, 987.987654). The t in "point" is at index 4, and [5:] starts at index 5, which is the (.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see...I miscounted the index.

So the format for data type WGS.Point includes the parentheses and the comma?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junjun tested and decided to change to a new namespace, it's changed now:
"Point(xxx.xxxx yyy.yyyy)"^^geo.wktLiteral

@Yueqiao12Zhang
Copy link
Contributor Author

A few general comments:

  1. There are some commits in the history that I don't understand. What does test: update based on change mean? It is very general.... Also, "delete tunes.csv". I don't see any change to "tunes.csv".

I'm thinking if we see two continuous commits, the test output is changed based on the previous commit.

  1. If you are going to infer these types of data types element by element, then I think it needs to be documented somewhere that when reconciling, you need to put geographic data in this format and date-time data in the other format.

They are documented in the reconciliation process.

  1. What do you think about the fact that we are essentially inferring data type on an element by element basis rather than a column by column basis? Are there situations where two elements in the same column will have different data types? Probably out of scope for this PR, but the more different types of inferences that you are doing based on some pretty contingent things (for example, could there be a non-geo data element that just happens to begin with "Point("?), the more I feel like this should be done on a column by column basis.

We format the coordinates in the reconciliation process, I think it's standardized.

@dchiller
Copy link
Contributor

dchiller commented Aug 23, 2024

I'm thinking if we see two continuous commits, the test output is changed based on the previous commit.

The best practice would be to reference the commit: test: update out_rdf.ttl based on cf33751

They are documented in the reconciliation process.

Where? Since this script applies to all DB's, then this kind of thing should be in documentation for this script.

We format the coordinates in the reconciliation process, I think it's standardized.

I don't think this addresses my question. Is there going to be a situation where the output of the reconciliation process (in other words, the reconciled csvs) has a single column with multiple data types?

@Yueqiao12Zhang
Copy link
Contributor Author

They are documented in the reconciliation process.

Where? Since this script applies to all DB's, then this kind of thing should be in documentation for this script.

Right now the only place where we have to format datetime and coordinate is in The Sessions DB, so I assume this is a 1/4 chance.

We format the coordinates in the reconciliation process, I think it's standardized.

I don't think this addresses my question. Is there going to be a situation where the output of the reconciliation process (in other words, the reconciled csvs) has a single column with multiple data types?

One column will always be one single type. I can change the .startwith to another pattern matching. I think that processing by column would be extra work for the operator, where we have to indicate the columns with their types manually? Therefore I assume that pattern matching would be simpler.
One of my idea that we can process by column is that we apply pattern matching to the entire column to ensure that the column is the same type. But this would take too much resources to check.

@dchiller
Copy link
Contributor

One column will always be one single type. I can change the .startwith to another pattern matching. I think that processing by column would be extra work for the operator, where we have to indicate the columns with their types manually? Therefore I assume that pattern matching would be simpler.
One of my idea that we can process by column is that we apply pattern matching to the entire column to ensure that the column is the same type. But this would take too much resources to check.

Ok, yes... let's not worry about it now. But wanted to know. Thanks!

Right now the only place where we have to format datetime and coordinate is in The Sessions DB, so I assume this is a 1/4 chance.

I'm not sure what you mean by 1/4 chance. There will be other databases, though, so I would put this in the README for this script.

@Yueqiao12Zhang
Copy link
Contributor Author

I'm not sure what you mean by 1/4 chance. There will be other databases, though, so I would put this in the README for this script.

I work mainly on 4 different databases, and only The Session needs reformat. That's where 1/4 comes from.

@Yueqiao12Zhang
Copy link
Contributor Author

One column will always be one single type. I can change the .startwith to another pattern matching. I think that processing by column would be extra work for the operator, where we have to indicate the columns with their types manually? Therefore I assume that pattern matching would be simpler.
One of my idea that we can process by column is that we apply pattern matching to the entire column to ensure that the column is the same type. But this would take too much resources to check.

Ok, yes... let's not worry about it now. But wanted to know. Thanks!

Right now the only place where we have to format datetime and coordinate is in The Sessions DB, so I assume this is a 1/4 chance.

I'm not sure what you mean by 1/4 chance. There will be other databases, though, so I would put this in the README for this script.

My mistake, but many recordings and authors have integers as their names. I would have to consider these instances

@fujinaga
Copy link
Member

My mistake, but many recordings and authors have integers as their names. I would have to consider these instances

Can you give us some examples where "recordings and authors have integers as their names"?

@dchiller
Copy link
Contributor

dchiller commented Aug 26, 2024

My mistake, but many recordings and authors have integers as their names. I would have to consider these instances

No mistake...just something to think about. Since the csv2rdf/csv2rdf_single_subject.py script is designed for use across all the databases, then it should be documented somewhere for all the databases that we are assuming a canonical datetime format. That way, if we add another database with a different date time format, we know immediately that one of the transformations we need to apply is to the datetime.

I think recordings and authors having integers in their name is a big reason why we want to do this inference of type at the column level, rather than the individual cell level. You'd have to have a recording with a pretty specific title to be caught up in this, of course, but since we already know that the whole column will have the same data type, it seems cleaner just to go ahead and use that knowledge rather than one day have a weird and hard-to-find bug because someone decided to name their some "Point(1,2)" or something.

@Yueqiao12Zhang
Copy link
Contributor Author

My mistake, but many recordings and authors have integers as their names. I would have to consider these instances

Can you give us some examples where "recordings and authors have integers as their names"?

This is a recording, it's artist is "1651", which is recognized as an integer in the RDF since it does not have quotes.

<https://thesession.org/recordings/3720> a <http://www.wikidata.org/entity/Q49017950> ;
    ns3:Artist <https://thesession.org/recordings/artists/2112>,
        1651 ;

@Yueqiao12Zhang
Copy link
Contributor Author

My mistake, but many recordings and authors have integers as their names. I would have to consider these instances

No mistake...just something to think about. Since the csv2rdf/csv2rdf_single_subject.py script is designed for use across all the databases, then it should be documented somewhere for all the databases that we are assuming a canonical datetime format. That way, if we add another database with a different date time format, we know immediately that one of the transformations we need to apply is to the datetime.

I think recordings and authors having integers in their name is a big reason why we want to do this inference of type at the column level, rather than the individual cell level. You'd have to have a recording with a pretty specific title to be caught up in this, of course, but since we already know that the whole column will have the same data type, it seems cleaner just to go ahead and use that knowledge rather than one day have a weird and hard-to-find bug because someone decided to name their some "Point(1,2)" or something.

Ok. Do you think that there should be some user-defined variables that specifies the type of each column? Or should it be automatically recognized?

@dchiller
Copy link
Contributor

@candlecao Why are you committing to this branch? If the changes you are making need to be committed to the repo, please make another branch and pull request.

@candlecao
Copy link
Contributor

@candlecao Why are you committing to this branch? If the changes you are making need to be committed to the repo, please make another branch and pull request.

Ok, got it. I will be careful next time. I can give up my committing on the branch this time. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment