-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datetime place format recognition #159
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few general comments:
- There are some commits in the history that I don't understand. What does
test: update based on change
mean? It is very general.... Also, "delete tunes.csv". I don't see any change to "tunes.csv". - If you are going to infer these types of data types element by element, then I think it needs to be documented somewhere that when reconciling, you need to put geographic data in this format and date-time data in the other format.
- What do you think about the fact that we are essentially inferring data type on an element by element basis rather than a column by column basis? Are there situations where two elements in the same column will have different data types? Probably out of scope for this PR, but the more different types of inferences that you are doing based on some pretty contingent things (for example, could there be a non-geo data element that just happens to begin with "Point("?), the more I feel like this should be done on a column by column basis.
csv2rdf/csv2rdf_single_subject.py
Outdated
@@ -84,6 +86,10 @@ def convert_csv_to_turtle(filenames: List[str]) -> Graph: | |||
obj = Literal(element, datatype=XSD.boolean) | |||
elif element.isnumeric(): | |||
obj = Literal(element, datatype=XSD.integer) | |||
elif element.startswith("Point("): | |||
obj = Literal(element[5:], datatype=WGS.Point) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a closing parenthesis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is element[5:]
further parsed into lat/log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my reconciliation process and documents, it lets OpenRefine to modify the longitude and latitude to the Point(_, _) format. This is a Point format in WGS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is
element[5:]
further parsed into lat/log?
No, they will be in format (_, _) ^^ WGS.point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a closing parenthesis?
It includes both parenthesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. If element = "Point(123.123456, 987.987654)"
won't element[5:] = "123.123456, 987.987654)"
? Note the ")" at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's (123.123456, 987.987654)
. The t
in "point" is at index 4, and [5:] starts at index 5, which is the (
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see...I miscounted the index.
So the format for data type WGS.Point includes the parentheses and the comma?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Junjun tested and decided to change to a new namespace, it's changed now:
"Point(xxx.xxxx yyy.yyyy)"^^geo.wktLiteral
Co-authored-by: Dylan Hillerbrand <[email protected]>
Co-authored-by: Dylan Hillerbrand <[email protected]>
I'm thinking if we see two continuous commits, the test output is changed based on the previous commit.
They are documented in the reconciliation process.
We format the coordinates in the reconciliation process, I think it's standardized. |
The best practice would be to reference the commit:
Where? Since this script applies to all DB's, then this kind of thing should be in documentation for this script.
I don't think this addresses my question. Is there going to be a situation where the output of the reconciliation process (in other words, the reconciled csvs) has a single column with multiple data types? |
Right now the only place where we have to format datetime and coordinate is in The Sessions DB, so I assume this is a 1/4 chance.
One column will always be one single type. I can change the .startwith to another pattern matching. I think that processing by column would be extra work for the operator, where we have to indicate the columns with their types manually? Therefore I assume that pattern matching would be simpler. |
Ok, yes... let's not worry about it now. But wanted to know. Thanks!
I'm not sure what you mean by 1/4 chance. There will be other databases, though, so I would put this in the README for this script. |
I work mainly on 4 different databases, and only The Session needs reformat. That's where 1/4 comes from. |
My mistake, but many recordings and authors have integers as their names. I would have to consider these instances |
Can you give us some examples where "recordings and authors have integers as their names"? |
No mistake...just something to think about. Since the I think recordings and authors having integers in their name is a big reason why we want to do this inference of type at the column level, rather than the individual cell level. You'd have to have a recording with a pretty specific title to be caught up in this, of course, but since we already know that the whole column will have the same data type, it seems cleaner just to go ahead and use that knowledge rather than one day have a weird and hard-to-find bug because someone decided to name their some "Point(1,2)" or something. |
This is a recording, it's artist is "1651", which is recognized as an integer in the RDF since it does not have quotes.
|
Ok. Do you think that there should be some user-defined variables that specifies the type of each column? Or should it be automatically recognized? |
Replace ns2:Artist by wdt:P175 and delete the assertion for name space ns2.
This is for supplementing data.
@candlecao Why are you committing to this branch? If the changes you are making need to be committed to the repo, please make another branch and pull request. |
…m/DDMAL/linkedmusic-datalake into datetime-place-format-recognition
Ok, got it. I will be careful next time. I can give up my committing on the branch this time. Thanks. |
Recognize the datetime and place format in CSV, correcting a reconciliation mistake in sessions-csv.csv