Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime place format recognition #159

Open
wants to merge 42 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
cf33751
refactor: reads geographic and datetime
Yueqiao12Zhang Aug 20, 2024
446a43d
test: update based on change
Yueqiao12Zhang Aug 20, 2024
f918248
Delete tunes.csv
Yueqiao12Zhang Aug 20, 2024
7a676f4
Merge branch 'main' into datetime-place-format-recognition
Yueqiao12Zhang Aug 22, 2024
76a8e37
Merge branch 'main' into datetime-place-format-recognition
Yueqiao12Zhang Aug 23, 2024
212d343
refactor: complie pattern matching csv2rdf/csv2rdf_single_subject.py
Yueqiao12Zhang Aug 23, 2024
1991475
refactor: compile pattern csv2rdf/csv2rdf_single_subject.py
Yueqiao12Zhang Aug 23, 2024
88eb0ac
fix: reconciliation add missing countries
Yueqiao12Zhang Aug 30, 2024
fca03ee
Merge branch 'main' into datetime-place-format-recognition
Yueqiao12Zhang Aug 30, 2024
7aaaa8b
merge: fix conflict
Yueqiao12Zhang Aug 30, 2024
8fd007e
gitignore: ignore output since it's too large
Yueqiao12Zhang Aug 30, 2024
86a6aaa
refactor: change output coordinate type
Yueqiao12Zhang Aug 30, 2024
bf0fa91
style: correct syntax
Yueqiao12Zhang Aug 30, 2024
47f2af4
test: correct coordinate, todo: correct datetime
Yueqiao12Zhang Aug 30, 2024
ce6a873
test: update datetime format and empty coordinate
Yueqiao12Zhang Sep 6, 2024
094a3b0
test: change the openrefine history json based on new format change
Yueqiao12Zhang Sep 6, 2024
7f584c0
doc: manual update the procedure for the session reconciling
Yueqiao12Zhang Sep 6, 2024
c9969cb
doc: fix text transform code
Yueqiao12Zhang Sep 6, 2024
ce79903
test: change the openrefine history for updated text transform code
Yueqiao12Zhang Sep 6, 2024
bd34b49
test: events.csv remove empty point coordinate
Yueqiao12Zhang Sep 6, 2024
7fcf81f
test: move to csv2rdf folder for testing
Yueqiao12Zhang Sep 6, 2024
f9f226c
feat: correctly checks for digits, and ignore digits in artist name
Yueqiao12Zhang Sep 6, 2024
e860302
test: add lang tag
Yueqiao12Zhang Sep 6, 2024
76c5ded
fix: correct find_artist filename
Yueqiao12Zhang Sep 6, 2024
9847d33
feat: use python datetime to add day of the week to the rdf
Yueqiao12Zhang Sep 6, 2024
be7b076
test: add day of the week
Yueqiao12Zhang Sep 6, 2024
a937458
Revert "test: add day of the week"
Yueqiao12Zhang Sep 6, 2024
e1a1ff6
feat: use datetime obj to recognize and reformat the datetime string
Yueqiao12Zhang Sep 6, 2024
f40aee7
test: remove the history that adds the "T" in datetime
Yueqiao12Zhang Sep 6, 2024
ce97fa5
test: "T" in datetime removed
Yueqiao12Zhang Sep 6, 2024
0e1da14
refactor: create a set containing string type columns
Yueqiao12Zhang Sep 6, 2024
19166c1
test: full output
Yueqiao12Zhang Sep 6, 2024
c3b5650
refactor: remove lang tag
candlecao Sep 9, 2024
1a791bc
Update csv2rdf_single_subject.py
candlecao Sep 9, 2024
524638a
Update out_rdf.ttl
candlecao Sep 9, 2024
28f3698
Modify the property value for "artist" in mapping.json
candlecao Sep 11, 2024
48d8d29
Update out_rdf.ttl
candlecao Sep 11, 2024
0cd3e4b
Update mapping.json
candlecao Sep 12, 2024
9012a48
Update recordings-csv.csv
candlecao Sep 12, 2024
24b275f
Create recordings-csv_onlyForArtist_wiki.csv
candlecao Sep 12, 2024
02fc059
Update mapping.json
candlecao Sep 12, 2024
486d3f7
Merge branch 'datetime-place-format-recognition' of https://github.co…
candlecao Sep 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ cantusdb/data/raw/*.csv
Cantusdb/data/deprecated/*
acousticbrainz/data/*
!acousticbrainz/data/reconciled/*
musicbrainz/data/raw/*
musicbrainz/data/raw/*
csv2rdf/thesession/out_rdf.ttl
38 changes: 36 additions & 2 deletions csv2rdf/csv2rdf_single_subject.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,41 @@
import sys
import json
import os
import re
from datetime import datetime
import validators
from rdflib import Graph, URIRef, Literal, Namespace
from rdflib.namespace import RDF, XSD
from rdflib.namespace import RDF, XSD, GEO

# The "type" attribute of each CSV file must be entered in the mapper file in the
# same order as the input in commandline.

DIRNAME = os.path.dirname(__file__)
mapping_filename = os.path.join(DIRNAME, sys.argv[1])
dest_filename = os.path.join(os.path.dirname(mapping_filename), "out_rdf.ttl")
DT_PATTERN = re.compile(r"^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}$")
STRING_NUM_COLUMN_SETS = {
URIRef("https://musicbrainz.org/doc/Recording#Artist")
}

WD = Namespace("http://www.wikidata.org/entity/")
WDT = Namespace("http://www.wikidata.org/prop/direct/")


def check_for_num(s: str, t) -> bool:
"""
(str, str) -> bool
checks if the string s is a valid integer given the column title.
"""
if "e" in s or "^" in s:
return False

if t in STRING_NUM_COLUMN_SETS:
return False

return s.isdigit()


def convert_csv_to_turtle(filenames: List[str]) -> Graph:
"""
(List[str]) -> Graph
Expand Down Expand Up @@ -86,8 +107,21 @@ def convert_csv_to_turtle(filenames: List[str]) -> Graph:
else:
if element == "True" or element == "False":
obj = Literal(element, datatype=XSD.boolean)
elif element.isnumeric():
elif check_for_num(element, predicates[i]):
obj = Literal(element, datatype=XSD.integer)
elif element.startswith("Point("):
obj = Literal(element.upper(), datatype=GEO.wktLiteral)
elif DT_PATTERN.match(element):
datetime_obj = datetime.strptime(
element, "%Y-%m-%d %H:%M:%S"
)

day_of_week = datetime_obj.strftime("%A")
day_of_week_obj = Literal(day_of_week)
g.add((key_attribute, URIRef("http://www.wikidata.org/prop/direct/P2894"), day_of_week_obj))

day_str = datetime_obj.strftime("%Y-%m-%dT%H:%M:%S")
obj = Literal(day_str, datatype=XSD.dateTime)
else:
obj = Literal(element)

Expand Down
Loading