-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a TSV format? #30
Comments
Hi, thanks for your interest. Currently, we only provide DB files in the sqlite format. If you want to convert them into CSV for TSV files, please try this file aser_table2tsv.py. But this file is not compatible with aser 2.0, so you need to change the column indices with EVENTUALITY_COLUMNS and RELATION_COLUMNS . |
How much do the SQLite files occupy? From where can I download it? I'd be happy to try the procedure and eventually provide help updating the script (if there is any need). I'm interested in this KG because I am the principal author of a scalable graph machine learning library with a Rust backend, and we are looking for large graphs (billions of nodes and edges) from the real world. The biggest we are using is ClueWeb09 (followed by WikiData), but it is hardly of interest if you exclude simple benchmarks. |
You can download files from our webpage. The merged core KG is around 30GB on disk. But if you are only interested in the topology and graph structure, I think the converted CSV/TSV should occupy around 1~2 GB. However, this KG is not a dense graph, and nodes are text in fact. I personally think using it for graph benchmarking is not too suitable. |
Thank you for your answer! Indeed, we already have enough large graphs for plain benchmarking and we'd like to actually have something more real world, such as this. It is great that there are textual informations for the nodes, as we use Bert models to create node features for models such as GCNs and the likes. I'll be downloading the graph and start to process it into something we can use, if you are interested I'll keep you posted on what we obtain. Since of course you and your team are the experts on this graph, if we get something interesting out of it I'd be happy to talk about it and potentially collaborate on experimental designs. |
I wrote a small python script to convert the current database version into three TSVs: a node list, edge type list and edge list. In plain TSV format (without further compressions), the resulting files weight respectively 583MB, 4KB, and 354MB. The used script first downloads the data from here and then processes it into the documents mentioned above. First of all, the download part (if you need to install downloaders, just run from downloaders import BaseDownloader
downloader = BaseDownloader(auto_extract=False,)
downloader.download([
"https://data.dgl.ai/dataset/ASER/core/KG.tar.0",
"https://data.dgl.ai/dataset/ASER/core/KG.tar.1",
"https://data.dgl.ai/dataset/ASER/core/KG.tar.2",
"https://data.dgl.ai/dataset/ASER/core/KG.tar.3",
"https://data.dgl.ai/dataset/ASER/core/KG.tar.4",
"https://data.dgl.ai/dataset/ASER/core/KG.tar.5",
"https://data.dgl.ai/dataset/ASER/core/KG.tar.6",
"https://data.dgl.ai/dataset/ASER/core/KG.tar.7",
]) Then, create the database file with Finally, execute the actual processing step: import sqlite3
import csv
import os
from tqdm.auto import tqdm
import pandas as pd
def convert_sqlite_to_tsv(
node_list_tsv_path: str,
edge_type_list_tsv_path: str,
edge_list_tsv_path: str,
db_path: str,
threshold: float = 0.5
):
"""Writes out the provided database to the provided TSV file.
Parameters
--------------------------
node_list_tsv_path: str
Path where to write out the node list
edge_type_list_tsv_path: str
Path where to write out the edge type list
edge_list_tsv_path: str
Path where to write out the edge list
db_path: str
Path from where to load the database
threshold: float = 0.5
Minimum weight from which to load the edge
"""
# We create the paths directories if needed.
for path in (
node_list_tsv_path,
edge_type_list_tsv_path,
edge_list_tsv_path
):
os.makedirs(os.path.dirname(path), exist_ok=True)
# We create a dictionary for the nodes to write
# directly a numeric ID and not a long hash, which would
# weight significantly more than just the numeric ID.
nodes_dictionary= {}
# We will do a similar thing for the edge types below
# once we load them into the header.
with open(edge_list_tsv_path, 'w', newline='') as f:
connection = sqlite3.connect(db_path)
header = list(pd.read_sql_query(
'SELECT * FROM RELATIONS LIMIT 1;',
connection
).columns)
# We create the dictionary of the edge types
# which we will use to replace the string edge types
# into a more succinct numeric edge type.
edge_types_dictionary = {
edge_type: edge_type_id
for edge_type_id, edge_type in enumerate(header[3:-1])
}
# We exclude the hash ID of the edge and the last
# column, which is the co-occurrence weight.
header = header[1:-1]
rows_number = next(connection.execute('SELECT COUNT(*) FROM RELATIONS;'))[0]
query = connection.execute(
'SELECT {} FROM RELATIONS;'.format(", ".join(header))
)
# We rename the first two columns into names that
# are more intuitive in their meaning.
header[0] = "subject"
header[1] = "object"
# We create the TSV writer
tsv_writer = csv.writer(f, delimiter='\t')
# And write the header of the TSV writer
# Specifically, the 4 columns are:
tsv_writer.writerow([header[0], "category", header[1], "weight"])
# We start to iterate on the database
for row in tqdm(
(dict(zip(header, row)) for row in query),
desc="Writing out ASER edge list",
total=rows_number,
leave=False,
dynamic_ncols=True
):
for edge_type, edge_type_id in edge_types_dictionary.items():
# We retrieve the edge weight of the current edge type
edge_weight = row[edge_type]
# If the weight is higher than the provided threshold
if row[edge_type] > threshold:
# We retrieve the numeric node ID curresponding to the provided
# subject and objects for this line
subject_node_id = nodes_dictionary.setdefault(row["subject"], len(nodes_dictionary))
object_node_id = nodes_dictionary.setdefault(row["object"], len(nodes_dictionary))
# Then we finally write out the line.
tsv_writer.writerow([
subject_node_id,
edge_type_id,
object_node_id,
edge_weight,
])
connection.close()
# Now we write out the node list
with open(node_list_tsv_path, 'w', newline='') as f:
# We create the TSV writer for the node list
tsv_writer = csv.writer(f, delimiter='\t')
# And write the header of the node file TSV writer
# Specifically, the 2 columns are:
tsv_writer.writerow(["node_name", "node_id"])
for node_name, node_id in tqdm(
nodes_dictionary.items(),
desc="Writing out ASER node list",
total=len(nodes_dictionary),
leave=False,
dynamic_ncols=True
):
tsv_writer.writerow([node_name, node_id])
# Now we write out the edge type list
with open(edge_type_list_tsv_path, 'w', newline='') as f:
# We create the TSV writer for the edge type list
tsv_writer = csv.writer(f, delimiter='\t')
# And write the header of the egde type list file TSV writer
# Specifically, the 2 columns are:
tsv_writer.writerow(["edge_type_name", "edge_type_id"])
for edge_type_name, edge_type_id in tqdm(
edge_types_dictionary.items(),
desc="Writing out ASER edge type list",
total=len(edge_types_dictionary),
leave=False,
dynamic_ncols=True
):
tsv_writer.writerow([edge_type_name, edge_type_id]) You can run the function above like this, and should complete the processing in under 10 minutes on a normal machine: convert_sqlite_to_tsv(
node_list_tsv_path = "ASER/node_list.tsv",
edge_type_list_tsv_path = "ASER/edge_type_list.tsv",
edge_list_tsv_path = "ASER/edge_list.tsv",
db_path = "downloads/KG.db"
) The first three lines of the resulting files look like these:
Some further questions: I was under the impression that there was textual data for the nodes: where can I retrieve it? Is it in some other table in the database? Are there any other features for either edges, nodes or edge types? Like, do the nodes have categories? If so, where can I find them? Again, thank you! |
I've already seen that there is a |
Hi, @LucaCappelletti94! I'm also looking for the TSV file (with the textual data for nodes and edges). Could you please provide the processed TSV file? I would be extremely grateful. |
Hello and thank you for making this resource!
Is there a CSV or TSV version of the node list and edge list of this knowledge graph? How does one get it in any format?
Thank you!
The text was updated successfully, but these errors were encountered: