Restrict DOI ingestion to specific file formats and structures #15

FrancisTembo · 2024-11-11T19:54:00Z

The current implementation of the DOI ingestion function processes the input file line by line, regardless of file format or structure. This approach allows any file type to be processed. Current code:

with open(args.list_of_dois, "r") as csv_file:
    for line in csv_file:
        list_of_dois.append(line.strip())

Problem: This code does not validate the file type, so it will try to process any input file (e.g., .txt, .csv, .json, yaml). While it works for line-based formats, this lack of restriction could lead to issues if the input is a file with a different format or structure.

Also, if one passes the invalid .csv file the pipeline does not have a failure feedback mechanism as it gives a Success message.

The text was updated successfully, but these errors were encountered:

willu47 · 2024-11-12T12:57:15Z

A headerless, one-column csv/text file is what is expected at the moment. DOIs can be in any format. Legal DOIs are extracted from each row using a regular expression.

FrancisTembo added the bug Something isn't working label Nov 11, 2024

FrancisTembo self-assigned this Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrict DOI ingestion to specific file formats and structures #15

Restrict DOI ingestion to specific file formats and structures #15

FrancisTembo commented Nov 11, 2024

willu47 commented Nov 12, 2024

Restrict DOI ingestion to specific file formats and structures #15

Restrict DOI ingestion to specific file formats and structures #15

Comments

FrancisTembo commented Nov 11, 2024

willu47 commented Nov 12, 2024