Skip to content

Restrict DOI ingestion to specific file formats and structures #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
FrancisTembo opened this issue Nov 11, 2024 · 1 comment
Open
Assignees
Labels
bug Something isn't working

Comments

@FrancisTembo
Copy link
Contributor

The current implementation of the DOI ingestion function processes the input file line by line, regardless of file format or structure. This approach allows any file type to be processed. Current code:

with open(args.list_of_dois, "r") as csv_file:
    for line in csv_file:
        list_of_dois.append(line.strip())

Problem: This code does not validate the file type, so it will try to process any input file (e.g., .txt, .csv, .json, yaml). While it works for line-based formats, this lack of restriction could lead to issues if the input is a file with a different format or structure.

Also, if one passes the invalid .csv file the pipeline does not have a failure feedback mechanism as it gives a Success message.

@FrancisTembo FrancisTembo added the bug Something isn't working label Nov 11, 2024
@FrancisTembo FrancisTembo self-assigned this Nov 11, 2024
@willu47
Copy link
Contributor

willu47 commented Nov 12, 2024

A headerless, one-column csv/text file is what is expected at the moment. DOIs can be in any format. Legal DOIs are extracted from each row using a regular expression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants