From 27e23697c3d68e560b8085da511eedb5408f5c50 Mon Sep 17 00:00:00 2001 From: Ellis Brown Date: Sun, 30 Jun 2024 14:42:18 -0700 Subject: [PATCH] some more cleanup of the example workflow --- dataengine/.gitignore | 1 - dataengine/README.md | 29 +++++++++++++++---- dataengine/data/.gitignore | 6 ++++ .../example_input_fields_subfields.txt} | 0 4 files changed, 29 insertions(+), 7 deletions(-) delete mode 100644 dataengine/.gitignore create mode 100644 dataengine/data/.gitignore rename dataengine/{input_fields_subfields.txt => data/example_input_fields_subfields.txt} (100%) diff --git a/dataengine/.gitignore b/dataengine/.gitignore deleted file mode 100644 index adbb97d..0000000 --- a/dataengine/.gitignore +++ /dev/null @@ -1 +0,0 @@ -data/ \ No newline at end of file diff --git a/dataengine/README.md b/dataengine/README.md index 5d07b9d..830b559 100644 --- a/dataengine/README.md +++ b/dataengine/README.md @@ -1,31 +1,48 @@ # Overall Workflow -The workflow consists of a series of Python scripts that should be executed in the following order in the appropriate environment (see requirements.txt): +## Environment Setup +Please install the necesary packages using the [`requirements.txt`](requirements.txt) file. -```bash -#!/bin/bash -# set your environment / keys +## Prepare input +See [Input Specification](#input-specification) for details on how to prepare the input file, and [data/example_input_fields_subfields.txt](data/example_input_fields_subfields.txt) for an example. The example below expects the input file to be named `input_fields_subfields.txt` and placed in the `data` directory, but this can be changed via the enivronment variables. + + +## Example Workflow + +The workflow consists of a series of Python scripts that should be executed in the following order: + + +### 1. set your environment / keys +```bash OPENAI_API_KEY="your_openai_key" GOOGLE_API_KEY="your_google_api_key" GOOGLE_SE_ID="your_google_search_engine_id" USER_AGENT="your_user_agent" # https://foundation.wikimedia.org/wiki/Policy:User-Agent_policy WIKIPEDIA_USER_AGENT="/ ()" +``` -# set args for the scripts +### 1. set args for the scripts +```bash +# input DATA_DIR="./data/" IN_FILE="${DATA_DIR}/input_fields_subfields.txt" -TOPICS_DIR="${DATA_DIR}/topics/" +# intermediate output +TOPICS_DIR="${DATA_DIR}/topics/" WIKI_DIR="${DATA_DIR}/wikidata/" WIKI_LINKS_DIR="${WIKI_DIR}/wikilinks/" WIKI_DATA_DIR="${WIKI_DIR}/data/" +# final output IMAGE_DIR="${DATA_DIR}/images/" QA_DIR="${DATA_DIR}/qadata/" VQA_DIR="${DATA_DIR}/vqa/" +``` +### 3. run the scripts +```bash python generate_topics.py --data_file_path $IN_FILE --output_dir $TOPICS_DIR python process_json_files.py --topics_dir $TOPICS_DIR python clean_and_rename_files.py --topics_dir $TOPICS_DIR diff --git a/dataengine/data/.gitignore b/dataengine/data/.gitignore new file mode 100644 index 0000000..2d64b87 --- /dev/null +++ b/dataengine/data/.gitignore @@ -0,0 +1,6 @@ +# ignore everythin +* + +# except the following files +!.gitignore +!example_input_fields_subfields.txt diff --git a/dataengine/input_fields_subfields.txt b/dataengine/data/example_input_fields_subfields.txt similarity index 100% rename from dataengine/input_fields_subfields.txt rename to dataengine/data/example_input_fields_subfields.txt