This project involves the extraction, processing, and transformation of Midjourney prompts obtained from https://prompthero.com/midjourney-prompts. The goal is to create a dataset suitable for training language models, specifically tailored for image generation prompts.
The project consists of four Python scripts:
-
link_extracter.py
- Description: Scrapes prompt links from the Midjourney prompts website using Selenium and BeautifulSoup.
- Dependencies:
csv
,BeautifulSoup
,selenium
-
text_extracter.py
- Description: Fetches text content from the extracted prompt links, considering rate-limiting and retries.
- Dependencies:
csv
,requests
,BeautifulSoup
,time
,HTTPAdapter
,Retry
-
addsubject.py
- Description: Identifies main subjects in the prompts using spaCy and NLTK, then replaces placeholders with these subjects.
- Dependencies:
csv
,spacy
,nltk
-
convert2json.py
- Description: Converts the processed CSV data into a JSON format suitable for training a language model.
- Dependencies:
csv
,json
-
Clone the Repository:
git clone [repository_url] cd Midjourney-Prompts-Project
-
Install Dependencies:
pip install -r requirements.txt
-
Run the Scripts:
python link_extracter.py python text_extracter.py python addsubject.py python convert2json.py
-
Generated Files:
prompt_links.csv
: Contains the extracted prompt links.partial_prompt_texts.csv
: Contains text extracted from the prompt links.prompts_with_subject.csv
: Contains prompts with identified subjects.prompts_with_subject.jsonl
: JSON file suitable for language model training.
Explore the dataset on Hugging Face: Midjourney Art Prompts
The generated JSON file is specifically formatted for fine-tuning models using Hugging Face's Transformers library, such as ChatGPT. The CSV files, on the other hand, can be used for training or fine-tuning other Language Model Models (LLMs).
- The prompthero.com website for providing the Midjourney prompts.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.