NOTE: This repo has now been rewritten into a general purpose distributed compute job manager, see below:
- DistCompute Client: TheoCoombes/distcompute-client
- DistCompute Tracker Server: TheoCoombes/distcompute-tracker
A server powering Crawling@Home's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.
- Client Repo: TheoCoombes/crawlingathome
- Worker Repo: ARKSeal/crawlingathome-worker
- Live Server: http://crawlingathome.duckdns.org/
- Install requirements
git clone https://github.com/TheoCoombes/crawlingathome-server
cd crawlingathome-server
pip install -r requirements.txt
- Setup Redis
- Redis Guide
- Configure your Redis connection url in
config.py
.
- Setup SQL database
- PostGreSQL Guide - follow steps 1-4, naming your database
crawlingathome
. - Install the required python library for the database you are using. (see link above)
- Configure your SQL connection url in
config.py
. - In the
crawlingathome-server
folder, create a new folder named 'jobs', and download this file there. - Also create two files there, named
closed.json
,open_gpu.json
with the text[]
stored in both. - Also create an extra file there named
leaderboard.json
, with the text{}
stored. - Finally, create another file there named
shard_info.json
with the text{"directory": "https://commoncrawl.s3.amazonaws.com/", "format": ".gz", "total_shards": 8569338}
stored. - You can then run
update_db.py
to setup the jobs database. (this may take a while)
- PostGreSQL Guide - follow steps 1-4, naming your database
- Install ASGI server
- From v3.0.0, you are required to start the server using a console command directly from the server backend.
- You can either use
gunicorn
oruvicorn
. Currently, the main production server usesuvicorn
with 12 worker processes. - e.g.
uvicorn main:app --host 0.0.0.0 --port 80 --workers 12
As stated in step 4 of installation, you need to run the server using a console command directly from the ASGI server platform:
uvicorn main:app --host 0.0.0.0 --port 80 --workers 12
- Runs the server through Uvicorn, using 12 processes.