NOTE: This repo has now been rewritten into a general purpose distributed compute job manager, see below:
- DistCompute Client: TheoCoombes/distcompute-client
- DistCompute Tracker Server: TheoCoombes/distcompute-tracker
A client library for Crawling@Home's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.
- Server Repo: TheoCoombes/crawlingathome-server
- Worker Repo: ARKSeal/crawlingathome-worker
- Live Dashboard: http://crawlingathome.duckdns.org/
- Python >= 3.7
As this module will only be used for creating the dataset (short-term), it has not been added to pip
. However, installing from source is fairly simple:
git clone https://github.com/TheoCoombes/crawlingathome
pip install -r crawlingathome/requirements.txt
Now, from the current directory, you can import the module:
import crawlingathome as cah
crawlingathome.init(url="http://crawlingathome.duckdns.org/", nickname=None, type="HYBRID") -> Client
Creates and returns a new client instance.
url
: the Crawling@Home server URLnickname
: the user's nickname (for the leaderboard)type
: the type of worker from "HYBRID", "CPU" & "GPU"- You can also use the classes instead of a string, e.g.
crawlingathome.core.CPUClient
instead of"CPU"
- You can also use the classes instead of a string, e.g.
Dumps a client into a dictionary, so that it can be loaded externally. (see below)
Loads an existing client using dumped data passed as kwargs, returning a client instance. (see above)
import crawlingathome as cah
client = cah.init(
url="https://example.com",
nickname="TheoCoombes",
type="HYBRID"
)
while client.jobCount() > 0 and client.isAlive():
client.newJob()
client.downloadShard()
# Saved shard at ./shard.wat
while processing_shard:
# ... process data
client.log("Completed x / y images") # Updates the client's progress to the server
client.completeJob(num_pairs_found)
client.bye()
Finds the amount of available Hybrid/CPU jobs from the server, returning an integer.
Send a request to the server, requesting for a new job.
Downloads the current job's shard to the current directory (./shard.wat
)
Marks the current job as done to the server, along with submitting the total amount of alt-text pairs scraped. (_markjobasdone()
will be removed in future clients, use this instead)
total_scraped
(required): the amount of alt-text pairs scraped for the current job
Logs the string progress
into the server.
progress
(required): The string detailing the progress, e.g."12 / 100 (12%)"
Returns True
if this client is still connected to the server, otherwise returns False
.
Client-side wrapper for crawlingathome.dump(client)
.
Removes the node instance from the server, ending all current jobs.
The URL to the current shard.
The starting ID. Type: np.int64
.
The ending ID, 1 million more than starting ID. Type: np.int64
.
The 'shard' of the chunk, either 0 (first 50%) or 1 (last 50%).
The CPU client is programatically similar to HybridClient
, with only a differing upload function:
Marks the current job as done to the server and sends the download URL for GPU workers to pull the generated .tar file from.
download_url
(required): the URL to download the shards- As this is a string, this could theoretically be anything. For example an IP to directly pull from the worker or a Google Drive link etc.
Similarly to the CPU Client, the GPU client is programatically similar to HybridClient
, instead with a differing downloadShard()
function, shard
variable and new invalidURL
method:
Extracts the .tar file recieved from CPU workers into the path path
, creating the directory if neccesary.
Flags a GPU job's URL as invalid to the server, moving the job back into open jobs.
Instead of being a CommonCrawl URL before, this is the string the CPU client uploaded in CPUClient.completeJob(...)
.
GPUClient jobs are dynamically created, meaning it needs CPU clients to generate jobs for it. Because of this, there may be periods of time when your worker(s) don't have any jobs to fufil. You can prepare for this by making use of the GPUClient.jobCount()
function as well as using a try/except on the newJob()
call.
GPUClient.newJob()
raises acrawlingathome.errors.ZeroJobError
when there are no jobs to fufil.