The assignment focuses on creating an efficient system to process job listings to infer seniority for the roles based on a custom scale. The challenge involves setting up a caching layer to reduce expensive computation.
8M per day -> ~5600 per minute -> ~93 per second so we are under the processing limit of 1000 per second, but in the interest of efficiency, implementing a caching service would be important.
Average gRPC lookups - 1.8B total job scraped/20M unique company, title combinations = 1.11% of listing. By implementing a caching mechanism we can reduce gRPC calls by over 98%, resulting in significantly faster performance.
Considering the performance requirements using a Redis database would provide the best performance with sub micro-second lookup times.
Considering ~20M unique pairs with an estimated an average length of 50 characters and seniority levels stored as integers, it would take about 100 bytes of storage per record including overhead. That would total ~2GB database size, giving us plenty of room to scale in the future.
Redis database can be backed up to S3 daily to ensure that data can be restored in case of failures. This is important because Redis is an in-memory database and can be susceptible to data loss.
If cost is a major concern, we have the option to self-host a redis instance or use a NoSQL database like DynamoDB which would be slightly slower, but cheaper to operate.
- Read JSONL files containing the raw job postings can be read from
s3://rl-data/job-postings-raw/through a Lambda trigger or setting up a SQS queue. The code will execute the entire pipeline of steps below. Sample code implementing an invoked Lambda job can be found in thelambda_handlerfunction in sample.py.
- Read JSONL files to extract company and post titles. Sample code can be found in the
read_jsonl_file_from_s3function in sample.py. - De-duplicate any potential listings to ensure we are processing each unique pair only once and minimizing redundant cache lookups and gRPC calls. Sample code can be found in the
deduplicate_job_postingsfunction in sample.py. - Check Redis cache for each unique pair of (company, title). This is done to avoid duplicate keys in the cache. Sample code can be found in the
check_cachefunction in sample.py. - For each pair of (company, title) not present in Redis, we create a batch gRPC call to execute the Seniority model. Sample gRPC proto, code and Dockerfile to deploy the gRPC server can be found in seniority_grpc folder.
- For the missing (company, title) pairs, we need to add them to the Redis cache for future lookups. Sample code can be found in the
update_cachefunction in sample.py. - Combine the results from cache hits and gRPC calls and augment the original job listing by adding the
senioritykey to each line. Sample code can be found in theaugment_job_postingsfunction in sample.py.
- Finally write the results to
s3://rl-data/job-postings-mod/with the Seniority info. We have implemented an efficient caching mechanism to improve execution efficiency and reduced costs.
- Setup SQS to poll the queue every minute to line up with new job.
- If the Seniority model is updated then the Redis database can be cleared/updated through an adhoc run.
- Move old raw JSONL files to cold storage after needed time to reduce storage costs.
- Hashing (company, title) pairs to reduce key size and storage costs.