It is a fun way to assess your data skills. It is also a good representative sample of the work we do at Rearc.
- Data management / data engineering concepts.
- Programming language (python, java, scala, etc).
- AWS knowledge (Lambda, SQS, CloudWatch logs).
- Infrastructure-as-code (Terraform, CloudFormation, etc)
This quest consists of 4 different parts. Putting all 4 parts together we will have a Data Pipeline architecture.
- Part 1 and Part 2 will showcase your skills with data management, AWS concepts, and your overall data engineering skillset. The goal is to source data from different places and store it in-house.
- Part 3 will showcase your data analytics skills. The goal is to find some interesting insights with data.
- Lastly, Part 4 will put all the pieces together. The goal here is to showcase your experience with automation and AWS services.
- Republish this open dataset in Amazon S3 and share with us a link.
- You may run into 403 Forbidden errors as you test accessing this data. There is a way to comply with the BLS data access policies and re-gain access to fetch this data programatically - we have included some hints as to how to do this at the bottom of this README in the Q/A section.
- Script this process so the files in the S3 bucket are kept in sync with the source when data on the website is updated, added, or deleted.
- Don't rely on hard coded names - the script should be able to handle added or removed files.
- Ensure the script doesn't upload the same file more than once.
- Create a script that will fetch data from this API. You can read the documentation here
- Save the result of this API call as a JSON file in S3.
-
Load both the csv file from Part 1
pr.data.0.Current
and the json file from Part 2 as dataframes (Spark, Pyspark, Pandas, Koalas, etc). -
Using the dataframe from the population data API (Part 2), generate the mean and the standard deviation of the annual US population across the years [2013, 2018] inclusive.
-
Using the dataframe from the time-series (Part 1), For every series_id, find the best year: the year with the max/largest sum of "value" for all quarters in that year. Generate a report with each series id, the best year for that series, and the summed value for that year. For example, if the table had the following values:
series_id year period value PRS30006011 1995 Q01 1 PRS30006011 1995 Q02 2 PRS30006011 1996 Q01 3 PRS30006011 1996 Q02 4 PRS30006012 2000 Q01 0 PRS30006012 2000 Q02 8 PRS30006012 2001 Q01 2 PRS30006012 2001 Q02 3 the report would generate the following table:
series_id year value PRS30006011 1996 7 PRS30006012 2000 8 -
Using both dataframes from Part 1 and Part 2, generate a report that will provide the
value
forseries_id = PRS30006032
andperiod = Q01
and thepopulation
for that given year (if available in the population dataset). The below table shows an example of one row that might appear in the resulting table:series_id year period value Population PRS30006032 2018 Q01 1.9 327167439 Hints: when working with public datasets you sometimes might have to perform some data cleaning first. For example, you might find it useful to perform trimming of whitespaces before doing any filtering or joins
-
Submit your analysis, your queries, and the outcome of the reports as a .ipynb file.
- Using AWS CloudFormation, AWS CDK or Terraform, create a data pipeline that will automate the steps above.
- The deployment should include a Lambda function that executes Part 1 and Part 2 (you can combine both in 1 lambda function). The lambda function will be scheduled to run daily.
- The deployment should include an SQS queue that will be populated every time the JSON file is written to S3. (Hint: S3 - Notifications)
- For every message on the queue - execute a Lambda function that outputs the reports from Part 3 (just logging the results of the queries would be enough. No .ipynb is required).
You can do as many as you like. We suspect though that once you start you won't be able to stop. It's addictive.
- Link to data in S3 and source code (Step 1)
- Source code (Step 2)
- Source code in .ipynb file format and results (Step 3)
- Source code of the data pipeline infrastructure (Step 4)
- Any README or documentation you feel would help us navigate your quest.
We have many more for you to solve as a member of the Rearc team!
Do. Or do not. There is no fail.
No.
Hint 1
The BLS data access policies can be found here: https://www.bls.gov/bls/pss.htmHint 2
The policy page says:BLS also reserves the right to block robots that do not contain information that can be used to contact the owner. Blocking may occur in real time.
How could you add information to your programmatic access requests to let BLS contact you?
Hint 3
Adding aUser-Agent
header to your request with contact information will comply with the BLS data policies and allow you to keep accessing their data programmatically.