-
Notifications
You must be signed in to change notification settings - Fork 15
AWS Datapipeline Architecture
-
The data pipelining and analysis infrastructure for Brain data (
ECoG/μECoG
) uses Amazon Web Services (AWS) tools for storage, processing and computation. -
All the time series brain data, which are split into one hour chunks (
.chn
files) are stored in AWS Simple Storage Service (S3). The S3 bucket containing the chunk files acts as a trigger to a AWS Lambda function that extracts the header information from the files and store that information in the database backend RDS instance. MySQL server is used as a DB engine for RDS. -
The header information stored in RDS contains
objectKey
(file name),start time
,stop time
,patient ID
, andchannel number
for each chunk file. -
S3 also contains a bucket with behavioral data from cognitive experiments. The bucket containing behavioral data is a trigger for another AWS Lambda function, which extracts the header information and stores it in a DynamoDB table. The behavioral data header information on the DynamoDB table contains
experiment
(task’s name),start time
,stop time
,subject
(subject ID),day
(day ID), andfile name
for the behavioral data. -
Users access the files through command line utilities and localhost web application that allow for querying (by time, subject ID, experiment name, etc.) from RDS and DynamoDB table, and will either download matching data for analysis on local machines or route them to other AWS computing resources like Apache Spark (running on AWS EC2) for large scale batch or streaming analysis using multiple machine clusters.
-
To optimize the use of RDS instance, users run a Python script that takes the snapshot of the most recent DB instance and delete the currently running instance. Whenever, a user wants to access the information on RDS, they restore the DB instance from the snapshot.
-
The following image shows the general data flow path along different components of the data pipeline system.
- Log in through the web browser – aws.amazon.com using the IAM user name and password for the assigned AWS account number.
- Access it using AWS CLI through terminal – need to have AWS access key ID and access key stored in the bash profile
- All the codes related to data pipeline system are in DataPipeline-upload and DataPipeline-download github repositories.
- Download or clone both the repositories.
- Before uploading the data, make sure the EC2 instance related to RDS is running and the RDS snapshot is restored. To do so,
cd
into theDataPipeline-download
folder. - Copy
properties.py
file fromCode/aws-properties-files/download
folder and paste it inDataPipeline-download
folder in your local machine. The file contains configurations specific to EC2 instance and RDS instance. - Now, run the following command on the terminal:
python3 run.py
- This will start the EC2 instance linked with RDS instance and restore the snapshot.
- While restoring the snapshot, the program will prompt to enter the date for which you would like to restore the snapshot. Enter the date in
YYYY-MM-DD
format. - The program will print
The snapshot is completely restored. Now, trying to modify it for easy access.
to indicate that the snapshot has been restored.
- Then,
cd
intoDataPipeline-upload
folder and run the following script:-
python3 edfSplitUpdated.py [edf file location] --s3 s3://edf-chunks-patientID –subject [patient ID] –day [day of the surgery]
. For example, if the surgery happened on January 14th, then day = 14 - This will upload the chunks file to the respective S3 bucket while the header information will be stored in RDS MySQL database.
-
- Log into AWS website
- Upload files to
behavioral-data-all
bucket in S3. This will act as a trigger for lambda function which will then update the DynamoDB table with relevant information.
-
EC2 machine for RDS instance: this instance needs to be running while RDS snapshot is restored
t2.micro
-
us-east-1
availability zone - very basic EC2 compute resource as it doesn't require compute power
-
EC2 machine for lambda config: this instance is used to create zip file with python scripts that extract header information from chunk files in S3 buckets and install the dependencies for those python scripts to work. The zip file is then uploaded to lambda server less architecture.
t2.micro
-
us-east-1
availability zone - very basic EC2 compute resource as it doesn't require compute power
- Data for each patient is stored in a separate S3 bucket. Buckets are named as
edf-chunks-patientID
. To add data for new patient, first create a S3 bucket with the same naming convention and link that bucket as a trigger tos3_rds_
lambda function. While adding the bucket as a trigger, selectObject Create All
and for suffix use.chn
.
-
RDS instance:
- DB-engine:
MySQL
- DB-Storage: 5GB
- Storage type: Magnetic
- DB Engine version: 5.6.27
- Port: 3306
-
us-east-1
availability zone
- DB-engine:
-
To test if the database is populated with data, you could connect MySQL workbench with the database link.
-
To do so, use RDS instance endpoint link as the Hostname to setup a new connection.
-
Use the username and password assigned while creating RDS instance as the username and password for the new connection.
-
DynamoDB:
- Provisioned read capacity units: 20
- Provisioned write capacity units: 20
- All the information about behavior data is stored in one table, which has 6 columns
-
There are 2 functions that run on AWS Lambda:
s3_rds_data_transfer
s3_dynamo_behavioral
-
There are 3 ways to create Lambda function: edit code inline, upload a .zip file, and upload a file from Amazon S3. In our case, to create
s3_dynamo_behavioral
function, the code was written inline.s3_rds_data_transfer
function required more dependencies and had a python file calling a function in another python file, so a zip file was created with 2 python files and all the dependencies. The way to create a zip file is explained in this link. -
To connect new S3 bucket with
s3_rds_data_transfer
function, follow the steps below:- Select
S3
as the trigger type - From the list of buckets, select the bucket you want to add as a trigger
- Select
Object Created (All)
as event type - Use
.chn
as the suffix - Do not forget to check the box for Enable Trigger and submit the form
- Select
- The basic way to download the data is directly through the AWS website.
- However, there are python scripts in DataPipeline-download repository, which you can run locally to filter the files and download them based on start and stop time as well as patient ID and task name
- Make sure the EC2 instance for RDS instance is running as well as the RDS snapshot is restored.
- Access into the
web
folder of DataPipeline-download folder - Copy
rds_config.py
anddynamo_config.py
files fromCode/aws-properties-files/download
folder stored in back-end server and paste them inweb
folder. These files contain configurations specific to our RDS and DynamoDB system. - Run
python3 hello.py
on terminal - Open your web browser and enter
http://localhost:5000/
- It should open a page with options to download the files based on
patient ID
,start time
andstop time
as well astask name
, although the script to download files based ontask name
is not fully functional - Select the option to use and proceed to download the files.
- The files will be downloaded in the web folder of DataPipeline-download directory in your local machine. The script will create
s3_download
folder where the chunks files will be downloaded and abehavioral_data
folder inside thes3_download
folder where the JSON files will be downloaded.
- It should open a page with options to download the files based on