Skip to content

AWS Datapipeline Architecture

Abhishek Upadhyaya Ghimire edited this page Apr 21, 2017 · 9 revisions

Documentation of AWS Data-pipeline system

Description of the system

  • The data pipelining and analysis infrastructure for Brain data (ECoG/μECoG) uses Amazon Web Services (AWS) tools for storage, processing and computation.

  • All the time series brain data, which are split into one hour chunks (.chn files) are stored in AWS Simple Storage Service (S3). The S3 bucket containing the chunk files acts as a trigger to a AWS Lambda function that extracts the header information from the files and store that information in the database backend RDS instance. MySQL server is used as a DB engine for RDS.

  • The header information stored in RDS contains objectKey(file name), start time, stop time, patient ID, and channel number for each chunk file.

  • S3 also contains a bucket with behavioral data from cognitive experiments. The bucket containing behavioral data is a trigger for another AWS Lambda function, which extracts the header information and stores it in a DynamoDB table. The behavioral data header information on the DynamoDB table contains experiment(task’s name), start time, stop time, subject(subject ID), day(day ID), and file name for the behavioral data.

  • Users access the files through command line utilities and localhost web application that allow for querying (by time, subject ID, experiment name, etc.) from RDS and DynamoDB table, and will either download matching data for analysis on local machines or route them to other AWS computing resources like Apache Spark (running on AWS EC2) for large scale batch or streaming analysis using multiple machine clusters.

  • To optimize the use of RDS instance, users run a Python script that takes the snapshot of the most recent DB instance and delete the currently running instance. Whenever, a user wants to access the information on RDS, they restore the DB instance from the snapshot.

  • The following image shows the general data flow path along different components of the data pipeline system. aws-dataflow

How to access the system?

  • Log in through the web browser – aws.amazon.com using the IAM user name and password for the assigned AWS account number.
  • Access it using AWS CLI through terminal – need to have AWS access key ID and access key stored in the bash profile

How to transfer ECoG data?

  • All the codes related to data pipeline system are in DataPipeline-upload and DataPipeline-download github repositories.
  • Download or clone both the repositories.
  • Before uploading the data, make sure the EC2 instance related to RDS is running and the RDS snapshot is restored. To do so, cd into the DataPipeline-download folder.
  • Copy properties.py file from Code/aws-properties-files/download folder and paste it in DataPipeline-download folder in your local machine. The file contains configurations specific to EC2 instance and RDS instance.
  • Now, run the following command on the terminal:
    • python3 run.py
    • This will start the EC2 instance linked with RDS instance and restore the snapshot.
    • While restoring the snapshot, the program will prompt to enter the date for which you would like to restore the snapshot. Enter the date in YYYY-MM-DD format.
    • The program will print The snapshot is completely restored. Now, trying to modify it for easy access. to indicate that the snapshot has been restored.
  • Then, cd into DataPipeline-upload folder and run the following script:
    • python3 edfSplitUpdated.py [edf file location] --s3 s3://edf-chunks-patientID –subject [patient ID] –day [day of the surgery]. For example, if the surgery happened on January 14th, then day = 14
    • This will upload the chunks file to the respective S3 bucket while the header information will be stored in RDS MySQL database.

How to transfer behavioral (JSON) data?

  • Log into AWS website
  • Upload files to behavioral-data-all bucket in S3. This will act as a trigger for lambda function which will then update the DynamoDB table with relevant information.

EC2 instances:

  • EC2 machine for RDS instance: this instance needs to be running while RDS snapshot is restored
    • t2.micro
    • us-east-1 availability zone
    • very basic EC2 compute resource as it doesn't require compute power
  • EC2 machine for lambda config: this instance is used to create zip file with python scripts that extract header information from chunk files in S3 buckets and install the dependencies for those python scripts to work. The zip file is then uploaded to lambda server less architecture.
    • t2.micro
    • us-east-1 availability zone
    • very basic EC2 compute resource as it doesn't require compute power

File Storage - S3

  • Data for each patient is stored in a separate S3 bucket. Buckets are named as edf-chunks-patientID. To add data for new patient, first create a S3 bucket with the same naming convention and link that bucket as a trigger to s3_rds_ lambda function. While adding the bucket as a trigger, select Object Create All and for suffix use .chn. create S3 bucket

Database

  • RDS instance:

    • DB-engine: MySQL
    • DB-Storage: 5GB
    • Storage type: Magnetic
    • DB Engine version: 5.6.27
    • Port: 3306
    • us-east-1 availability zone
  • To test if the database is populated with data, you could connect MySQL workbench with the database link.

  • To do so, use RDS instance endpoint link as the Hostname to setup a new connection.

  • Use the username and password assigned while creating RDS instance as the username and password for the new connection.

  • DynamoDB:

    • Provisioned read capacity units: 20
    • Provisioned write capacity units: 20
    • All the information about behavior data is stored in one table, which has 6 columns

Lambda - Serverless Compute

  • There are 2 functions that run on AWS Lambda:

    • s3_rds_data_transfer
    • s3_dynamo_behavioral
  • There are 3 ways to create Lambda function: edit code inline, upload a .zip file, and upload a file from Amazon S3. In our case, to create s3_dynamo_behavioral function, the code was written inline. s3_rds_data_transfer function required more dependencies and had a python file calling a function in another python file, so a zip file was created with 2 python files and all the dependencies. The way to create a zip file is explained in this link.

  • To connect new S3 bucket with s3_rds_data_transfer function, follow the steps below:

    • Select S3 as the trigger type
    • From the list of buckets, select the bucket you want to add as a trigger
    • Select Object Created (All) as event type
    • Use .chn as the suffix
    • Do not forget to check the box for Enable Trigger and submit the form adding s3 bucket as lambda triggers

How to download ECoG/behavioral data?

  • The basic way to download the data is directly through the AWS website.
  • However, there are python scripts in DataPipeline-download repository, which you can run locally to filter the files and download them based on start and stop time as well as patient ID and task name
  • Make sure the EC2 instance for RDS instance is running as well as the RDS snapshot is restored.
  • Access into the web folder of DataPipeline-download folder
  • Copy rds_config.py and dynamo_config.py files from Code/aws-properties-files/download folder stored in back-end server and paste them in web folder. These files contain configurations specific to our RDS and DynamoDB system.
  • Run python3 hello.py on terminal
  • Open your web browser and enter http://localhost:5000/
    • It should open a page with options to download the files based on patient ID, start time and stop time as well as task name, although the script to download files based on task name is not fully functional
    • Select the option to use and proceed to download the files.
    • The files will be downloaded in the web folder of DataPipeline-download directory in your local machine. The script will create s3_download folder where the chunks files will be downloaded and a behavioral_data folder inside the s3_download folder where the JSON files will be downloaded. localhost web app