Skip to content

This is an overview of the AWS SDK for pandas (awswrangler) which is an open-source python library that makes it easier to work with data from AWS services.

Notifications You must be signed in to change notification settings

masood2iq/AWS-SDK-for-Pandas-awswrangler-or-datawrangler-Overview

Repository files navigation

AWS SDK for Pandas (awswrangler or datawrangler) Overview

Description

This is an overview of the AWS SDK for pandas (awswrangler) which is an open-source python library that makes it easier to work with data from AWS services.

Overview

AWS Data Wrangler is an open-source Python library built on top of Pandas, Apache Arrow, and Boto3, it offers abstracted functions to execute usual ETL tasks like loading/unloading data from Data Lakes, Data Warehouses, and Databases using python. AWS datawrangler is easily integrated with AWS services like AWS S3, AWS Glue, Amazon Athena, AWS DynamoDB, AWS CloudWatch, AWS Redshift, Amazon Timestream, AWS EMR, etc. Working with data datawrangler support reading and writing Excel, JSON, CSV, and Parquet from S3. Interact with data and metadata through AWS Glue and run SQL queries on Amazon Athena.

Note: Before working with AWS datawrangler you need to install and configure your AWS CLI account on your Linux machine.

Now, before installing datawrangler, we need to install the python3 on our Linux machine, which can be done with commands

$ apt update
$ apt install -y python3

After installation of python3, we need to install the python package pip, which can be done with commands

$ apt install python3-pip
$ pip3 install --upgrade pip
$ apt install -y python3-venv

To create a virtual environment for python 3, we can do it in two ways as


WAY - 1


Create a virtual environment, which can be done with commands.

$ python3 -m venv my_env_project

The above command creates a directory named my_env_project in the current directory, which contains pip, interpreter, scripts, and libraries, view as

$ ls my_env_project/

You can now activate the virtual environment, with the command

$ source my_env_project/bin/activate

Command prompt would change to your environment and will look as shown

(my_env_project) ubuntu@DESKTOP-I4BBP24:~$

Now, we install the awswrangler package into our virtual environment as

(my_env_project)$ pip install awswrangler

Now, if you didn’t configured AWS CLI, configure as

(my_env_project)$ aws configure

Run python command inside virtual environment to open the interpreter

(my_env_project)$ python

Every time you install a new package inside your virtual environment, you should be able to import it into your project. Now let’s test awswrangler with S3 bucket.

(my_env_project) ubuntu@DESKTOP-I4BBP24:~/my_env_project$ python
>>> import awswrangler as wr
>>> s3_bucket_name='you_bucket_name'
>>> s3_bucket_file_path='directory_name/'
>>> s3_bucket_path=f"s3://{s3_bucket_name}/{s3_bucket_file_path}"
>>> df=wr.s3.read_csv (path=s3_bucket_path, path_suffix=['.csv'])
>>> print (df)

To exit from the interpreter, type

>>> quit()

We can also create a python script and run from inside python 3 virtual environment as

(my_env_project) ubuntu@ubuntu:~$ vim script.py

Copy and paste the given code inside the script file

import awswrangler as wr
s3_bucket_name='you_bucket_name'
s3_bucket_file_path='directory_name/'
s3_bucket_path=f"s3://{s3_bucket_name}/{s3_bucket_file_path}"
df=wr.s3.read_csv (path=s3_bucket_path, path_suffix=['.csv'])
print (df)

To execute the script, run command

(my_env_project) ubuntu@ubuntu:~$ python script.py

To exit from virtual environment use exit or Ctrl+d command. To delete a virtual environment run the following command

(my_env_project) ubuntu@ubuntu:~$ deactivate

The above command won't remove my_env_project directory, simply use rm command to delete it.


WAY - 2


Create a directory and go into it to create virtual environment as

$ mkdir jupyter_notebook
$ ls jupyter_notebook

$ cd jupyter_notebook

Now, create a python virtual environment named jupypter_notebook

$ virtualenv jupyter_notebook

To activate and get inside that virtual environment

$ source jupyter_notebook/bin/activate

Install Jupyter inside the virtual environment

(jupyter_notebook) ubuntu@ubuntu:~/jupyter_notebook$ pip3 install jupyter

Create a kernel that can be used to run python commands inside the virtual environment of jupyter notebook.

(jupyter_notebook) ubuntu@ubuntu:~/jupyter_notebook$ ipython kernel install --user --name=python-env

You can launch its web interface from the terminal as

(jupyter_notebook) ubuntu@ubuntu:~/jupyter_notebook$ jupyter notebook --allow-root

You get the link to open it in your browser, click on right side, New drop down menu and select your python_env.

Install awswrangler with command given in your python_env virtual environment.

pip install awswrangler

Run the following code to test the awswrangler with your S3 Bucket to get the data from .csv file.

import awswrangler as wr
s3_bucket_name='you_bucket_name'
s3_bucket_file_path='directory_name/'
s3_bucket_path=f"s3://{s3_bucket_name}/{s3_bucket_file_path}"
df=wr.s3.read_csv (path=s3_bucket_path, path_suffix=['.csv'])
print (df)

After you are done with the project, exit from Jupyter from the browser and no longer need the kernel you can uninstall it with the command.

(jupyter_notebook) ubuntu@ubuntu:~/jupyter_notebook$ jupyter-kernelspec uninstall python-env

To exit from virtual environment

(jupyter_notebook) ubuntu@ubuntu:~/jupyter_notebook$ deactivate

To delete the virtual environment

virtualenv --clear /home/ubuntu/jupyter-notebook/

About

This is an overview of the AWS SDK for pandas (awswrangler) which is an open-source python library that makes it easier to work with data from AWS services.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published