disaster-response-NLP

A machine learning pipeline for classifying messages from natural disasters and allocating them to the correct services.

requirements.txt - Contains the required packages to set up the development envrironment.
process_data.py - A python script which imports, merges and clean the dataset and stores it in a SQlite database.
train_classifier.py - A python script which creates a machine learning pipeline and trains a random forest classifier.
custom_transformer.py - A python cript which contains transform classes for text data.
models/model.joblib - The trained model.
app/run.py - The flask app that runs in the web browser.
app/create_figures.py - Creates the Plotly figures ready for the app.
categories.csv - the classes for each of the categories
messages.csv - the mesages from which the classes relate too
disaster.db - the database which is an output from prcoess data.py.

Project summary

1. Business understanding

The goal is to complete the Data Science project as part of the Udacity nanodegree.

We are tasked with training a machine learning model to predict the service categories related to a text based message.

2. Data understanding

Data has been preconfigured by Figure Eight. It includes 2 files. Firstly messages.csv which includes real messages that have been sent during the occurence of a natural disaster. The second file, categories.csv, contains the associated service categories that the messages relate too. The categories data are multi class.

3. Data preparation

The two data files are treated as follows;

Data are imported
Data are merged
Data are exported into an SQlite database called disaster_db.

4. Modelling

A random forest classfier (with a multiclassfication wrapper) was chosen to classify the text data. The data were split into training and testing sets and then entered a pipeline. The pipeline includes a feature union which includes a number of transformers including tfidf and sentiment transformers. The model is trained using GridSearchCV. To account for the imbalance, I used a class weight set to 'balanced' in the classifier.

I would have like to have synthesise cases, but I found the area of research a little sparse with suggestions of SMOTE oversampling of text data as incorrect due to its hign dimensionality.

5. Evaluation

The F1 score was chosen to evaluate the model as it accounts for both precision an recall. Due to the large class imbalance, precision will be high so using that could give us a false impression of success. We want to reward classifying the correct response for both 1 and 0. Further to this, we used the F1 score 'macro' as this lends weight to the smaller class in the class imbalance.

The resulting average F Score was 0.56.

6. Deployment

N/A

Acknowledgements

Some functions relating to the python scripts were taken from the Udacity training materials.

The following articles also helped me with some decision making.

https://towardsdatascience.com/random-forest-hyperparameters-and-how-to-fine-tune-them-17aee785ee0d https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.idea		.idea
app		app
data		data
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
custom_transformer.py		custom_transformer.py
process_data.py		process_data.py
requirements.txt		requirements.txt
train_classifier.py		train_classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

disaster-response-NLP

Contents

Installation

How to run

Motivation

Project files

Project summary

1. Business understanding

2. Data understanding

3. Data preparation

4. Modelling

5. Evaluation

6. Deployment

Acknowledgements

About

Releases

Packages

Languages

tim-blackmore/disaster-response-NLP

Folders and files

Latest commit

History

Repository files navigation

disaster-response-NLP

Contents

Installation

How to run

Motivation

Project files

Project summary

1. Business understanding

2. Data understanding

3. Data preparation

4. Modelling

5. Evaluation

6. Deployment

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages