Skip to content
This repository has been archived by the owner on Jun 29, 2019. It is now read-only.

A bunch of edits in README.md #5

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 24 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ The aim of this real-world scenario is to highlight how to use Azure Machine Lea

1. How to train a neural word embeddings model on a text corpus of about 18 million PubMed abstracts using [Spark Word2Vec implementation](https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec).
2. How to build a deep Long Short-Term Memory (LSTM) recurrent neural network model for entity extraction on a GPU-enabled Azure Data Science Virtual Machine (GPU DSVM) on Azure.
2. Demonstrate that domain-specific word embeddings model can outperform generic word embeddings models in the entity recognition task.
2. Demonstrate that domain-specific word embeddings models can outperform generic word embeddings models in the entity recognition task.
3. Demonstrate how to train and operationalize deep learning models using Azure Machine Learning Workbench.

4. Demonstrate the following capabilities within Azure Machine Learning Workbench:

* Instantiation of [Team Data Science Process (TDSP) structure and templates](how-to-use-tdsp-in-azure-ml.md).
* Automated management of your project dependencies including the download and the installation
* Automated management of your project dependencies including the download and the installation.
* Execution of code in Jupyter notebooks as well as Python scripts.
* Run history tracking for Python files.
* Execution of jobs on remote Spark compute context using HDInsight Spark 2.1 clusters.
Expand All @@ -23,7 +23,7 @@ The aim of this real-world scenario is to highlight how to use Azure Machine Lea
## Use Case Overview
Biomedical named entity recognition is a critical step for complex biomedical NLP tasks such as:
* Extraction of diseases, symptoms from electronic medical or health records.
* Drug discovery
* Drug discovery.
* Understanding the interactions between different entity types such as drug-drug interaction, drug-disease relationship and gene-protein relationship.

Our use case scenario focuses on how a large amount of unstructured data corpus such as Medline PubMed abstracts can be analyzed to train a word embedding model. Then the output embeddings are considered as automatically generated features to train a neural entity extractor.
Expand All @@ -37,7 +37,7 @@ The following figure shows the architecture that was used to process data and tr
## Data Description

### 1. Word2Vec model training data
We first downloaded the raw MEDLINE abstract data from [MEDLINE](https://www.nlm.nih.gov/pubs/factsheets/medline.html). The data is publically available in the form of XML files on their [FTP server](https://ftp.ncbi.nlm.nih.gov/pubmed/baseline). There are 892 XML files available on the server and each of the XML files has the information of 30,000 articles. More details about the data collection step are provided in the Project Structure section. The fields present in each file are
We first downloaded the raw MEDLINE abstract data from [MEDLINE](https://www.nlm.nih.gov/pubs/factsheets/medline.html). The data is publically available in the form of XML files on their [FTP server](https://ftp.ncbi.nlm.nih.gov/pubmed/baseline). There are 892 XML files available on the server and each of the XML files has the information of 30,000 articles. More details about the data collection step are provided in the [Data Acquisition and Understanding](./code/01_data_acquisition_and_understanding/ReadMe.md) section. The fields present in each file are

abstract
affiliation
Expand All @@ -59,11 +59,15 @@ We first downloaded the raw MEDLINE abstract data from [MEDLINE](https://www.nlm

### 2. LSTM model training data

The neural entity extraction model has been trained and evaluated on publiclly available datasets. To obtain a detailed description about these datasets, you could refer to the following sources:
The neural entity extraction model has been trained and evaluated on the following, publically available datasets:
* [Bio-Entity Recognition Task at BioNLP/NLPBA 2004](http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html)
* [BioCreative V CDR task corpus](http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/)
* [SemEval 2013 - Task 9.1 (Drug Recognition)](https://www.cs.york.ac.uk/semeval-2013/task9/)

These datasets collectively have annotations for the following entity types:
* Drug
* Disease
* ...

## Prerequisites

Expand All @@ -73,7 +77,7 @@ The neural entity extraction model has been trained and evaluated on publiclly a
* macOS Sierra

### Azure services
* To run this scenario with Spark cluster, provision [Azure HDInsight Spark cluster](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql) (Spark 2.1 on Linux (HDI 3.6)) for scale-out computation. To process the full amount of MEDLINE abstracts discussed below, We recommend having a cluster with:
* To run this scenario with Spark cluster, provision [Azure HDInsight Spark cluster](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql) (Spark 2.1 on Linux (HDI 3.6)) for scale-out computation. To process the full amount of MEDLINE abstracts discussed below, we recommend having a cluster with:
* a head node of type [D13_V2](https://azure.microsoft.com/en-us/pricing/details/hdinsight/)
* at least four worker nodes of type [D12_V2](https://azure.microsoft.com/en-us/pricing/details/hdinsight/).

Expand All @@ -89,7 +93,7 @@ All the required dependencies are defined in the aml_config/conda_dependencies.y
automatically provisioned for runs against docker, VM, and HDI cluster targets. For details about the Conda environment file format, refer to [here](https://conda.io/docs/using/envs.html#create-environment-file-by-hand).

* [TensorFlow with GPU support](https://www.tensorflow.org/install/)
* [CNTK 2.0](https://docs.microsoft.com/en-us/cognitive-toolkit/using-cntk-with-keras)
* [CNTK 2.0](https://docs.microsoft.com/en-us/cognitive-toolkit/using-cntk-with-keras) [[Well, CNTK is not listed in the yml file!]]
* [Keras](https://keras.io/#installation)
* NLTK
* Fastparquet
Expand All @@ -103,14 +107,15 @@ automatically provisioned for runs against docker, VM, and HDI cluster targets.
* [How to use GPU](how-to-use-gpu.md)

## Scenario Structure
For the scenario, we use the TDSP project structure and documentation templates (Figure 1), which follows the [TDSP lifecycle](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md). Project is created based on instructions provided [here](https://github.com/amlsamples/tdsp/blob/master/docs/how-to-use-tdsp-in-azure-ml.md).
For the scenario, we use the TDSP project structure and documentation templates (Figure 1), which follows the [TDSP lifecycle](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md). The project is created based on instructions provided [here](https://github.com/amlsamples/tdsp/blob/master/docs/how-to-use-tdsp-in-azure-ml.md).


![Fill in project information](./docs/images/instantiation-3.png)
Figure 1. TDSP Template in AML Workbench.

### Configuration of execution environments

This project includes steps that run on two compute/execution environments: in Spark cluster and GPU-supported DS VM. We start with the description of the dependencies required both environments.
This project includes steps that run on two compute/execution environments: in Spark cluster and GPU-supported DSVM. We start with the description of the dependencies required for these environments.

To install these packages in Docker image and in the nodes of Spark cluster, we modify conda_dependencies.yml file:

Expand All @@ -119,7 +124,7 @@ To install these packages in Docker image and in the nodes of Spark cluster, we
- python=3.5.2
# ipykernel is required to use the remote/docker kernels in Jupyter Notebook.
- ipykernel=4.6.1
- tensorflow-gpu==1.2.0
- tensorflow-gpu
- nltk
- requests
- lxml
Expand All @@ -134,9 +139,9 @@ To install these packages in Docker image and in the nodes of Spark cluster, we
- keras
- azure-storage

The modified conda\_dependencies.yml file is stored in aml_config directory of this project.
The modified conda\_dependencies.yml file is stored in the aml_config directory of this project.

In the next steps, we connect execution environment to Azure account. Open command line window (CLI) by clicking File menu in the top left corner of AML Workbench and choosing "Open Command Prompt." Then run in CLI
In the next steps, we connect execution environments to an Azure account. Open a Command Line Interface (CLI) window by clicking File menu in the top left corner of AML Workbench and choosing "Open Command Prompt." Then run in CLI

az login

Expand All @@ -148,37 +153,37 @@ Go to this web page, enter the code and sign into your Azure account. After this

az account list -o table

and find the subscription ID of Azure subscription that has your AML Workbench Workspace account. Finally, run in CLI
and find the subscription ID of the Azure subscription that has your AML Workbench Workspace account. Finally, run in CLI

az account set -s <subscription ID>

to complete the connection to your Azure subscription.

In the next two sections we show how to complete configuration of remote docker and Spark cluster.
In the next two sections we show how to complete the configuration of the remote Docker container and the Spark cluster environments.

#### Configuration of remote Docker container

To set up a remote Docker container, run the following command in the CLI:
```
az ml computetarget attach --name my-dsvm-env --address <IP address> --username <username> --password <password> --type remotedocker
```
with IP address, user name and password in DSVM. IP address of DSVM can be found in Overview section of your DSVM page in Azure portal:
with IP address, user name and password for the DSVM. The IP address of a DSVM can be found in the Overview section of your DSVM page in Azure portal:

![VM IP](./docs/images/vm_ip.png)

This command creates two files my-dsvm-env.compute and my-dsvm-env.runconfig under aml_config folder.
This command creates two files under the aml_config folder: my-dsvm-env.compute and my-dsvm-env.runconfig.

#### Configuration of Spark cluster

To set up Spark environment, run the following command in the CLI:
```
az ml computetarget attach --name my-spark-env --address <cluster name>-ssh.azurehdinsight.net --username <username> --password <password> --type cluster
```
with the name of the cluster, cluster's SSH user name and password. The default value of SSH user name is `sshuser`, unless you changed it during provisioning of the cluster. The name of the cluster can be found in Properties section of your cluster page in Azure portal:
with the name of the cluster, cluster's SSH user name and password. The default value of SSH user name is `sshuser`, unless you changed it during provisioning of the cluster. The name of the cluster can be found in the Properties section of your cluster page in Azure portal:

![Cluster name](./docs/images/cluster_name.png)

This command creates two files my-spark-env.compute and my-spark-env.runconfig under aml_config folder.
This command creates two files under the aml_config folder: my-spark-env.compute and my-spark-env.runconfig.

The step-by-step data science workflow is as follows:
### 1. [Data Acquisition and Understanding](./code/01_data_acquisition_and_understanding/ReadMe.md)
Expand All @@ -191,7 +196,7 @@ The step-by-step data science workflow is as follows:

## Conclusion

This use case scenario demonstrate how to train a word embedding model using Word2Vec algorithm on Spark and then use the extracted embeddings as features to train a deep neural network for entity extraction. We have applied the training pipeline on the biomedical domain. However, the pipeline is generic enough to be applied to detect custom entity types of any other domain. You just need enough data and you can easily adapt the workflow presented here for a different domain.
This use case scenario demonstrates how to train a word embedding model using Word2Vec algorithm on Spark and then use the extracted embeddings as features to train a deep neural network for entity extraction. We have applied the training pipeline on the biomedical domain. However, the pipeline is generic enough to be applied to detect custom entity types of any other domain. You just need enough data and you can easily adapt the workflow presented here for a different domain.

## References

Expand Down