Merge pull request #33 from jalammar/batch-input-activations

Batch input activations
jalammar · Feb 25, 2021 · 8639e3a · 8639e3a
2 parents e4ad283 + 86cdb9e
commit 8639e3a
Show file tree

Hide file tree

Showing 53 changed files with 965 additions and 306 deletions.
diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml
@@ -1,6 +1,6 @@
 name: Python package
 
-on: [push]
+on: [push, pull_request]
 
 jobs:
   build:

diff --git a/.gitignore b/.gitignore
@@ -55,7 +55,7 @@ output/*.html
 output/*/index.html
 
 # Sphinx
-docs/_build
+docs-old/_build
 
 .DS_Store
 *~

diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -5,6 +5,11 @@
 # Required
 version: 2
 
-# Build documentation in the docs/ directory with Sphinx
-sphinx:
-  configuration: docs/conf.py
+# Build documentation
+mkdocs:
+  configuration: mkdocs.yml
+
+python:
+   version: 3.8
+   install:
+   - requirements: docs/requirements.txt
diff --git a/docs/Makefile → docs-old/Makefile b/docs/Makefile → docs-old/Makefile
diff --git a/docs/_static/activation-factors.PNG → docs-old/_static/activation-factors.PNG b/docs/_static/activation-factors.PNG → docs-old/_static/activation-factors.PNG
diff --git a/docs/_static/basic_screenshot.png → docs-old/_static/basic_screenshot.png b/docs/_static/basic_screenshot.png → docs-old/_static/basic_screenshot.png
diff --git a/docs/_static/custom.css → docs-old/_static/custom.css b/docs/_static/custom.css → docs-old/_static/custom.css
diff --git a/docs/_static/ecco-logo.png → docs-old/_static/ecco-logo.png b/docs/_static/ecco-logo.png → docs-old/_static/ecco-logo.png
diff --git a/docs/_static/input-output.PNG → docs-old/_static/input-output.PNG b/docs/_static/input-output.PNG → docs-old/_static/input-output.PNG
diff --git a/docs/_static/input-saliency.PNG → docs-old/_static/input-saliency.PNG b/docs/_static/input-saliency.PNG → docs-old/_static/input-saliency.PNG
diff --git a/docs/_static/layer_predictions.PNG → docs-old/_static/layer_predictions.PNG b/docs/_static/layer_predictions.PNG → docs-old/_static/layer_predictions.PNG
diff --git a/docs/_static/rankings.PNG → docs-old/_static/rankings.PNG b/docs/_static/rankings.PNG → docs-old/_static/rankings.PNG
diff --git a/docs/_static/rankings_watch.PNG → docs-old/_static/rankings_watch.PNG b/docs/_static/rankings_watch.PNG → docs-old/_static/rankings_watch.PNG
diff --git a/docs/_templates/layout.html → docs-old/_templates/layout.html b/docs/_templates/layout.html → docs-old/_templates/layout.html
diff --git a/docs/conf.py → docs-old/conf.py b/docs/conf.py → docs-old/conf.py
diff --git a/docs/getting_started.rst → docs-old/getting_started.rst b/docs/getting_started.rst → docs-old/getting_started.rst
diff --git a/docs/index.rst → docs-old/index.rst b/docs/index.rst → docs-old/index.rst
diff --git a/docs/make.bat → docs-old/make.bat b/docs/make.bat → docs-old/make.bat
diff --git a/docs/api/ecco.md b/docs/api/ecco.md
@@ -0,0 +1,3 @@
+
+**ecco.from__pretrained()**
+::: ecco.from_pretrained
diff --git a/docs/api/language-model.md b/docs/api/language-model.md
@@ -0,0 +1,5 @@
+
+
+
+::: ecco.lm.LM
+    handler: python
diff --git a/docs/api/nmf.md b/docs/api/nmf.md
@@ -0,0 +1,8 @@
+One of the ways in which Ecco tries to make Transformer language models more transparent is by making it easier to [examine the neuron activations](https://jalammar.github.io/explaining-transformers/) in the feed-forward neural network sublayer of [Transformer blocks](https://jalammar.github.io/illustrated-transformer/). 
+Large language models can have up to billions of neurons. Direct examination of these neurons is not always insightful because their firing is sparse, there's a lot of redundancy, and their number makes it hard to extract a signal.
+
+[Matrix decomposition](https://scikit-learn.org/stable/modules/decomposition.html) methods can give us a glimpse into the underlying patterns in neuron firing. From these methods, Ecco currently provides easy access to Non-negative Matrix Factorization (NMF).
+
+## NMF
+
+::: ecco.output.NMF
diff --git a/docs/api/output.md b/docs/api/output.md
@@ -0,0 +1,3 @@
+
+
+::: ecco.output.OutputSeq
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -0,0 +1,7 @@
+Ecco is made up of two components:
+
+- [Ecco](https://github.com/jalammar/ecco), a python component. Wraps around language models and collects relevant data. 
+- [EccoJS](https://github.com/jalammar/eccojs), a Javascript component used to create interactive explorables from the outputs of Ecco.
+
+All the machine learning happens in the Ecco. The results can be plotted by python, or interactive explorables are created using eccoJS.
+
diff --git a/docs/img/eccorca.png b/docs/img/eccorca.png
diff --git a/docs/img/eccorca_pink.png b/docs/img/eccorca_pink.png
diff --git a/docs/img/eccorca_purple.png b/docs/img/eccorca_purple.png
diff --git a/docs/img/eccorca_white.png b/docs/img/eccorca_white.png
diff --git a/docs/img/layer_predictions_ex_london.png b/docs/img/layer_predictions_ex_london.png
diff --git a/docs/img/nmf_ex_1.png b/docs/img/nmf_ex_1.png
diff --git a/docs/img/nmf_ex_1_widethumb.png b/docs/img/nmf_ex_1_widethumb.png
diff --git a/docs/img/ranking_watch_ex_is_are_1.png b/docs/img/ranking_watch_ex_is_are_1.png
diff --git a/docs/img/rankings_ex_eu_1.png b/docs/img/rankings_ex_eu_1.png
diff --git a/docs/img/rankings_ex_eu_1_widethumb.png b/docs/img/rankings_ex_eu_1_widethumb.png
diff --git a/docs/img/rankings_watch_ex_is_are_widethumb.png b/docs/img/rankings_watch_ex_is_are_widethumb.png
diff --git a/docs/img/saliency_ex_1.png b/docs/img/saliency_ex_1.png
diff --git a/docs/img/saliency_ex_1_thumbwide.png b/docs/img/saliency_ex_1_thumbwide.png
diff --git a/docs/img/saliency_ex_2.png b/docs/img/saliency_ex_2.png
diff --git a/docs/img/saliency_ex_2_thumbwide.png b/docs/img/saliency_ex_2_thumbwide.png
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,54 @@
+# Welcome to Ecco
+Ecco is a python library for explaining Natural Language Processing models using interactive visualizations.
+
+Language models are some of the most fascinating technologies. They are programs that can speak and <i>understand</i> language better than any technology we've had before. For the general audience, Ecco provides an easy way to start interacting with language models. For people closer to NLP, Ecco provides methods to visualize and interact with underlying mechanics of the language models.
+
+Ecco runs inside Jupyter notebooks. It is built on top of [pytorch](https://pytorch.org/) and [transformers](https://github.com/huggingface/transformers).
+
+Ecco is not concerned with training or fine-tuning models. Only exploring and understanding existing pre-trained models.
+
+## Tutorials
+- Video: [Take A Look Inside Language Models With Ecco](https://www.youtube.com/watch?v=rHrItfNeuh0). \[<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Language_Models_and_Ecco_PyData_Khobar.ipynb">Colab Notebook</a>]
+
+
+## How-to Guides
+- [Interfaces for Explaining Transformer Language Models](https://jalammar.github.io/explaining-transformers/)
+- [Finding the Words to Say: Hidden State Visualizations for Language Models](https://jalammar.github.io/hidden-states/)
+
+
+## API Reference
+The [API reference](api/ecco) and the [architecture](architecture) page explain Ecco's components and how they work together.
+
+## Gallery
+
+<div class="container gallery" markdown="1">
+
+<p><strong>Predicted Tokens:</strong> View the model's prediction for the next token (with probability scores). See how the predictions evolved through the model's layers. [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Output_Token_Scores.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Output_Token_Scores.ipynb">Colab</a>]</p>
+<img src="img/layer_predictions_ex_london.png" />
+<hr />
+<p><strong>Rankings across layers:</strong> After the model picks an output token, Look back at how each layer ranked that token.  [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Evolution_of_Selected_Token.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Evolution_of_Selected_Token.ipynb">Colab</a>]</p>
+<img src="img/rankings_ex_eu_1_widethumb.png" />
+<hr />
+<p><strong>Layer Predictions:</strong>Compare the rankings of multiple tokens as candidates for a certain position in the sequence.  [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Comparing_Token_Rankings.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Comparing_Token_Rankings.ipynb">Colab</a>]</p>
+<img src="img/rankings_watch_ex_is_are_widethumb.png" />
+<hr />
+<p><strong>Input Saliency:</strong> How much did each input token contribute to producing the output token?   [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Input_Saliency.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Input_Saliency.ipynb">Colab</a>]
+</p>
+<img src="img/saliency_ex_1_thumbwide.png" />
+
+<hr />
+<p><strong>Detailed Saliency:</strong> See more precise input saliency values using the detailed view. [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Input_Saliency.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Input_Saliency.ipynb">Colab</a>]
+</p>
+<img src="img/saliency_ex_2_thumbwide.png" />
+
+<hr />
+<p><strong>Neuron Activation Analysis:</strong> Examine underlying patterns in neuron activations using non-negative matrix factorization. [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Neuron_Factors.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Neuron_Factors.ipynb">Colab</a>]</p>
+<img src="img/nmf_ex_1_widethumb.png" />
+
+</div>
+
+## Getting Help
+Having trouble?
+
+- The [Discussion](https://github.com/jalammar/ecco/discussions) board might have some relevant information. If not, you can post your questions there.
+- Report bugs at Ecco's [issue tracker](https://github.com/jalammar/ecco/issues)
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -0,0 +1,12 @@
+mkdocs==1.1.2
+mkdocs-material==6.2.8
+mkdocstrings==0.14.0
+-f https://download.pytorch.org/whl/torch_stable.html
+torch~=1.6.0
+transformers~=4.2.2
+matplotlib~=3.3.1
+numpy~=1.19.1
+ipython~=7.16.1
+scikit-learn~=0.23.2
+seaborn~=0.11.0
+PyYAML==5.4.1
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
@@ -0,0 +1,7 @@
+
+.gallery{
+    font-size:80%;
+}
+.gallery img{
+    max-width: 400px;
+}
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -0,0 +1,35 @@
+site_name: Ecco
+
+theme:
+  name: "material"
+  logo: img/eccorca_white.png
+  favicon: img/eccorca_purple.png
+  palette:
+    primary: purple
+    accent: pink
+extra_css:
+  - stylesheets/extra.css
+
+plugins:
+- mkdocstrings:
+    watch:
+      - src/ecco
+    handlers:
+      python:
+        setup_commands:
+#          - import ecco
+          - import sys
+          - sys.path.append("src")
+
+nav:
+  - Home: index.md
+  - Architecture: architecture.md
+  - API:
+      - Ecco: api/ecco.md
+      - Language Model: api/language-model.md
+      - Output: api/output.md
+      - NMF: api/nmf.md
+
+markdown_extensions:
+  - pymdownx.highlight
+  - pymdownx.superfences
diff --git a/readme.md b/readme.md
@@ -0,0 +1,64 @@
+![Ecco Logo](https://ar.pegg.io/img/ecco-logo-w-800.png)
+
+[![PyPI Package latest release](https://img.shields.io/pypi/v/ecco.svg)](https://pypi.org/project/ecco)
+[![Supported versions](https://img.shields.io/pypi/pyversions/ecco.svg)](https://pypi.org/project/ecco)
+
+
+Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.
+
+Ecco provides multiple interfaces to aid the explanation and intuition of [Transformer](https://jalammar.github.io/illustrated-transformer/)-based language models. Read: [Interfaces for Explaining Transformer Language Models](https://jalammar.github.io/explaining-transformers/).
+
+Ecco runs inside Jupyter notebooks. It is built on top of [pytorch](https://pytorch.org/) and [transformers](https://github.com/huggingface/transformers).
+
+
+Ecco is not concerned with training or fine-tuning models. Only exploring and understanding existing pre-trained models. The library is currently an alpha release of a research project. Not production ready. You're welcome to contribute to make it better!
+
+
+Documentation: [ecco.readthedocs.io](https://ecco.readthedocs.io/)
+
+
+## Tutorials
+- Video: [Take A Look Inside Language Models With Ecco](https://www.youtube.com/watch?v=rHrItfNeuh0). \[<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Language_Models_and_Ecco_PyData_Khobar.ipynb">Colab Notebook</a>]
+
+
+## How-to Guides
+- [Interfaces for Explaining Transformer Language Models](https://jalammar.github.io/explaining-transformers/)
+- [Finding the Words to Say: Hidden State Visualizations for Language Models](https://jalammar.github.io/hidden-states/)
+
+
+## API Reference
+The [API reference](api/ecco) and the [architecture](architecture) page explain Ecco's components and how they work together.
+
+## Gallery & Examples
+
+<div class="container gallery" markdown="1">
+
+<p><strong>Predicted Tokens:</strong> View the model's prediction for the next token (with probability scores). See how the predictions evolved through the model's layers. [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Output_Token_Scores.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Output_Token_Scores.ipynb">Colab</a>]</p>
+<img src="docs/img/layer_predictions_ex_london.png" width="400" />
+<hr />
+<p><strong>Rankings across layers:</strong> After the model picks an output token, Look back at how each layer ranked that token.  [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Evolution_of_Selected_Token.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Evolution_of_Selected_Token.ipynb">Colab</a>]</p>
+<img src="docs/img/rankings_ex_eu_1_widethumb.png" width="400"/>
+<hr />
+<p><strong>Layer Predictions:</strong>Compare the rankings of multiple tokens as candidates for a certain position in the sequence.  [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Comparing_Token_Rankings.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Comparing_Token_Rankings.ipynb">Colab</a>]</p>
+<img src="docs/img/rankings_watch_ex_is_are_widethumb.png" width="400" />
+<hr />
+<p><strong>Input Saliency:</strong> How much did each input token contribute to producing the output token?   [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Input_Saliency.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Input_Saliency.ipynb">Colab</a>]
+</p>
+<img src="docs/img/saliency_ex_1_thumbwide.png" width="400"/>
+
+<hr />
+<p><strong>Detailed Saliency:</strong> See more precise input saliency values using the detailed view. [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Input_Saliency.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Input_Saliency.ipynb">Colab</a>]
+</p>
+<img src="docs/img/saliency_ex_2_thumbwide.png" width="400"/>
+
+<hr />
+<p><strong>Neuron Activation Analysis:</strong> Examine underlying patterns in neuron activations using non-negative matrix factorization. [<a href="https://github.com/jalammar/ecco/blob/main/notebooks/Ecco_Neuron_Factors.ipynb">Notebook</a>] [<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_Neuron_Factors.ipynb">Colab</a>]</p>
+<img src="docs/img/nmf_ex_1_widethumb.png" width="400"/>
+
+</div>
+
+## Getting Help
+Having trouble?
+
+- The [Discussion](https://github.com/jalammar/ecco/discussions) board might have some relevant information. If not, you can post your questions there.
+- Report bugs at Ecco's [issue tracker](https://github.com/jalammar/ecco/issues)
diff --git a/readthedocs.yml b/readthedocs.yml
diff --git a/requirements.txt b/requirements.txt
@@ -9,4 +9,5 @@ pytest~=6.1.2
 setuptools~=49.6.0
 torch~=1.6.0
 torchvision~=0.7.0
+PyYAML==5.4.1
 
diff --git a/setup.py b/setup.py
@@ -25,7 +25,7 @@ def read(*names, **kwargs):
 
 setup(
     name='ecco',
-    version='0.0.13',
+    version='0.0.14',
     license='BSD-3-Clause',
     description='Visualization tools for NLP machine learning models.',
     long_description='%s\n%s' % (
@@ -66,7 +66,8 @@ def read(*names, **kwargs):
     install_requires=[
         "transformers ~= 4.2",
         "seaborn ~= 0.11",
-        "scikit-learn~=0.23"
+        "scikit-learn~=0.23",
+        "PyYAML~=5.4"
     ],
     extras_require={
         "dev": [

diff --git a/src/ecco/__init__.py b/src/ecco/__init__.py
@@ -1,24 +1,69 @@
-__version__ = '0.0.13'
-from ecco.lm import LM, MockGPT, MockGPTTokenizer
-from transformers import AutoTokenizer, AutoModelForCausalLM
-
-def from_pretrained(hf_model_id,
-                    activations=False,
-                    attention=False,
-                    hidden_states=True,
-                    activations_layer_nums=None,
+"""
+This is main entry point to Ecco. `from_pretrained()` is used to initialize an [LM][ecco.lm.LM]
+object which then we use as a language model like GPT2 (or masked language model like BERT).
+
+Usage:
+
+```
+    import ecco
+
+    lm = ecco.from_pretrained('distilgpt2')
+```
+"""
+
+
+__version__ = '0.0.14'
+from ecco.lm import LM
+from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel
+from typing import Optional, List
+
+
+def from_pretrained(hf_model_id: str,
+                    activations: Optional[bool] = False,
+                    attention: Optional[bool] = False,
+                    hidden_states: Optional[bool] = True,
+                    activations_layer_nums: Optional[List[int]] = None,
+                    verbose: Optional[bool] = True,
+                    gpu: Optional[bool] = True
                     ):
-    if hf_model_id == "mockGPT":
-        tokenizer = MockGPTTokenizer()
-        model = MockGPT()
+    """
+Constructs a [LM][ecco.lm.LM] object based on a string identifier from HuggingFace Transformers. This is main entry point to Ecco.
+
+Usage:
+
+```python
+import ecco
+lm = ecco.from_pretrained('gpt2')
+```
+
+Args:
+    hf_model_id: name of the model identifying it in the HuggingFace model hub. e.g. 'distilgpt2', 'bert-base-uncased'.
+    activations: If True, collect activations when this model runs inference. Option saved in LM.
+    attention: If True, collect attention. Option passed to the model.
+    hidden_states: if True, collect hidden states. Needed for layer_predictions and rankings().
+    activations_layer_nums: If we are collecting activations, we can specify which layers to track. This is None by
+        default and all layer are collected if 'activations' is set to True.
+    verbose: If True, model.generate() displays output tokens in HTML as they're generated.
+    gpu: Set to False to force using the CPU even if a GPU exists.
+"""
+    # TODO: Should specify task/head in a cleaner way. Allow masked LM. T5 generation.
+    # Likely use model-config. Have a default. Allow user to specify head?
+    if 'gpt2' not in hf_model_id:
+        tokenizer = AutoTokenizer.from_pretrained(hf_model_id)
+        model = AutoModel.from_pretrained(hf_model_id,
+                                                     output_hidden_states=hidden_states,
+                                                     output_attentions=attention)
     else:
         tokenizer = AutoTokenizer.from_pretrained(hf_model_id)
         model = AutoModelForCausalLM.from_pretrained(hf_model_id,
                                                      output_hidden_states=hidden_states,
                                                      output_attentions=attention)
 
     lm_kwargs = {
+        'model_name': hf_model_id,
         'collect_activations_flag': activations,
-        'collect_activations_layer_nums': activations_layer_nums}
+        'collect_activations_layer_nums': activations_layer_nums,
+        'verbose': verbose,
+        'gpu': gpu}
     lm = LM(model, tokenizer, **lm_kwargs)
     return lm
diff --git a/src/ecco/__main__.py b/src/ecco/__main__.py
@@ -1,5 +1,5 @@
 """
-Entrypoint module, in case you use `python -mecco`.
+Entrypoint module, in case you use `python -m ecco`.
 
 
 Why does this file exist, and why __main__? For more info, read:
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@

		ecco.from__pretrained()
		::: ecco.from_pretrained