|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Gensim\n", |
| 8 | + "\n", |
| 9 | + "> Gensim is designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible." |
| 10 | + ] |
| 11 | + }, |
| 12 | + { |
| 13 | + "cell_type": "markdown", |
| 14 | + "metadata": {}, |
| 15 | + "source": [ |
| 16 | + "> Gensim is designed to process raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation and Random Projections discover semantic structure of documents by examining statistical co-occurrence patterns of the words within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.\n", |
| 17 | + "\n", |
| 18 | + "> Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents." |
| 19 | + ] |
| 20 | + }, |
| 21 | + { |
| 22 | + "cell_type": "markdown", |
| 23 | + "metadata": {}, |
| 24 | + "source": [ |
| 25 | + "## From Strings to Vectors" |
| 26 | + ] |
| 27 | + }, |
| 28 | + { |
| 29 | + "cell_type": "markdown", |
| 30 | + "metadata": {}, |
| 31 | + "source": [ |
| 32 | + "Let’s start from documents represented as strings:" |
| 33 | + ] |
| 34 | + }, |
| 35 | + { |
| 36 | + "cell_type": "code", |
| 37 | + "execution_count": 1, |
| 38 | + "metadata": {}, |
| 39 | + "outputs": [], |
| 40 | + "source": [ |
| 41 | + "from gensim import corpora\n", |
| 42 | + "\n", |
| 43 | + "# This is a tiny corpus of nine documents, each consisting of only a single sentence.\n", |
| 44 | + "documents = [\"Human machine interface for lab abc computer applications\",\n", |
| 45 | + " \"A survey of user opinion of computer system response time\",\n", |
| 46 | + " \"The EPS user interface management system\",\n", |
| 47 | + " \"System and human system engineering testing of EPS\", \n", |
| 48 | + " \"Relation of user perceived response time to error measurement\",\n", |
| 49 | + " \"The generation of random binary unordered trees\",\n", |
| 50 | + " \"The intersection graph of paths in trees\",\n", |
| 51 | + " \"Graph minors IV Widths of trees and well quasi ordering\",\n", |
| 52 | + " \"Graph minors A survey\"]" |
| 53 | + ] |
| 54 | + }, |
| 55 | + { |
| 56 | + "cell_type": "markdown", |
| 57 | + "metadata": {}, |
| 58 | + "source": [ |
| 59 | + "First, let’s tokenize the documents, remove common words (using a toy stoplist) as well as words that only appear once in the corpus:" |
| 60 | + ] |
| 61 | + }, |
| 62 | + { |
| 63 | + "cell_type": "code", |
| 64 | + "execution_count": 3, |
| 65 | + "metadata": {}, |
| 66 | + "outputs": [ |
| 67 | + { |
| 68 | + "name": "stdout", |
| 69 | + "output_type": "stream", |
| 70 | + "text": [ |
| 71 | + "[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'], ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps'], ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'], ['graph', 'minors', 'survey']]\n" |
| 72 | + ] |
| 73 | + } |
| 74 | + ], |
| 75 | + "source": [ |
| 76 | + "# Remove common words and tokenize\n", |
| 77 | + "stoplist = set('for a of the and to in'.split())\n", |
| 78 | + "texts = [[word for word in document.lower().split() if word not in stoplist]\n", |
| 79 | + " for document in documents]\n", |
| 80 | + "\n", |
| 81 | + "print(texts)" |
| 82 | + ] |
| 83 | + }, |
| 84 | + { |
| 85 | + "cell_type": "code", |
| 86 | + "execution_count": 4, |
| 87 | + "metadata": {}, |
| 88 | + "outputs": [ |
| 89 | + { |
| 90 | + "name": "stdout", |
| 91 | + "output_type": "stream", |
| 92 | + "text": [ |
| 93 | + "[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]\n" |
| 94 | + ] |
| 95 | + } |
| 96 | + ], |
| 97 | + "source": [ |
| 98 | + "# Remove words that appear only once\n", |
| 99 | + "from collections import defaultdict\n", |
| 100 | + "frequency = defaultdict(int)\n", |
| 101 | + "for text in texts:\n", |
| 102 | + " for token in text:\n", |
| 103 | + " frequency[token] += 1\n", |
| 104 | + " \n", |
| 105 | + "texts = [[token for token in text if frequency[token] > 1] for text in texts]\n", |
| 106 | + "\n", |
| 107 | + "print(texts)" |
| 108 | + ] |
| 109 | + }, |
| 110 | + { |
| 111 | + "cell_type": "code", |
| 112 | + "execution_count": 5, |
| 113 | + "metadata": {}, |
| 114 | + "outputs": [ |
| 115 | + { |
| 116 | + "name": "stdout", |
| 117 | + "output_type": "stream", |
| 118 | + "text": [ |
| 119 | + "[['human', 'interface', 'computer'],\n", |
| 120 | + " ['survey', 'user', 'computer', 'system', 'response', 'time'],\n", |
| 121 | + " ['eps', 'user', 'interface', 'system'],\n", |
| 122 | + " ['system', 'human', 'system', 'eps'],\n", |
| 123 | + " ['user', 'response', 'time'],\n", |
| 124 | + " ['trees'],\n", |
| 125 | + " ['graph', 'trees'],\n", |
| 126 | + " ['graph', 'minors', 'trees'],\n", |
| 127 | + " ['graph', 'minors', 'survey']]\n" |
| 128 | + ] |
| 129 | + } |
| 130 | + ], |
| 131 | + "source": [ |
| 132 | + "#pretty-printer\n", |
| 133 | + "from pprint import pprint\n", |
| 134 | + "pprint(texts)" |
| 135 | + ] |
| 136 | + }, |
| 137 | + { |
| 138 | + "cell_type": "markdown", |
| 139 | + "metadata": {}, |
| 140 | + "source": [ |
| 141 | + "To convert documents to vectors, we’ll use a document representation called **bag-of-words**. In this representation, each document is represented by one vector where a vector element i represents the number of times the ith word appears in the document. \n", |
| 142 | + "\n", |
| 143 | + "It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary:" |
| 144 | + ] |
| 145 | + }, |
| 146 | + { |
| 147 | + "cell_type": "code", |
| 148 | + "execution_count": 9, |
| 149 | + "metadata": {}, |
| 150 | + "outputs": [ |
| 151 | + { |
| 152 | + "name": "stdout", |
| 153 | + "output_type": "stream", |
| 154 | + "text": [ |
| 155 | + "Dictionary(12 unique tokens: ['human', 'interface', 'computer', 'survey', 'user']...)\n" |
| 156 | + ] |
| 157 | + } |
| 158 | + ], |
| 159 | + "source": [ |
| 160 | + "dictionary = corpora.Dictionary(texts)\n", |
| 161 | + "# we assign a unique integer ID to all words appearing in the processed corpus\n", |
| 162 | + "# this sweeps across the texts, collecting word counts and relevant statistics.\n", |
| 163 | + "# In the end, we see there are twelve distinct words in the processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector).\n", |
| 164 | + "\n", |
| 165 | + "\n", |
| 166 | + "print(dictionary)" |
| 167 | + ] |
| 168 | + }, |
| 169 | + { |
| 170 | + "cell_type": "code", |
| 171 | + "execution_count": 10, |
| 172 | + "metadata": {}, |
| 173 | + "outputs": [ |
| 174 | + { |
| 175 | + "name": "stdout", |
| 176 | + "output_type": "stream", |
| 177 | + "text": [ |
| 178 | + "{'human': 0, 'interface': 1, 'computer': 2, 'survey': 3, 'user': 4, 'system': 5, 'response': 6, 'time': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}\n" |
| 179 | + ] |
| 180 | + } |
| 181 | + ], |
| 182 | + "source": [ |
| 183 | + "# To see the mapping between the words and their ids\n", |
| 184 | + "print(dictionary.token2id)" |
| 185 | + ] |
| 186 | + }, |
| 187 | + { |
| 188 | + "cell_type": "code", |
| 189 | + "execution_count": 16, |
| 190 | + "metadata": {}, |
| 191 | + "outputs": [ |
| 192 | + { |
| 193 | + "name": "stdout", |
| 194 | + "output_type": "stream", |
| 195 | + "text": [ |
| 196 | + "[(0, 1), (2, 1)]\n" |
| 197 | + ] |
| 198 | + } |
| 199 | + ], |
| 200 | + "source": [ |
| 201 | + "# To convert tokenized documents to vectors\n", |
| 202 | + "new_doc = \"Human computer interaction\"\n", |
| 203 | + "new_vec = dictionary.doc2bow(new_doc.lower().split())\n", |
| 204 | + "\n", |
| 205 | + "print(new_vec)" |
| 206 | + ] |
| 207 | + }, |
| 208 | + { |
| 209 | + "cell_type": "markdown", |
| 210 | + "metadata": {}, |
| 211 | + "source": [ |
| 212 | + "The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a bag-of-words--a sparse vector, in the form of [(word_id, word_count), ...].\n", |
| 213 | + "\n", |
| 214 | + "As the token_id is 0 for \"human\" and 2 for \"computer\", the new document “Human computer interaction” will be transformed to [(0, 1), (2, 1)]. The words \"computer\" and \"human\" exist in the dictionary and appear once. Thus, they become (0, 1), (2, 1) respectively in the sparse vector. The word \"interaction\" doesn't exist in the dictionary and, thus, will not show up in the sparse vector. The other ten dictionary words, that appear (implicitly) zero times, will not show up in the sparse vector and, there will never be a element in the sparse vector like (3, 0)." |
| 215 | + ] |
| 216 | + }, |
| 217 | + { |
| 218 | + "cell_type": "code", |
| 219 | + "execution_count": 17, |
| 220 | + "metadata": {}, |
| 221 | + "outputs": [ |
| 222 | + { |
| 223 | + "name": "stdout", |
| 224 | + "output_type": "stream", |
| 225 | + "text": [ |
| 226 | + "[(0, 1), (1, 1), (2, 1)]\n", |
| 227 | + "[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]\n", |
| 228 | + "[(1, 1), (4, 1), (5, 1), (8, 1)]\n", |
| 229 | + "[(0, 1), (5, 2), (8, 1)]\n", |
| 230 | + "[(4, 1), (6, 1), (7, 1)]\n", |
| 231 | + "[(9, 1)]\n", |
| 232 | + "[(9, 1), (10, 1)]\n", |
| 233 | + "[(9, 1), (10, 1), (11, 1)]\n", |
| 234 | + "[(3, 1), (10, 1), (11, 1)]\n" |
| 235 | + ] |
| 236 | + } |
| 237 | + ], |
| 238 | + "source": [ |
| 239 | + "corpus = [dictionary.doc2bow(text) for text in texts]\n", |
| 240 | + "for c in corpus:\n", |
| 241 | + " print(c)" |
| 242 | + ] |
| 243 | + } |
| 244 | + ], |
| 245 | + "metadata": { |
| 246 | + "kernelspec": { |
| 247 | + "display_name": "Python 3", |
| 248 | + "language": "python", |
| 249 | + "name": "python3" |
| 250 | + }, |
| 251 | + "language_info": { |
| 252 | + "codemirror_mode": { |
| 253 | + "name": "ipython", |
| 254 | + "version": 3 |
| 255 | + }, |
| 256 | + "file_extension": ".py", |
| 257 | + "mimetype": "text/x-python", |
| 258 | + "name": "python", |
| 259 | + "nbconvert_exporter": "python", |
| 260 | + "pygments_lexer": "ipython3", |
| 261 | + "version": "3.6.3" |
| 262 | + } |
| 263 | + }, |
| 264 | + "nbformat": 4, |
| 265 | + "nbformat_minor": 2 |
| 266 | +} |
0 commit comments