-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter4.tex
200 lines (128 loc) · 24.7 KB
/
chapter4.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
%===================================== CHAP 4 =================================
\chapter{Implementation}\label{cap_4}
This chapter will explain in detail how the final system was implemented. Based on the concept described in section \ref{cd_pipeline}, a pipeline will transform Wikipedia's source data into structured example objects. The system's architecture and the format of the source data will be displayed in figures, which will explain how the system creates examples from the source data. The last part of the chapter will mention the tools used in our project, that were of great significance.
\section{Pipeline}
\subsection{Overview}
\textit{Wikidump Parser} is the first part of the pipeline. It transforms the XML dump of the whole Wikipedia into a database with relevant data from articles. The Wikidump Parser is written in JavaScript and executed by a Node.js process. To insert the relevant data from the XML into a database, it parses the XML, then parses the Wiki markup and lastly inserts relevant articles into the SQL database. These separate processes are managed by a master process executed from the main file \texttt{index.js}.
The second part of the pipeline is called \textit{Example Indexer}. Example Indexer fetches data from article, section and category tables in the SQL database and combines the data into examples. To fill the index with examples related to specific domains, a whitelist is used to include examples based on their categories before it builds an index in Elasticsearch with these examples.
\textit{Example Search} is the last part of the pipeline. When the pipeline is finished with processing the data into examples, Example Search can fetch the examples from the index, and display them for the user. Example Search is a web based search interface that queries the index over HTTP and displays the results.
\begin{figure}[H]
\caption{Architectural overview of the pipeline}
\includegraphics[width=\textwidth]{ArchitectureOverview}
\label{fig:pipeline_arch}
\end{figure}
Figure \ref{fig:pipeline_arch} displays these three parts of the whole system, where arrows indicates the data flow. The different parts of the system does not communicate with each other directly, but by inserting and fetching data from permanent storage.
\subsection{Wikidump Parser}
\subsubsection{Architecture}
Wikidump Parser has a very simple architecture. As shown in table \ref{fig:wikidump_parser}, two main processes is responsible for inserting relevant articles and sections from the XML dump into the SQL database. XMLParser uses a stream to parse the XML and uses an event pattern to notify every time an article has been parsed. Markup Parser then parses the article received from each event. If the article contains an example, Wikidump Parser inserts it into the SQL database.
\begin{figure}[h]
\caption{Architectural overview of Wikidump Parser}
\includegraphics[width=\textwidth]{WikidumpParser}
\label{fig:wikidump_parser}
\end{figure}
\subsubsection{XML parser}
The data dump that is fed into the beginning of the pipeline is on the XML format. Figure \ref{fig:xml} shows how the XML is structured. The file is composed of page elements at the top level with the page tag. Each \textit{page} element represents an article in Wikipedia. The page has several sub elements, but the most interesting element, is the one with the revision tag. Wikipedia saves several revisions of a page, but in the XML dump used, only the newest revision is included. It is inside this element that we find the article's content in the form of Wiki markup. Line number 18 in figure \ref{fig:xml} contains a comment marking where the whole article's markup is found.
\begin{figure}[h]
\caption{The XML structure used in the XML dump of Wikipedias database}
\includegraphics[width=\textwidth]{XMLPage}
\label{fig:xml}
\end{figure}
In our project the JavaScript class \texttt{XMLParser.js} handles the parsing of XML. It detects where an article in the XML document starts and where it ends by looking at the element tags, then proceeds to send the content to \texttt{index.js}. The document is read as a stream, which gives the advantage of only keeping a small bit of the document in memory. The library \textit{sax}\footnote{\url{https://www.npmjs.com/package/sax}} is used to read the document as a stream. Sax then creates events indicating where in the document it is currently reading, and what elements it is entering or exiting. Our project listens to the events for an open tag, close tag and content. By saving the state of which element Sax currently is inside we extract content, title, id and timestamp from the article element. When we detect the end of an article element, the data extracted is passed to \texttt{index.js}.
\subsubsection{Markup parser}
When extracting data from Wikipedia articles, the article's markup is parsed. The markup is called \textit{wiki markup}, figure \ref{fig:wiki_markup} demonstrates how the markup might look like for an article. When looking at the XML in figure \ref{fig:xml}, the entire wiki markup for an article is found at \texttt{line 18}.
\begin{figure}[h]
\caption{A short outline from an article's wiki markup, with important aspects highlighted and marked. }
\includegraphics[width=\textwidth]{wiki_markup_marked}
\label{fig:wiki_markup}
\end{figure}
Figure \ref{fig:wiki_markup} points out some important aspects of the markup language. For a complete guide, Wikipedia has its own help page\footnote{Help:wiki markup - \url{https://en.wikipedia.org/wiki/Help:Wiki_markup}}. Mark \texttt{A} shows how to refer to an image or other graphic. The first piece of information is about where to find it in Wikipedia's database, while the rest is for the visual presentation of the image. Mark \texttt{B} is the most important markup when dividing the article into sections. The equal signs at each side indicates a header for a section. The number of equal signs range from one to six. One is the Article title, two is section, three is subsection and so on. By using the hierarchy, we can both separate out sections from each other and find each sections parent section.
The markup by Mark \texttt{C} is used for special formatting between the tags. In the case of figure \ref{fig:wiki_markup}, mathematical expressions are shown. Code is another format that is often included in examples, it is tagged in the following way; \texttt{<source lang="java">}, where the value of \texttt{lang} is the programming language. Mark \texttt{D} shows how one Wikipedia article refers to another. When parsing the sections, all these references are stored. These references are later used for finding relations and relevance between two articles. The words right of the pipe is the reference name, while the ones on the left are the names of the articles that are linked to.
Each article sent to \texttt{index.js} from \texttt{XMLParser.js} is assessed on whether it contains examples or not. Articles not containing any examples, are discarded completely from the final collection. In order to assess the articles, the markup is parsed into objects. This is the task of \texttt{MarkupParser.js}. The parser uses several steps to extract the desired information from the markup. The first steps parses the entire article's markup. The result of the first step is objects for each section, including the introduction. The sections contain a header, what level they belong to in the section hierarchy and the content of the section.
The next step iterates through the list of sections to determine if some of them is an example. Sections that are not an example is discarded. If no examples are found, the whole article is discarded at this point. References to other articles and categories are then extracted from the markup. The data is finally inserted into the database pictured in figure \ref{fig:db_diagram}.
\subsection{SQL Database}
Section \ref{custom-pipeline} explains that the source data is fed into the pipeline as a snapshot of Wikipedia at a particular time. The size of the source data is an approximately 50 GB XML-document. Most of the data has no link or relation to examples, making it not interesting for this project. Therefore Markup parser excludes most of it before it is organized in the SQL database.
\begin{figure}[h]
\caption{A simple overview of the database tables and their attributes}
\includegraphics[width=\textwidth]{db_diagram}
\label{fig:db_diagram}
\end{figure}
Figure \ref{fig:db_diagram} displays the tables of the database and the relations between them. If an article with a relevant section is discovered, the article is inserted into the \textit{pages} table in the database, with only metadata and the article's introduction. The relevant sections are then extracted from the article and stored in the database with relation to the article. The categories of the article is also kept, because it will be helpful when searching among examples in the finished index. At the end of the process of inserting one example into the database, the references are stored with a connection to the section they are used in.
The database acts as a temporary buffer for the data going into the index. Having a buffer separates the parsing, from the building of the index. Therewith changes made in the parsing, will not affect the rest, so the database acts as an interface between the parser and the indexer. The data is stored on disk, so also the point in time when the programs are executed can happen independently.
This is preferred since the parsing is a very time consuming process, using about nine hours completing a full parsing run. Having it structured in SQL with its metadata helps showing how successful the parsing was. Some of the metadata is also irrelevant for the index, so only keeping it in the database makes the end result less complex.
\subsection{Example Indexer and Whitelist} \label{imp_indexer}
A Java program queries the database for all sections. The relations and data from the other tables are incorporated into the section. Now the section is an object with all the data needed to independently represent an example.
Although all the sections extracted from Wikipedia articles are examples, not all are relevant for a specific search purpose. Therefore the examples can be filtered through a \textit{whitelist} before they are inserted into the index. By filtering the examples before adding them to the index, we can assure that only examples of interesting domains are added. This can then improve the results returned from the index after a search. Appendix \ref{app_whitelist} contains an example of a whitelist used to keep desired examples. A total of 4 whitelists have been created, to be used in different combinations. The whitelists can be found in the Example Indexer resource folder, at \texttt{src/main/resources}. The whitelists were created by extracting the complete list of all categories from the SQL database. Next, the categories were sorted by how many articles that had a relation to them. Finally, the list was manually altered into three different versions. A fourth version was created based on Wikipedia's category hierarchy. If one of the categories linked to a specific example also exists in the applied whitelist, that example will be included in the index's corpus.
The purpose of using whitelists was to reduce the amount of examples in the index. The whitelist can then reflect a chosen domain by including only examples related to predetermined domains, consequently giving the corpus a specific scope. A more specific scope for the corpus, should improve the quality of the results returned.
\begin{figure}[h]
\caption{A venn diagram showing how whitelists act as subsets of all categories}
\includegraphics[width=\textwidth]{whiteListVenn}
\label{fig:whiteListVenn}
\end{figure}
The four different whitelists were given the names \textit{Edu}, \textit{Top200Edu}, \textit{MathTech} and \textit{MathTechWiki}. The names reflect the set of categories included for each whitelist. \textit{Edu} and \textit{Top200Edu} contains the categories which are most popular and also educational. \textit{Edu} contains all educational topics, while \textit{Top200Edu} is restricted to the top 200 categories. \textit{MathTech} narrows the selection further by only including categories related to mathematics and technology. The last list, \textit{MathTechWiki}, is based on the main categories from Wikipedia's main category overview\footnote{\url{https://en.wikipedia.org/wiki/Portal:Contents/Categories} (Last visited 7. Mars 2016)}. Mathemathics, Logic, Mathematical Sciences, Computing and all of their immediate sub-categories are included. Figure \ref{fig:whiteListVenn} demonstrates how one whitelist is a subset of another category list.
\begin{table}[h!]
\centering
\begin{tabular} {|| p{10em} | r | r||}
\hline
Whitelist & Categories & Examples \\ [0.5ex]
\hline
No whitelist & 17 170 & 28 110 \\
Edu & 17 132 & 22 977 \\
Top200Edu & 200 & 6 364 \\
MathTech & 157 & 5 037 \\
MathTechWiki & 64 & 1 003 \\
\hline
\end{tabular}
\caption{Statistics for examples and categories included when applying different whitelists for filtering of the example collection.}
\label{table:whitelist_stats}
\end{table}
Table \ref{table:whitelist_stats} display the statistics after each whitelist has been used on its own to filter examples. Table \ref{table:whitelist_stats} shows total number of categories included in each list and how many examples included in the final corpus.
When building our index, \textit{Top200Edu} and \textit{MathTechWiki} are being used, which keeps examples related to the most popular and important categories. Examples only related to less popular categories or irrelevant ones, are as a result discarded.
Before the examples are inserted into the index, the index is created with the Elasticsearch java API\footnote{\url{https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/index.html}}. Our project uses the default settings, so only the cluster name is needed to be defined beforehand, the rest happens automatically by Elasticsearch. Lastly a mapping for the example object is passed to Elasticsearch. Figure \ref{fig:es_mapping} shows the mapping passed. It defines all the fields with its types. The mapping sets \textit{categories} to \texttt{not\_analyzed}, so it will match the exact category names.
\begin{figure}[H]
\caption{Mapping used to define an example in Elasticsearch}
\centering
\includegraphics[scale=0.5]{es_mapping}
\label{fig:es_mapping}
\end{figure}
The examples needs to be converted into JSON format first in order to put them into the index. A simple method maps all fields over to their equivalent attributes. Attributes in JSON are key-value pairs, key name and the field's value. The Java API sends each example to Elasticsearch as a document on the JSON format, which then indexes it.
\subsection{Index}
As already mentioned, Elasticsearch is the chosen tool for indexing the examples. By using its simple RESTful API, examples can be stored as JSON documents, which Elasticsearch then indexes automatically by detecting the examples data structure and type. Hence, we can focus more on the queries used to retrieve examples, instead of building the index. Elasticsearch manages the whole index, and leaves a simple web based interface for communication.
During this project, Elasticsearch was run on the same computer as both Example Indexer and Example Search. Therefore the communication between these programs happens with HTTP request over localhost. Localhost is used when communicating between different processes on same computer. By resolving to the IP address 127.0.0.1, which is a loopback address, only the right port is needed. Consequently, Elasticsearch, Example Search and Example Indexer, which all are web applications, communicates with each other without an internet connection. During the project, having them all on same computer was the easiest option. For possible future use, hosting the index on a separate server, only the ip address would have to be changed.
Although Elasticsearch mostly handles the customization for us, we have chosen the cluster and node setup. All the data is gathered in one cluster, running on one node, named \texttt{wiki\_cluster}. There is also only one index, \texttt{wikipedia}. To make the index as simple as possible, only one document model exist to represent the examples, which naturally is named \texttt{example}. A query for a specific example will then look like this:\\
\texttt{GET http://localhost:9200/wikipedia/example/1}\\
The request above specifies 1 at the end of the line, which makes Elasticsearch return the example with id 1.
The web server mostly uses queries for its searches, here is a possible example of how to perform a query:
\begin{Verbatim}[fontsize=\small]
GET
'http://localhost:9200/wikipedia/example/_search' -d '{
"query" : {
"term" : { "title" : "Prisoners Dilemma" }
}
\end{Verbatim}
The above search only makes use of the examples title field, while the different queries performed by the search interface utilizes most of the fields of an example.
\section{Example Search}
\begin{figure}[H]
\caption{A sequence diagram showing the interactions between the actors when searching for examples}
\includegraphics[width=\textwidth]{WebQuery}
\label{fig:web_query}
\end{figure}
Example Search is responsible for managing queries passed from the user. It serves a web page that allows the user to enter queries for examples. Express\footnote{\url{http://expressjs.com/}} is used as a framework to effortless set up a web application with a HTTP API.
For the web page, \textit{Jade}\footnote{\url{http://jade-lang.com/}} is used as a template engine to build the HTML structure. Using a template engine allows easy reusing of markup code. Also control structures and programming statements can be used directly in the file. Jade offers its own syntax that is faster to write than the standard HTML syntax. For simplicity, the web page is mainly styled by Wikipedia's style sheets, with a few adjustments. When the user wants to query for examples, the web page offers a search box. The search box sends a HTTP request to the web server with the query.
For the web page to return examples after a search, a series of requests are made to different actors, figure \ref{fig:web_query} gives a quick overview of the actors and their interaction. The interaction starts when the server receives the HTTP request, Express routes the request to the right function. The routing functions can be found in \texttt{routes/index.js}.
When the request is received, a new function in \texttt{elasticsearch/api.js} is called. \texttt{api.js} queries Elasticsearch directly by using the official Elasticsearch JavaScript library\footnote{\url{https://www.npmjs.com/package/elasticsearch}}. Elasticsearch divides the keyword into terms, then it matches the terms with the \textit{inverted index} created from all the examples. Finally Elasticsearch uses TF-IDF to measure the relevance of the examples. The response is sent back, ordered by relevance, to the routing function asynchronously using a callback.
To display the page in the same way that Wikipedia does, the article's link is used to scrape the page from en.wikipedia.org. By scraping the page, we get access to the complete HTML structure of the actual page. From the HTML structure the example section is extracted. A list of all the relevant examples with their corresponding HTML is sent back to the web page in JSON format.
The web page uses AJAX\footnote{AJAX - Asynchronous JavaScript and XML} to handle the response from the server and displaying the results in a table. In the table a link for each example is created. This links leads to a view where only that example is displayed. In addition, related examples are shown. The related examples are found by sending a new request to the web server. The web server forms a new query to Elasticsearch. By using the selected example and results from the previous search, the web server tries to return the most related examples. First an algorithm detects the most popular category based on the first search. Then the most popular category is used to retrieve all examples that is related to it. Finally, a formula is used to measure how related the examples actually are. The formula used to sort the relatedness of examples based on categories are: \(\frac{|M|}{|C|}\) where M is matching categories one related example has with the selected example and C is all the categories for the selected example.
With the selected example and its related examples returned to the web page, the user can browse between examples by using the links created for each example.
\section{Tools}
\subsection{Elasticsearch} \label{elasticsearch}
Elasticsearch is a tool used in this project for indexing the examples. Elasticsearch is built on top of Apache Lucene\footnote{\url{https://lucene.apache.org}}, which is a information retrieval library, written in Java. Internally in Elasticsearch, data is stored as structured JSON\footnote{JSON - JavaScript Object Notation} documents. The API\footnote{API - Application Programming Interface} for communicating with Elasticsearch is a RESTful\footnote{REST - Representational State Transfer} API using JSON over HTTP. The API can be used for configuring Elasticsearch, building the index and querying it.
%%Ta med i erfaring seksjon om dette at siden alt dette bruker json og webserveren bruker javascript via node, er alt veldig enkelt og konsistent.
Elasticsearch is built for scalability. Being scalable means handling growth of both the dataset and interactions on it. Elasticsearch scales by having a cluster of many nodes. If the system needs to scale, new nodes can easily be added, and Elasticsearch will distribute resources to the newly added nodes. Different nodes can exist on different servers. However this project does not need or take advantage of this scaling, and will only run on one node.
There are two ways of implementing a search in Elasticsearch, \textit{filter} and \textit{query}. The \textit{filter} is utilizing \textit{terms} to decide whether a document should be returned or not. Searching with a \textit{term} is very similar to how one would use SQL.
Searches can for instance consist of text strings, numbers, ranges or dates, and Elasticsearch will return everything that matches. It also allows for boolean operators and nesting of these. Using a \textit{filter} is very quick and should be used if the relevance of the documents is insignificant.
If relevance score is important then the second option, \textit{query}, should be chosen. If a \textit{query} is combined with a \textit{term}, Elasticsearch is looking for the exact value in its index. A score is then returned based on the documents TF-IDF relevance to the term. If a \textit{match} is used instead, an analysis will be performed, creating a list of terms from the query, and then executing low-level queries for each of the terms. The results are combined to produce the final relevance score. These two methods can also be combined or extended with other methods to customize the search further.
\subsection{NPM and Node.js}
In our project, both the programs Wikidump Parser and Example Search are written in JavaScript, which is normally executed by browsers. Node.js\cite{node} allows the code to be executed from a terminal instead, which enables JavaScript to be used on servers. Node.js is an asynchronous event driven framework. By embracing to event loop in this manner, Node.js avoids thread management and blocking of those. Instead callbacks are pushed to the event loop and Node.js runs until there are no more callbacks to perform. This makes Node.js ideal for simpler and less complex systems, and is therefore chosen for this project.
Node.js also comes with a packet manager called Node Packet Manager(NPM). NPM allows any Node.js project, to include libraries and other JavaScript projects published to their \textit{Open Source Registry} \footnote{\url{https://www.npmjs.com/npm/open-source}}. This is done by a simple API call to the NPM executable in the terminal.
Using code from open source libraries saves a lot of time during development while still having full control and overview of the code executed. Hence we decided to include several libraries. Examples of libraries used are \textit{sax} for reading the XML file line by line as a stream. \textit{wtf\_wikipedia}\footnote{\url{https://www.npmjs.com/package/wtf_wikipedia}} to parse some of the Wikipedia markup; \textit{request}\footnote{\url{https://www.npmjs.com/package/request}} to fetch a HTML file from a server with a GET call; \textit{cheerio}\footnote{\url{https://www.npmjs.com/package/cheerio}} for iterating through an HTML structure while supporting filtering, reading and editing of the HTML. Using libraries results in the system being more modular, which makes it easier to alter during development. This has been highly advantageous for this project, since requirements have been continually changed.
\cleardoublepage