-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter3.tex
106 lines (67 loc) · 14.7 KB
/
chapter3.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
%===================================== CHAP 3 =================================
\chapter{Conceptual design}\label{cap_3}
This chapter will explain how the system will extract examples, add them to the database and make them searchable for an end user, on a conceptual level. It will start by briefly explaining the overall approach, before going more into details regarding the pipeline, and how the examples produced by the pipeline are managed for search purposes. The chapter will finish with an analysis of examples, to help us better understand them, which in turn can improve the performance of the system when it is implemented.
\section{Overall Approach}
The goal of our approach is to create a searchable index for a collection of examples. The first step is to find examples and store them in a database. Wikipedia was chosen as a resource for extraction of examples. Wikipedia's articles have a reliable and consistent format. It is also the largest collection of open information, which gives us a sufficient amount of source data to work with. The source data is acquired from a XML dump that Wikimedia publishes regularly. Except for the markup itself, the articles contain no structure. From the dump we obtain the article's markup and additional meta data. Therefore a considerable amount of processing is required to locate and extract relevant examples that can be inserted into the index. To achieve this we created a software pipeline where several independent processes work together by using the output from one process as input to the next process. We start by feeding the pipeline with the raw XML file. The pipeline parses the XML and markup of each article. The relevant articles are stored in a relational database. A new process creates examples based on the data in the database. Last, an index is built from these examples.
\section{Pipeline} \label{cd_pipeline}
\subsection{Creating the pipeline} \label{custom-pipeline}
\begin{figure}[h]
\caption{A conceptual overview of the pipeline used to extract and index examples}
\includegraphics[width=\textwidth]{PipelineConcept}
\end{figure}
Based on what was discussed in section \ref{smila}, it was decided not to use the SMILA system and its utilities itself, but instead look at how SMILA approaches the problem and take inspiration from that. Therefore other methods were considered, which mainly consisted of building the pipeline from scratch. Tailoring the different sub-processes for our project and organizing them in a pipeline, was decided to be the best course of action. The first issue to be settled regards the format of the source data. SMILA fetches information by crawling the web, in our case Wikipedia\footnote{www.wikipedia.org}, but using a web-crawler is a very general method. Since our project will only use Wikipedia, the extraction method could be more specific. An XML-dump from a snapshot of Wikipedia's database fits perfectly in terms of simplicity and stability for our project. The dump contains all the articles of Wikipedia in their newest version.
To transform the XML-dump into useful data, several sub-processes will be utilized. All the sub-processes in the pipeline are loosely coupled, which means that they are fully independent given the correct input. Being fully independent allows the pipeline to easily swap out or alter sub-processes without affecting the rest of the system. The SMILA pipeline also delivered a high degree of parallelity. Parallelity is a property the current version of the pipeline created from scratch does not have. Reason being functionality prioritized higher than efficiency and performance.
\subsection{Extracting data from XML-dump}
Parsing the source data and then extracting examples from it, is the first task the software pipeline performs. The source data used for input to the pipeline uses the XML format. On the top level of the XML documents there exists article elements marked with an \textit{article} tag. Each article element represents a page on Wikipedia and thus contains metadata about the page as well as the content itself. All the data forms an XML document with a size of 50 gigabyte. Because of the document's huge size, the process reads the XML-document as a stream, buffering line by line, and then identifies article elements and saves the relevant data from it to an object in memory. For each article, the process looks at the sections and selects the ones that contains an example. It parses those articles into an object with the relevant data. A relational database is created to store information about the examples. When iterating over the articles, the process populates the relational database with data from relevant sections and articles. The database stores meta data from the articles, content from specific sections and references in the sections. Relations between articles, sections and references are also stored, and are used in the next step of the pipeline, when the data will be used to form example entities.
\subsection{Building an index of examples from SQL}
The first step in building the index is fetching data from the database to form examples. Virtual tables called \textit{views}, are used in the database to normalize the data on a format which the index will be using.
Elasticsearch was chosen as the tool for building the index. Section \ref{elasticsearch} explains why it is a good choice for this project. A simple Java process fetches the views from the database and creates objects representing examples. References to other examples and which categories the examples are also added here. The example model has a one-to-one relationship to the examples represented in the Elasticsearch index, so the objects are directly converted to JSON and inserted into the index. To configure and query the index, Elasticsearch serves a HTTP API. The API is used before building of the index to specify behaviour of different fields, and after for searching the complete index.
\section{Search examples}
\subsection{General idea}
When the source data consisting of Wikipedia pages has been processed through the pipeline described in section \ref{cd_pipeline}, the final result is an index built up of examples. Having just an index is not very useful without an easy way of interacting with it though. Therefore a user friendly, web based search interface has been created. Figure \ref{fig:ui} displays the interaction between the user and Elasticsearch with the two methods for querying, which are by using keywords or a selected example.
\begin{figure}[h]
\caption{A simple overview of the user's interaction with the examples}
\includegraphics[width=\textwidth]{UserInteraction}
\label{fig:ui}
\end{figure}
The main objective of the interface is to assist the user in finding relevant examples and to present these examples to the user in a helpful manner. To present the examples, the interface use HTML and CSS from the live version of Wikipedia from its web server. To find relevant examples, the interface helps the user in two ways. The first one is simply using keywords entered in a search field by the user. The second is when the user has already selected an example that is relevant, the search interface will then use this example to find similar examples.
\subsection{Querying by keywords}
The user's first interaction with the interface is by interacting with the search field. In this field the user will type in keywords which will be used in a query sent to Elasticsearch. Elasticsearch will then use the TF-IDF algorithm to match the search phrase to fields contained in the example. The fields used for matching the search terms are introduction, content, title and categories. By boosting the importance of some fields, the query can be fine tuned into delivering a more relevant collection of examples based on the keywords in the search phrase.
\subsection{Querying by examples}
The collection of examples returned by the user's first query is presented in a list. The list allows the user to browse through the returned examples and select an example. The selection of an example triggers a new query, which is automatically sent to Elasticsearch and executed. The query uses data from the selected example and the previous search. By comparing the categories of the selected example, and categories from the search, the most popular category is identified. The most popular category is a category from the selected example's set of categories, and appears most times among the best results of the initial search. The most popular category is then used by Elasticsearch to retrieve related examples. The set of returned examples are then rated after how good a match they are to the selected example's set of categories, and then ordered in two lists based on their rating. One list contains examples with a perfect match, while the other contains the ten best examples that partially match.
%The new query is based on the selected example and is a \textit{compound query}, which means it is a query that combines the results of subqueries. One of those subqueries used by this query is a filter. The sub query filters results based on common categories. It also filters on section id, to exclude the same example. The second part of the query is another sub query that performs a match between references and page titles. This match uses an analyzer to make the query act like a \textit{full text search} giving results as a score in opposition to a simple boolean value that would indicated only if there was a match or not. By using an analyzer instead, more information is available for measuring the relevance of the example.By combining these subqueries the interface presents two lists of related examples to the user, one based on references into the current example, and a second one based on references from the current example.
\section{Analysis of examples} \label{examples-section} %% må skrives om til hvordan vi faktisk drar nytte av det, når det blir implementert
%strucutre/characteristics of an example\\
\subsection{An example}
An example is used as a tool to better understand a topic. It is usually used together with an explanatory text, where the example is a minor part of it. Examples are rarely presented standalone since they often required a certain degree of context. If this context is fused into the example, it often tends to make the example very complex and too troublesome to use efficiently. Because of this, an example can often be looked at as an appendix to a text regarding a subject.
People who want to learn about a certain topic, but already know the basic context may prefer to skip the explanatory text and only look at the examples. This project tries to exploit the fact that examples act as an appendix to a subject. Therewith, we attempt to give persons trying to expand their knowledge within a topic a more preferable way of doing so. In order to develop a better tool helping these persons, Research Goal 2 focuses on reaching a deeper understanding of examples. We hope that a better understanding of examples will lead to more success when finding search results and recommending related examples.
\subsection{Comparing examples} \label{comparing_examples}
Finding common properties of examples is very difficult and nearly impossible if compared across different domains. When comparing two examples within the same domain and topic, there will still be different approaches and techniques of explaining, which still makes it hard to directly compare them. As a consequence, a highly qualitative and subjective analysis has been made for examples within the Game Theory domain.
Different key properties have been identified and examined across different topics. Some properties are connected to the structure and presentation of the example, while others are based on the contents. Through examining topics within Game Theory, a list of properties of an example were designed. Table \ref{table:1} displays these properties with a brief explanation.
\begin{table}[h!]
\centering
\begin{tabular} {|| p{5em} | p{23em} ||}
\hline
Name & Description \\ [0.5ex]
\hline\hline
Figures & Graphics and drawings referenced in the example. \\
Length & The length of the example, can be both words and characters. \\
Domain & Which domain the example belongs to. \\
Source code & If source code is used for explaining principles. \\
Equations & If equations are used to explain the example \\
Analogies & When an real world example is used to explain a theoretical one. \\
Subsections & An example that is divided up in smaller parts or sections that offer different angles of explanations for the example. \\
Walkthrough & A step by step guide that solves a problem. \\
Iterations & Several iterations that gradually increase the complexity of the explanation. \\
References & How often the example refers to figures or other examples and articles. \\ [1ex]
\hline
\end{tabular}
\caption{Properties of an example and a description of them.}
\label{table:1}
\end{table}
\subsection{Structure of a good example} \label{good_example}
The structure of an example is mostly defined by how it utilizes the properties mentioned in table \ref{table:1}. Simply checking which examples has the most properties is inaccurate and misleading. A good example does not need to have all the properties and counting them will not result in finding the best.
Instead the combination of the values these properties contain are significant, especially the property describing the domain, as some domains could generally benefit from a different composition of properties than others. This leads to a more narrow approach domain-wise, when looking into different examples.
When trying to identify the structure of a good example, there have been selected a few topics within the Game Theory domain. Then for a certain topic, for instance \textit{Pareto Efficiency}, two or three examples have been compared to each other.
For the topic \textit{Pareto efficiency} two examples were examined. Pareto efficiency is about reaching a state of allocation of resources where it is not possible to make one part better of without making another part worse. The first example is named \textit{Examples and exercises on Pareto efficiency}. This example builds up its explanation step by step with several iterations. Each iteration is a bit more complex than the previous. This enables the example to cover the topic with a certain depth. The second example is named \textit{Robinson Crusoe example}. This example is very long and detailed. In addition to a large amount of text, it also uses equations to a substantial degree. Initially the first example seems better, it definitely makes it easier to grasp the concept of \textit{Pareto efficiency}. The second example focuses more on the mathematical part, and there will most likely exist some persons who prefer this. See appendix \ref{app_example} for both examples.
\cleardoublepage