Skip to content

Commit

Permalink
final edits
Browse files Browse the repository at this point in the history
  • Loading branch information
Mike Holler committed May 6, 2014
1 parent cb99e34 commit 73e6b1c
Show file tree
Hide file tree
Showing 8 changed files with 40 additions and 27 deletions.
24 changes: 17 additions & 7 deletions document.tex
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
\documentclass[12pt,letterpaper]{article}
%\usepackage[margin=1.5in]{geometry} % Set page margin
%\usepackage{setspace} % Allows \doublespacing command to be used
\usepackage{etex}
\usepackage[T1]{fontenc}
\usepackage{amsmath}
Expand All @@ -25,22 +26,32 @@
\usepackage{pgfplotstable}
\usepackage{tikz}
\usetikzlibrary{shapes,arrows}
\usepackage{draftwatermark}
%\usepackage{draftwatermark}
%\SetWatermarkColor[rgb]{1,0.85,0.85}
\usepackage{tabularx,ragged2e,booktabs,caption}
\usepackage{xspace}

\newcolumntype{C}[1]{>{\Centering}m{#1}}
\renewcommand\tabularxcolumn[1]{C{#1}}
\SetWatermarkColor[rgb]{1,0.85,0.85}
\interfootnotelinepenalty=10000 % Don't split footnotes accross pages.
%-------------------------------------------------
% From http://tex.stackexchange.com/a/60212/31317
\usepackage{titlesec}
\usepackage[pdfpagelayout=TwoPageRight]{hyperref}
\usepackage[pdfpagelayout=TwoPageRight,ocgcolorlinks]{hyperref}
\usepackage{fontspec}

\setmainfont[Ligatures=TeX]{Times New Roman}
\setmonofont[Scale=MatchLowercase]{DejaVu Sans Mono}

%\usepackage[ocgcolorlinks]{hyperref}
\usepackage{xcolor}
\hypersetup{
colorlinks,
linkcolor=[HTML]{AD0000},
citecolor=[HTML]{0000AD},
urlcolor=[HTML]{000000}
}

% Signature and date command.
%\newcommand*{\SignatureAndDate}[1]{%
% \par\noindent\makebox[2.5in]{\hrulefill} \hfill\makebox[2.0in]{\hrulefill}%
Expand Down Expand Up @@ -140,11 +151,10 @@
\textbf{\Large Acknowledgements}
\end{center}

Placeholder. This will be a surprise :)
%I would like to express my deep gratitude to Dr. Caroline St. Clair, my thesis director, for her constant encouragement and invaluable feedback throughout this research.
%I would also like to thank Dr. Michael De Brauw for his feedback as my second reader, and John Small for his help verifying my references.
I would like to express my deep gratitude to Dr. Caroline St. Clair, my thesis director, for her constant encouragement and invaluable feedback throughout this research.
I would also like to thank Dr. Michael De Brauw for his feedback as my second reader, and John Small for his help verifying my references.

%Finally, I would like to thank my parents and girlfriend for their extensive support, encouragement, and understanding over the past year.
Finally, I would like to thank my parents and girlfriend for their extensive support, encouragement, and understanding over the past year.

\newpage
\thispagestyle{empty}
Expand Down
2 changes: 1 addition & 1 deletion tex/10-abstract.tex
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
\newpage
\begin{abstract}
Creating the index for a book is either arduous or expensive for authors.
If an automated process can be introduced which effectively generates an index with approximately the same accuracy as a human, it would reduce the time and money authors need to spend on their work, allowing them to allocate those resources in a more useful way.
If an automated process can be introduced which effectively generates an index with approximately the same accuracy as a human, it would reduce the time and money authors need to spend on their work, allowing them to allocate those resources in more useful ways.
This research looks to establish the efficacy of using a Na{\"i}ve Bayes' classifier trained with data extracted from Wikipedia articles to create an index for {\it Biology}, a Biology textbook.
Wikipedia article titles are used to label each class of training data.
This approach relies on the assumption that the subset of all desired index entries is contained in the set of all possible Wikipedia article titles.
Expand Down
2 changes: 1 addition & 1 deletion tex/15-introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ \subsection{Indexing Methods}
At the end of this process, the selection of words and phrases is edited and then typeset~\cite{mulvany}.
While this old fashioned method worked in the many years before computers became the powerful productivity tools they are today, computers now help indexers by providing purpose-built software that, in effect, replaces the paper note cards with a more efficient digital representation.

The author of {\it Indexing Books} describes computer-aided indexing by breaking indexing software used into major two types: embedded and dedicated.
The author of {\it Indexing Books} describes computer-aided indexing by breaking indexing software used into two major types: embedded and dedicated.
Embedded indexing software helps a writer or an indexer mark, in the electronic text itself, the words and phrases that he or she wants to be indexed~\cite{mulvany}.
Examples of embedded indexing software are Microsoft Word~\cite{ms-word-indexing} and \LaTeX~\cite{lamport}.
Dedicated indexing software, on the other hand, only provides tools whose purpose is to make creating an index easier, like the ability to digitize entries, sort them, and structure them in different ways~\cite{mulvany}.
Expand Down
13 changes: 5 additions & 8 deletions tex/20-background.tex
Original file line number Diff line number Diff line change
@@ -1,16 +1,12 @@
\section{Background}
% Background section placeholder (move everything below this to another file eventually).
The process of creating an index is either costly or time consuming for authors, depending on whether they choose to create the index themselves or pay someone to do it.
The act of creating an index is labor and organization intensive, often requiring a text to be read multiple times while keeping track of entries electronically or on index cards.

Since indexing a book is so expensive and time consuming, it makes sense to see if computers can be used to generate an index that is nearly as accurate as human indexing professionals.
To automate the indexing process, software exists to replace the indexer's index cards with a more efficient computerized organization system.
However, there is a new, rising interest in seeing if computers can generate indexes deterministically, without the help of human beings.
To do this, software engineers and researchers must apply natural language processing techniques in new and interesting ways.

\subsection{Accuracy of Human and Automatic Indexes}

When attempting to automate the indexing process, it is important to understand what standards the software will be judged by.
When attempting to automate the indexing process, it is important to understand the standards by which the software will be evaluated.
In {\it Natural-Language Processing and Automatic Indexing}, an article published in {\it The Indexer} by C. Korycinski and Alan F. Newell, the authors cite relevant metrics for automatic indexing excerpted from Cleverdon's work:

\begin{quote}
Expand All @@ -23,8 +19,8 @@ \subsection{Accuracy of Human and Automatic Indexes}
\end{quote}

Of course, 2 is most relevant to this research, but the other three reveal how imprecisely human beings perform tasks that replicate another human's work.
In the statistics given above, indexing is the least precise of all of the human classification tasks mentioned above.
This makes the gold accuracy standard for an automatic indexer 30\%.
In the statistics given above, indexing is the least precise of all of the human classification tasks mentioned.
This sets the standard for the accuracy of an automatic indexer at 30\%.
That is, a computer index compared to a human index should display at least the same proportion of similarity as a human generated index compared to another human generated index.
Now that a metric for index quality has been established, the next section discusses methods and strategies that can be used to create an automatic indexer that might be able to match this benchmark.

Expand All @@ -46,13 +42,14 @@ \subsubsection{Machine Learning}
English readers use certain heuristics (shortcuts) to aid in determining the gender of a name, and computers can do the same.
For example, an average person might know that names ending in {\it -a} are typically feminine, while names ending in {\it -o} are typically masculine, so he or she understands that the last letter is a strong determinant of a name's gender, or {\it label}.
In NLP, these determinants are known as {\it features}.
With enough data about names and their genders (a {\it training set}), the probability of a letter determining a particular gender can be calculated, and a person could be {\it trained} using this data and the "last letter" feature to guess name genders with reasonable accuracy.
With enough data about names and their genders (a {\it training set}), the probability of a letter determining a particular gender can be calculated, and a person could be {\it trained} using this data and the ``last letter'' feature to guess name genders with reasonable accuracy.
Of course, the last letter is not the only feature of a name that determines its gender, and by training with additional relevant features the classifier's guesses might be made more accurate.
Indeed, this process of determining the category (or {\it class}) of an item is called {\it classification}.
The above is an example a {\it supervised} machine learning, since training data is involved.
Unsupervised machine learning algorithms are outside the scope of this research, and will not be mentioned further in this paper.

\subsubsection{Document Classification}
\label{sec:doc-class}
% Document classification and automatic indexing
Document classification is a type of NLP that uses machine learning methods like those above, and is used in this research to create a computer generated index.
Document classification involves taking a piece of text (known as a {\it document}) as input and producing a label or {\it class} as output~\cite{jurafsky}.
Expand Down
4 changes: 2 additions & 2 deletions tex/25-data-collection.tex
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ \subsection{Test Set}
Since the goal of this research is to create an index for a book, the test set was generated from a textbook with a comprehensive index section.
This textbook is called {\it Biology}~\cite{biology}, and is freely available from OpenStax~\cite{openstax-bio} under the Creative Commons Attribution license.
The book was created by six senior contributors that hold professorial positions at prevalent universities, and approximates an average college textbook.
The book was created by six senior contributors that hold professorial positions at prestigious universities, and approximates an average college textbook.
This textbook is 1477 pages long, containing an index of 3118 unique topics (labels) making for 4678 different index entries (references).
In {\it Biology}, all words referred to by index entries are bolded in the text itself. Below is an example of what this looks like (bold in original):
Expand Down Expand Up @@ -260,7 +260,7 @@ \subsubsection{Reducing Index Entry Set}
To discover this subset of index entries, a database of Wikipedia titles must be intersected with the {\tt index}.
The Wikimedia Foundation periodically creates dumps for their many databases and makes them publicly available online~\cite{wiki-dumps}.
One of the many data sets they make available is a list of Wikipedia article titles in the main \url{/wiki/} namespace for the English language version of Wikipedia~\cite{wiki-dump-titles}.
One of the many data sets they make available is a list of Wikipedia article titles in the main {\tt /wiki/} namespace for the English language version of Wikipedia~\cite{wiki-dump-titles}.
At the time of this writing, there are 10,639,771 separate Wikipedia article titles matching this criteria.
This dump will serve as the source that will ultimately be intersected with the {\it Biology} index entries to yield the entries that will be used in analyses.
Expand Down
2 changes: 1 addition & 1 deletion tex/30-analysis.tex
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ \subsubsection{Writing the Classifier}

\subsubsection{Running the Classifier}

The classifier was run on twelve commodity HP computers with 64-bit Windows 7, Intel i7 processors, and 4 gigabytes of memory installed.
The classifier was run on twelve commodity HP computers with 64-bit Windows~7, Intel~i7 processors, and 4~gigabytes of memory installed.
A zip file containing classifier.py, plain text Wikipedia data, plain text {\it Biology} data, indexToWiki.json, and rankedTitles.txt was copied twice\footnote{The processor would have allowed up to four instances of the classifier to run simultaneously (one on each core), but the classifier's large memory requirements meant that only two instances of the classifier program could run comfortably on each computer.} onto each of the computers.
Each copy of the classifier was then configured to use a unique combination of feature characteristics defining the contents of the feature set.
Finally, both copies of classify.py were ran simultaneously on each computer, taking advantage of the processors' multiple cores.
Expand Down
10 changes: 5 additions & 5 deletions tex/35-conclusion.tex
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ \section{Conclusion}
Although this research was unable to reach human levels of indexing accuracy, it did achieve a 9.49\% accuracy from one of the twenty-four tested feature sets.

The most accurate feature set was created using the 2000 most frequently linked articles' titles from the Wikipedia training data.
Since this study focused on testing a breadth of different feature sets, the nuances of the more successful feature sets were left unexplored for lack of time.
Since this study focused on testing a breadth of different feature sets, the nuances of the more successful feature sets were left unexplored.
More research on optimizing this feature set should be done, perhaps by eliminating stop words, using a different type of supervised classification, or changing the sample size of the feature set (which was locked at 2000 for each of the twenty-four trials).

In addition to the sub-optimal accuracy, there are a few overall flaws with using a \naive Bayes classifier---or any document classifier---for automatic indexing.
Document classifiers like the one used for this research are required to label each document once and only once.
This becomes an issue not only when an input paragraph, or document, should {\it not} have an index label, but also when it should have more than one.
This becomes an issue not only when an input paragraph, or document, should {\it not} have an index label ($r = 0$), but also when it should have more than one ($r > 1$).
After all, not every paragraph of a human-indexed book will have an index entry that points to it.
This forced assignment might be eliminated if the classifier is made to consult a heuristic threshold, where everything above the threshold likelihood must be an index entry, and if nothing is above it, it does not get labeled.

Expand All @@ -22,8 +22,8 @@ \section{Conclusion}
Once automatic indexers based on document classifiers are made more efficient and accurate, publishers might still have to create automatic indexers for different subjects\footnote{E.g., a textbook company might have automatic indexers for Biology, Classics, and Computer Science books.} in order to limit the amount of memory and time the program needs to run without sacrificing accuracy.
These specialized indexers could share the same feature set generation algorithm, but vary by training on different categories of Wikipedia articles.

Overall, this research confirms that the \naive Bayes classifier seems to be a potential candidate for automatic indexing.
Although this experiment failed to achieve human results, it did get appreciably close enough to warrant additional research and improvement.
Although this experiment failed to achieve human results, it did get close enough to warrant additional research and improvement.
Nancy Mulvany, professional indexer and author of {\it Indexing Books} confidently writes, ``There is nothing automatic about the index-writing process.
There is no automatic indexing tool available that could produce the index in the back of this book,''~\cite{mulvany}.
Mulvany is still correct in her statement, but with the continuing advancement of technology and the gradual improvement of Natural Language Processing techniques, computers may one day be able to recreate the index in Mulvany's book.
Mulvany is still correct in her statement, but with the continuing advancement of technology and the gradual improvement of Natural Language Processing techniques, computers may one day be able to recreate the index in Mulvany's book.
Overall, this research confirms that the \naive Bayes classifier seems to be a potential candidate for automatic indexing.
10 changes: 8 additions & 2 deletions tex/40-bibliography.tex
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
\pagebreak
\cleardoublepage
\phantomsection
\addcontentsline{toc}{section}{References}
\begin{thebibliography}{99}

% We *really* want to break URLs across lines here. Since URL coloring is
% pure black, the printer won't print these links in ``color''. It's also
% important that this line is within the ``thebibliography`` environment,
% because we only want this behavior in the bibliography.
\hypersetup{ocgcolorlinks=false}
\bibitem{amazon-products}
Amazon. 2014. Product Advertising API. [Online]. Available from: \url{https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html}. Accessed 2014 Apr 24.

Expand Down

0 comments on commit 73e6b1c

Please sign in to comment.