Skip to content

Commit

Permalink
very last minute punctuation changes
Browse files Browse the repository at this point in the history
  • Loading branch information
Mike Holler committed May 7, 2014
1 parent 73e6b1c commit bf44270
Show file tree
Hide file tree
Showing 6 changed files with 20 additions and 20 deletions.
6 changes: 3 additions & 3 deletions tex/15-introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ \section{Introduction}
\addlegendentry{projected}
\end{axis}
\end{tikzpicture}
\caption{Publication statistics for traditional books published in the U.S.\cite{bowker}}.
\caption{Publication statistics for traditional books published in the U.S.\cite{bowker}}
\end{center}
\end{figure}

Expand Down Expand Up @@ -88,7 +88,7 @@ \subsection{Cost of Indexing}
\hline
$d$ & Average dollars per page indexed \cite{mulvany} & \$5 \\
\hline
$p$ & Average Number of Pages in Book (see appx.~\ref{appendix:d}) & 380 pgs) \\
$p$ & Average Number of Pages in Book (see appx.~\ref{appendix:d}) & 380 pgs \\
\hline
$r$ & Average Number of Pages Indexed per Hour \cite{connolly} & 7 pgs/hr \\
\hline
Expand All @@ -104,7 +104,7 @@ \subsection{Cost of Indexing}

$$ 300,000,000 \text{ USD} = 328259 \times 0.47 \times 380 \times 5 $$

$$ 1000 \text{ years} = 8,000,000 \text{ hours} = \frac{328259 \times 0.47 \times 380}{7}$$
$$ 1,000 \text{ years} = 8,000,000 \text{ hours} = \frac{328259 \times 0.47 \times 380}{7}$$
\caption{The Price of Manual Indexing}
\end{figure}

Expand Down
2 changes: 1 addition & 1 deletion tex/20-background.tex
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ \subsection{Natural Language Processing}
Natural language processing (NLP) has existed since the beginning of electronic computers themselves (around 1940), but it has not existed in the modern sense until more recently.
Today, there is a growing focus on teaching computers to understand and learn the desired qualities of a text so they can be analyzed and extracted from large amounts of textual data~\cite{jurafsky}.

Internet search engines like Google rely on NLP techniques to generate their search results\footnote{As of March 1, 2014, Google has published 219 whitepapers on NLP topics, primarily on how to use NLP to improve search results~\cite{google-nlp}.}, text editing software uses NLP to detect grammatical errors in sentences~\cite{norvig}, and mobile applications use NLP to extract summaries from long form text~\cite{bit-of-news}.
Internet search engines like Google rely on NLP techniques to generate their search results,\footnote{As of March 1, 2014, Google has published 219 whitepapers on NLP topics, primarily on how to use NLP to improve search results~\cite{google-nlp}.} text editing software uses NLP to detect grammatical errors in sentences~\cite{norvig}, and mobile applications use NLP to extract summaries from long form text~\cite{bit-of-news}.

\subsubsection{Machine Learning}

Expand Down
8 changes: 4 additions & 4 deletions tex/25-data-collection.tex
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ \section{Data Collection}
This research takes test and training set data from two different, mutually exclusive corpora.
The reason for this deviation for convention is to answer the question of whether Wikipedia data can be used to train for any more specific set of input data, like paragraphs from a Biology textbook.
The training set is composed of paragraphs from Wikipedia articles, with each paragraph labeled by the title of the article it belongs to.
The test data contain paragraphs from the OpenStax {\it Biology} textbook~\cite{biology}, with each paragraph labeled by the index entry that refers to it\footnote{If a {\it Biology} paragraph is not referenced in the index, it is not included in the data set. If it is referenced by multiple index entries, the paragraph is included once per index entry.}
The test data contain paragraphs from the OpenStax {\it Biology} textbook~\cite{biology}, with each paragraph labeled by the index entry that refers to it.\footnote{If a {\it Biology} paragraph is not referenced in the index, it is not included in the data set. If it is referenced by multiple index entries, the paragraph is included once per index entry.}

Due to the sheer size of Wikipedia and the memory and computational power available for this research, the data set was reduced by selecting only paragraphs from Wikipedia articles whose title matched an index entry in {\it Biology}.
Likewise, this research makes an assumption that the set of all reasonable index entries for any textbook is the same as the set of all titles in the English Wikipedia.
Expand Down Expand Up @@ -79,7 +79,7 @@ \subsubsection{Downloading the Training Data}

\subsubsection{Converting MediaWiki Markup to HTML}

To convert the exported articles to HTML, a local MediaWiki server was installed on a personal computer using MediaWiki's online instructions~\cite{mediawiki-installation} with the purpose of creating a local version of Wikipedia that only contains the articles that share titles with index labels from {\bf Biology}.
To convert the exported articles to HTML, a local MediaWiki server was installed on a personal computer using MediaWiki's online instructions~\cite{mediawiki-installation} with the purpose of creating a local version of Wikipedia that only contains the articles that share titles with index labels from {\it Biology}.
This mini-Wikipedia can then be crawled using printPageLinks.py to extract and store the HTML versions of the articles.

After installation, all available first-party MediaWiki plugins were enabled in the installation, as Wikipedia uses many of them on their own MediaWiki installation.
Expand Down Expand Up @@ -192,7 +192,7 @@ \subsection{Test Set}
Since the goal of this research is to create an index for a book, the test set was generated from a textbook with a comprehensive index section.
This textbook is called {\it Biology}~\cite{biology}, and is freely available from OpenStax~\cite{openstax-bio} under the Creative Commons Attribution license.
The book was created by six senior contributors that hold professorial positions at prestigious universities, and approximates an average college textbook.
This textbook is 1477 pages long, containing an index of 3118 unique topics (labels) making for 4678 different index entries (references).
This textbook is 1,477 pages long, containing an index of 3,118 unique topics (labels) making for 4,678 different index entries (references).
In {\it Biology}, all words referred to by index entries are bolded in the text itself. Below is an example of what this looks like (bold in original):
Expand Down Expand Up @@ -288,7 +288,7 @@ \subsubsection{Reducing Index Entry Set}
WHERE BINARY i.wikiLabel = at.title;
\end{lstlisting}
The 3118 unique index labels intersected against all 10,639,771 Wikipedia articles yields a total number of 518 overlapping terms.
The 3,118 unique index labels intersected against all 10,639,771 Wikipedia articles yields a total number of 518 overlapping terms.
This number was so low because all Wikipedia titles begin with a capital letter, but not all index labels did, even if the word was not a proper noun or acronym.
Since this selection was restricted to exact case matches only, all index labels with lowercase initial letters were excluded from the intersection.
This does not seem to introduce a bias towards proper nouns, however, as {\it Biology}'s index contains words and phrases exactly as they appear in the text.
Expand Down
18 changes: 9 additions & 9 deletions tex/30-analysis.tex
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ \subsubsubsection{Contains}
\underline{All} \underline{animals} \underline{are} {\bf \underline{heterotrophic} \underline{beings}}\textsuperscript{Heterotroph}, \underline{meaning} \underline{that} \underline{they} \underline{feed} \underline{directly} \underline{or} \underline{indirectly} \underline{on} \underline{other} \underline{living} \underline{things}. \underline{They} \underline{are} \underline{often} \underline{further} \underline{subdivided} \underline{into} \underline{groups} \underline{such} \underline{as} \underline{\bf carnivores}\textsuperscript{Carnivore}, \underline{\bf herbivores}\textsuperscript{Herbivore}, \underline{\bf omnivores}\textsuperscript{Omnivore}, \underline{and} \underline{\bf parasites}\textsuperscript{Parasitic animals}.
\end{quote}

Many of the underlined words above, like ``things'', ``other'', ``subdivided'', {\it et al.} are not indicative of the word Enzyme.
Many of the underlined words above, like ``things'', ``other'', ``subdivided'', {\it et al.} are not indicative of the word ``Animal''.
One of the drawbacks of the contains feature is how many unhelpful---but nevertheless underlined---words make it into the feature set, since literally every word in the training set is its own feature.
Since features are supposed to communicate information about a text (in this case, features are used to indicate an article's subject), the presence of so many features that do not help indicate the paragraph's subject suggests the {\it contains} primary feature characteristic does not describe an optimal feature set.
Therefore, it makes sense to try some more restrictive feature sets in addition to testing the contains feature.
Expand Down Expand Up @@ -171,7 +171,7 @@ \subsection{Conducting the Experiment}
Random \\
\end{tabular}
\end{minipage}
\caption{Comprehensive list of feature characters being used.\label{tab:feature-characteristics}}
\caption{Comprehensive list of feature characteristics being used.\label{tab:feature-characteristics}}
\end{table}
\end{center}

Expand All @@ -195,23 +195,23 @@ \subsubsection{Running the Classifier}
\subsection{Experimental Results and Discussion}

The experiment's 24 trials and their results are summarized in tables \ref{tab:results-grouped} and \ref{tab:results-sorted} below.
Table~\ref{tab:results-grouped} presents the experiment's results grouped by feature characteristic, making it possible to easily locate and compare the effectiveness between features with many similar feature characteristics.
Table~\ref{tab:results-grouped}, table~\ref{tab:results-sorted} organizes the experimental results by accuracy, with the most accurate features on top.
The former presents the experiment's results grouped by feature characteristic, making it possible to easily locate and compare the effectiveness between features with many similar feature characteristics.
The latter organizes the experimental results by accuracy, with the most accurate features on top.

Table~\ref{tab:results-grouped} shows, that generally, case-insensitive matches are more accurate than case-sensitive matches (by about double).
The best case-sensitive feature combines the {\it most frequent} and {\it in first sentence} feature characteristics, while the best case-insensitive feature (and best overall feature) combines {\it linked article titles} and {\it most frequent}.
The table also shows that the {\it contains} feature yields the worst results of all primary features, with a peak of 0.40\% accuracy when it is used in combination with the {\it random} and {\it case-insensitive} feature characteristics.
This is to be expected, since the other three primary features use heuristics to select only individual features that are more likely to have relevance to the entire paragraph than a random word in the text.

Table~\ref{tab:results-grouped} shows the general success of the various features relative to the next most and next least accurate feature.
Table~\ref{tab:results-sorted} shows the general success of the various features relative to the next most and next least accurate feature.
From this table, it is clear that the best feature is significantly more accurate than any other feature with an accuracy of 9.49\%.
The next best method is {\it case-insensitive, in first sentence, most frequent} at 3.03\%.
After the top two, the two case-sensitive versions of these features follow at 1.21\% and 1.82\% respectively.

Visibly, table~\ref{tab:results-sorted} also reveals a trend in sampling technique that is worthy of discussion.
Visibly, Table~\ref{tab:results-sorted} also reveals a trend in sampling technique that is worthy of discussion.
The table lists most frequent as almost unanimously the most effective sampling technique, with all but two of the features using the {\it most frequent} feature characteristic at the top of the table.
The other two features using {\it most frequent} are at the bottom of the table, but these features both use the relatively unreliable {\it contains} primary feature.
{\it Most frequent} was probably the worst of the feature using {\it contains} because stop words\footnote{Stop words are words that, while important for grammar, are largely irrelevant to the document. These are words like ``the'', ``and'', and ``have''.} were not removed for this research, meaning the {\it most-frequent contains} feature likely included a large number of stop words within the 2000 feature used.
{\it Most frequent} was probably the worst of the feature using {\it contains} because stop words\footnote{Stop words are words that, while important for grammar, are largely irrelevant to the document. These are words like ``the'', ``and'', and ``have''.} were not removed for this research, meaning the {\it most-frequent contains} feature likely included a large number of stop words within the 2,000 feature used.

Like {\it contains}, {\it first word in sentence} appears largely ineffective with a peak accuracy of 0.61\% in case-sensitive and case-insensitive {\it most frequent}.
Given the example provided in section~\ref{sec:first-word}, this is not entirely surprising.
Expand All @@ -221,7 +221,7 @@ \subsection{Experimental Results and Discussion}
\pagebreak
\begin{center}
\begin{table}[h]
\caption{Indexing results using 2000 features.}
\caption{Indexing results using 2,000 features.}
\begin{tabular}{cllll}
\multicolumn{1}{l}{\textbf{\begin{tabular}[c]{@{}c@{}}\label{tab:results-grouped}Sensitivity\end{tabular}}} & \textbf{Primary Feature} & \textbf{Sampling Technique} & \textbf{Accuracy} \\ \hline
\multirow{12}{*}{Case-sensitive} & \multirow{3}{*}{Contains} & Least frequent & 0.20\% \\ \cline{3-4}
Expand Down Expand Up @@ -255,7 +255,7 @@ \subsection{Experimental Results and Discussion}
\pagebreak
\begin{center}
\begin{table}[h]
\caption{Indexing results using 2000 features, sorted by accuracy, descending.}
\caption{Indexing results using 2,000 features, sorted by accuracy, descending.}
\begin{tabular}{llll}
\label{tab:results-sorted}
\textbf{Sensitivity} & \textbf{Primary Feature} & \textbf{Sampling Technique} & \textbf{Accuracy $\downarrow$} \\ \hline
Expand Down
4 changes: 2 additions & 2 deletions tex/35-conclusion.tex
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ \section{Conclusion}
The gold standard for indexing accuracy is 30\%, defined from Korycinski and Newell~\cite{automatic-indexing}.
Although this research was unable to reach human levels of indexing accuracy, it did achieve a 9.49\% accuracy from one of the twenty-four tested feature sets.

The most accurate feature set was created using the 2000 most frequently linked articles' titles from the Wikipedia training data.
The most accurate feature set was created using the 2,000 most frequently linked articles' titles from the Wikipedia training data.
Since this study focused on testing a breadth of different feature sets, the nuances of the more successful feature sets were left unexplored.
More research on optimizing this feature set should be done, perhaps by eliminating stop words, using a different type of supervised classification, or changing the sample size of the feature set (which was locked at 2000 for each of the twenty-four trials).
More research on optimizing this feature set should be done, perhaps by eliminating stop words, using a different type of supervised classification, or changing the sample size of the feature set (which was locked at 2,000 for each of the twenty-four trials).

In addition to the sub-optimal accuracy, there are a few overall flaws with using a \naive Bayes classifier---or any document classifier---for automatic indexing.
Document classifiers like the one used for this research are required to label each document once and only once.
Expand Down
2 changes: 1 addition & 1 deletion tex/45-appendices.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
\section{Indexed Book Statistics}
\label{appendix:d}

Book Book Goose~\cite{book-book-goose}, a random book browsing site, was used to generate a list of 100 random books\footnote{By inspecting network activity while on the site, it is evident that Book Book Goose uses Amazon's product API\cite{amazon-products} to randomly pull book information from Amazon.}.
Book Book Goose~\cite{book-book-goose}, a random book browsing site, was used to generate a list of 100 random books.\footnote{By inspecting network activity while on the site, it is evident that Book Book Goose uses Amazon's product API\cite{amazon-products} to randomly pull book information from Amazon.}
Of these books, the books that did not have an author and a title were discarded, leaving 77 books.
Each of these books were searched in Google Books~\cite{google-books}.
If a book in the search results matched the book searched, the number of pages the book contained were recorded.
Expand Down

0 comments on commit bf44270

Please sign in to comment.