find a more sound methodology for classifying lines as "poetry" and "not poetry" #1

aparrish · 2018-08-13T15:28:05Z

Right now this is accomplished using a set of checks based on surface-level textual characteristics, and while these checks produce okay results, they're brittle and unsophisticated. Since this is really just a straightforward text classification task, here's what I think is needed, at minimum:

a collection of several thousand (or more?) examples of poem lines and non-poem lines, labelled by hand
a suite of tests to check the accuracy of any classification method (and tweaks to those methods) against the hand-labelled set
a statistical model that produces high accuracy on the hand-labelled set.

I suspect just like... a random forest classifier trained on n-grams would produce pretty good results. A side benefit of this would be that the same classifier could likely be used to find stretches of poetry even in Project Gutenberg books that aren't labelled as "Poetry" in the subject metadata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find a more sound methodology for classifying lines as "poetry" and "not poetry" #1

find a more sound methodology for classifying lines as "poetry" and "not poetry" #1

aparrish commented Aug 13, 2018

find a more sound methodology for classifying lines as "poetry" and "not poetry" #1

find a more sound methodology for classifying lines as "poetry" and "not poetry" #1

Comments

aparrish commented Aug 13, 2018