You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now this is accomplished using a set of checks based on surface-level textual characteristics, and while these checks produce okay results, they're brittle and unsophisticated. Since this is really just a straightforward text classification task, here's what I think is needed, at minimum:
a collection of several thousand (or more?) examples of poem lines and non-poem lines, labelled by hand
a suite of tests to check the accuracy of any classification method (and tweaks to those methods) against the hand-labelled set
a statistical model that produces high accuracy on the hand-labelled set.
I suspect just like... a random forest classifier trained on n-grams would produce pretty good results. A side benefit of this would be that the same classifier could likely be used to find stretches of poetry even in Project Gutenberg books that aren't labelled as "Poetry" in the subject metadata.
The text was updated successfully, but these errors were encountered:
Right now this is accomplished using a set of checks based on surface-level textual characteristics, and while these checks produce okay results, they're brittle and unsophisticated. Since this is really just a straightforward text classification task, here's what I think is needed, at minimum:
I suspect just like... a random forest classifier trained on n-grams would produce pretty good results. A side benefit of this would be that the same classifier could likely be used to find stretches of poetry even in Project Gutenberg books that aren't labelled as "Poetry" in the subject metadata.
The text was updated successfully, but these errors were encountered: