Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find a more sound methodology for classifying lines as "poetry" and "not poetry" #1

Open
3 tasks
aparrish opened this issue Aug 13, 2018 · 0 comments
Open
3 tasks

Comments

@aparrish
Copy link
Owner

Right now this is accomplished using a set of checks based on surface-level textual characteristics, and while these checks produce okay results, they're brittle and unsophisticated. Since this is really just a straightforward text classification task, here's what I think is needed, at minimum:

  • a collection of several thousand (or more?) examples of poem lines and non-poem lines, labelled by hand
  • a suite of tests to check the accuracy of any classification method (and tweaks to those methods) against the hand-labelled set
  • a statistical model that produces high accuracy on the hand-labelled set.

I suspect just like... a random forest classifier trained on n-grams would produce pretty good results. A side benefit of this would be that the same classifier could likely be used to find stretches of poetry even in Project Gutenberg books that aren't labelled as "Poetry" in the subject metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant