damionjunk.nlp

This library is a launching point for playing around with the Stanford Core NLP package in Clojure. As of version 0.3.0, CMU's ark-tweet-nlp was added.

Implemented Features

CoreNLP has a lot of functionality that I don't use, and didn't implement. I'm primarily interested in the parts of speech, sentiment, and named entity annotations. I'll add other functionality as the need arises, or a request is made.

POS tagging with CMU's ark-tweet-nlp is implemented. Models can be substituted with an explicit call to (damionjunk.nlp.cmu-ark/initialize :model "modelname").

Dependency Information

For convenience, this library is available on Clojars:

[]

[damionjunk/nlp  "0.3.0"]

This library is using Stanford's CoreNLP version 3.5.2, which requires the Java 1.8 runtime. See the CoreNLP History for more details. Version 3.4.1 of the CoreNLP was the last to support Java 1.6 and 1.7. Carnegie Mellon's Ark Tweet NLP library is at version 0.3.2.

Code Examples

CMU ark-tweet-nlp

Parts of speech with CMU's Ark Tweet:

(require '[damionjunk.nlp.cmu-ark :as ark])

(ark/tag "ikr? u r my best friend. :) LOL amirite? #funzone")
;;=>
({:token "ikr", :pos "!"} {:token "?", :pos ","} {:token "u", :pos "O"}
 {:token "r", :pos "V"} {:token "my", :pos "D"} {:token "best", :pos "A"}
 {:token "friend", :pos "N"} {:token ".", :pos ","} {:token ":)", :pos "E"}
 {:token "LOL", :pos "!"} {:token "amirite", :pos "!"} {:token "?", :pos ","}
 {:token "#funzone", :pos "#"})

CoreNLP Sentiment Annotator

(require '[damionjunk.nlp.stanford :as nlp])

(nlp/sentiment-maps "Hi there. I really hated that movie. Just kidding, I loved it!")

;; => ({:sentiment 2, :text "Hi there."}
;;     {:sentiment 1, :text "I really hated that movie."}
;;     {:sentiment 3, :text "Just kidding, I loved it!"})

CoreNLP Sentiment, POS, and NER

(require '[damionjunk.nlp.stanford :as nlp])

(nlp/sentiment-ner-maps "Here, let me Google that for you.")

;; => ({:sentiment 1,
;;      :text "Here, let me Google that for you.",
;;      :tokens
;;        ({:pos "RB", :ner "O", :token "Here"}
;;         {:pos ",", :ner "O", :token ","}
;;         {:pos "VB", :ner "O", :token "let"}
;;         {:pos "PRP", :ner "O", :token "me"}
;;         {:pos "NNP", :ner "ORGANIZATION", :token "Google"}
;;         {:pos "IN", :ner "O", :token "that"}
;;         {:pos "IN", :ner "O", :token "for"}
;;         {:pos "PRP", :ner "O", :token "you"}
;;         {:pos ".", :ner "O", :token "."})})

Pipelines and Memory

CoreNLP loads a lot of stuff into memory, and has to do a bit of startup initialization. This library stores the built up pipeline in an atom for reuse. The atom (damionjunk.nlp.stanford/pipelines) is a keyword to CoreNLP (java object) mapping.

A different pipeline is built for each type of annotation that takes place, because not all annotators are needed for each type of annotation. For example, if you are only interested in sentiment analysis, there is no need to load the extra models for CoreNLP execution.

You will see this on STDERR when making the first call to sentiment-ner-maps:

Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.7 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [3.4 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [1.9 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [3.5 sec].
Initializing JollyDayHoliday for SUTime from classpath: edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Adding annotator parse
Adding annotator sentiment

This is dumped to STDERR when making the first call to sentiment-maps:

Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].
Adding annotator sentiment

Subsequent calls do not reload the models and use the constructed pipeline from the atom. As you can see, there are a different number of models loaded depending on which annotations you need. If you call a different function after the models are loaded, a new pipeline is constructed, but the models do not have to be reloaded.

(nlp/pos-ner-maps "What part of speech is this?")

;Adding annotator tokenize
;Adding annotator ssplit
;Adding annotator pos
;Adding annotator lemma
;Adding annotator ner

License

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src/damionjunk		src/damionjunk
test/damionjunk		test/damionjunk
.gitignore		.gitignore
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

damionjunk.nlp

Implemented Features

Dependency Information

Code Examples

CMU ark-tweet-nlp

CoreNLP Sentiment Annotator

CoreNLP Sentiment, POS, and NER

Pipelines and Memory

License

About

Releases

Packages

Languages

License

damionjunk/damionjunk.nlp

Folders and files

Latest commit

History

Repository files navigation

damionjunk.nlp

Implemented Features

Dependency Information

Code Examples

CMU ark-tweet-nlp

CoreNLP Sentiment Annotator

CoreNLP Sentiment, POS, and NER

Pipelines and Memory

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages