date: 08-11-2022
written by: Wan-Ting Yeh
language: python
library: os, pandas, spacy
- batch analysing text within a folder
- using spacy library
- tokenisation, lemmentisation, part-of-speech, type-token ratio
- Walk through all the txt files in one folder
- steps:
- tokenise the text
- clean the data (Exclude unwanted token, eg., punctuation, symbols)
- lemmentisation (talked, talking --> talk)
- custominsed lemmentisation (eg., peeeeeekaboo --> peekaboo)
- count unique word / total word / type-token ratio
- unique word: only appears once in the text
- total word: word count in the text
- type-token ratio = unique word/ total word
- list part of word (noun, pronoun, adj...)
- ouput file
- OUTPUT_PATH_final: unique word
- OUTPUT_PATH_pos: part of speech
- Ensure that you've installed os, pandas, spacy library in your environment
- Ensure you've installed spacy pipeline
- see spacy documentation: https://spacy.io/usage
- install prompt in command line:
- python -m spacy download en_core_web_sm