Skip to content

Language Model and Text Classification for German Language using Deep Learning

Notifications You must be signed in to change notification settings

Bachfischer/german2vec

Repository files navigation

german2vec

Overview

This repository contains documentation and code for building a German Language Model using the fastai library and applying it on a variety of NLP tasks such as text classification. The language model is based on 3-layer AWD-LSTM that was first published by Salesforce Research.

The backbone of the model is trained on the German Wikipedia Corpus and uses transfer learning to apply it to on text classification tasks (as described in Universal Language Model Fine-tuning for Text Classification).

Update:

A pre-trained Language Model using the German Wikipedia Corpus is available from this website: https://lernapparat.de/german-lm/. Thanks for sharing, Thomas!

Project structure

  • data/ -- language model for German language (available from https://lernapparat.de/german-lm/)
  • doc/ -- documentation and implementation notes
  • sb-10k_german_sentiment_classification/ -- raw data for SB-10k Corpus
  • scr/ -- notebooks used for various experiments on NLP classification
Notebook Task
sb-10k-use_pretrained_language_model.ipynb classifier for SB-10k Corpus (built on pre-trained language model)
sb-10k_small_wikipedia_corpus.ipynb classifier for SB-10k Corpus (built on self-trained language model using German Wikipedia)
sb-10k-data_preprocessing.ipynb data pre-processing steps for SB-10k: German Sentiment Corpus

TODO

Future research

to be updated

Contact

For more information, please feel free to contact me via e-mail ([email protected])

About

Language Model and Text Classification for German Language using Deep Learning

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published