Skip to content

[early stage development] A content-based, self-guided, NLP-enhanced integrated language learning environment emphasizing language exposure and active learning.

Notifications You must be signed in to change notification settings

chaosarium/Influx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Influx

Prototype for an integrated content-based language learning environment.

wakatime

Is this usable at its current state?

No. Not yet. It technically has a functioning database and text reader, but there is not yet any dictionary integration nor translation integration. The UI needs a lot of work. I have yet to figure out a way to package a binary built from rust with an embedding python interpreter.

Its current state

Some basic UI and working multilingual sentence segmentation and tokenization:

preview img preview img

Also now with phrase support!

gif

Links

  • Phase I dev log here
  • Continuous dev log here
  • Proposal here

Development notes

Architecture

  • SurrealDB + Axum + Disk as backend service exposing an API
  • Python + Stanza via PyO3 for NLP
  • Svelte + Tailwind frontend that interacts with the API
  • Tauri as a desktop client
  • fsrs-rs for SRS algorithm

Key issues to decide / address

  • language table in database + tokens relate to language vs. single database file for each language
  • how to handle lemmatization? should Stanza's lemma be used as default? how does user manually assign lemma? should lemma and reflexes be separate entries? how to relate them in the database?
  • how to integrate user-provided dictionaries?
  • how to allow extensions? should there be support for custom nlp scripts?

Plan

(only a partial plan)

Phase I - Project Skeleton

  • file system content access
  • working vocabulary database
  • allow python scripting for extendable language support
  • text processing: tokenization, lemmatization, and sentence segmentation
  • document query api
  • basic text reader
  • token data write requests and confirmations
  • svelte routing structure
  • read toml language configurations
  • read toml application configurations
  • language-specific file listing
  • ensure uniqueness of vocabulary database entries
  • update added token id if saving unmarked
  • Phrase parsing
    • some algorithm
    • algorithm on Document
    • efficient algorithm
    • frontend
  • database
    • language database
    • relate language and tokens
    • lemma handling
    • always lowercase tokens

Phase II - Backend & Packaging

Phase III - Frontend Usability

  • dictionary (pop up only for now) support
  • UI design
  • UI implementation
  • loading indicators
  • feedback messages
  • typescript: export typescript for rust structs

Phase IV - Frontend Language Learning Features

  • dictionary
  • translation
  • TTS
  • sentence structure analysis?

Phase V - Code Quality

  • better error handling
  • documentation
  • security and accounts?
  • make db item a trait

Phase ? - Future

  • markdown rendering?
  • video support
  • audio support
  • pdf + ocr support?

For future self

  • Current implementation is for rapid development. Change all unwrap to proper error handling.
  • File on disk could lead to race condition, but probabily won't encounter in single user situation
  • Language settings could be on disk
  • security? account? whatever for now as it's localhost
  • influx_api should be renamed influx_server

Running development server

Running influx server

cd influx_api
cargo run

Running nlp server

Python install

brew install [email protected]
cd Influx
python3.10 -m venv py_venv
source py_venv/bin/activate

check it's the right python

which python

make sure it's .../Influx/py_venv/bin/python

python -m pip install stanza==1.7.0 Flask==3.0.0 nuitka==1.9.7

Compilling NLP server

python -m nuitka --follow-imports --onefile main.py

Run a development server

python main.py --port 3001 --influx_path ../toy_content

Running frontend

cd influx_ui
npm run dev

Building

Running Tauri development server

cargo tauri dev

Build:

cargo tauri build --target aarch64-apple-darwin

Terminology (used in code base)

  • A Document is the entire text, consisting of sentences
  • A sentence is a series of sentence constituents
  • A lexeme is either a token or a phrase
  • Constituent refers to the part as it shows up in the document or sentence, whereas a lexeme refers to the instance currently in or would be in the database
  • A phrase is a list of token orthographies
  • A slice is a sequence of lexemes
  • A token is a single unit word or sub-word
  • Orthography refers to the lowercase written form of a token
  • Normalised orthography of a phrase is the orthograpies of the tokens it consists of joint by space; this is a workaround since javascript only likes string keys
  • Text refers to the token's orthography in the original text so it could be partially uppercased
  • A witespace token goes between lexemes within a sentence
  • A whitespace document constituent goes between sentences in a document
  • A composit token are things like Let's which contains subword tokens Let and us
  • A single token are single words like let which can't be broken down further
  • A phrase token is a phras pretending to be a token, for exampl hello word is a phrase but can also be treated like a grand composit token
  • A token is shadowed if it's part of a bigger token or phrase, e.g. let and us are shadowed by Let's; hello and world are shadowed by hello world
  • Lemma always refers to the orthography of the lemma

API design

Method defaults to GET is unspecified

  • / returns something random
  • /settings returns app settings as json
    • /langs returns list of languages in settings
  • /vocab to work with vocabs
    • /vocab/token/{lang_identifier}/{orthography} to query for a single token?
    • POST /vocab/create_token to create a token
    • POST /vocab/update_token to update a token
    • DELETE /vocab/delete_token to update a token
  • docs to work with docs
    • /docs/{lang_identifier} returns list of content, with metadata, for the language specified by lang_identifier. Currently only supports markdown content.
      • /docs/{lang_identifier}/{filename} returns a specific piece of content, with metadata, text, lemmatised and tokenised text, and results from querying vocabulary database

About

[early stage development] A content-based, self-guided, NLP-enhanced integrated language learning environment emphasizing language exposure and active learning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published