Influx

Prototype for an integrated content-based language learning environment.

Is this usable at its current state?

No. Not yet. It technically has a functioning database and text reader, but there is not yet any dictionary integration nor translation integration. The UI needs a lot of work. I have yet to figure out a way to package a binary built from rust with an embedding python interpreter.

Its current state

Some basic UI and working multilingual sentence segmentation and tokenization:

Also now with phrase support!

Links

Phase I dev log here
Continuous dev log here
Proposal here

Development notes

Architecture

SurrealDB + Axum + Disk as backend service exposing an API
Python + Stanza via PyO3 for NLP
Svelte + Tailwind frontend that interacts with the API
Tauri as a desktop client
fsrs-rs for SRS algorithm

Key issues to decide / address

language table in database + tokens relate to language vs. single database file for each language
how to handle lemmatization? should Stanza's lemma be used as default? how does user manually assign lemma? should lemma and reflexes be separate entries? how to relate them in the database?
how to integrate user-provided dictionaries?
how to allow extensions? should there be support for custom nlp scripts?

Plan

(only a partial plan)

Phase I - Project Skeleton

Phase II - Backend & Packaging

tauri wrapper
figure out how to package python dependencies (check https://pyo3.rs/v0.14.2/building_and_distribution.html or https://pyoxidizer.readthedocs.io/en/stable/pyoxidizer_rust.html)
document set up process
build CI
API error reporting
Documentation?
Caching Stanza outputs

Phase III - Frontend Usability

Phase IV - Frontend Language Learning Features

dictionary
translation
TTS
sentence structure analysis?

Phase V - Code Quality

better error handling
documentation
security and accounts?
make db item a trait

Phase ? - Future

markdown rendering?
video support
audio support
pdf + ocr support?

For future self

Current implementation is for rapid development. Change all unwrap to proper error handling.
File on disk could lead to race condition, but probabily won't encounter in single user situation
Language settings could be on disk
security? account? whatever for now as it's localhost
influx_api should be renamed influx_server

Running development server

Running influx server

cd influx_api
cargo run

Running nlp server

Python install

brew install [email protected]
cd Influx
python3.10 -m venv py_venv
source py_venv/bin/activate

check it's the right python

which python

make sure it's .../Influx/py_venv/bin/python

python -m pip install stanza==1.7.0 Flask==3.0.0 nuitka==1.9.7

Compilling NLP server

python -m nuitka --follow-imports --onefile main.py

Run a development server

python main.py --port 3001 --influx_path ../toy_content

Running frontend

cd influx_ui
npm run dev

Building

Running Tauri development server

cargo tauri dev

Build:

cargo tauri build --target aarch64-apple-darwin

Terminology (used in code base)

A Document is the entire text, consisting of sentences
A sentence is a series of sentence constituents
A lexeme is either a token or a phrase
Constituent refers to the part as it shows up in the document or sentence, whereas a lexeme refers to the instance currently in or would be in the database
A phrase is a list of token orthographies
A slice is a sequence of lexemes
A token is a single unit word or sub-word
Orthography refers to the lowercase written form of a token
Normalised orthography of a phrase is the orthograpies of the tokens it consists of joint by space; this is a workaround since javascript only likes string keys
Text refers to the token's orthography in the original text so it could be partially uppercased
A witespace token goes between lexemes within a sentence
A whitespace document constituent goes between sentences in a document
A composit token are things like Let's which contains subword tokens Let and us
A single token are single words like let which can't be broken down further
A phrase token is a phras pretending to be a token, for exampl hello word is a phrase but can also be treated like a grand composit token
A token is shadowed if it's part of a bigger token or phrase, e.g. let and us are shadowed by Let's; hello and world are shadowed by hello world
Lemma always refers to the orthography of the lemma

API design

Method defaults to GET is unspecified

/ returns something random
/settings returns app settings as json
- /langs returns list of languages in settings
/vocab to work with vocabs
- /vocab/token/{lang_identifier}/{orthography} to query for a single token?
- POST /vocab/create_token to create a token
- POST /vocab/update_token to update a token
- DELETE /vocab/delete_token to update a token
docs to work with docs
- /docs/{lang_identifier} returns list of content, with metadata, for the language specified by lang_identifier. Currently only supports markdown content.
  - /docs/{lang_identifier}/{filename} returns a specific piece of content, with metadata, text, lemmatised and tokenised text, and results from querying vocabulary database

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
influx_server		influx_server
influx_ui		influx_ui
nlp_server		nlp_server
toy_content		toy_content
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Influx

Development notes

Architecture

Key issues to decide / address

Plan

For future self

Running development server

Running influx server

Running nlp server

Running frontend

Building

Running Tauri development server

Terminology (used in code base)

API design

About

Releases

Packages

Contributors 2

Languages

chaosarium/Influx

Folders and files

Latest commit

History

Repository files navigation

Influx

Development notes

Architecture

Key issues to decide / address

Plan

For future self

Running development server

Running influx server

Running nlp server

Running frontend

Building

Running Tauri development server

Terminology (used in code base)

API design

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages