LanguageStatisticsLibPy

The LanguageStatisticsLibPy is a Python library designed to facilitate the analysis and manipulation of language statistics data used in the CrypTool 2 software.

The library supports a broad array of languages and offers functionality for generating and handling n-gram data, specifically for calculating n-gram frequencies using language statistic files from CrypTool 2 (for example, "en-5gram-nocs.gz" indicates an English 5-gram file that is not case-sensitive and excludes spaces), found in the "LanguageStatistics" subdirectory of CrypTool 2. Additionally, it facilitates the use of CrypTool 2's dictionaries through a "Word Tree," an efficient data structure for rapid word searches within a language. These dictionary files are also housed within the "LanguageStatistics" subdirectory of CrypTool 2.

Features

Support for Multiple Languages: The library includes predefined support for fifteen languages, including English, German, Spanish, French, and more, each with its own set of unigram frequencies and alphabets.
N-Gram Loading: Users can load unigrams, bigrams, trigrams, tetragrams, pentagrams, and hexagrams as n-gram objects in supported languages, with the option to include or exclude spaces. To do this, you must use the language statistic files located in the CrypTool 2/LanguageStatistics directory. CrypTool 2 includes n-grams ranging from 1 to 5, and all language statistic files are case-insensitive, denoted as "nocs" in the filename. Each language statistic is available in two forms: with space/blank ("sp" in the filename) and without space/blank (indicated by the absence of "sp" in the filename) within the alphabet.
Index of Coincidence Calculation: It offers a method to calculate the Index of Coincidence (IoC) for a given piece of plaintext, which is useful for cryptanalysis and language pattern recognition.
Alphabet and Number Mapping: The library provides functionality to map characters to their respective positions in a language's alphabet and vice versa, supporting operations on encoded messages or language data.
Dynamic N-Gram and Word Tree Support: Depending on the available data, the library dynamically supports various n-gram types and can load a word tree structure for efficient word lookups in a specific language.

Usage

Initialization: Start by importing the LanguageStatistics class and specify the language code for your analysis.
Loading N-Grams: To load n-grams of your chosen type (e.g., unigrams, bigrams) for a specific language, use the create_grams method with the appropriate .gz file from the LanguageStatistics directory in CrypTool 2. For instance, to load English 4-grams that are case-insensitive and include the space/blank symbol, use the file named en-4gram-nocs-sp.gz.
Calculating IoC: Calculate the Index of Coincidence for a given plaintext using the calculate_ioc method.
Word Tree Loading: For advanced language analysis, load a pre-built word tree for a specific language using the load_word_tree method.

You can find example usages in the Test.py file.

Supported Languages

The library includes predefined configurations for the following languages:

English (en)
German (de)
Spanish (es)
French (fr)
Italian (it)
Hungarian (hu)
Russian (ru)
Czech (cs)
Greek (el)
Latin (la)
Dutch (nl)
Swedish (sv)
Portuguese (pt)
Polish (pl)
Turkish (tr)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Grams		Grams
.gitignore		.gitignore
LICENSE		LICENSE
LanguageStatistics.py		LanguageStatistics.py
LanguageStatisticsFile.py		LanguageStatisticsFile.py
Node.py		Node.py
README.md		README.md
WordTree.py		WordTree.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LanguageStatisticsLibPy

Features

Usage

Supported Languages

About

Releases

Packages

Contributors 2

Languages

License

CrypToolProject/LanguageStatisticsLibPy

Folders and files

Latest commit

History

Repository files navigation

LanguageStatisticsLibPy

Features

Usage

Supported Languages

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages