llm-datasets

llm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.

The documentation is available here.

Quick start

Installation

Install the llm-datasets package with pip:

pip install llm-datasets

In order to keep the package minimal by default, llm-datasets comes with optional dependencies useful for some use cases. For example, if you want to have the text extraction for all available datasets, run:

pip install llm-datasets[datasets]

Available commands

The framework provides the llm-datasets commandline interface (CLI) that allows running different processing and utility functions:

usage: llm-datasets <command> [<args>]

positional arguments:
  {chunkify,collect_metrics,compose,convert_parquet_to_jsonl,extract_text,hf_upload,print_stats,shuffle,train_tokenizer,render_docs,exact_dedup}
                        llm-datasets command helpers
    chunkify            Split the individual datasets into equally-sized file chunks (based
                        on bytes or rows)
    collect_metrics     Collect metrics (token count etc.) from extracted texts
    compose             Compose the final train/validation set based on the individual
                        datasets
    convert_parquet_to_jsonl
                        Convert Parquet files to JSONL
    extract_text        Extract text from raw datasets
    hf_upload           Upload files or directories to Huggingface Hub.
    print_stats         Print dataset statistics as CSV, Markdown, ...
    shuffle             Shuffle the individual datasets on the file-chunk level (no global
                        shuffle!)
    train_tokenizer     Train a tokenizer (only: sentencepiece supproted)
    render_docs         Render files for documents (overview of datasets, statistics, plots)
    exact_dedup         Exact deduplication using TLSH local-sensitive hashing

Download and text extraction

To download and extract the plain-text of one or more datasets, run the following command:

llm-datasets extract_text $DATASET_ID $OUTPUT_DIR

By default, output is saved as JSONL files. To change the output format, you can use the --output_format argument as below:

llm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd

Available datasets

A list or table with all available datasets can be print with the follow command:

llm-datasets print_stats --print_output md

Token count by language

Language	Tokens
bg	31 B
ca	6 B
code	212 B
cs	42 B
da	13 B
de	160 B
el	63 B
en	1 T
es	101 B
et	9 B
eu	1 B
fi	19 B
fr	84 B
ga	274 M
gl	231 M
hr	11 B
hu	52 B
it	61 B
lt	7 B
lv	5 B
mt	4 B
nl	44 B
nn	76 M
no	13 B
pl	45 B
pt	46 B
ro	18 B
sh	184 M
sk	32 B
sl	13 B
sr	11 B
sv	19 B
uk	56 B

Token count by source

Source	Tokens
curlicat	963 M
macocu	74 B
redpajama	44 B
wura	N/A
wikihow	99 M
pes2o	57 B
proof_pile	12 B
pile_of_law	111 B
math_amps	7 B
edgarcorpus	N/A
bulgarian_news	640 M
bulnc	4 B
openlegaldata	7 B
dewac	3 B
ga_bilingual_legistation	4 k
ga_universal_dependencies	40 k
hrwac	2 B
styria_news	432 M
croatian_news_engri	1 B
itwac	3 B
korpus_malti	816 M
sonar	746 M
cc_gigafida	260 M
academic_slovene_kas	3 B
slwac_web	3 B
sk_court_decisions	24 B
sk_laws	105 M
syn_v9	13 B
cs_en_parallel	473 M
danish_gigaword	2 B
danewsroom	835 M
dk_clarin	80 M
cabernet	599 M
norwegian_cc	11 B
pl_nkjp	3 M
pl_parliamentary_corpus	1 B
parlamento_pt	732 M
brwac	4 B
seimas_lt_en	12 k
state_related_latvian_web	52 k
greek_legal_code	80 M
greek_web_corpus	11 B
estonian_reference_corpus	481 M
enc2021	3 B
ekspress	723 M
euscrawl	831 M
spanish_legal	1 B
ylenews	286 M
sv_gigaword	528 M
srpkor	866 M
marcell_legislative_subcorpus_v2	1 B
uk_laws	2 B
eurlex	41 B
legal_mc4	28 B
wiki	21 B
wikibooks	313 M
wikiquote	247 M
wikinews	90 M
wikisource	2 B
wikivoyage	119 M
colossal_oscar	2 T
starcoder	212 B

Dataset viewer

We provide a Web-based application through streamlit to browse all datasets and their contained text content. To start the app, first clone this repository, install dependencies, and run the following command:

# clone is needed since streamlit does not support apps from modules yet
git clone https://github.com/malteos/llm-datasets.git

streamlit run src/lm_datasets/viewer/app.py -- \
    --raw_datasets_dir=$RAW_DATASETS_DIR \
    --output_dir=$PROCESSED_DATASET_DIR

Development & Contributions

Setup environment

To setup, your local development environment we recommend conda and cloning the repository. The repository also includes settings and launch scripts for VSCode.

git clone git@github.com:malteos/llm-datasets.git
cd llm-datasets

conda create -n llm-datasets python=3.10
conda activate llm-datasets

make install

# if you want to use content hash (for deduplication) you need to install TLSH
make install-tlsh

Alternatively, you can install the Python package directly from the dev branch:

pip install git+https://github.com/malteos/llm-datasets.git@dev

Formating and linting

This repository uses Ruff to validate code quality and formatting.

make lint

Testing

The tests can be executed with:

make test

Acknowledgements

The work on the llm-datasets software is partially funded by the German Federal Ministry for Economic Affairs and Climate Action (BMWK) through the project OpenGPT-X (project no. 68GX21007D).

License

Apache 2.0

(Please note that the actual datasets are released with different licenses)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

llm-datasets

Quick start

Installation

Available commands

Download and text extraction

Available datasets

Token count by language

Token count by source

Dataset viewer

Development & Contributions

Setup environment

Formating and linting

Testing

Acknowledgements

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

llm-datasets

Quick start

Installation

Available commands

Download and text extraction

Available datasets

Token count by language

Token count by source

Dataset viewer

Development & Contributions

Setup environment

Formating and linting

Testing

Acknowledgements

License