Activity
Added example to decontaminate Swissdox in FineWeb (work in progress).
Added example to decontaminate Swissdox in FineWeb (work in progress).
Merge branch 'main' into swissdox
Merge branch 'main' into swissdox
Update the randomize_start argument to randomize_start_duration to ac…
Update the randomize_start argument to randomize_start_duration to ac…
Merge branch 'huggingface:main' into dedup
Merge branch 'huggingface:main' into dedup
bugfix pii emails and quality filters default args
bugfix pii emails and quality filters default args
fix documents with a lot of paragraphs being removed by the repetitio…
fix documents with a lot of paragraphs being removed by the repetitio…
add swissdox and curiavista reader and add first version of SwissAIWr…
add swissdox and curiavista reader and add first version of SwissAIWr…