Skip to content

Activity

Add commoncrawl robots code

kylematobacreated commoncrawl_robots • 100c923 • 
on Nov 21, 2024

Added example to decontaminate Swissdox in FineWeb (work in progress).

amarfurtpushed 1 commit to decont • 0b81df3…28c82d2 • 
on Jul 15, 2024

res

amarfurtcreated decont • 0b81df3 • 
on Jul 15, 2024

merged main

EmanuelaBorospushed 30 commits to dedup • 799ee5a…b58fe9e • 
on Jun 28, 2024

res

jderiupushed 3 commits to main • a550e1d…0b81df3 • 
on Jun 26, 2024

res

jderiupushed 1 commit to pii • 0f2f620…0b81df3 • 
on Jun 26, 2024

robots

jderiupushed 1 commit to pii • 3c72f87…0f2f620 • 
on Jun 25, 2024

pii

jderiucreated pii • 3c72f87 • 
on Jun 19, 2024

merge

jderiupushed 2 commits to main • ba1c6f5…a550e1d • 
on Jun 19, 2024

Merge branch 'main' into swissdox

jderiupushed 62 commits to swissdox • 9fbab55…f423321 • 
on Jun 19, 2024

bugfixes

jderiupushed 1 commit to swissdox • 1d8dd8a…9fbab55 • 
on Jun 19, 2024

merge

EmanuelaBorospushed 9 commits to main • 0f2c69f…ba1c6f5 • 
on May 31, 2024

wip multilegal

EmanuelaBorospushed 1 commit to dedup • 7448ae3…799ee5a • 
on May 31, 2024

Update the randomize_start argument to randomize_start_duration to ac…

EmanuelaBorospushed 18 commits to main • 9f5f7b0…0f2c69f • 
on May 31, 2024

Added multilegal pipeline

EmanuelaBorospushed 1 commit to dedup • a2a6d9f…7448ae3 • 
on May 14, 2024

added requirements.txt

EmanuelaBorospushed 1 commit to dedup • 909f036…a2a6d9f • 
on May 14, 2024

formatting

EmanuelaBorospushed 1 commit to dedup • 5cce35e…909f036 • 
on May 14, 2024

Merge pull request huggingface#181 from QasidSaleem/remove_import_Lis…

EmanuelaBorospushed 4 commits to main • b2b96e4…9f5f7b0 • 
on May 13, 2024

Merge branch 'huggingface:main' into dedup

EmanuelaBorospushed 28 commits to dedup • 1d8dd8a…5cce35e • 
on May 7, 2024

Unsigned int tokenizer and srun args (huggingface#154)

EmanuelaBorospushed 7 commits to main • c72b1e4…b2b96e4 • 
on May 7, 2024

add mulit-law-pile pipeline

EmanuelaBoroscreated dedup • 1d8dd8a • 
on May 3, 2024

bugfix pii emails and quality filters default args

EmanuelaBorospushed 1 commit to main • a8d21e2…c72b1e4 • 
on May 3, 2024

fix documents with a lot of paragraphs being removed by the repetitio…

EmanuelaBorospushed 2 commits to main • 6d06210…a8d21e2 • 
on Apr 30, 2024

add mulit-law-pile pipeline

jderiupushed 1 commit to swissdox • 80bd7b7…1d8dd8a • 
on Apr 25, 2024

add swissdox and curiavista reader and add first version of SwissAIWr…

jderiucreated swissdox • 80bd7b7 • 
on Apr 25, 2024

Update pypi-release.yml

EmanuelaBorospushed 17 commits to main • 8c7e052…6d06210 • 
on Apr 23, 2024

Line dedup min remove words option (huggingface#146)

EmanuelaBorospushed 1 commit to main • 670fc40…8c7e052 • 
on Apr 8, 2024

Fix substring dedup range (huggingface#132)

EmanuelaBorospushed 1 commit to main • afadc8f…670fc40 • 
on Apr 5, 2024