aphp · marconaguib · Dec 15, 2025 · Dec 15, 2025 · Dec 16, 2025 · Dec 16, 2025
diff --git a/changelog.md b/changelog.md
@@ -11,6 +11,7 @@
 ### Added
 
 - Support passing multiple paths to `edsnlp.read_parquet`
+- New `edsnlp.data.from_huggingface_dataset()` and `edsnlp.data.to_huggingface_dataset()` data connectors, with corresponding `hf_ner` and `hf_text` connectors adapted respectively for NER datasets and simple text datasets.
 - New `eds.llm_span_qualifier` component to perform span classification tasks using LLMs
 
 ### Changed

diff --git a/docs/data/converters.md b/docs/data/converters.md
@@ -219,6 +219,42 @@ one per entity, that can be used to write to a dataframe. The schema of each pro
         heading_level: 4
         show_source: false
 
+## Hugging Face Text (`converter="hf_text"`) {: #hf_text }
+
+The `hf_text` converter is designed for simple text datasets from the Hugging Face Hub, where each example contains a single text field (e.g., IMDB, Wikipedia). It tokenizes the raw text and optionally extracts a document ID.
+
+### Converting HF Text data to Doc objects {: #edsnlp.data.converters.HfTextDict2DocConverter }
+
+::: edsnlp.data.converters.HfTextDict2DocConverter
+    options:
+        heading_level: 4
+        show_source: false
+
+### Converting Doc objects to HF Text data {: #edsnlp.data.converters.HfTextDoc2DictConverter }
+
+::: edsnlp.data.converters.HfTextDoc2DictConverter
+    options:
+        heading_level: 4
+        show_source: false
+
+## Hugging Face NER (`converter="hf_ner"`) {: #hf_ner }
+
+The `hf_ner` converter is designed for token-level named entity recognition (NER) datasets from the Hugging Face Hub (e.g., CoNLL-2003, WikiNER). It expects datasets with token lists and corresponding NER tags, and converts them to `Doc` objects with entities extracted using a flexible tagging scheme that supports BIO, BILOU, and other common NER formats.
+
+### Converting HF NER data to Doc objects {: #edsnlp.data.converters.HfNerDict2DocConverter }
+
+::: edsnlp.data.converters.HfNerDict2DocConverter
+    options:
+        heading_level: 4
+        show_source: false
+
+### Converting Doc objects to HF NER data {: #edsnlp.data.converters.HfNerDoc2DictConverter }
+
+::: edsnlp.data.converters.HfNerDoc2DictConverter
+    options:
+        heading_level: 4
+        show_source: false
+
 ## Markup (`converter="markup"`) {: #edsnlp.data.converters.MarkupToDocConverter }
 
 This converter is used to convert markup data, such as Markdown or XML into documents.

diff --git a/docs/data/huggingface.md b/docs/data/huggingface.md
@@ -0,0 +1,67 @@
+# HuggingFace datasets
+
+??? abstract "TLDR"
+
+    ```{ .python .no-check }
+    import edsnlp
+
+    # Read from the Hub (streaming) and convert to Docs
+    stream = edsnlp.data.from_huggingface_dataset(
+        "lhoestq/conll2003",
+        split="train",
+        converter="hf_ner",
+        tag_order=[
+            "O",
+            "B-PER",
+            "I-PER",
+            "B-ORG",
+            "I-ORG",
+            "B-LOC",
+            "I-LOC",
+            "B-MISC",
+            "I-MISC",
+        ],
+        nlp=edsnlp.blank("eds"),
+        load_kwargs={"streaming": True},
+    )
+
+    # Optionally process
+    stream = stream.map_pipeline(nlp)
+
+    # Export back to a HF IterableDataset of dicts
+    hf_iter = edsnlp.data.to_huggingface_dataset(
+        stream,
+        converter="hf_ner",
+        words_column="tokens",
+        ner_tags_column="ner_tags",
+    )
+    ```
+
+Use the Hugging Face Datasets ecosystem as a data source or sink for EDS-NLP pipelines. You can read datasets from the Hub or reuse already loaded `datasets.Dataset` / `datasets.IterableDataset` objects, optionally shuffle them deterministically, loop over them, and map them through any pipeline before writing them back as an `IterableDataset`.
+
+We rely on the `datasets` package. Install it with `pip install datasets` or `pip install "edsnlp[ml]"`.
+
+Typical converters:
+
+- `hf_ner`: expects token and tag columns (defaults: `tokens`, `ner_tags`) and produces Docs with entities. Compatible with BILOU/IOB schemes through `tag_order` or `tag_map`.
+- `hf_text`: expects a single text column (default: `text`) and produces plain Docs; optional `id_column` is inferred when present.
+
+When loading a dataset dictionary with multiple splits, pass an explicit `split` (e.g. `"train"`). You can also select a configuration/subset via `name` and forward any `datasets.load_dataset` arguments through `load_kwargs` (e.g. `{"streaming": True}`).
+
+## Reading Hugging Face datasets {: #edsnlp.data.huggingface_dataset.from_huggingface_dataset }
+
+::: edsnlp.data.huggingface_dataset.from_huggingface_dataset
+    options:
+        heading_level: 3
+        show_source: false
+        show_toc: false
+        show_bases: false
+
+## Writing Hugging Face datasets {: #edsnlp.data.huggingface_dataset.to_huggingface_dataset }
+
+::: edsnlp.data.huggingface_dataset.to_huggingface_dataset
+    options:
+        heading_level: 3
+        show_source: false
+        show_toc: false
+        show_bases: false
diff --git a/docs/data/index.md b/docs/data/index.md
@@ -43,6 +43,7 @@ At the moment, we support the following data sources:
 | [Pandas](./pandas)            | Pandas DataFrame objects   |
 | [Polars](./polars)            | Polars DataFrame objects   |
 | [Spark](./spark)              | Spark DataFrame objects    |
+| [Hugging Face datasets](./huggingface) | Datasets from the Hugging Face Hub |
 
 and the following schemas:
 
@@ -51,5 +52,7 @@ and the following schemas:
 | [Custom](./converters/#custom)                                      | `converter=custom_fn`  |
 | [OMOP](./converters/#omop)                                          | `converter="omop"`     |
 | [Standoff](./converters/#standoff)                                  | `converter="standoff"` |
+| [HF Text](./converters/#hf_text)                                    | `converter="hf_text"`  |
+| [HF NER](./converters/#hf_ner)                                      | `converter="hf_ner"`   |
 | [Ents](./converters/#edsnlp.data.converters.EntsDoc2DictConverter)  | `converter="ents"`     |
 | [Markup](./converters/#edsnlp.data.converters.MarkupToDocConverter) | `converter="markup"`   |
diff --git a/edsnlp/core/stream.py b/edsnlp/core/stream.py
@@ -1087,6 +1087,9 @@ def __repr__(self):
         )
 
     if TYPE_CHECKING:
+        from edsnlp.data import (
+            to_huggingface_dataset as to_huggingface_dataset,  # noqa: F401
+        )
         from edsnlp.data import to_iterable as to_iterable  # noqa: F401
         from edsnlp.data import to_pandas as to_pandas  # noqa: F401
         from edsnlp.data import to_polars as to_polars  # noqa: F401

diff --git a/edsnlp/data/__init__.py b/edsnlp/data/__init__.py
@@ -13,4 +13,5 @@
     from .spark import from_spark, to_spark
     from .pandas import from_pandas, to_pandas
     from .polars import from_polars, to_polars
+    from .huggingface_dataset  import from_huggingface_dataset, to_huggingface_dataset
     from .converters import get_dict2doc_converter, get_doc2dict_converter