Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
### Added

- Support passing multiple paths to `edsnlp.read_parquet`
- New `edsnlp.data.from_huggingface_dataset()` and `edsnlp.data.to_huggingface_dataset()` data connectors, with corresponding `hf_ner` and `hf_text` connectors adapted respectively for NER datasets and simple text datasets.
- New `eds.llm_span_qualifier` component to perform span classification tasks using LLMs

### Changed
Expand Down
36 changes: 36 additions & 0 deletions docs/data/converters.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,42 @@ one per entity, that can be used to write to a dataframe. The schema of each pro
heading_level: 4
show_source: false

## Hugging Face Text (`converter="hf_text"`) {: #hf_text }

The `hf_text` converter is designed for simple text datasets from the Hugging Face Hub, where each example contains a single text field (e.g., IMDB, Wikipedia). It tokenizes the raw text and optionally extracts a document ID.

### Converting HF Text data to Doc objects {: #edsnlp.data.converters.HfTextDict2DocConverter }

::: edsnlp.data.converters.HfTextDict2DocConverter
options:
heading_level: 4
show_source: false

### Converting Doc objects to HF Text data {: #edsnlp.data.converters.HfTextDoc2DictConverter }

::: edsnlp.data.converters.HfTextDoc2DictConverter
options:
heading_level: 4
show_source: false

## Hugging Face NER (`converter="hf_ner"`) {: #hf_ner }

The `hf_ner` converter is designed for token-level named entity recognition (NER) datasets from the Hugging Face Hub (e.g., CoNLL-2003, WikiNER). It expects datasets with token lists and corresponding NER tags, and converts them to `Doc` objects with entities extracted using a flexible tagging scheme that supports BIO, BILOU, and other common NER formats.

### Converting HF NER data to Doc objects {: #edsnlp.data.converters.HfNerDict2DocConverter }

::: edsnlp.data.converters.HfNerDict2DocConverter
options:
heading_level: 4
show_source: false

### Converting Doc objects to HF NER data {: #edsnlp.data.converters.HfNerDoc2DictConverter }

::: edsnlp.data.converters.HfNerDoc2DictConverter
options:
heading_level: 4
show_source: false

## Markup (`converter="markup"`) {: #edsnlp.data.converters.MarkupToDocConverter }

This converter is used to convert markup data, such as Markdown or XML into documents.
Expand Down
67 changes: 67 additions & 0 deletions docs/data/huggingface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# HuggingFace datasets

??? abstract "TLDR"

```{ .python .no-check }
import edsnlp

# Read from the Hub (streaming) and convert to Docs
stream = edsnlp.data.from_huggingface_dataset(
"lhoestq/conll2003",
split="train",
converter="hf_ner",
tag_order=[
"O",
"B-PER",
"I-PER",
"B-ORG",
"I-ORG",
"B-LOC",
"I-LOC",
"B-MISC",
"I-MISC",
],
nlp=edsnlp.blank("eds"),
load_kwargs={"streaming": True},
)

# Optionally process
stream = stream.map_pipeline(nlp)

# Export back to a HF IterableDataset of dicts
hf_iter = edsnlp.data.to_huggingface_dataset(
stream,
converter="hf_ner",
words_column="tokens",
ner_tags_column="ner_tags",
)
```

Use the Hugging Face Datasets ecosystem as a data source or sink for EDS-NLP pipelines. You can read datasets from the Hub or reuse already loaded `datasets.Dataset` / `datasets.IterableDataset` objects, optionally shuffle them deterministically, loop over them, and map them through any pipeline before writing them back as an `IterableDataset`.

We rely on the `datasets` package. Install it with `pip install datasets` or `pip install "edsnlp[ml]"`.

Typical converters:

- `hf_ner`: expects token and tag columns (defaults: `tokens`, `ner_tags`) and produces Docs with entities. Compatible with BILOU/IOB schemes through `tag_order` or `tag_map`.
- `hf_text`: expects a single text column (default: `text`) and produces plain Docs; optional `id_column` is inferred when present.

When loading a dataset dictionary with multiple splits, pass an explicit `split` (e.g. `"train"`). You can also select a configuration/subset via `name` and forward any `datasets.load_dataset` arguments through `load_kwargs` (e.g. `{"streaming": True}`).

## Reading Hugging Face datasets {: #edsnlp.data.huggingface_dataset.from_huggingface_dataset }

::: edsnlp.data.huggingface_dataset.from_huggingface_dataset
options:
heading_level: 3
show_source: false
show_toc: false
show_bases: false

## Writing Hugging Face datasets {: #edsnlp.data.huggingface_dataset.to_huggingface_dataset }

::: edsnlp.data.huggingface_dataset.to_huggingface_dataset
options:
heading_level: 3
show_source: false
show_toc: false
show_bases: false
3 changes: 3 additions & 0 deletions docs/data/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ At the moment, we support the following data sources:
| [Pandas](./pandas) | Pandas DataFrame objects |
| [Polars](./polars) | Polars DataFrame objects |
| [Spark](./spark) | Spark DataFrame objects |
| [Hugging Face datasets](./huggingface) | Datasets from the Hugging Face Hub |

and the following schemas:

Expand All @@ -51,5 +52,7 @@ and the following schemas:
| [Custom](./converters/#custom) | `converter=custom_fn` |
| [OMOP](./converters/#omop) | `converter="omop"` |
| [Standoff](./converters/#standoff) | `converter="standoff"` |
| [HF Text](./converters/#hf_text) | `converter="hf_text"` |
| [HF NER](./converters/#hf_ner) | `converter="hf_ner"` |
| [Ents](./converters/#edsnlp.data.converters.EntsDoc2DictConverter) | `converter="ents"` |
| [Markup](./converters/#edsnlp.data.converters.MarkupToDocConverter) | `converter="markup"` |
3 changes: 3 additions & 0 deletions edsnlp/core/stream.py
Original file line number Diff line number Diff line change
Expand Up @@ -1087,6 +1087,9 @@ def __repr__(self):
)

if TYPE_CHECKING:
from edsnlp.data import (
to_huggingface_dataset as to_huggingface_dataset, # noqa: F401
)
from edsnlp.data import to_iterable as to_iterable # noqa: F401
from edsnlp.data import to_pandas as to_pandas # noqa: F401
from edsnlp.data import to_polars as to_polars # noqa: F401
Expand Down
1 change: 1 addition & 0 deletions edsnlp/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@
from .spark import from_spark, to_spark
from .pandas import from_pandas, to_pandas
from .polars import from_polars, to_polars
from .huggingface_dataset import from_huggingface_dataset, to_huggingface_dataset
from .converters import get_dict2doc_converter, get_doc2dict_converter
Loading
Loading