Skip to content

Commit 03df97f

Browse files
cau-gitdolfim-ibmceberamvagenas
authored
feat!: Expose DoclingDocument as main type, move old typing to legacy (#41)
* Fix area method of BoundingBox Signed-off-by: Christoph Auer <[email protected]> * add image placeholder Signed-off-by: Michele Dolfi <[email protected]> * enable picture label Signed-off-by: Michele Dolfi <[email protected]> * refactor captions and markdown Signed-off-by: Michele Dolfi <[email protected]> * add logic to skip repeated caption Signed-off-by: Michele Dolfi <[email protected]> * use DocItemLabel Signed-off-by: Michele Dolfi <[email protected]> * Extend default export labels, add convenience mehtods Signed-off-by: Christoph Auer <[email protected]> * Introduce ListItem API, with marker and enumerated properties Signed-off-by: Christoph Auer <[email protected]> * add classification and description in PictureData Signed-off-by: Michele Dolfi <[email protected]> * add molecule picture data Signed-off-by: Michele Dolfi <[email protected]> * Fixes for DoclingDocument and aligned methods on legacy doc Signed-off-by: Christoph Auer <[email protected]> * add advanced picture data content Signed-off-by: Michele Dolfi <[email protected]> * Many markdown export fixes, renaming BaseTableData Signed-off-by: Christoph Auer <[email protected]> * Rename module paths doc->legacy_doc, experimental->doc Signed-off-by: Christoph Auer <[email protected]> * feat: imageref with pil_image Signed-off-by: Michele Dolfi <[email protected]> * Small fixes Signed-off-by: Christoph Auer <[email protected]> * docs: remove documentation in markdown to support python 3.13 (#43) Since json-schema-for-humans dependency does not support python 3.13, remove the generation of documentation in markdown of main docling types. Remove 'ds' prefix from documentation scripts. Update README. Add python 3.13 in CI/CD workflow checks. Signed-off-by: Cesar Berrospi Ramis <[email protected]> * Fix TableCell model validator Signed-off-by: Christoph Auer <[email protected]> * store list of classes in classification Signed-off-by: Michele Dolfi <[email protected]> * Fixes for DocumentOrigin mimetype validation Signed-off-by: Christoph Auer <[email protected]> * introduce picturedata as list of annotations Signed-off-by: Michele Dolfi <[email protected]> * feat: adapt hierarchical chunker to v2 DoclingDocument [skip-ci] Signed-off-by: Panos Vagenas <[email protected]> * feat: add table support in chunker, incl. captions Signed-off-by: Panos Vagenas <[email protected]> * use Field constraints instead of conlist, refactor chunking types Signed-off-by: Panos Vagenas <[email protected]> * revert unnecessary doc module change Signed-off-by: Panos Vagenas <[email protected]> * align test data with upstream changes Signed-off-by: Panos Vagenas <[email protected]> * Update __init__.py on docling_core.types.doc Signed-off-by: Christoph Auer <[email protected]> * Remove DescriptionItem Signed-off-by: Christoph Auer <[email protected]> --------- Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: Panos Vagenas <[email protected]> Co-authored-by: Michele Dolfi <[email protected]> Co-authored-by: Cesar Berrospi Ramis <[email protected]> Co-authored-by: Panos Vagenas <[email protected]>
1 parent 3194f56 commit 03df97f

File tree

87 files changed

+28617
-16884
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

87 files changed

+28617
-16884
lines changed

.github/workflows/checks.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ jobs:
66
runs-on: ubuntu-latest
77
strategy:
88
matrix:
9-
python-version: ['3.9', '3.10', '3.11', '3.12']
9+
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
1010
steps:
1111
- uses: actions/checkout@v3
1212
- uses: ./.github/actions/setup-poetry

.pre-commit-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ repos:
5252
hooks:
5353
- id: docs
5454
name: Docs
55-
entry: poetry run ds_generate_docs docs
55+
entry: poetry run generate_docs docs
5656
pass_filenames: false
5757
language: system
5858
files: '\.py$'

README.md

+8-8
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Docling Core
22

33
[![PyPI version](https://img.shields.io/pypi/v/docling-core)](https://pypi.org/project/docling-core/)
4-
![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-blue)
4+
![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%20%203.11%20%7C%203.12%20%7C%203.13-blue)
55
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
66
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
77
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
@@ -21,7 +21,7 @@ pip install docling-core
2121

2222
### Development setup
2323

24-
To develop for Docling Core, you need Python 3.9 / 3.10 / 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
24+
To develop for Docling Core, you need Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 and Poetry. You can then install from your local clone's root dir:
2525
```bash
2626
poetry install
2727
```
@@ -45,14 +45,14 @@ poetry run pytest test
4545
Document.model_validate_json(data_str)
4646
```
4747

48-
- You can generate the JSON schema of a model with the script `ds_generate_jsonschema`.
48+
- You can generate the JSON schema of a model with the script `generate_jsonschema`.
4949

5050
```py
5151
# for the `Document` type
52-
ds_generate_jsonschema Document
52+
generate_jsonschema Document
5353

5454
# for the use `Record` type
55-
ds_generate_jsonschema Record
55+
generate_jsonschema Record
5656
```
5757

5858
## Documentation
@@ -61,12 +61,12 @@ Docling supports 3 main data types:
6161

6262
- **Document** for publications like books, articles, reports, or patents. When Docling converts an unstructured PDF document, the generated JSON follows this schema.
6363
The Document type also models the metadata that may be attached to the converted document.
64-
Check [Document](docs/Document.md) for the full JSON schema.
64+
Check [Document](docs/Document.json) for the full JSON schema.
6565
- **Record** for structured database records, centered on an entity or _subject_ that is provided with a list of attributes.
6666
Related to records, the statements can represent annotations on text by Natural Language Processing (NLP) tools.
67-
Check [Record](docs/Record.md) for the full JSON schema.
67+
Check [Record](docs/Record.json) for the full JSON schema.
6868
- **Generic** for any data representation, ensuring minimal configuration and maximum flexibility.
69-
Check [Generic](docs/Generic.md) for the full JSON schema.
69+
Check [Generic](docs/Generic.json) for the full JSON schema.
7070

7171
The data schemas are defined using [pydantic](https://pydantic-docs.helpmanual.io/) models, which provide built-in processes to support the creation of data that adhere to those models.
7272

docling_core/transforms/chunker/__init__.py

+2-8
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,5 @@
55

66
"""Define the chunker types."""
77

8-
from docling_core.transforms.chunker.base import ( # noqa
9-
BaseChunker,
10-
Chunk,
11-
ChunkWithMetadata,
12-
)
13-
from docling_core.transforms.chunker.hierarchical_chunker import ( # noqa
14-
HierarchicalChunker,
15-
)
8+
from docling_core.transforms.chunker.base import BaseChunk, BaseChunker, BaseMeta
9+
from docling_core.transforms.chunker.hierarchical_chunker import HierarchicalChunker

docling_core/transforms/chunker/base.py

+27-40
Original file line numberDiff line numberDiff line change
@@ -4,71 +4,58 @@
44
#
55

66
"""Define base classes for chunking."""
7-
import re
87
from abc import ABC, abstractmethod
9-
from typing import Final, Iterator, Optional
8+
from typing import Any, ClassVar, Iterator
109

11-
from pydantic import BaseModel, Field, field_validator
10+
from pydantic import BaseModel
1211

13-
from docling_core.types import BoundingBox, Document
14-
from docling_core.types.base import _JSON_POINTER_REGEX
12+
from docling_core.types.doc import DoclingDocument as DLDocument
1513

16-
# (subset of) JSONPath format, e.g. "$.main-text[84]" (for migration purposes)
17-
_DEPRECATED_JSON_PATH_PATTERN: Final = re.compile(r"^\$\.([\w-]+)\[(\d+)\]$")
1814

15+
class BaseMeta(BaseModel):
16+
"""Metadata base class."""
1917

20-
def _create_path(pos: int, path_prefix: str = "main-text") -> str:
21-
return f"#/{path_prefix}/{pos}"
18+
excluded_embed: ClassVar[list[str]] = []
19+
excluded_llm: ClassVar[list[str]] = []
2220

21+
def export_json_dict(self) -> dict[str, Any]:
22+
"""Helper method for exporting non-None keys to JSON mode.
2323
24-
class Chunk(BaseModel):
25-
"""Data model for Chunk."""
24+
Returns:
25+
dict[str, Any]: The exported dictionary.
26+
"""
27+
return self.model_dump(mode="json", by_alias=True, exclude_none=True)
2628

27-
path: str = Field(pattern=_JSON_POINTER_REGEX)
28-
text: str
29-
heading: Optional[str] = None
3029

31-
@field_validator("path", mode="before")
32-
@classmethod
33-
def _json_pointer_from_json_path(cls, path: str):
34-
if (match := _DEPRECATED_JSON_PATH_PATTERN.match(path)) is not None:
35-
groups = match.groups()
36-
if len(groups) == 2 and groups[0] is not None and groups[1] is not None:
37-
return _create_path(
38-
pos=int(groups[1]),
39-
path_prefix=groups[0],
40-
)
41-
return path
30+
class BaseChunk(BaseModel):
31+
"""Chunk base class."""
4232

33+
text: str
34+
meta: BaseMeta
4335

44-
class ChunkWithMetadata(Chunk):
45-
"""Data model for Chunk including metadata."""
36+
def export_json_dict(self) -> dict[str, Any]:
37+
"""Helper method for exporting non-None keys to JSON mode.
4638
47-
page: Optional[int] = None
48-
bbox: Optional[BoundingBox] = None
39+
Returns:
40+
dict[str, Any]: The exported dictionary.
41+
"""
42+
return self.model_dump(mode="json", by_alias=True, exclude_none=True)
4943

5044

5145
class BaseChunker(BaseModel, ABC):
52-
"""Base class for Chunker."""
46+
"""Chunker base class."""
5347

5448
@abstractmethod
55-
def chunk(self, dl_doc: Document, **kwargs) -> Iterator[Chunk]:
49+
def chunk(self, dl_doc: DLDocument, **kwargs) -> Iterator[BaseChunk]:
5650
"""Chunk the provided document.
5751
5852
Args:
59-
dl_doc (Document): document to chunk
53+
dl_doc (DLDocument): document to chunk
6054
6155
Raises:
6256
NotImplementedError: in this abstract implementation
6357
6458
Yields:
65-
Iterator[Chunk]: iterator over extracted chunks
59+
Iterator[BaseChunk]: iterator over extracted chunks
6660
"""
6761
raise NotImplementedError()
68-
69-
@classmethod
70-
def _create_path(cls, pos: int, path_prefix: str = "main-text") -> str:
71-
return _create_path(
72-
pos=pos,
73-
path_prefix=path_prefix,
74-
)

0 commit comments

Comments
 (0)