Skip to content

Commit

Permalink
0.2.1 doc updates (#5)
Browse files Browse the repository at this point in the history
* Added new github workflow for docs

* Added examples to doc strings

* version bump

* Change log updates

* spelling

* Updated pytesseract doc string examples

* Fixed failed test
  • Loading branch information
SamEdwardes authored Jan 9, 2022
1 parent ebed5b9 commit f995ea1
Show file tree
Hide file tree
Showing 10 changed files with 120 additions and 14 deletions.
18 changes: 18 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
name: ci
on:
push:
branches:
- main
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: 3.x
- run: pip install wheel
- run: pip install mkdocs-material
- run: pip install mkdocstrings
- run: pip install mkdocs-include-markdown-plugin
- run: mkdocs gh-deploy --force
6 changes: 3 additions & 3 deletions docs/api/spacypdfreader.parsers.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# spacypdfreader.parsers

::: spacypdfreader.parsers.base
::: spacypdfreader.parsers.base.BaseParser

::: spacypdfreader.parsers.pdfminer
::: spacypdfreader.parsers.pdfminer.PdfminerParser

::: spacypdfreader.parsers.pytesseract
::: spacypdfreader.parsers.pytesseract.PytesseractParser
5 changes: 5 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Changelog

## 0.2.1 (2022-01-09)

- Added examples to the API docs.
- Added continuous deployment for GitHub pages.

## 0.2.0 (2021-12-10)

- Added support for additional pdf to text extraction engines:
Expand Down
21 changes: 21 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,29 @@ plugins:
- spacypdfreader
handlers:
python:
setup_commands:
- import sys
- from unittest.mock import MagicMock as mock
- sys.modules["spacy"] = mock()
- sys.modules["spacy.tokens"] = mock()
- sys.modules["rich.console"] = mock()
- sys.modules["rich.progress"] = mock()
- sys.modules["pdfminer"] = mock()
- sys.modules["pdfminer.pdfparser"] = mock()
- sys.modules["pdfminer.pdfpage"] = mock()
- sys.modules["pdfminer.pdfdocument"] = mock()
- sys.modules["pdfminer.high_level"] = mock()
- sys.modules["pytesseract"] = mock()
- sys.modules["PIL"] = mock()
- sys.modules["pdf2image"] = mock()
selection:
filters:
- "!^_" # exlude all members starting with _
- "^__init__$" # but always include __init__ modules and methods
rendering:
show_category_heading: false
show_root_heading: true
show_signature_annotations: true

extra:
social:
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "spacypdfreader"
version = "0.2.0"
version = "0.2.1"
description = "A PDF to text extraction pipeline component for spaCy."
authors = ["SamEdwardes <[email protected]>"]
license = "MIT"
Expand Down
2 changes: 1 addition & 1 deletion spacypdfreader/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from spacypdfreader.spacypdfreader import pdf_reader

__version__ = "0.2.0"
__version__ = "0.2.1"
4 changes: 2 additions & 2 deletions spacypdfreader/parsers/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ class BaseParser:
"""The base parser class.
The `BaseParser` is used to extend spacypdfreader with additional PDF to
text parsers. See [Parsers](/parsers) in the documentation for additional
details.
text parsers. See [Parsers](/parsers) section in the documentation for
additional details.
Attributes:
name: A string name representation of the class. Will only be used for
Expand Down
41 changes: 38 additions & 3 deletions spacypdfreader/parsers/pdfminer.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,44 @@ class PdfminerParser(BaseParser):
pdfminer is relatively fast, but has low accuracy than other parsers such as
[pytesseract](/parsers/#pytesseract).
See the [pdfminer section](/parsers/#pdfminer) in the docs for more
details. For more details on pdfminer see the
[pdfminer docs](https://pdfminersix.readthedocs.io/en/latest/).
Refer to [spacypdfreader.parsers.base.BaseParser][] for a list of attributes
and the `__init__` method.
Examples:
`PdfminerParser` is the default PDF to text parser and will be
automatically used unless otherwise specificied.
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
To be more explicit import `PdfminerParser` and pass it into the
`pdf_reader` function.
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pdfminer import PdfminerParser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, PdfminerParser)
For more fine tuning you can pass in additional parameters to pdfminer.
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pdfminer import PdfminerParser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> params = {"caching": False}
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, PdfminerParser, **params)
Info:
See the [pdfminer section](/parsers/#pdfminer) in the docs for more
details on the implementation of pdfminer. For more details on pdfminer
refer to the
[pdfminer docs](https://pdfminersix.readthedocs.io/en/latest/).
"""

name: str = "pdfminer"
Expand Down
33 changes: 30 additions & 3 deletions spacypdfreader/parsers/pytesseract.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,36 @@ class PytesseractParser(BaseParser):
the image to extract the text. pytesseract results in the best quality but
can be very slow compared to other parsers.
See the [pytesseract section](/parsers/#pytesseract) in the docs for more
details. For more details on pytesseract see the
[pytesseract docs](https://github.com/madmaze/pytesseract).
Refer to [spacypdfreader.parsers.base.BaseParser][] for a list of attributes
and the `__init__` method.
Examples:
To use `PytesseractParser` it must be explicitly imported and passed
into the `pdf_reader` function.
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pytesseract import PytesseractParser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, PytesseractParser)
For more fine tuning you can pass in additional parameters to
pytesseract.
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pytesseract import PytesseractParser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> params = {"nice": 1}
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, PytesseractParser, **params)
Info:
See the [pytesseract section](/parsers/#pytesseract) in the docs for
more details on the implementation of pytesseract. For more details on
pytesseract see the
[pytesseract docs](https://github.com/madmaze/pytesseract).
"""

name: str = "pytesseract"
Expand Down
2 changes: 1 addition & 1 deletion tests/test_spacypdfreader.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def pdf_assertions(doc: spacy.tokens.Doc):


def test_version():
assert __version__ == "0.2.0"
assert __version__ == "0.2.1"


def test_get_number_of_pages():
Expand Down

0 comments on commit f995ea1

Please sign in to comment.