Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: sajari/docconv
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.3.5
Choose a base ref
...
head repository: sajari/docconv
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref

Commits on Sep 20, 2022

  1. docd/appengine: use 1.3.5 release (#131)

    Contains the ca-certificates so error reporting can contact its API.
    jonathaningram authored Sep 20, 2022
    Copy the full SHA
    ab693bd View commit details

Commits on Mar 6, 2023

  1. Copy the full SHA
    e0a7210 View commit details
  2. readme: update code.sajari.com/docconv/docd installation instructions (

    …#132)
    
    ---------
    
    Co-authored-by: Jonathan Ingram <jon.ingram@algolia.com>
    rhaist and jonathaningram authored Mar 6, 2023
    Copy the full SHA
    8c168b3 View commit details

Commits on Sep 13, 2023

  1. Copy the full SHA
    603e780 View commit details
  2. doc: channel response before return (#134)

    Co-authored-by: Andrea Mugellini <a.mugellini@theswarm.co>
    pixge and Andrea Mugellini authored Sep 13, 2023
    Copy the full SHA
    a3b1ff2 View commit details
  3. Copy the full SHA
    a000b5d View commit details
  4. Copy the full SHA
    6446419 View commit details
  5. docd: bump to Go 1.21 and debian:12 (#145)

    Also show Go version in startup logs.
    jonathaningram authored Sep 13, 2023
    Copy the full SHA
    48caff1 View commit details
  6. Copy the full SHA
    1937742 View commit details

Commits on Oct 30, 2023

  1. Copy the full SHA
    bad7570 View commit details
  2. Copy the full SHA
    07e9902 View commit details
  3. Copy the full SHA
    e9e59ef View commit details

Commits on Oct 31, 2023

  1. v2: initial commit (#153)

    - Some methods have been changed to return an error as their last
      argument
    - Log calls inside various functions have been removed
    - Use a v1 tag if you need the previous signature
    jonathaningram authored Oct 31, 2023
    Copy the full SHA
    93312f4 View commit details
  2. go.mod: bump golang.org/x/net from 0.7.0 to 0.17.0 (#154)

    Bumps [golang.org/x/net](https://github.com/golang/net) from 0.7.0 to 0.17.0.
    - [Commits](golang/net@v0.7.0...v0.17.0)
    
    ---
    updated-dependencies:
    - dependency-name: golang.org/x/net
      dependency-type: direct:production
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Oct 31, 2023
    Copy the full SHA
    316980e View commit details
  3. go.mod: bump google.golang.org/grpc from 1.44.0 to 1.56.3 (#152)

    Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.44.0 to 1.56.3.
    - [Release notes](https://github.com/grpc/grpc-go/releases)
    - [Commits](grpc/grpc-go@v1.44.0...v1.56.3)
    
    ---
    updated-dependencies:
    - dependency-name: google.golang.org/grpc
      dependency-type: indirect
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Oct 31, 2023
    Copy the full SHA
    9cc631a View commit details

Commits on Nov 20, 2023

  1. docd: don't panic on broken pipe errors (#155)

    Example log entry that is panicking:
    
    > could not write to response (failed after 3965 bytes): write tcp 172.17.0.5:8080->172.17.0.4:61528: write: broken pipe
    jonathaningram authored Nov 20, 2023
    Copy the full SHA
    0865213 View commit details
  2. go.mod: use github.com/sajari/msoleps@race-fix (#156)

    See [my race fix PR]. Saw the issue in our logs, so using our fork for
    now.
    
    [my race fix PR]: richardlehane/msoleps#6
    jonathaningram authored Nov 20, 2023
    Copy the full SHA
    8b59f16 View commit details
  3. Copy the full SHA
    87d9a67 View commit details
  4. Copy the full SHA
    7d3f204 View commit details
  5. Copy the full SHA
    36fd47d View commit details
  6. Copy the full SHA
    232da35 View commit details

Commits on Jan 18, 2024

  1. Merge pull request from GHSA-3qm5-5hmp-8c6w

    * Limit the size of the content type definition in OOXML
    
    * Improve naming in OOXML tests
    
    * Check OOXML error type in tests
    jupenur authored Jan 18, 2024
    Copy the full SHA
    6e3f875 View commit details

Commits on Mar 3, 2024

  1. Copy the full SHA
    785a29a View commit details
Showing with 645 additions and 708 deletions.
  1. +13 −5 .github/workflows/docd.yml
  2. +2 −2 .github/workflows/go.yml
  3. +46 −34 README.md
  4. +1 −1 client/cmd/docconv-client/main.go
  5. +10 −15 doc.go
  6. +91 −0 doc_test.go
  7. +2 −3 docconv.go
  8. +9 −9 docd/Dockerfile
  9. +2 −2 docd/appengine/Dockerfile
  10. +4 −4 docd/appengine/app.yaml
  11. +16 −11 docd/convert.go
  12. +25 −0 docd/internal/cloudtrace/context.go
  13. +16 −0 docd/internal/cloudtrace/header.go
  14. +14 −0 docd/internal/cloudtrace/http_handler.go
  15. +72 −0 docd/internal/cloudtrace/logger.go
  16. +15 −0 docd/internal/debug/context.go
  17. +22 −0 docd/internal/debug/http_handler.go
  18. +33 −0 docd/internal/debug/logger.go
  19. +5 −5 docd/internal/recovery.go
  20. +51 −14 docd/main.go
  21. +2 −3 docx.go
  22. +17 −1 docx_test/docx_test.go
  23. BIN docx_test/testdata/decompression_size_limit.docx
  24. +32 −11 go.mod
  25. +50 −491 go.sum
  26. +16 −14 html.go
  27. +3 −10 html_appengine.go
  28. +3 −3 html_test/html_test.go
  29. +5 −5 iWork/TSPArchiveMessages.pb.go
  30. +5 −5 iWork/TSPDatabaseMessages.pb.go
  31. +5 −4 iWork/TSPMessages.pb.go
  32. +1 −1 iWork/pb-schema/TSPArchiveMessages.proto
  33. +1 −2 iWork/pb-schema/TSPDatabaseMessages.proto
  34. +1 −2 iWork/pb-schema/TSPMessages.proto
  35. +1 −1 image.go
  36. +1 −1 image_ocr.go
  37. +1 −1 image_ocr_test.go
  38. +1 −2 local.go
  39. +1 −2 odt.go
  40. +4 −5 pages.go
  41. +1 −2 pdf.go
  42. +15 −22 pdf_ocr.go
  43. +5 −2 pdf_ocr_test.go
  44. +1 −2 pptx.go
  45. +16 −1 pptx_test/pptx_test.go
  46. BIN pptx_test/testdata/decompression_size_limit.pptx
  47. +3 −3 rtf_test/rtf_test.go
  48. +4 −5 snappy/snappy_test.go
  49. BIN testdata/001-test.doc
  50. +1 −2 tidy.go
18 changes: 13 additions & 5 deletions .github/workflows/docd.yml
Original file line number Diff line number Diff line change
@@ -10,27 +10,35 @@ jobs:
name: Publish Docker image
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: docker/login-action@f054a8b539a109f9f41c372932f1ae047eff08c9
- name: Checkout
uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- id: meta
uses: docker/metadata-action@v3
uses: docker/metadata-action@v5
with:
images: sajari/docd
labels: |
org.opencontainers.image.description=A tool which exposes code.sajari.com/docconv as a service
org.opencontainers.image.description=A tool which exposes code.sajari.com/docconv/v2 as a service
org.opencontainers.image.title=docd
tags: |
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
type=semver,pattern={{major}}
type=sha,format=long
- uses: docker/build-push-action@v2
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
file: docd/Dockerfile
platforms: linux/amd64,linux/arm64
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
push: true
4 changes: 2 additions & 2 deletions .github/workflows/go.yml
Original file line number Diff line number Diff line change
@@ -11,12 +11,12 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
go: ["1.19", "1.16", "1.15", "1.14"]
go: ["1.21"]
steps:
- uses: actions/checkout@v2

- name: Install dependencies
run: sudo apt install unrtf tidy
run: sudo apt install wv unrtf tidy

- name: Set up Go ${{ matrix.go }}
uses: actions/setup-go@v2
80 changes: 46 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,61 @@
# docconv

[![Go reference](https://pkg.go.dev/badge/code.sajari.com/docconv.svg)](https://pkg.go.dev/code.sajari.com/docconv)
[![Go reference](https://pkg.go.dev/badge/code.sajari.com/docconv/v2.svg)](https://pkg.go.dev/code.sajari.com/docconv/v2)
[![Build status](https://github.com/sajari/docconv/workflows/Go/badge.svg?branch=master)](https://github.com/sajari/docconv/actions)
[![Report card](https://goreportcard.com/badge/code.sajari.com/docconv)](https://goreportcard.com/report/code.sajari.com/docconv)
[![Sourcegraph](https://sourcegraph.com/github.com/sajari/docconv/-/badge.svg)](https://sourcegraph.com/github.com/sajari/docconv)
[![Report card](https://goreportcard.com/badge/code.sajari.com/docconv/v2)](https://goreportcard.com/report/code.sajari.com/docconv/v2)
[![Sourcegraph](https://sourcegraph.com/github.com/sajari/docconv/v2/-/badge.svg)](https://sourcegraph.com/github.com/sajari/docconv/v2)

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

> **Note for returning users:** the Go import path for this package changed to `code.sajari.com/docconv`.
## Installation

If you haven't setup Go before, you first need to [install Go](https://golang.org/doc/install).

To fetch and build the code:

$ go get code.sajari.com/docconv/...
```console
$ go install code.sajari.com/docconv/v2/docd@latest
```

This will also build the command line tool `docd` into `$GOPATH/bin`. Make sure that `$GOPATH/bin` is in your `PATH` environment variable.
See `go help install` for details on the installation location of the installed `docd` executable. Make sure that the full path to the executable is in your `PATH` environment variable.

## Dependencies

tidy, wv, popplerutils, unrtf, https://github.com/JalfResi/justext
- tidy
- wv
- popplerutils
- unrtf
- https://github.com/JalfResi/justext

Example install of dependencies (not all systems):
### Debian-based Linux

```console
$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext
```

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext
### macOS

```console
$ brew install poppler-qt5 wv unrtf tidy-html5
$ go get github.com/JalfResi/justext
```

### Optional dependencies

To add image support to the `docconv` library you first need to [install and build gosseract](https://github.com/otiai10/gosseract/tree/v2.2.4).

Now you can add `-tags ocr` to any `go` command when building/fetching/testing `docconv` to include support for processing images:

$ go get -tags ocr code.sajari.com/docconv/...
```console
$ go get -tags ocr code.sajari.com/docconv/v2/...
```

This may complain on macOS, which you can fix by installing [tesseract](https://tesseract-ocr.github.io) via brew:

$ brew install tesseract
```console
$ brew install tesseract
```

## docd tool

@@ -55,24 +72,22 @@ The `docd` tool runs as either:

Optionally you can build it yourself:

```
cd docd
docker build -t docd .
```console
$ cd docd
$ docker build -t docd .
```

3. via the command line.

Documents can be sent as an argument, e.g.

$ docd -input document.pdf
```console
$ docd -input document.pdf
```

### Optional flags

- `addr` - the bind address for the HTTP server, default is ":8888"
- `log-level`
- 0: errors & critical info
- 1: inclues 0 and logs each request as well
- 2: include 1 and logs the response payloads
- `readability-length-low` - sets the readability length low if the ?readability=1 parameter is set
- `readability-length-high` - sets the readability length high if the ?readability=1 parameter is set
- `readability-stopwords-low` - sets the readability stopwords low if the ?readability=1 parameter is set
@@ -83,11 +98,10 @@ The `docd` tool runs as either:

### How to start the service

$ # This will only log errors and critical info
$ docd -log-level 0
$ # This will run on port 8000 and log each request
$ docd -addr :8000 -log-level 1
```console
$ # This runs on port 8000
$ docd -addr :8000
```

## Example usage (code)

@@ -104,15 +118,14 @@ package main

import (
"fmt"
"log"

"code.sajari.com/docconv"
"code.sajari.com/docconv/v2"
)

func main() {
res, err := docconv.ConvertPath("your-file.pdf")
if err != nil {
log.Fatal(err)
// TODO: handle
}
fmt.Println(res)
}
@@ -125,9 +138,8 @@ package main

import (
"fmt"
"log"

"code.sajari.com/docconv/client"
"code.sajari.com/docconv/v2/client"
)

func main() {
@@ -136,14 +148,14 @@ func main() {

res, err := client.ConvertPath(c, "your-file.pdf")
if err != nil {
log.Fatal(err)
// TODO: handle
}
fmt.Println(res)
}
```

Alternatively, via a `curl`:

```
curl -s -F input=your-file.pdf http://localhost:8888/convert
```console
$ curl -s -F input=@your-file.pdf http://localhost:8888/convert
```
2 changes: 1 addition & 1 deletion client/cmd/docconv-client/main.go
Original file line number Diff line number Diff line change
@@ -6,7 +6,7 @@ import (
"fmt"
"os"

"code.sajari.com/docconv/client"
"code.sajari.com/docconv/v2/client"
)

var (
25 changes: 10 additions & 15 deletions doc.go
Original file line number Diff line number Diff line change
@@ -4,8 +4,6 @@ import (
"bytes"
"fmt"
"io"
"io/ioutil"
"log"
"os"
"os/exec"
"time"
@@ -27,23 +25,24 @@ func ConvertDoc(r io.Reader) (string, map[string]string, error) {
go func() {
defer func() {
if e := recover(); e != nil {
log.Printf("panic when reading doc format: %v", e)
// TODO: Propagate error.
}
}()

meta := make(map[string]string)

doc, err := mscfb.New(f)
if err != nil {
log.Printf("ConvertDoc: could not read doc: %v", err)
// TODO: Propagate error.
mc <- meta
return
}

props := msoleps.New()
for entry, err := doc.Next(); err == nil; entry, err = doc.Next() {
if msoleps.IsMSOLEPS(entry.Initial) {
if oerr := props.Reset(doc); oerr != nil {
log.Printf("ConvertDoc: could not reset props: %v", oerr)
if err := props.Reset(doc); err != nil {
// TODO: Propagate error.
break
}

@@ -73,27 +72,23 @@ func ConvertDoc(r io.Reader) (string, map[string]string, error) {
// Document body
bc := make(chan string, 1)
go func() {

// Save output to a file
outputFile, err := ioutil.TempFile("/tmp", "sajari-convert-")
var buf bytes.Buffer
outputFile, err := os.CreateTemp("/tmp", "sajari-convert-")
if err != nil {
// TODO: Remove this.
log.Println("TempFile Out:", err)
bc <- buf.String()
return
}
defer os.Remove(outputFile.Name())

err = exec.Command("wvText", f.Name(), outputFile.Name()).Run()
if err != nil {
// TODO: Remove this.
log.Println("wvText:", err)
// TODO: Propagate error.
}

var buf bytes.Buffer
_, err = buf.ReadFrom(outputFile)
if err != nil {
// TODO: Remove this.
log.Println("wvText:", err)
// TODO: Propagate error.
}

bc <- buf.String()
91 changes: 91 additions & 0 deletions doc_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
package docconv

import (
"os"
"os/exec"
"path"
"strings"
"testing"
"time"

"github.com/google/go-cmp/cmp"
)

func TestConvertDoc(t *testing.T) {
if _, err := exec.LookPath("wvText"); err != nil {
t.Skip("wvText not installed")
return
}

tests := []struct {
file string
wantTrimmedText string
wantMeta map[string]string
wantErr bool
}{
{
file: "001-test.doc",
wantTrimmedText: "test",
wantMeta: map[string]string{
"AppName": "Microsoft Office Word",
"CharCount": "4",
"Character count": "4",
"CodePage": "1252",
"Company": "",
"CreateTime": "2023-09-13 01:54:00 +0000 UTC",
"CreatedDate": "1694570040",
"Dirty links": "false",
"DocSecurity": "0",
"Document parts": "0",
"EditTime": "1970-01-01 00:00:00 +0000 UTC",
"Heading pair": "0",
"Hyperlinks changed": "false",
"LastAuthor": "cloudconvert_7",
"LastSaveTime": "2023-09-13 01:54:00 +0000 UTC",
"Line count": "1",
"ModifiedDate": "1694570040",
"PageCount": "1",
"Paragraph count": "1",
"RevNumber": "1",
"Scale": "false",
"Shared document": "false",
"Template": "Normal",
"Version": "1048576",
"WordCount": "0",
},
},
}
for _, tt := range tests {
t.Run(tt.file, func(t *testing.T) {
f, err := os.Open(path.Join("testdata", tt.file))
if err != nil {
t.Fatal(err)
}
defer f.Close()

gotText, gotMeta, err := ConvertDoc(f)
if (err != nil) != tt.wantErr {
t.Errorf("ConvertDoc() error = %v, wantErr %v", err, tt.wantErr)
return
}
gotText = strings.TrimSpace(gotText)
if gotText != tt.wantTrimmedText {
t.Errorf("ConvertDoc() text = %v, want %v", gotText, tt.wantTrimmedText)
}
if !cmp.Equal(tt.wantMeta, gotMeta, maybeTimeComparer) {
t.Errorf("ConvertDoc() meta mismatch (-want +got):\n%v", cmp.Diff(tt.wantMeta, gotMeta, maybeTimeComparer))
}
})
}
}

// Compares strings as time.Times if they look like times. Required because
// wvText returns different time formats depending on system clock.
var maybeTimeComparer = cmp.Comparer(func(x, y string) bool {
xt, xterr := time.Parse("2006-01-02 15:04:05 -0700 MST", x)
yt, yterr := time.Parse("2006-01-02 15:04:05 -0700 MST", y)
if xterr == nil && yterr == nil {
return xt.Equal(yt)
}
return x == y
})
Loading