Support .pages files #8

mish15 · 2015-04-06T01:43:04Z

Can we easily support the .pages extension?

The text was updated successfully, but these errors were encountered:

oprimus · 2015-04-26T09:48:33Z

Not as straight forward as first thought. It is just a zip file, but inside there are Apple's IWS files rather than XML. IWA files are a protobuf stream compressed with snappy - sort of.

http://stackoverflow.com/questions/27454317/decompressing-snappy-files-missing-stream-identifier-chunk-and-crc-32c-checksum

https://github.com/google/protobuf
https://code.google.com/p/snappy-go/

oprimus · 2015-04-26T14:21:54Z

The snappy-go implementation doesn't seem to be compatible with Apple's butchered implementation. I'm getting over the missing stream identifier by prepending the reader:
snappy.NewReader(io.MultiReader(strings.NewReader("\xff\x06\x00\x00sNaPpY"), file))

The problem now appears to be that Apple is using the old COPY_4 tag which the snappy golang library doesn't support (as in it detects it and says "unsupported COPY_4 tag"). All other golang snappy libraries appear to be based on this one so don't support it either.

I've implemented the COPY_4 tag by porting it from implementations in other languages. In particular https://github.com/gray/compress-snappy/blob/master/src/csnappy_decompress.c. However it's now saying that the input is corrupt so there must be something else which I can't track down.

At this point I've seen nobody successfully reading these out there so they pretty much need to be considered a proprietary file format.

If we want to progress I think the next step is to use the C implementation of snappy to see if that reads it. If it doesn't then I'm not sure where to go next.

mish15 · 2015-04-26T18:07:10Z

Does this help? .iwa seem to be the same
https://github.com/obriensp/iWorkFileFormat/blob/master/Docs/index.md

Looks like snappy is kind of followed, but not really.

"they do not include the required Stream Identifier chunk, and compressed
chunks do not include a CRC-32C checksum.
The stream is composed of contiguous chunks prefixed by a 4 byte header.
The first byte indicates the chunk type, which in practice is always 0 for
iWork, indicating a Snappy compressed chunk. The next three bytes are
interpreted as a 24-bit little-endian integer indicating the length of the
chunk. The 4 byte header is not included in the chunk length."

On Monday, 27 April 2015, oprimus [email protected] wrote:

The snappy-go implementation doesn't seem to be compatible with Apple's
butchered implementation. I'm getting over the missing stream identifier by
prepending the reader:
snappy.NewReader(io.MultiReader(strings.NewReader("\xff\x06\x00\x00sNaPpY"),
file))

The problem now appears to be that Apple is using the old COPY_4 tag which
the snappy golang library doesn't support (as in it detects it and says
"unsupported COPY_4 tag"). All other golang snappy libraries appear to be
based on this one so don't support it either.

I've implemented the COPY_4 tag by porting it from implementations in
other languages. In particular
https://github.com/gray/compress-snappy/blob/master/src/csnappy_decompress.c.
However it's now saying that the input is corrupt so there must be
something else which I can't track down.

At this point I've seen nobody successfully reading these out there so
they pretty much need to be considered a proprietary file format.

If we want to progress I think the next step is to use the C
implementation of snappy to see if that reads it. If it doesn't then I'm
not sure where to go next.

—
Reply to this email directly or view it on GitHub
#8 (comment).

Hamish Ogilvy
Sajari Pty Ltd
_t: +61 (_0) 414 658 353 | e: *[email protected]
*w: www.sajari.com

mish15 · 2015-04-26T20:02:43Z

Any trap on where the corrupt err comes from? e.g. Is it in the header read
or chunk processing loop? You're hardcoding the decoded length in the
stream identifier, which is the first check for corruption.

From what I can read it's definitely doable. Looks like it's in the
snappy "framing format", not pure snappy, so probably needs to be read
and decoded in chunks instead of a single block as per
https://code.google.com/p/snappy/source/browse/trunk/framing_format.txt

Can you upload the WIP branch?

On Monday, 27 April 2015, oprimus [email protected] wrote:

The snappy-go implementation doesn't seem to be compatible with Apple's
butchered implementation. I'm getting over the missing stream identifier by
prepending the reader:
snappy.NewReader(io.MultiReader(strings.NewReader("\xff\x06\x00\x00sNaPpY"),
file))

The problem now appears to be that Apple is using the old COPY_4 tag which
the snappy golang library doesn't support (as in it detects it and says
"unsupported COPY_4 tag"). All other golang snappy libraries appear to be
based on this one so don't support it either.

I've implemented the COPY_4 tag by porting it from implementations in
other languages. In particular
https://github.com/gray/compress-snappy/blob/master/src/csnappy_decompress.c.
However it's now saying that the input is corrupt so there must be
something else which I can't track down.

At this point I've seen nobody successfully reading these out there so
they pretty much need to be considered a proprietary file format.

If we want to progress I think the next step is to use the C
implementation of snappy to see if that reads it. If it doesn't then I'm
not sure where to go next.

—
Reply to this email directly or view it on GitHub
#8 (comment).

Hamish Ogilvy
Sajari Pty Ltd
_t: +61 (_0) 414 658 353 | e: *[email protected]
*w: www.sajari.com

oprimus · 2015-04-27T02:42:33Z

Commit 7ed3c56
Snappy compression needs to be altered to disable checksums for this to work (See below). Otherwise it gets to the point where we can get the uncompressed stream and find the archive length of the first object. However when trying to unmarshal the ArchiveInfo I get an "unexpected EOF".

 vi ~/go/src/code.google.com/p/snappy-go/snappy/decode.go

                 case chunkTypeCompressedData:
                        // Section 4.2. Compressed data (chunk type 0x00).
                        //if chunkLen < checksumSize {
                        //      r.err = ErrCorrupt
                        //      return 0, r.err
                        //}
                        buf := r.buf[:chunkLen]
                        if !r.readFull(buf) {
                                return 0, r.err
                        }
                        //checksum := uint32(buf[0]) | uint32(buf[1])<<8 | uint32(buf[2])<<16 | uint32(buf[3])<<24
                        //buf = buf[checksumSize:]

                        n, err := DecodedLen(buf)
                        if err != nil {
                                r.err = err
                                return 0, r.err
                        }
                        if n > len(r.decoded) {
                                r.err = ErrCorrupt
                                return 0, r.err
                        }
                        if _, err := Decode(r.decoded, buf); err != nil {
                                fmt.Println("decode error", err)
                                r.err = err
                                return 0, r.err
                        }
                        //if crc(r.decoded[:n]) != checksum {
                        //      fmt.Println("checksum")
                        //      r.err = ErrCorrupt
                        //      return 0, r.err
                        //}
                        r.i, r.j = 0, n
                        continue

See Issue #8

These came from: https://github.com/obriensp/iWorkFileFormat/tree/master/iWorkFileInspect or/iWorkFileInspector/Messages/Proto See Issue #8

See Issue #8 This captures some of these.

See Issue #8

dhowden · 2015-09-26T22:37:11Z

The snappy tests are failing (no doubt due to the changes you mention here not being compatible with the tests). I have marked the failing tests to be skipped for the moment, but we really need to fix this.

gonedjur · 2018-01-31T13:43:46Z

I see that you include the three cases, if a quickview pdf is available, an xml or the protobuffer iwa.

Does any of this work for iworks'14 files?

mish15 · 2018-01-31T21:30:08Z

Best thing to do is to test it and see. The pages format is pretty hacky

gonedjur · 2018-02-01T15:10:28Z

Looks like a no.

2018/02/01 14:39:28 Received file: t.pages (application/vnd.apple.pages)
archiveInfo:
2018/02/01 14:39:28 {"body":"","meta":{},"msecs":2}

Edit:

I wonder how these guys do it. https://cloudconvert.com/formats/document/pages

They manage 5.5 in some way. Only guys I've seen to do it...

mish15 · 2018-02-01T20:17:59Z

We welcome pull requests! :)

mish15 · 2018-02-01T20:18:50Z

It’s definitely possible, just need to play with the encoding. It wasn’t documented anywhere well from memory, but may be possibly these days

mish15 assigned oprimus Apr 26, 2015

mish15 added a commit that referenced this issue Apr 27, 2015

Ignore checksums in snappy (Apple does not set them, they will fail)

df758a4

See Issue #8

mish15 added a commit that referenced this issue Apr 27, 2015

Update to use local packages

958ece3

See Issue #8

mish15 added a commit that referenced this issue Apr 27, 2015

Add photo schema files

5f61ea5

These came from: https://github.com/obriensp/iWorkFileFormat/tree/master/iWorkFileInspect or/iWorkFileInspector/Messages/Proto See Issue #8

mish15 added a commit that referenced this issue Apr 27, 2015

Some .pages file include a PDF or XML version

3e47cd5

See Issue #8 This captures some of these.

oprimus pushed a commit that referenced this issue Apr 27, 2015

Support older iWorks files by reading the embedded Preview.pdf file #8

6573028

oprimus pushed a commit that referenced this issue Apr 27, 2015

Unwind duplicate commit #8

13ff0ec

mish15 added a commit that referenced this issue May 1, 2015

Merge branch 'issue-8'

876c59f

See Issue #8

dhowden added a commit that referenced this issue Sep 26, 2015

Skip failing tests for now, updated #8 to note this needs fixing

7e783cb

gonedjur unassigned oprimus Jan 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support .pages files #8

Support .pages files #8

mish15 commented Apr 6, 2015

oprimus commented Apr 26, 2015

oprimus commented Apr 26, 2015

mish15 commented Apr 26, 2015

mish15 commented Apr 26, 2015

oprimus commented Apr 27, 2015

dhowden commented Sep 26, 2015

gonedjur commented Jan 31, 2018 •

edited

Loading

mish15 commented Jan 31, 2018

gonedjur commented Feb 1, 2018 •

edited

Loading

mish15 commented Feb 1, 2018

mish15 commented Feb 1, 2018

Support .pages files #8

Support .pages files #8

Comments

mish15 commented Apr 6, 2015

oprimus commented Apr 26, 2015

oprimus commented Apr 26, 2015

mish15 commented Apr 26, 2015

mish15 commented Apr 26, 2015

oprimus commented Apr 27, 2015

dhowden commented Sep 26, 2015

gonedjur commented Jan 31, 2018 • edited Loading

mish15 commented Jan 31, 2018

gonedjur commented Feb 1, 2018 • edited Loading

mish15 commented Feb 1, 2018

mish15 commented Feb 1, 2018

gonedjur commented Jan 31, 2018 •

edited

Loading

gonedjur commented Feb 1, 2018 •

edited

Loading