Skip to content

Commit

Permalink
Merge pull request #78 from uddhavphatak/master
Browse files Browse the repository at this point in the history
Update in documentation
  • Loading branch information
michaelrsweet authored Oct 25, 2024
2 parents 0e45e49 + 0ab291a commit 1e55779
Showing 1 changed file with 154 additions and 0 deletions.
154 changes: 154 additions & 0 deletions doc/pdfio.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,159 @@ that are defined in a separate header file:
```c
#include <pdfio-content.h>
```
Understanding PDF Files
-----------------------

A PDF file provides data and commands for displaying pages of graphics and text,
and is structured in a way that allows it to be displayed in the same way across
multiple devices and platforms.
The following is a PDF which shows "Hello, World!" on one page:
```
%PDF-1.0 %Header starts here
%âãÏÓ
1 0 obj %Body starts here
<<
/Kids [2 0 R]
/Count 1
/Type /Pages
>>
endobj
2 0 obj
<<
/Rotate 0
/Parent 1 0 R
/Resources 3 0 R
/MediaBox [0 0 612 792]
/Contents [4 0 R]/Type /Page
>>
endobj
3 0 obj
<<
/Font
<<
/F0
<<
/BaseFont /Times-Italic
/Subtype /Type1
/Type /Font
>>
>>
>>
endobj
4 0 obj
<<
/Length 65
>>
stream
1. 0. 0. 1. 50. 700. cm
BT
/F0 36. Tf
(Hello, World!) Tj
ET
endstream
endobj
5 0 obj
<<
/Pages 1 0 R
/Type /Catalog
>>
endobj
xref %Cross-reference table starts here
0 6
0000000000 65535 f
0000000015 00000 n
0000000074 00000 n
0000000192 00000 n
0000000291 00000 n
0000000409 00000 n
trailer %Trailer starts here
<<
/Root 5 0 R
/Size 6
>>
startxref
459
%%EOF
```

### Header
This is the first line of a PDF File. This specifies the version of PDF Format used.
For Example: '%PDF-1.0'

Since PDF files almost always contain binary data, they can become corrupted if line
endings are changed (for example, if the file is transferred over FTP in text mode).
To allow legacy file transfer programs to determine that the file is binary, it is
usual to include some bytes withcharacter codes higher than 127 in the header.
- For example: %âãÏÓ
- The percent sign indicates another header line, the other few bytes are arbitrary
character codes in excess of 127. So, the whole header in our example is:

```
%PDF-1.0
%âãÏÓ
```

### Body
The file body consists of a sequence of objects, each preceded by an object number,
generation number, and the obj keyword on one line, and followed by the endobj keyword
on another. For Example:

```
1 0 obj
<<
/Kids [2 0 R]
/Count 1
/Type /Pages
>>
endobj
```

Here, the object number is 1, and the generation number is 0 (it almost always is).
The content for object 1 is in between the two lines 1 0 obj and endobj.
In this case, it’s the dictionary <</Kids [2 0 R] /Count 1 /Type /Pages>>

### Cross-Reference Table
The cross-reference table lists the byte offset of each object in the file body.
This allows random access to objects, meaning they don't have to be read in order.
Objects that are not used are never read, making the process efficient.
Operations like counting the number of pages in a PDF document are fast, even in large files.
Each object has an object number and a generation number.
- Generation numbers are used when a cross-reference table entry is reused.
- For simplicity, we will assume generation numbers to be always zero and ignore them.
The cross-reference table consists of:
- Header line that indicates the number of entries.
- Special entry (the first entry).
- One line for each of the object in the file body.

```
0 6 %Six entries in table, starting at 0
0000000000 65535 f %Special entry
0000000015 00000 n %Object 1 is at byte offset 15
0000000074 00000 n %Object 2 is at byte offset 74
0000000192 00000 n %etc...
0000000291 00000 n
0000000409 00000 n %Object 5 is at byte offset 409
```

### Trailer
The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary,
which contains at least the /Size entry (Number of entries in the cross-reference table)
and the /Root entry (Object number of the document catalog, which is the root element
of the graph of objects in the body).
There follows a line with just the startxref keyword, a line with a single number (the byte offset of
the start of the cross-reference table within the file), and then the line %%EOF, which signals the
end of the PDF file.

```
trailer %Trailer keyword
<< %The trailer dictinonary
/Root 5 0 R
/Size 6
>>
startxref %startxref keyword
459 %Byte offset of cross-reference table
%%EOF %End-of-file marker
```


API Overview
Expand All @@ -132,6 +285,7 @@ PDFio exposes several types:
- `pdfio_stream_t`: An object stream



Reading PDF Files
-----------------

Expand Down

0 comments on commit 1e55779

Please sign in to comment.