Skip to content

Commit

Permalink
filter control characters from content-level actual text
Browse files Browse the repository at this point in the history
Seeing backspaces, which are maybe meant to represent underscores. Upsets
XML, which just doesn't allow them.
  • Loading branch information
dwarring committed Oct 19, 2023
1 parent cd8a12b commit f1b67e3
Show file tree
Hide file tree
Showing 10 changed files with 32 additions and 45 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
#- windows-latest
raku-version:
- 'latest'
- '2021.12'
- '2022.07'
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v2
Expand Down
17 changes: 3 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,12 +188,10 @@ my PDF::XObject::Image $img .= open: "t/images/lightbulb.gif";
my $figure = $doc.Figure: $gfx, $img, :position[50, 70], :Alt("A light-bulb");
```

An [PDF::XObject::Form](https://pdf-raku.github.io/PDF-Class-raku/PDF/XObject/Form) may be associated with a marked content
sub-tree. This is achieved by marking the form against a document fragment, then calling `do` to repeatably
render the form, while inserting the fragment, as demonstrated below:
A [PDF::XObject::Form](https://pdf-raku.github.io/PDF-Class-raku/PDF/XObject/Form) may be associated with a document fragment. The
form can then be rendered, and the fragment inserted into the document, by repeatedly calling `do` on the fragment, as demonstrated below:

```raku

use PDF::Tags;
use PDF::Tags::Elem;
use PDF::Class;
Expand Down Expand Up @@ -222,20 +220,11 @@ $page.graphics: -> $gfx {
};
}

# multiple rendering of the form, and insertion of its structure tree
# multiple rendering of the form, and insertion into the structure tree
$doc.do($gfx, $form-frag, :position[150, 70]);
$doc.do($gfx, $form-frag, :position[150, 20]);
}

```

To insert an XObject Form that has marked content:

1. Create a new fragment element.
2. Create the Form XObject, marking content against the fragment
3. The `do` method can then be used to both render and insert
a copy of the fragment into the structure tree.

### Links

Links are usually contained in a block element, such as a `Paragraph`. If
Expand Down
2 changes: 1 addition & 1 deletion docs/PDF/Tags/Mark.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ The Marked Content ID within the content stream. These are usually numbered in s

method value() returns PDF::Content::Tag

The low-level [PDF::Content::Tag](https://pdf-raku.github.io/PDF-Content-raku) object, which contains further details on the tag:
The low-level [PDF::Content::Tag](https://pdf-raku.github.io/PDF-Content-raku/PDF/Content/Tag) object, which contains further details on the tag:

* `canvas` - The owner of the content stream; a PDF::Page or PDF::XObject::Form object.

Expand Down
2 changes: 1 addition & 1 deletion docs/PDF/Tags/Node.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Methods

### method cos

Returns the underlying [PDF::Class](https://pdf-raku.github.io/PDF-Class-raku) or [PDF::Content](https://pdf-raku.github.io/PDF-Content-raku) object. The [PDF::Tags::Node](https://pdf-raku.github.io/PDF-Tags-raku/PDF/Tags/Node) subclass and [PDF::COS](https://pdf-raku.github.io/PDF-raku) type are mapped as follows:
Returns the underlying [PDF::Class](https://pdf-raku.github.io/PDF-Class-raku) or [PDF::Content](https://pdf-raku.github.io/PDF-Content-raku/PDF/Content) object. The [PDF::Tags::Node](https://pdf-raku.github.io/PDF-Tags-raku/PDF/Tags/Node) subclass and [PDF::COS](https://pdf-raku.github.io/PDF-raku) type are mapped as follows:

<table class="pod-table">
<thead><tr>
Expand Down
4 changes: 2 additions & 2 deletions docs/PDF/Tags/XML-Writer.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ Synopsis
--------

use PDF::Class;
use PDF::Tags;
use PDF::Tags::Reader;
use PDF::Tags::XML-Writer;
my PDF::Class $pdf .= open: "t/write-tags.pdf";
my PDF::Tags $tags .= read: :$pdf;
my PDF::Tags::Reader $tags .= read: :$pdf;
my PDF::Tags::XML-Writer $xml-writer .= new: :debug, :root-tag<Docs>;
# atomic write
say $xml-writer.Str($tags);
Expand Down
17 changes: 3 additions & 14 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,12 +188,10 @@ my PDF::XObject::Image $img .= open: "t/images/lightbulb.gif";
my $figure = $doc.Figure: $gfx, $img, :position[50, 70], :Alt("A light-bulb");
```

An [PDF::XObject::Form](https://pdf-raku.github.io/PDF-Class-raku/PDF/XObject/Form) may be associated with a marked content
sub-tree. This is achieved by marking the form against a document fragment, then calling `do` to repeatably
render the form, while inserting the fragment, as demonstrated below:
A [PDF::XObject::Form](https://pdf-raku.github.io/PDF-Class-raku/PDF/XObject/Form) may be associated with a document fragment. The
form can then be rendered, and the fragment inserted into the document, by repeatedly calling `do` on the fragment, as demonstrated below:

```raku

use PDF::Tags;
use PDF::Tags::Elem;
use PDF::Class;
Expand Down Expand Up @@ -222,20 +220,11 @@ $page.graphics: -> $gfx {
};
}

# multiple rendering of the form, and insertion of its structure tree
# multiple rendering of the form, and insertion into the structure tree
$doc.do($gfx, $form-frag, :position[150, 70]);
$doc.do($gfx, $form-frag, :position[150, 20]);
}

```

To insert an XObject Form that has marked content:

1. Create a new fragment element.
2. Create the Form XObject, marking content against the fragment
3. The `do` method can then be used to both render and insert
a copy of the fragment into the structure tree.

### Links

Links are usually contained in a block element, such as a `Paragraph`. If
Expand Down
14 changes: 12 additions & 2 deletions lib/PDF/Tags/Mark.rakumod
Original file line number Diff line number Diff line change
Expand Up @@ -84,13 +84,23 @@ class PDF::Tags::Mark
fail "todo: update marked content attributes";
callsame();
}

sub sanitize(Str $_) {
# actual text sometimes have backspaces, etc?
.subst(
/<[ \x0..\x8 ]>/,
'',
:g
);
}

method ActualText {
$.attributes unless $!atts-built;
$!actual-text //= PDF::COS::TextString.COERCE: $_
$!actual-text //= sanitize PDF::COS::TextString.COERCE: $_
with %!attributes<ActualText>;
$!actual-text;
}
method remove-actual-text {
method remove-actual-text is DEPRECATED {
with $.ActualText {
$!actual-text = Nil;
$!value.attributes<ActualText>:delete;
Expand Down
2 changes: 1 addition & 1 deletion lib/PDF/Tags/Node.rakumod
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ class PDF::Tags::Node {
Returns the underlying L<PDF::Class> or L<PDF::Content> object. The L<PDF::Tags::Node> subclass and L<PDF::COS> type are mapped as follows:
=begin table
PDF::Tags::Node object | PDF::Class object |Base class | Notes
PDF::Tags::Node object | PDF::Class object |Base class | Notes
=================================================
PDF::Tags | PDF::StructTreeRoot | PDF::Tags::Node::Parent | PDF structure tree root
PDF::Tags::Elem | PDF::StructElem | PDF::Tags::Node::Parent | Intermediate structure element node
Expand Down
2 changes: 1 addition & 1 deletion lib/PDF/Tags/Node/Parent.rakumod
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ class PDF::Tags::Node::Parent
has UInt $!elems;
has $.style is rw; # Computed CSS style

method elems is also<Numeric> {
method elems(::?CLASS:D:) is also<Numeric> {
$!elems //= do with $.cos.kids {
when Hash { 1 }
default { .elems }
Expand Down
15 changes: 7 additions & 8 deletions lib/PDF/Tags/XML-Writer.rakumod
Original file line number Diff line number Diff line change
Expand Up @@ -49,16 +49,15 @@ method !chunk(Str $s is copy, UInt $depth = 0) {
method !line(|c) { $!feed = True; self!chunk(|c); $!feed = True; }
method !frag(|c) { $*inline ?? self!chunk(|c) !! self!line(|c) }

sub html-escape(Str $_) {
sub xml-escape(Str $_) {
.trans:
/\&/ => '&amp;',
/\</ => '&lt;',
/\>/ => '&gt;',

}
multi sub str-escape(@a) { @a.map(&str-escape).join: ' '; }
multi sub str-escape(Str $_) {
html-escape($_).trans: /\"/ => '&quote;';
xml-escape($_).trans: /\"/ => '&quote;';
}
multi sub str-escape(Pair $_) { str-escape(.value) }
multi sub str-escape($_) is default { str-escape(.Str) }
Expand Down Expand Up @@ -233,7 +232,7 @@ multi method stream-xml(PDF::Tags::Elem $node, UInt :$depth is copy = 0) {
}
}
given html-escape($_) {
given xml-escape($_) {
my $frag = do {
when $omit-tag.so { $_ }
when .so { '<%s%s>%s</%s>'.sprintf($name, $att, $_, $name) }
Expand Down Expand Up @@ -287,7 +286,7 @@ multi method stream-xml(PDF::Tags::Mark $node, :$depth!) {
multi method stream-xml(PDF::Tags::Text $_, :$depth!) {
if .Str -> $text {
self!chunk(html-escape($text), $depth);
self!chunk(xml-escape($text), $depth);
}
}
Expand All @@ -297,7 +296,7 @@ method !marked-content(PDF::Tags::Mark $node, :$depth!) {
when PDF::Tags::Mark {
my $text = self!marked-content($_, :$depth);
}
when PDF::Tags::Text { html-escape(.Str) }
when PDF::Tags::Text { xml-escape(.Str) }
default { die "unhandled tagged content: {.WHAT.raku}"; }
}
@text.join;
Expand All @@ -319,10 +318,10 @@ method !marked-content(PDF::Tags::Mark $node, :$depth!) {
=head2 Synopsis
use PDF::Class;
use PDF::Tags;
use PDF::Tags::Reader;
use PDF::Tags::XML-Writer;
my PDF::Class $pdf .= open: "t/write-tags.pdf";
my PDF::Tags $tags .= read: :$pdf;
my PDF::Tags::Reader $tags .= read: :$pdf;
my PDF::Tags::XML-Writer $xml-writer .= new: :debug, :root-tag<Docs>;
# atomic write
say $xml-writer.Str($tags);
Expand Down

0 comments on commit f1b67e3

Please sign in to comment.