Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support to disable East Asian font hints in docx output #9910

Open
TomBener opened this issue Jun 24, 2024 · 23 comments
Open

Support to disable East Asian font hints in docx output #9910

TomBener opened this issue Jun 24, 2024 · 23 comments

Comments

@TomBener
Copy link
Contributor

New issue from #9817.

In my filed, we tend to cite the Chinese sources in articles but they are relatively small in the entire document. So the English journals expect the typesetting to be in line with English instead of Chinese, particularly the quotation mark. In this context, could Pandoc provide an option to disable East Asian font hints?

@jgm
Copy link
Owner

jgm commented Jun 24, 2024

Do you mean disable them globally or in a fine-grained way (e.g. don't put a font hint inside this specially marked span) ?

@TomBener
Copy link
Contributor Author

Disable East Asian font hints globally would be fine, just like the previous version (like Pandoc 3.2).

@jgm
Copy link
Owner

jgm commented Jun 24, 2024

I'm confused, because you requested this feature in the first place, but when I implemented it you immediately asked for a way to disable it. Is it actually a useful feature?

@TomBener
Copy link
Contributor Author

I understand your confusion. This is indeed an annoying case, especially for non-CJK users.

  1. When writing an article primarily in Chinese (or Japanese, Korean), there would be some ASCII characters in almost all cases, so I want the East Asian characters to be inclosed with specific font attributes, as implemented currently.
  2. When writing an article primarily in English but a few CJK characters are included, I don’t want to enclose East Asian font hints for CJK texts to ensure punctuations (such as quotation marks) are consistent in the whole document.

Do you mean disable them globally or in a fine-grained way (e.g. don't put a font hint inside this specially marked span) ?

Specifying the language manually would be feasible, but it is hard to do so for bibliographies.

@jgm
Copy link
Owner

jgm commented Jun 24, 2024

Would it make sense to add the font hints only when the specified language (e.g., metadata lang, perhaps overridable at the Div or Span level) is a CJK language?

@TomBener
Copy link
Contributor Author

TomBener commented Jun 24, 2024

Would it make sense to add the font hints only when the specified language (e.g., metadata lang, perhaps overridable at the Div or Span level) is a CJK language?

Sorry, I don't think this is a good idea as lang may affect other settings, which are unexpected in some cases. For example, when setting lang: zh-CN, CSL will use localization that I don't want, as I have reported in an old issue.

So, I think the current implementation of adding East Asian font hints is good and no need to change. Perhaps I could write a Lua filter to remove them when writing English articles if necessary.

@TomBener
Copy link
Contributor Author

I tried to write a Lua filter as follows:

function traverse(elem)
    if elem.t == "RawBlock" or elem.t == "RawInline" then
        if elem.format == "openxml" then
            elem.text = elem.text:gsub('<w:rFonts w:hint="eastAsia" />', '')
        end
    end

    return elem
end

return {
    { RawBlock = traverse },
    { RawInline = traverse }
}

But it didn't work. Could you please help to diagnose it or give some guidance?

@jgm
Copy link
Owner

jgm commented Jun 24, 2024

A lua filter can't remove these because they are added in the writer. Lua filters only affect the AST (which is the input to the writer).

@TomBener
Copy link
Contributor Author

Thanks for your guidance. Are there any alternative ways?

@jgm
Copy link
Owner

jgm commented Jun 25, 2024

Nothing will work but postprocessing the docx. (It wouldn't be that hard to find and remove the offending elements from the context in the container.)

Again, I'm open to providing this flexibility in pandoc, but I need to figure out what the best way to do it would be.

@jgm
Copy link
Owner

jgm commented Jun 25, 2024

Sorry, I don't think this is a good idea as lang may affect other settings, which are unexpected in some cases. For example, when setting lang: zh-CN, CSL will use localization that I don't want, as I have reported in an #7022 (comment).

You needn't set the document-wide lang. We could have the feature be sensitive to a lang on a div, for example. So you could put Chinese content inside

::: {lang=zh}
...
:::

and the Word writer could be trained to add the font hints inside that context (unless overridden by an interior span or div with lang=en).

@TomBener
Copy link
Contributor Author

Thanks. I believe the step for post-processing the docx is feasible.

Regarding the language attribute, I think there is no need to change the current implementation as the East Asian Languages should always be enclosed with eastAsia font hints, no matter what the document language is. The peculiar need I request here is not usual.

@tarleb
Copy link
Collaborator

tarleb commented Jun 25, 2024

Quick suggestion for post-processing: Using a binary custom Lua writer, i.e., a custom writer that defines a ByteStringWriter function instead of a Writer function, can be used to do the post-processing in pandoc itself. The pandoc.zip module can be used to unpack and re-pack the output of pandoc.write, and the file entries of the archive can be modified via normal string processing.

@jgm
Copy link
Owner

jgm commented Jun 25, 2024

the East Asian Languages should always be enclosed with eastAsia font hints, no matter what the document language is.

The difficulty is determining whether quotation marks surrounding a Chinese phrase should themselves be considered East Asian or not. As you've noted, that depends on the context. Hence my suggestion to make this sensitive to language tagging.

@TomBener
Copy link
Contributor Author

TomBener commented Jun 26, 2024

Quick suggestion for post-processing: Using a binary custom Lua writer, i.e., a custom writer that defines a ByteStringWriter function instead of a Writer function, can be used to do the post-processing in pandoc itself. The pandoc.zip module can be used to unpack and re-pack the output of pandoc.write, and the file entries of the archive can be modified via normal string processing.

Sounds a promising method. But I cannot fully understand it. Could you please provide more details? For example, I'd like to remove <w:rPr><w:rFonts w:hint="eastAsia" /></w:rPr> in document.xml under the unzipped docx, how to use this method? Thanks!

@TomBener TomBener reopened this Jun 26, 2024
@TomBener
Copy link
Contributor Author

The difficulty is determining whether quotation marks surrounding a Chinese phrase should themselves be considered East Asian or not. As you've noted, that depends on the context. Hence my suggestion to make this sensitive to language tagging.

All issues come from that simplified Chinese and English use the same quotation mark (Traditional Chinese does not). I think Pandoc does't need to try to handle this tricky issue further.

A Japanese designer has submitted a proposal to add standardized variation sequences for four quotation marks. I hope it can be adopted as soon as possible:

This document is a proposal for adding eight standardized variation sequences (SVSes) for the following four quotation marks that use VS1 (aka U+FE00) and VS2 (aka U+FE01) to distinguish between the forms whose usage varies according to well-established Western versus East Asian conventions:

U+2018 ‘ LEFT SINGLE QUOTATION MARK
U+2019 ’ RIGHT SINGLE QUOTATION MARK
U+201C “ LEFT DOUBLE QUOTATION MARK
U+201D ” RIGHT DOUBLE QUOTATION MARK

@tarleb
Copy link
Collaborator

tarleb commented Jun 27, 2024

Could you please provide more details?

Sure, here we go:

--- file: docx-no-eahints.lua
-- Copyright: © 2024 Albert Krewinkel
-- License: MIT

local mediabag = require 'pandoc.mediabag'
local path = require 'pandoc.path'
local zip = require 'pandoc.zip'

function ByteStringWriter(doc, opts)
  local docx = pandoc.write(mediabag.fill(doc), 'docx', opts)
  local archive = zip.Archive(docx)
  for i, entry in ipairs(archive.entries) do
    if path.filename(entry.path) == 'document.xml' then
      local pattern = '<w:rPr><w:rFonts w:hint="eastAsia" /></w:rPr>'
      local newcontent = entry:contents():gsub(pattern, '')
      archive.entries[i] = zip.Entry(entry.path, newcontent)
    end
  end
  return archive:bytestring()
end

Use with

pandoc --to=docx-no-eahints.lua -o my-outfile.docx …

It's not really well-tested, but should work. Or, at the very least, should give a better idea of what I meant, and how this could work.

@TomBener
Copy link
Contributor Author

TomBener commented Jun 28, 2024

Thanks @tarleb, it works. But I encounter an issue that the page size was changed from A4 to US Letter after applying the Lua filter. The original XML tags in document.xml was removed from the reference docx via --reference-doc:

<w:sectPr w:rsidR="00D3414C">
  <w:pgSz w:h="16840" w:w="11900" />
  <w:pgMar w:bottom="1440" w:footer="720" w:gutter="0" w:header="720" w:left="1440" w:right="1440" w:top="1440" />
  <w:cols w:space="720" />
  <w:docGrid w:linePitch="360" />
</w:sectPr>

BTW, is it possible to use this Lua filter with Quarto?

@tarleb
Copy link
Collaborator

tarleb commented Jun 28, 2024

Thanks @tarleb, it works. But I encounter an issue that the page size was changed from A4 to US Letter after applying the Lua filter. The original XML tags in document.xml was removed from the reference docx via --reference-doc:

Using --reference-doc with the custom writer should still be possible.

BTW, is it possible to use this Lua filter with Quarto?

I don't know, sorry.

@TomBener
Copy link
Contributor Author

TomBener commented Jun 28, 2024

I've uploaded a folder with files for testing: lua-custom-writer-test.zip

With the same source input file test.md and custom.docx as the reference-doc, running the command:

pandoc test.md -o test.docx --reference-doc custom.docx

generated test.docx with the A4 page size. But if running the command:

pandoc test.md -o test.docx --reference-doc custom.docx -t docx-no-eahints.lua

would generate test.docx with the US Letter page size. By unzipping test.docx, I was able to confirm that the later conversion removed East Asian font hints, but it also unexpectedly removed the following XML for defining page size:

<w:sectPr w:rsidR="005F2E0E" w:rsidSect="002F2276">
    <w:pgSz w:h="16840" w:w="11900" />
    <w:pgMar w:bottom="1440" w:footer="720" w:gutter="0" w:header="720" w:left="1440" w:right="1440" w:top="1440" />
    <w:cols w:space="720" />
    <w:docGrid w:linePitch="326" />
</w:sectPr>

This behavior seems weird and I have no idea what's the problem, could you please help to diagnose the issue @tarleb

@tarleb
Copy link
Collaborator

tarleb commented Jun 28, 2024

Weird. I currently don't have time to debug this, but it would be nice to get to the bottom of this. Does the reference doc get applied at all?

@TomBener
Copy link
Contributor Author

Weird. I currently don't have time to debug this, but it would be nice to get to the bottom of this. Does the reference doc get applied at all?

Never mind, it's not urgent. The reference doc was applied in both conversions. You can see them in the folder above.

@TomBener
Copy link
Contributor Author

TomBener commented Jul 4, 2024

@tarleb Can you kindly help to debug the page size issue above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants