Skip to content

Latest commit

 

History

History
528 lines (421 loc) · 24.6 KB

README.md

File metadata and controls

528 lines (421 loc) · 24.6 KB

Gettext

The GNU gettext utilities provide a well established solution for the internationalization of software. It allows users to switch between natural languages without switching executables. Many commercial translation offices can work with GNU gettext message catalogs (Portable Object files, PO), and various editors exist that help with the translation process. The translation process and programming process can happen asynchronously and without knowledge of each other. New translations can be added without recompilation.

The use of GNU gettext in D has been enabled by the mofile package, that this Gettext package builds on. If you would only use mofile directly then you would depend on the GNU xgettext utility for the task of string extraction, hoping it can parse D code as if it were C code. You would also be dealing with a number of limitations that are native to GNU gettext.

This Gettext package removes the need for an external parser and provides a more powerful interface than GNU gettext itself. It combines convenient and reliable string extraction - enabled by D's unique language features - and a comprehensive integration with Dub, while leveraging a well established ecosystem for translation into other natural languages.

Contents

Features

  • Concise translation markers that can be aliased to your preference.
  • All marked strings that are seen by the compiler are extracted automatically.
  • All (current and future) D string literal formats are supported.
  • Static initializers of fields, constants, immutables, manifest constants and anonymous enums can be marked as translatable (a D specialty).
  • Translatable strings may be part of generated and mixed in code (another D specialty).
  • Concatenations of translatable strings, untranslated strings and single chars are supported, even in initializers.
  • Arrays of translatable strings are supported, also when statically initialized.
  • Plural forms are language dependent, and play nice with format strings.
  • Multiple identical strings are translated once, unless they are given different contexts.
  • Notes to the translator can be attached to individual translatable strings.
  • Code occurrences of strings are communicated to the translator.
  • Available languages are discovered and selected at run-time.
  • Platform independent, not linked with C libraries.
  • Automated generation of the PO template.
  • Automated merging into existing translations (requires GNU gettext utilities).
  • Automated generation of Machine Object files (MO) (requires GNU gettext utilities).
  • Includes utility for listing unmarked strings in the project.

Installation

Dub configuration

Add the following to your dub.json:

dub.json
    "targetType": "executable",
    "dependencies": {
        "gettext": "~>1"
    },
    "configurations": [
        {
            "name": "default"
        },
        {
            "name": "i18n",
            "preGenerateCommands": [
                "dub run --config=xgettext",
                "dub run gettext:merge -- --popath=po --backup=none",
                "dub run gettext:po2mo -- --popath=po --mopath=mo"
            ],
            "copyFiles": [
                "mo"
            ]
        },
        {
            "name": "xgettext",
            "targetPath": ".xgettext",
            "versions": [ "xgettext" ],
            "subConfigurations": {
                "gettext": "xgettext"
            }
        }
    ]

or its equvialent dub.sdl:

dub.sdl
dependency "gettext" version="~>1"
configuration "default" {
    targetType "library"
}
configuration "i18n" {
    targetType "library"
    copyFiles "mo"
    preGenerateCommands \
        "dub run --config=xgettext" \
        "dub run gettext:merge -- --popath=po --backup=none" \
        "dub run gettext:po2mo -- --popath=po --mopath=mo"
}

configuration "xgettext" {
    targetType "library"
    targetPath ".xgettext"
    subConfiguration "gettext" "xgettext"
    versions "xgettext"
}

This may seem quite the boiler plate, but it automates many steps without taking away your control over them. We'll discuss these further below.

Module import

import gettext;

main() function

Insert the following line at the top of your main function:

mixin(gettext.main);

Ignore generated files

The PO template and MO files are generated, and need not be kept under version control. The executable in the .xgettext folder is an artefact of the string extraction process. If you use Git, add these lines to .gitignore:

.xgettext
*.pot
*.mo

Usage

Marking strings

Prepend tr! in front of every string literal that needs to be translated. For instance:

writeln("This string will remain untranslated.");
writeln(tr!"This string is to be translated");

Note that you may rename tr to whatever you want:

import gettext : _ = tr;
writeln(_!"This string is to be translated");

No additional changes to any configurations are needed to make this work.

Plural forms

Sentences that should change in plural form depending on a number should supply both singular and plural forms with the number like this:

// Before:
writefln("%d green bottle(s) hanging on the wall", n);
// After:
writeln(tr!("one green bottle hanging on the wall",
            "%d green bottles hanging on the wall")(n));

Note that the format specifier (%d, or %s, etc.) is optional in the singular form.

Many languages have not just two forms like the English language does, and translations in those languages can supply all the forms that the particular language requires. This is handled by the translator, and is demonstrated in the example below.

Marking format strings

Translatable strings can be format strings, used with std.format and std.stdio.writefln etc. These format strings do support plural forms, but the argument that determines the form must be supplied to tr and not to format. The corresponding format specifier will not be seen by format as it will have been replaced with a string by tr. Example:

format(tr!("Welcome %s, you may make a wish",
           "Welcome %s, you may make %d wishes")(n), name);

The format specifier that selects the form is the last specifier in the format string (here %d). In many sentences, however, the specifier that should select the form cannot be the last. In these cases, format specifiers must be given a position argument, where the highest position determines the form:

foreach (i, where; [tr!"hand", tr!"bush"])
    format(tr!("One bird in the %1$s", "%2$d birds in the %1$s")(i + 1), where);

Again, the specifier with the highest position argument will never be seen by format. On a side note, some translations may need a reordering of words, so translators may need to use position arguments in their translated format strings anyway.

Note: Specifiers with and without a position argument must not be mixed.

Concatenations

Translators will be able to produce the best translations if they get to work with full sentences, like

auto message = format(tr!`Could not open the file "%s" for reading.`, file);

However, in support of legacy code, concatenations of strings do work:

auto message = tr!`Could not open the file "` ~ file ~ tr!`" for reading.`;

Passing attributes

Optionally, two kinds of attributes can be passed to tr, in the form of an associative array initializer. These are for passing notes to the translator and for disambiguating identical sentences with different meanings.

Passing notes to the translator

Sometimes a sentence can be interpreted to mean different things, and then it is important to be able to clarify things for the translator. Here is an example of how to do this:

auto name = tr!("Walter Bright", Comment("Proper name. Phonetically: ˈwɔltər braɪt"));

The GNU gettext manual has a section about the translation of proper names.

Disambiguate identical sentences

Multiple occurrences of the same sentence are combined into one translation by default. In some cases, that may not work well. Some language, for example, may need to translate identical menu items in different menus differently. These can be disambiguated by adding a context like so:

auto labelOpenFile    = tr!("Open", Context("Menu|File"));
auto labelOpenPrinter = tr!("Open", Context("Menu|File|Printer"));

Notes and comments can be combined:

auto message1 = tr!("Review the draft.", Context("document"));
auto message2 = tr!("Review the draft.", Context("nautical"),
                                         Comment(`Nautical term! "Draft" = how deep the bottom` ~
                                                 `of the ship is below the water level.`));

They work on plural forms too:

writeln(tr!("One license.", "%d licenses.", Context("software"),
                                            Comment("Notice to translator."))(n));
writeln(tr!("One license.", "%d licenses.", Context("driver's"))(n));

Selecting a translation

Use the following functions to discover translation tables, get the language code for a table and activate a translation:

string[] availableLanguages(string moPath = null)
string languageCode() @safe
string languageCode(string moFile) @safe
void selectLanguage(string moFile) @safe

Note that any translation that happens before a language is selected, results in the value of the hard coded string.

Finding unmarked strings

To get an overview of all string literals in your project that are not marked as translatable, execute the following in your project root folder:

dub run gettext:todo -q

This prints a list of strings with their source file names and row numbers.

Fixing compilation errors

An attempt to translate a static string initializer will cause a compilation error, because the language is only selected at run-time. For example:

const string statically_initialized = tr!"Compile-time translation?";

will produce an error like this:

d:\SARC\gettext\source\gettext.d(285,20): Error: static variable `currentLanguage` cannot be read at compile time
source\mod1.d(7,24):        called from here: `TranslatableString("Compile-time translation?").gettext()`

Unless you're initializing a mutable static variable, the solution is to remove the explicit string type and let the type be inferred:

const statically_initialized = tr!"Compile-time translation!";

The correct translation will then be retrieved at the places where this constant is used, at run-time.

The way this works is that the type of the constant gets to be inferred as TranslatableString, which is a callable struct defined by this package. Whenever an instance of this struct is evaluated, the value of the translation is retrieved.

But, there are places where you wouldn't want to change the type away from string, like the initializer of a mutable static variable or an aggregate member. In these cases there is no other way than to move to run-time assignment until after the language has been selected.

Added steps to the build process

Since the first configuration in your dub.json is empty (the "default" configuration) nothing special happens when you just do

dub run

So your normal code - compile - run - test cycle is not slowed down by any additional steps.

But when you do

dub run --config=i18n

the preGenerateCommands and copyFiles sections of the i18n configuration kick into action, which cause a couple of tasks to be performed:

  1. Translatable strings are extracted from the sources into a PO template.
  2. Translations in any existing PO files are updated according to the new template.
  3. PO files are converted into binary MO files.
  4. MO files are copied to the target directory.

We'll discuss these in a little more detail below.

Creating/updating the PO template automatically

In other languages, string extraction into a .pot file is done by invoking the xgettext command line tool from the GNU gettext utilities. Because xgettext does not know about all the string literal syntaxes in D, and cannot scan any generated code that may be mixed in, we employ D itself to perform this task.

This is how this works: The dub run --config=xgettext line in the preGenerateCommands section of your Dub configuration compiles and runs your project into an alternative targetPath and executes the code that you have mixed in at the top of your main() function. That code makes smart use of D language features (see credits) to collect all strings that are to be translated, together with information from your Dub configuration and the latest Git tag. The rest of your main() is not executed in this configuration — but strings are still extracted. In any other configuration the mixin is actually empty.

By default this creates (or overwrites) the PO template in the po folder of your project. This can be changed by using options; To see which options are accepted, run the command with the --help option:

dub run --config=xgettext -- --help

Example

The teohdemo test contained in this package produces the following teohdemo.pot:

# PO Template for teohdemo.
# Copyright © 2022, SARC B.V.
# This file is distributed under the BSL-01 license.
# Bastiaan Veelo, 2022.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: v1.0.4\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2022-07-09T20:52:52.4027136Z\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <[email protected]>\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

#: source/app.d:10(main)
#, c-format
msgid "Selected language: %s"
msgstr ""

#: source/mod1.d:13(fun1) source/mod2.d:15(fun3)
msgid "Identical strings share their translation!"
msgstr ""

#: source/mod1.d:7
#, c-format
msgid "Hello! My name is %s."
msgstr ""

#: source/mod2.d:13(fun3)
msgid "Never used, but nevertheless translated!"
msgstr ""

#: source/mod2.d:8(fun2)
#, c-format
msgid "I'm counting one apple."
msgid_plural "I'm counting %d apples."
msgstr[0] ""
msgstr[1] ""

Updating existing translations automatically

The "dub run gettext:merge -- --popath=po" pre-generate command invokes the merge script that is included as a subpackage. This script runs the msgmerge utility from GNU gettext on the PO files that it finds. When needed, the path to msgmerge can be specified with the --gettextpath option. Any additional options are passed on to msgmerge directly, see its documentation. For example, you can use the --backup=numbered option to keep backups of original translations.

Note that if translatable strings were changed in the source, or new ones were added, the PO file is now incomplete. This is detected by the script, which then prints a warning. Changed strings are marked as #, fuzzy in the PO file, which can be picked up by editors as needing work. If a lookup in an outdated MO file does not succeed, the application will show the string as it occurs in the source.

Converting to binary form automatically

Similar to the previous step, the "dub run gettext:po2mo -- --popath=po --mopath=mo" pre-build command invokes the po2mo subpackage, which runs the msgfmt utility from GNU gettext. This converts all PO files into MO files in the mo folder. This folder is then copied to the target directory for inclusion in the distribution of your package. Any additional options are passed on to msgfmt directly, see its documentation.

Adding translations

Each natural language that is going to be supported requires a .po file, which is derived from the generated .pot template file. This .po file is then edited to fill in the stubs with the correct translations.

There are various tools to do this, from dedicated stand-alone editors, editor plugins or modes, web applications to command line utilities.

Currently my personal favourite is Poedit. You open the template, select the target language and start translating with real-time suggestions from various online translation engines. Or you let the AI give it its best effort and translate all messages at once, before reviewing the problematic ones (requires subscription). It supports marking translations that need work and adding notes.

Updating translations

Any translations that have fallen behind the template will need to be updated by a translator. To detect any such translations, you can scan for warnings in the output of this command:

dub run -q gettext:merge -- --popath=po

Warnings will also show if GNU gettext detected what it thinks is a mistake. Sadly it sometimes gets it wrong: Weekdays, for example, are capitalized in English but not in many other languages. If a translation string only consists of one word, a weekday, it guesses that it is the start of a sentence and will complain if the translation does not start with a capital letter. Therefore, translatable strings should be full sentences if at all possible.

PO file editors will typically allow translators to quickly jump between strings that need their attention.

After a PO file has been edited, MO files must be regenerated with this command:

dub run gettext:po2mo -- --popath=po --mopath=mo

Of course you can also simply rerun

dub run --config=i18n

once more to execute both above commands in succession.

Example

These are some runs of the included teohdemo test:

d:\SARC\gettext\tests\teohdemo>dub run -q
Please select a language:
[0] default
[1] en_GB
[2] nl_NL
[3] uk_UA
1
Hello! My name is Joe.
I'm counting one apple.
Hello! My name is Schmoe.
I'm counting 3 apples.
Hello! My name is Jane.
I'm counting 5 apples.
Hello! My name is Doe.
I'm counting 7 apples.

d:\SARC\gettext\tests\teohdemo>dub run -q
Please select a language:
[0] default
[1] en_GB
[2] nl_NL
[3] uk_UA
3
Привіт! Мене звати Joe.
Я рахую 1 яблуко.
Привіт! Мене звати Schmoe.
Я рахую 3 яблука.
Привіт! Мене звати Jane.
Я рахую 5 яблук.
Привіт! Мене звати Doe.
Я рахую 7 яблук.

Notice how the translation of "apple" in the last translation changes with three different endings dependent on the number of apples.

Impact on footprint and performance

The implementation of Gettext keeps generated code to a minium. Although the tr template is instantiated many times with unique parameters, it does not instantiate a new function each time. All that is left of a tr instantiation after compilation are the references to the strings that were passed in.

The discovery of translatable strings happens at compile time in the xgettext configuration, and the generation of the PO template happens during execution of the result of that compilation. This process takes about as much time as a regular compilation of your project.

There is a run time cost to the lookup of strings in the MO file. Currently, mofile reads the entire file into memory and does a binary search for the untranslated string to find the translated string. In case the cost of this lookup would become noticeable, mofile could easily be modified to cache the search with std.functional.memoize. Even memoizing a small number of lookups could have a big impact on the evaluations in an event loop.

Limitations

Wide strings

Attempts to translate a wstring or dstring will result in a compilation error:

auto w = tr!"Hello"w; // Error: template `gettext.tr` does not match any template declaration

It would be pointless for this package to try and support all string widths. After all, the hello literal above is assembled as an array of UTF-8 chars, which is then converted to wstring. GNU gettext works internally with UTF-8, so it would need to convert the wstring from UTF-16 back to UTF-8, and after translation convert to UTF-16 again before it returns.

This limitation is easily dealt with by converting the translated string after lookup:

auto w = tr!"Hello".to!wstring;

Forced string evaluation

In some cases it may be necessary to forcefully evaluate a translatable string as a string instead of a TranslatableString instance:

static const tr_and_tr = tr!"One " ~ tr!"sentence.";
assert (tr_and_tr.toString == tr!"One sentence.".toString); // Fails without `.toString`.

Justified Strings

Format strings accept a width argument so that

"hi".format!"%10s";                // "        hi";

produces a string of width 10 in which the contents are right justified. However, passing a translatable string directly will not work as intended:

tr!"hi".format!"%10s";             // "hi";

Justification can be made to work by forcing translation of the translatable string before feeding it into format, like so:

tr!"hi".toString.format!"%10s";    // "        hi";

But since std.format is known to be heavy on compile times, it is probably better to use std.string.rightJustify instead, with either of these two alternatives:

tr!"hi".rightJustify!string(10);   // "        hi";
tr!"hi".toString.rightJustify(10); // "        hi";

Note that using rightJustify directly without explicit !string instantiation will not compile due to the isSomeString template requirement of rightJustify.

Named enums

Members of named enums need forced string evaluation, otherwise they resolve to the member identifier name instead:

enum E {member = tr!"translation"}
writeln(E.member);          // "member"
writeln(E.member.toString); // "translation"

Contrary, anonimous enums and manifest constants do not require this treatment:

enum {member = tr!"translation"}
writeln(member); // "translation"

Credits

This package was sponsored by SARC B.V. The idea for automatic string extraction came from H.S. Teoh [1], [2], with optimizations by Steven Schveighoffer [3]. Reading of MO files was implemented by Roman Chistokhodov [4].

TODO

Investigate the merit of: