On this Page

    Internationalization

    DocOrigin is designed to allow a wide range of language and country (locale) support. All text is handled as multi-byte Unicode text. This allows use of virtually all standard left-to-right languages including Chinese, Thai, and other Southeast Asian fonts. Design allows locale (language/country combinations) to be specified at the form and individual text string level. Locale-specific word and character analysis routines provide correct word-wrap and editing capability.
    The following describes internationalization issues, including Unicode, and highlights those areas where Merge meets the boundaries of its operating environment, printers, and so on.

    Definitions

    Term

    Definition

    Internationalization

    The engineering of software to support multiple languages and multiple countries (locales), with consideration for areas such as language support (some sort of Unicode characters), time, date and number formatting, and the ability to easily localize (translate) maintain all components of the software.

    Unicode

    An international standard for the assigning of character glyphs (symbols) to numeric codes.

    UTF-32

    A comprehensive enumeration of fonts, where each character has a unique 32-bit numeric code. Although very complete, UTF-32 is rarely used.

    UTF-16

    A 16-bit-per-character storage of text. In its simplest form, this format can store just over 63,000 characters. (216 – 2048). This is more-or-less what is known as UCS 2. The full UTF-16 format also includes a provision for more that 63,000 characters by using a scheme of "surrogate pairs". Two 16-bit characters are used to encode a single UTF-32 Unicode character. (The values in the range xD800-xDFFF are used, which accounts for the missing 2048 codes mentioned above). The use of surrogate-pairs (or just surrogates) is necessary to encode some east Asian languages (Chinese).

    Considerations

    • DocOrigin Design allows to enter any 16-bit Unicode character by typing in character HEX code and pressing Alt+X. For example in text label type in "99aa" and then hit Alt+X.
    • DocOrigin applications use 16-bit UTF-16 to store all form text. Most code, including the text word-break logic, will recognize surrogate pairs of characters and handle them correctly.

    • Merge uses TrueType fonts as a source of font metrics (character widths). This format is restricted to 65,535 characters, and cannot handle surrogates.

    • All file I/O in Merge reads and writes 16-bit characters. (The file I/O routines determine if a single-byte file has been opened, and convert it on-the-fly.)
    • The error messages for Merge are stored in an XML file (16-bit) that can be translated for different locales.
    • PCL Printing — Merge supports only 8-bit PCL printers. Any Unicode character above 127 is mapped using a PCL symbol set mapping which is sent to the printer. Any character that is available in the printer can be printed by using its corresponding Unicode character in the data or form. Merge uses a combination of downloaded TrueType fonts and symbol sets to display Unicode fonts (for example, Chinese).
    • PDF and Postscript Printing — the Merge implementation of PDF and Postscript relies on the Adobe list of standard glyph names. PDF output will use glyphID encoding to display all other Unicode characters.
    • Script — The JavaScript language is able to handle 16-bit Unicode characters.
    • Merge includes scripting extensions for date, time, and currency conversions using the IBM ICU development system.