Codepage Handling
Filtering is great but it is a darn sight more effective if you are working in the right character set encoding.
The right character set encoding is utf-8.
If your input data is in utf-8, you're golden, you can skip this topic entirely. If not, then your task is how to get your input data into utf-8, so it can be filtered and processed properly.
If you are running Merge then the simple answer is to use the -symbolSet
option. When that is present, Merge will automatically use uconv to convert your data from the nominated code page to utf-8. In essence, it is the first filter run and allows subsequent filters, and Merge itself, to expect utf-8 based input.
-symbolSet
(as of 3.0.002.03)Before then one would've used uconv ahead of time to convert the data given to Merge. Or if using the ConverDatToXml filter, used the
-symbolSet
parameter that it supports (and still supports).uconv
can also be run standalone, say in Job Processing scripts if your code structure makes you want to employ it that way.
uconv
Included with DocOrigin is the IBM ICU internationalization library and tools. It is used throughout DocOrigin code. The ICU suite is rightfully a highly regarded suite.
In the .../DO/Bin folder you will find uconv.exe, (well no .exe on unix platforms). You can run:
uconv --help
and get blown away by all the options. It is impressive. Thank you IBM.
uconv encodings
For more fun, try:
uconv -l (that's a lower case L)
Reams of codepage identifiers will spew out. I bet your source data file is encoded in one of those codepages. Once you've located the encoding that applies to your data, you are away to the races.
uconv usage syntax
The standalone usage of uconv is:
uconv -f sourceEncoding -t utf-8 -o outputFile inputFile
Where...
-f
stands for from
-t
stands for to
-o
specifies the output file name
- You need to determine your data file's encoding and use it as the sourceEncoding
.
- The target encoding is always utf-8
.
- You can nominate any outputFile
you like.
- And of course, you need to tell it where your inputFile
is.
Try it! Seriously, try it ... standalone. When you can open the result with a tool that handles UTF-8 text, and you see what you want to see, then you are set -- and not before then. Find the right encoding and try it out.
as a Filter
uconv can be used directly as a filter with the following construct
$E/uconv -f sourceEncoding -t utf-8 -o %out
%in
%in
and %out
will be automatically supplied by Merge so as to facilitate the chaining of several filters. If %in is used in the first -filter
option it will be replaced with the original data file name. For subsequent -filter options %in will be replaced by the output file name of the previous filter. %out is supplied as an automatically generated temporary file name. The %out of the last -filter
is what is supplied for Merge to do the data + form merge operation.