Indirect Conversions

When a registry conversion rountine provides an encoding path to a common encoding, but not to each other, it can be difficult to get data in one shape to another. For example, if you only register a conversion routine from SHIFT-JIS to UTF-8, and then from UTF-8 to UTF-32, you have provided no direct path between SHIFT-JIS and UTF-32. But, what if it was possible for the the library to realize that UTF-8 is a possible substract between the two libraries? What if it could automatically detect certain “Universal Encodings”, like the Unicode Encodings, and use those as a bridge between 2 disparate encodings?

This is a technique that has been in use for glibc, musl libc, ICU, libiconv, and many encoders for over a decade now. They provide a conversion to a common substrate — generally, Unicode in the form of UTF-8, UTF-16, or UTF-32 (mostly this last one) — and then use it to convert for well over a decade now.

how the conversion registry will bridge the two encodings together without a developer needing to specifically write the encodings through an encoding pair. This is done by utilizing UTF-32 as a go-between for the two functions. This is a technique that is common among text transcoding engines, albeit the process of doing so for other libraries is generally explicit, involved, and sometimes painful.

When opening a new conversion routine, use the is_indirect member and related information on the cnc_conversion_info structure to find out if the conversion has been opened through an intermediate. Note that this is the only time the routines will tell you this: this information may not be accessible later.

Indirect Liaisons

Indirect encoding paths will not link together arbitrarily long encoding conversion steps to get from one encoding to another: it does not attempt to create a connectivity graph between all encodings (though, wouldn’t that be a fun project?). Remember that each intermediate encoding that the data must travel through imposes overhead! So, only one encoding is allowed to be the go-between for encodings.

There is a priority ordering to which encodings are chosen as indirect liaisons or indirect substrates to help encode from one unit of text to the other, and they are as follows:

  1. UTF-32

  2. UTF-32 Unchecked

  3. UTF-8

  4. UTF-8 Unchecked

  5. UTF-16

  6. UTF-16 Unchecked

  7. Everything Else.

Indirect encoding conversions use the cnc_pivot_info type that denotes a buffer of space to use as scratch space if any algorithm cannot perform a direct conversion. Pivots can help avoid any implementation-defined, stack-allocated buffer size that might be too large for the inputs used (and thus overflow the stack) or too small for the inputs used (and thus require multiple calls to the conversion algorithm).