Quick ‘n’ Dirty Tutorial

Setup

Use of this library is officially supported through the use of CMake. Getting an updated CMake is difficult on non-Windows machines, especially if they come from your system’s package manager distribution which tends to be several (dozen?) minor revisions out of date, or an entire major revision behind on CMake. To get a very close to up-to-date CMake, Python maintains an version that works across all systems. You can get it (and the ninja build system) by using the following command in your favorite command line application (assuming Python is already installed):

python -m pip install --user --update cmake ninja

If you depend on calling these executables using shorthand and not their full path, make sure that the Python “downloaded binaries” folder is contained with the PATH environment variable. Usually this is already done, but if you have trouble invoking cmake --version on your typical command line, please see the Python pip install documentation for more details for more information, in particular about the --user option.

If you do not have Python or CMake/ninja, you must get a recent enough version directly from CMake and build/install it and have a suitable build system around for CMake to pick up on (MSBuild from installing Visual Studio, make in most GNU distributions / MinGW on Windows on your PATH environment variable, and/or a personal installation of ninja).

Using CMake

Here’s a sample of the CMakeLists.txt to create a new project and pull in ztd.text in the simplest possible way:

project(my_app
	VERSION 1.0.0
	DESCRIPTION "My application."
	HOMEPAGE_URL "https://ztdcuneicode.readthedocs.io/en/latest/quick.html"
	LANGUAGES C
)

include(FetchContent)

FetchContent_Declare(ztd.cuneicode
	GIT_REPOSITORY https://github.com/soasis/cuneicode.git
	GIT_SHALLOW    ON
	GIT_TAG        main)
FetchContent_MakeAvailable(ztd.cuneicode)

This will automatically download and set up all the dependencies ztd.cuneicode needs (in this case, simply ztd.cmake, ztd.platform, and ztd.idk ). You can override how ztd.cuneicode gets these dependencies using the standard FetchContent described in the CMake FetchContent Documentation. When done configuring, simply use CMake’s target_link_libraries(…) to add it to the code:

# …

file(GLOB_RECURSE my_app_sources
	LIST_DIRECTORIES OFF
	CONFIGURE_DEPENDS
	source/*.c
)

add_executable(my_app ${my_app_sources})

target_link_libraries(my_app PRIVATE ztd::cuneicode)

Once you have everything configured and set up the way you like, you can then use ztd.cuneicode in your code, as shown below:

#include <ztd/cuneicode.h>

#include <ztd/idk/size.h>

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char* argv[]) {
	if (argc < 2) {
		fprintf(stderr, "A name argument must be given to the program!");
		return 1;
	}

	const char* name      = argv[1];
	const size_t name_len = strlen(name);

	char utf8_name_buffer[4096]    = { 0 };
	const size_t utf8_name_max_len = ztdc_c_string_array_size(utf8_name_buffer);
	char* utf8_name                = utf8_name_buffer;
	const size_t name_len_limit    = (name_len * CNC_C8_MAX);

	if (name_len_limit >= utf8_name_max_len) {
		fprintf(stderr,
		     "The name provided tot hsi program was, unfortunately, too big!");
		return 2;
	}

	size_t utf8_name_len_after = utf8_name_max_len;
	size_t name_len_after      = name_len;
	cnc_mcerr err              = cnc_mcsntomcsn_exec_utf8(
          &utf8_name_len_after, &utf8_name, &name_len_after, &name);
	const size_t input_consumed = name_len - name_len_after;
	const size_t output_written = utf8_name_max_len - utf8_name_len_after;
	if (err != cnc_mcerr_ok) {
		const char* err_str = cnc_mcerr_to_str(err);
		fprintf(stderr,
		     "An error occurred when attempting to transcribe your name from "
		     "the execution encoding to UTF-8, stopping at input element %zu "
		     "and only writing %zu elements "
		     "(error name: %s).",
		     input_consumed, output_written, err_str);
		return 3;
	}

	printf("Hello there, ");
	fwrite(utf8_name_buffer, 1, output_written, stdout);
	printf("!\n");

	return 0;
}

Let’s get started by digging into some examples!

Note

If you would like to see more examples and additional changes besides what is covered below, please do feel free to make requests for them here! This is not a very full-on tutorial and there is a lot of functionality that, still, needs explanation!

Simple Conversions

Simple conversions are provided for UTF-8, UTF-16, UTF-32, execution encoding, and wide execution encoding. They allow an end-user to use bit-based types.

To convert from UTF-16 to UTF-8, use the appropriately c8 and c16-marked free functions in the library:

#include <ztd/cuneicode.h>

#include <ztd/idk/size.h>

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main() {

	const ztd_char16_t utf16_text[] = u"🥺🙏";
	ztd_char8_t utf8_text[9]        = { 0 };

	// Now, actually output it
	const ztd_char16_t* p_input = utf16_text;
	ztd_char8_t* p_output       = utf8_text;
	size_t input_size           = ztdc_c_string_array_size(utf16_text);
	size_t output_size          = ztdc_c_array_size(utf8_text);
	cnc_mcstate_t state         = { 0 };
	// call the function with the right parameters!
	cnc_mcerr err               = cnc_c16snrtoc8sn( // formatting
          &output_size, &p_output,     // output first
          &input_size, &p_input,       // input second
          &state);                     // state parameter
	const size_t input_consumed = (ztdc_c_array_size(utf16_text) - input_size);
	const size_t output_written = (ztdc_c_array_size(utf8_text) - output_size);
	if (err != cnc_mcerr_ok) {
		const char* err_str = cnc_mcerr_to_str(err);
		fprintf(stderr,
		     "An (unexpected) error occurred and the conversion could not "
		     "happen! Error string: %s (code: '%d')\n",
		     err_str, (int)err);
		return 1;
	}

	printf(
	     "Converted %zu UTF-16 code units to %zu UTF-8 code units, giving the "
	     "text:",
	     input_consumed, output_written);
	// requires a capable terminal / output, but will be
	// UTF-8 text!
	fwrite(utf8_text, sizeof(ztd_char8_t), output_written, stdout);
	printf("\n");

	return 0;
}

We use raw printf to print the UTF-8 text. It may not appear correctly on a terminal whose encoding which is not UTF-8, which may be the case for older Microsoft terminals, some Linux kernel configurations, and deliberately misconfigured Mac OSX terminals. There are also some other properties that can be gained from the use of the function:

the amount of data read (using initial_input_size - input_size);
the amount of data written out (using initial_output_size - output_size);
a pointer to any extra input after the operation (p_input);
and, a pointer to any extra output that was not written to after the operation (p_output).

One can convert from other forms of UTF-8/16/32 encodings, and from the wide execution encodings/execution encoding (encodings used by default for const char[] and const wchar_t[] strings) using the various different prefixed-based functions.

Counting

More often than not, the exact amount of input and output might not be known before-hand. Therefore, it may be useful to count how many elements of output would be required before allocating exactly that much space to hold the result. In this case, simply passing NULL for the data output pointer instead of a real pointer, while providing a non-NULL pointer for the output size argument, will give an accurate reading for the amount of space that is necessary:

#include <ztd/cuneicode.h>

#include <ztd/idk/size.h>

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main() {

	const ztd_char16_t utf16_text[] = u"🥺🙏";

	const ztd_char16_t* p_count_input = utf16_text;
	// This size does NOT include the null terminating character.
	size_t count_input_size   = ztdc_c_string_array_size(utf16_text);
	cnc_mcstate_t count_state = { 0 };
	size_t output_size_after  = SIZE_MAX;
	// Use the function but with "nullptr" for the output pointer
	cnc_mcerr count_err = cnc_c16snrtoc8sn(
	     // To get the proper size for this conversion, we use the same
	     // function but with "NULL" specificers:
	     &output_size_after, NULL,
	     // input second
	     &count_input_size, &p_count_input,
	     // state parameter
	     &count_state);
	// Compute the needed space:
	const size_t output_size_needed = SIZE_MAX - output_size_after;
	if (count_err != cnc_mcerr_ok) {
		const char* err_str = cnc_mcerr_to_str(count_err);
		fprintf(stderr,
		     "An (unexpected) error occurred and the counting could not "
		     "happen! Error string: %s (code: '%d')\n",
		     err_str, (int)count_err);
		return 1;
	}

	ztd_char8_t* utf8_text = malloc(output_size_needed * sizeof(ztd_char8_t));

	// prepare for potential error return and error handling
	int return_value = 0;

	if (utf8_text == NULL) {
		return_value = 2;
		goto early_exit;
	}
	ztd_char8_t* p_output = utf8_text;
	cnc_mcstate_t state   = { 0 };

	// Now, actually output it
	const ztd_char16_t* p_input = utf16_text;
	// ztdc_c_array_size INCLUDES the null terminator in the size!
	size_t input_size  = ztdc_c_string_array_size(utf16_text);
	size_t output_size = output_size_needed;
	cnc_mcerr err      = cnc_c16snrtoc8sn(
          // output first
          &output_size, &p_output,
          // input second
          &input_size, &p_input,
          // state parameter
          &state);
	const size_t input_consumed
	     = ztdc_c_string_array_size(utf16_text) - input_size;
	const size_t output_written  = output_size_needed - output_size;
	const bool conversion_failed = err != cnc_mcerr_ok;
	if (conversion_failed) {
		// get error string to describe error code
		const char* err_str = cnc_mcerr_to_str(err);
		fprintf(stderr,
		     "An (unexpected) error occurred and the conversion could not "
		     "happen! The error occurred at UTF-16 input element #%zu, and only "
		     "managed to output %zu UTF-8 elements. Error string: %s (code: "
		     "'%d')\n",
		     input_consumed, output_written, err_str, (int)err);
		return_value = 3;
		goto early_exit;
	}
	// requires a capable terminal / output, but will be
	// UTF-8 text!
	printf("Converted UTF-8 text:\n");
	fwrite(utf8_text, sizeof(ztd_char8_t), output_written, stdout);
	printf("\n");

early_exit:
	if (utf8_text != NULL)
		free(utf8_text);

	return return_value;
}

Here, we find out the output_size_needed by taking the size before the call, then subtracting it by the decremented size from after the function call. Then, after checking for errors, we do the actual conversion with a properly sized buffer that includes a null terminator so the conversion result is suitable for printing to a (UTF-8 capable) terminal. Finally, after completing our task, we free the memory and return a proper error code.

Unbounded Output Writing

Sometimes, it is know ahead of time that there is enough space in a given buffer for a given conversion result because the inputs are not at all associated with user input or user-facing anything (e.g., static storage duration string literals with known sizes and elements). If that is the case, then a NULL value can be passed in for the output size argument, and the function will assume that there is enough space for writing:

#include <ztd/cuneicode.h>

#include <ztd/idk/size.h>

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main() {

	const ztd_char16_t utf16_text[] = u"🥺🙏";

	const ztd_char16_t* p_count_input = utf16_text;
	// This size does NOT include the null terminating character.
	size_t count_input_size   = ztdc_c_string_array_size(utf16_text);
	cnc_mcstate_t count_state = { 0 };
	size_t output_size_after  = SIZE_MAX;
	// Use the function but with "nullptr" for the output pointer
	cnc_mcerr count_err = cnc_c16snrtoc8sn(
	     // To get the proper size for this conversion, we use the same
	     // function but with "NULL" specificers:
	     &output_size_after, NULL,
	     // input second
	     &count_input_size, &p_count_input,
	     // state parameter
	     &count_state);
	// Compute the needed space:
	const size_t output_size_needed = SIZE_MAX - output_size_after;
	if (count_err != cnc_mcerr_ok) {
		const char* err_str = cnc_mcerr_to_str(count_err);
		fprintf(stderr,
		     "An (unexpected) error occurred and the counting could not "
		     "happen! Error string: %s (code: '%d')\n",
		     err_str, (int)count_err);
		return 1;
	}

	ztd_char8_t* utf8_text = malloc(output_size_needed * sizeof(ztd_char8_t));

	// prepare for potential error return and error handling
	int return_value = 0;

	if (utf8_text == NULL) {
		return_value = 2;
		goto early_exit;
	}
	ztd_char8_t* p_output = utf8_text;
	cnc_mcstate_t state   = { 0 };

	// Now, actually output it
	const ztd_char16_t* p_input = utf16_text;
	// ztdc_c_array_size INCLUDES the null terminator in the size!
	size_t input_size  = ztdc_c_string_array_size(utf16_text);
	size_t output_size = output_size_needed;
	cnc_mcerr err      = cnc_c16snrtoc8sn(
          // output first
          &output_size, &p_output,
          // input second
          &input_size, &p_input,
          // state parameter
          &state);
	const size_t input_consumed
	     = ztdc_c_string_array_size(utf16_text) - input_size;
	const size_t output_written  = output_size_needed - output_size;
	const bool conversion_failed = err != cnc_mcerr_ok;
	if (conversion_failed) {
		// get error string to describe error code
		const char* err_str = cnc_mcerr_to_str(err);
		fprintf(stderr,
		     "An (unexpected) error occurred and the conversion could not "
		     "happen! The error occurred at UTF-16 input element #%zu, and only "
		     "managed to output %zu UTF-8 elements. Error string: %s (code: "
		     "'%d')\n",
		     input_consumed, output_written, err_str, (int)err);
		return_value = 3;
		goto early_exit;
	}
	// requires a capable terminal / output, but will be
	// UTF-8 text!
	printf("Converted UTF-8 text:\n");
	fwrite(utf8_text, sizeof(ztd_char8_t), output_written, stdout);
	printf("\n");

early_exit:
	if (utf8_text != NULL)
		free(utf8_text);

	return return_value;
}

This can be useful for performance-oriented scenarios, where writing without doing any bounds checking may result in deeply improved speed.

Validating

Validation is similar to counting, except that the output size argument is NULL. This effectively allows someone to check if the input is not only valid for that encoding, but also if it can be transcoded to the output assuming enough size.

#include <ztd/cuneicode.h>

#include <ztd/idk/size.h>

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main() {

	const ztd_char16_t utf16_text[] = u"🥺🙏";

	const ztd_char16_t* count_input_ptr = utf16_text;
	// ztdc_c_array_size INCLUDES the null terminator in the size!
	const size_t initial_count_input_size = ztdc_c_array_size(utf16_text);
	size_t count_input_size               = initial_count_input_size;
	cnc_mcstate_t count_state             = { 0 };
	// Use the function but with "nullptr" for the output pointer
	cnc_mcerr err = cnc_c16snrtoc8sn(
	     // To get the proper size for this conversion, we use the same
	     // function but with "NULL" specificers:
	     NULL, NULL,
	     // input second
	     &count_input_size, &count_input_ptr,
	     // state parameter
	     &count_state);
	size_t input_read = (size_t)(initial_count_input_size - count_input_size);
	if (err != cnc_mcerr_ok) {
		const char* err_str = cnc_mcerr_to_str(err);
		fprintf(stderr,
		     "An (unexpected) error occurred and the counting/validating could "
		     "not happen!\nThe error happened at code unit %zu in the UTF-16 "
		     "input.\nError string: %s (code: '%d')\n",
		     input_read, err_str, (int)err);
	}

	printf(
	     "The input UTF-16 is valid and consumed all %zu code units (elements) "
	     "of input.\n",
	     input_read);

	return 0;
}

In many instances, simply validating that text can be converted rather than attempting the conversion can provide a far greater degree of speed using specialized algorithms and instruction sets.

Registry-Based Conversions

Conversion registries in cuneicode provide a way to obtain potentially runtime-defined encodings. It can be added to and removed from by a user, and all access to data (save for those which are defined to access global state such as the char/execution and wchar_t/wide execution encodings) is referenced straight from the objects created and involved and should involve no global, mutable state. This should enable users to create, use, and pass around registry objects freely without the burden of pre-allocated or statically-shared state, resulting in programs that are easier to reason about. Here is an example of converting between the Windows-1251 encoding and the UTF-8 encoding by passing both names to `cnc_conv_new(…)`:

#include <ztd/cuneicode.h>

#include <ztd/idk/size.h>

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main() {

	cnc_conversion_registry* registry = NULL;
	{
		cnc_open_err err
		     = cnc_registry_new(&registry, cnc_registry_options_default);
		if (err != cnc_open_err_ok) {
			const char* err_str = cnc_open_err_to_str(err);
			fprintf(stderr,
			     "An unexpected error has occurred: '%s' (code: '%d')", err_str,
			     (int)err);
			return 1;
		}
	}

	// Now that we've allocated, have a return value
	// just in case
	int return_value     = 0;
	cnc_conversion* conv = NULL;
	{
		cnc_conversion_info conv_info = { 0 };
		cnc_open_err err              = cnc_conv_new(
               registry, "windows-1251", "utf-8", &conv, &conv_info);
		if (err != cnc_open_err_ok) {
			const char* err_str = cnc_open_err_to_str(err);
			fprintf(stderr, "An unexpected error has occurred: %s (code: '%d')",
			     err_str, (int)err);
			return_value = 2;
			goto early_exit0;
		}
		// the conversion info structure can tell us about things
		printf(
		     "Successfully opened a registry conversion between %.*s and "
		     "%.*s!\n",
		     (int)conv_info.from_code_size,
		     (const char*)conv_info.from_code_data, (int)conv_info.to_code_size,
		     (const char*)conv_info.to_code_data);
		if (conv_info.is_indirect) {
			// the strings used for this printf are UTF-8 encoded, but we
			// know the names are ASCII-compatible, charset-invariant strings
			// thanks to the request above, so we don't do the special
			// printing method.
			printf(
			     "(It is an indirect conversion, going from %.*s to %.*s, "
			     "then %.*s to %.*s for the conversion.)\n",
			     (int)conv_info.from_code_size,
			     (const char*)conv_info.from_code_data,
			     (int)conv_info.indirect_code_size,
			     (const char*)conv_info.indirect_code_data,
			     (int)conv_info.indirect_code_size,
			     (const char*)conv_info.indirect_code_data,
			     (int)conv_info.to_code_size,
			     (const char*)conv_info.to_code_data);
		}
		else {
			printf(
			     "(The conversion is a direct conversion and deos not take an "
			     "intermediate conversion path.)\n");
		}
	}

	const char input[]
	     = "\xd1\xeb\xe0\xe2\xe0\x20\xd3\xea\xf0\xe0\xbf\xed\xb3\x21\x0a";

	const unsigned char* count_input_last = (const unsigned char*)input;
	size_t count_input_byte_size_leftover = ztdc_c_array_size(input);
	size_t count_output_byte_size         = SIZE_MAX;

	const cnc_mcerr count_err = cnc_conv(conv, &count_output_byte_size, NULL,
	     &count_input_byte_size_leftover, &count_input_last);
	const size_t output_byte_size_needed = SIZE_MAX - count_output_byte_size;
	const size_t count_input_byte_size_consumed
	     = ztdc_c_array_size(input) - count_input_byte_size_leftover;
	if (count_err != cnc_mcerr_ok) {
		const char* err_str = cnc_mcerr_to_str(count_err);
		fprintf(stderr,
		     "The counting step failed with the error %s (code: "
		     "'%d') at byte #%zu in the input (which presently needs %zu bytes "
		     "of output space)",
		     err_str, (int)count_err, count_input_byte_size_consumed,
		     output_byte_size_needed);
		return_value = 3;
		goto early_exit1;
	}

	const unsigned char* input_last = (const unsigned char*)input;
	size_t input_size               = ztdc_c_array_size(input);
	// not strictly necessary to multiply by unsigned char since it's
	// defined to be 1, but it's consistent with other places where
	// malloc gets used...
	unsigned char* output
	     = malloc(output_byte_size_needed * sizeof(unsigned char));
	if (output == NULL) {
		return_value = 4;
		goto early_exit2;
	}
	unsigned char* output_last       = output;
	size_t output_byte_size_leftover = output_byte_size_needed;
	cnc_mcerr err = cnc_conv(conv, &output_byte_size_leftover, &output_last,
	     &input_size, &input_last);
	const size_t output_byte_size_written
	     = output_byte_size_needed - output_byte_size_leftover;
	const size_t input_byte_size_consumed
	     = ztdc_c_array_size(input) - count_input_byte_size_leftover;
	if (err != cnc_mcerr_ok) {
		const char* err_str = cnc_mcerr_to_str(err);
		fprintf(stderr,
		     "The conversion step failed to convert with the error %s "
		     "(code: '%d') at byte #%zu in the input after writing as far as "
		     "byte #%zu in the output",
		     err_str, (int)count_err, input_byte_size_consumed,
		     output_byte_size_written);
		return_value = 5;
		goto early_exit2;
	}

	printf(
	     "The registry conversion was successful, writing %zu output bytes after "
	     "reading %zu input bytes.\nThe output is:\n",
	     output_byte_size_written, input_byte_size_consumed);
	// It's UTF-8: this should print correctly on a UTF-8 capable terminal.
	// We do not mill through `printf` because it can do a (potentially lossy)
	// conversion.
	fwrite((const char*)output, sizeof(char), output_byte_size_written, stdout);
	printf("\n");

early_exit2:
	if (output != NULL) {
		free(output);
	}
early_exit1:
	if (conv != NULL) {
		cnc_conv_delete(conv);
	}
early_exit0:
	if (registry != NULL) {
		cnc_registry_delete(registry);
	}
	return return_value;
}

Care must be taken that, upon allocating one of these types, it is deallocated with care. A large number of additional registry functionality is described in the registry design documentation and the registry API documentation, including an example of registering an encoding which contains its own state and must be stored in a cnc_conv* handle.

Most importantly in this short example is that there is no direct conversion between Windows-1251 and UTF-8 in the default offerings of cuneicode. Instead, the registry knows how to negotiate a pathway between the registered Windows-1251 encoding (which goes from itself to UTF-32 and back) to UTF-8 (which goes from itself to UTF-32 and back). This automatic handling of indirection is provided by-default and is described in the registry design for indirections.