Quick ‘n’ Dirty Tutorial
Setup
Use of this library is officially supported through the use of CMake. Getting an updated CMake is difficult on non-Windows machines, especially if they come from your system’s package manager distribution which tends to be several (dozen?) minor revisions out of date, or an entire major revision behind on CMake. To get a very close to up-to-date CMake, Python maintains an version that works across all systems. You can get it (and the ninja build system) by using the following command in your favorite command line application (assuming Python is already installed):
1python -m pip install --user --update cmake ninja
If you depend on calling these executables using shorthand and not their full path, make sure that the Python “downloaded binaries” folder is contained with the PATH
environment variable. Usually this is already done, but if you have trouble invoking cmake --version
on your typical command line, please see the Python pip install documentation for more details for more information, in particular about the --user
option.
If you do not have Python or CMake/ninja, you must get a recent enough version directly from CMake and build/install it and have a suitable build system around for CMake to pick up on (MSBuild from installing Visual Studio, make in most GNU distributions / MinGW on Windows on your PATH environment variable, and/or a personal installation of ninja).
Using CMake
Here’s a sample of the CMakeLists.txt
to create a new project and pull in ztd.text in the simplest possible way:
1project(my_app
2 VERSION 1.0.0
3 DESCRIPTION "My application."
4 HOMEPAGE_URL "https://ztdcuneicode.readthedocs.io/en/latest/quick.html"
5 LANGUAGES C
6)
7
8include(FetchContent)
9
10FetchContent_Declare(ztd.cuneicode
11 GIT_REPOSITORY https://github.com/soasis/cuneicode.git
12 GIT_SHALLOW ON
13 GIT_TAG main)
14FetchContent_MakeAvailable(ztd.cuneicode)
This will automatically download and set up all the dependencies ztd.cuneicode needs (in this case, simply ztd.cmake, ztd.platform, and ztd.idk ). You can override how ztd.cuneicode gets these dependencies using the standard FetchContent described in the CMake FetchContent Documentation. When done configuring, simply use CMake’s target_link_libraries(…)
to add it to the code:
1# …
2
3file(GLOB_RECURSE my_app_sources
4 LIST_DIRECTORIES OFF
5 CONFIGURE_DEPENDS
6 source/*.c
7)
8
9add_executable(my_app ${my_app_sources})
10
11target_link_libraries(my_app PRIVATE ztd::cuneicode)
Once you have everything configured and set up the way you like, you can then use ztd.cuneicode in your code, as shown below:
1#include <ztd/cuneicode.h>
2
3#include <ztd/idk/size.h>
4
5#include <stdio.h>
6#include <string.h>
7#include <stdlib.h>
8
9int main(int argc, char* argv[]) {
10 if (argc < 2) {
11 fprintf(stderr, "A name argument must be given to the program!");
12 return 1;
13 }
14
15 const char* name = argv[1];
16 const size_t name_len = strlen(name);
17
18 char utf8_name_buffer[4096] = { 0 };
19 const size_t utf8_name_max_len = ztdc_c_string_array_size(utf8_name_buffer);
20 char* utf8_name = utf8_name_buffer;
21 const size_t name_len_limit = (name_len * CNC_C8_MAX);
22
23 if (name_len_limit >= utf8_name_max_len) {
24 fprintf(stderr,
25 "The name provided tot hsi program was, unfortunately, too big!");
26 return 2;
27 }
28
29 size_t utf8_name_len_after = utf8_name_max_len;
30 size_t name_len_after = name_len;
31 cnc_mcerr err = cnc_mcsntomcsn_exec_utf8(
32 &utf8_name_len_after, &utf8_name, &name_len_after, &name);
33 const size_t input_consumed = name_len - name_len_after;
34 const size_t output_written = utf8_name_max_len - utf8_name_len_after;
35 if (err != cnc_mcerr_ok) {
36 const char* err_str = cnc_mcerr_to_str(err);
37 fprintf(stderr,
38 "An error occurred when attempting to transcribe your name from "
39 "the execution encoding to UTF-8, stopping at input element %zu "
40 "and only writing %zu elements "
41 "(error name: %s).",
42 input_consumed, output_written, err_str);
43 return 3;
44 }
45
46 printf("Hello there, ");
47 fwrite(utf8_name_buffer, 1, output_written, stdout);
48 printf("!\n");
49
50 return 0;
51}
Let’s get started by digging into some examples!
Note
If you would like to see more examples and additional changes besides what is covered below, please do feel free to make requests for them here! This is not a very full-on tutorial and there is a lot of functionality that, still, needs explanation!
Simple Conversions
Simple conversions are provided for UTF-8, UTF-16, UTF-32, execution encoding, and wide execution encoding. They allow an end-user to use bit-based types.
To convert from UTF-16 to UTF-8, use the appropriately c8
and c16
-marked free functions in the library:
1
2#include <ztd/cuneicode.h>
3
4#include <ztd/idk/size.h>
5
6#include <stdio.h>
7#include <string.h>
8#include <stdlib.h>
9
10int main() {
11
12 const ztd_char16_t utf16_text[] = u"🥺🙏";
13 ztd_char8_t utf8_text[9] = { 0 };
14
15 // Now, actually output it
16 const ztd_char16_t* p_input = utf16_text;
17 ztd_char8_t* p_output = utf8_text;
18 size_t input_size = ztdc_c_string_array_size(utf16_text);
19 size_t output_size = ztdc_c_array_size(utf8_text);
20 cnc_mcstate_t state = { 0 };
21 // call the function with the right parameters!
22 cnc_mcerr err = cnc_c16snrtoc8sn( // formatting
23 &output_size, &p_output, // output first
24 &input_size, &p_input, // input second
25 &state); // state parameter
26 const size_t input_consumed = (ztdc_c_array_size(utf16_text) - input_size);
27 const size_t output_written = (ztdc_c_array_size(utf8_text) - output_size);
28 if (err != cnc_mcerr_ok) {
29 const char* err_str = cnc_mcerr_to_str(err);
30 fprintf(stderr,
31 "An (unexpected) error occurred and the conversion could not "
32 "happen! Error string: %s (code: '%d')\n",
33 err_str, (int)err);
34 return 1;
35 }
36
37 printf(
38 "Converted %zu UTF-16 code units to %zu UTF-8 code units, giving the "
39 "text:",
40 input_consumed, output_written);
41 // requires a capable terminal / output, but will be
42 // UTF-8 text!
43 fwrite(utf8_text, sizeof(ztd_char8_t), output_written, stdout);
44 printf("\n");
45
46 return 0;
47}
We use raw printf
to print the UTF-8 text. It may not appear correctly on a terminal whose encoding which is not UTF-8, which may be the case for older Microsoft terminals, some Linux kernel configurations, and deliberately misconfigured Mac OSX terminals. There are also some other properties that can be gained from the use of the function:
the amount of data read (using
initial_input_size
-input_size
);the amount of data written out (using
initial_output_size
-output_size
);a pointer to any extra input after the operation (
p_input
);and, a pointer to any extra output that was not written to after the operation (
p_output
).
One can convert from other forms of UTF-8/16/32 encodings, and from the wide execution encodings/execution encoding (encodings used by default for const char[]
and const wchar_t[]
strings) using the various different prefixed-based functions.
Counting
More often than not, the exact amount of input and output might not be known before-hand. Therefore, it may be useful to count how many elements of output would be required before allocating exactly that much space to hold the result. In this case, simply passing NULL
for the data output pointer instead of a real pointer, while providing a non-NULL
pointer for the output size argument, will give an accurate reading for the amount of space that is necessary:
1
2#include <ztd/cuneicode.h>
3
4#include <ztd/idk/size.h>
5
6#include <stdio.h>
7#include <string.h>
8#include <stdlib.h>
9
10int main() {
11
12 const ztd_char16_t utf16_text[] = u"🥺🙏";
13
14 const ztd_char16_t* p_count_input = utf16_text;
15 // This size does NOT include the null terminating character.
16 size_t count_input_size = ztdc_c_string_array_size(utf16_text);
17 cnc_mcstate_t count_state = { 0 };
18 size_t output_size_after = SIZE_MAX;
19 // Use the function but with "nullptr" for the output pointer
20 cnc_mcerr count_err = cnc_c16snrtoc8sn(
21 // To get the proper size for this conversion, we use the same
22 // function but with "NULL" specificers:
23 &output_size_after, NULL,
24 // input second
25 &count_input_size, &p_count_input,
26 // state parameter
27 &count_state);
28 // Compute the needed space:
29 const size_t output_size_needed = SIZE_MAX - output_size_after;
30 if (count_err != cnc_mcerr_ok) {
31 const char* err_str = cnc_mcerr_to_str(count_err);
32 fprintf(stderr,
33 "An (unexpected) error occurred and the counting could not "
34 "happen! Error string: %s (code: '%d')\n",
35 err_str, (int)count_err);
36 return 1;
37 }
38
39 ztd_char8_t* utf8_text = malloc(output_size_needed * sizeof(ztd_char8_t));
40
41 // prepare for potential error return and error handling
42 int return_value = 0;
43
44 if (utf8_text == NULL) {
45 return_value = 2;
46 goto early_exit;
47 }
48 ztd_char8_t* p_output = utf8_text;
49 cnc_mcstate_t state = { 0 };
50
51 // Now, actually output it
52 const ztd_char16_t* p_input = utf16_text;
53 // ztdc_c_array_size INCLUDES the null terminator in the size!
54 size_t input_size = ztdc_c_string_array_size(utf16_text);
55 size_t output_size = output_size_needed;
56 cnc_mcerr err = cnc_c16snrtoc8sn(
57 // output first
58 &output_size, &p_output,
59 // input second
60 &input_size, &p_input,
61 // state parameter
62 &state);
63 const size_t input_consumed
64 = ztdc_c_string_array_size(utf16_text) - input_size;
65 const size_t output_written = output_size_needed - output_size;
66 const bool conversion_failed = err != cnc_mcerr_ok;
67 if (conversion_failed) {
68 // get error string to describe error code
69 const char* err_str = cnc_mcerr_to_str(err);
70 fprintf(stderr,
71 "An (unexpected) error occurred and the conversion could not "
72 "happen! The error occurred at UTF-16 input element #%zu, and only "
73 "managed to output %zu UTF-8 elements. Error string: %s (code: "
74 "'%d')\n",
75 input_consumed, output_written, err_str, (int)err);
76 return_value = 3;
77 goto early_exit;
78 }
79 // requires a capable terminal / output, but will be
80 // UTF-8 text!
81 printf("Converted UTF-8 text:\n");
82 fwrite(utf8_text, sizeof(ztd_char8_t), output_written, stdout);
83 printf("\n");
84
85early_exit:
86 if (utf8_text != NULL)
87 free(utf8_text);
88
89 return return_value;
90}
Here, we find out the output_size_needed
by taking the size before the call, then subtracting it by the decremented size from after the function call. Then, after checking for errors, we do the actual conversion with a properly sized buffer that includes a null terminator so the conversion result is suitable for printing to a (UTF-8 capable) terminal. Finally, after completing our task, we free the memory and return a proper error code.
Unbounded Output Writing
Sometimes, it is know ahead of time that there is enough space in a given buffer for a given conversion result because the inputs are not at all associated with user input or user-facing anything (e.g., static storage duration string literals with known sizes and elements). If that is the case, then a NULL
value can be passed in for the output size argument, and the function will assume that there is enough space for writing:
1
2#include <ztd/cuneicode.h>
3
4#include <ztd/idk/size.h>
5
6#include <stdio.h>
7#include <string.h>
8#include <stdlib.h>
9
10int main() {
11
12 const ztd_char16_t utf16_text[] = u"🥺🙏";
13
14 const ztd_char16_t* p_count_input = utf16_text;
15 // This size does NOT include the null terminating character.
16 size_t count_input_size = ztdc_c_string_array_size(utf16_text);
17 cnc_mcstate_t count_state = { 0 };
18 size_t output_size_after = SIZE_MAX;
19 // Use the function but with "nullptr" for the output pointer
20 cnc_mcerr count_err = cnc_c16snrtoc8sn(
21 // To get the proper size for this conversion, we use the same
22 // function but with "NULL" specificers:
23 &output_size_after, NULL,
24 // input second
25 &count_input_size, &p_count_input,
26 // state parameter
27 &count_state);
28 // Compute the needed space:
29 const size_t output_size_needed = SIZE_MAX - output_size_after;
30 if (count_err != cnc_mcerr_ok) {
31 const char* err_str = cnc_mcerr_to_str(count_err);
32 fprintf(stderr,
33 "An (unexpected) error occurred and the counting could not "
34 "happen! Error string: %s (code: '%d')\n",
35 err_str, (int)count_err);
36 return 1;
37 }
38
39 ztd_char8_t* utf8_text = malloc(output_size_needed * sizeof(ztd_char8_t));
40
41 // prepare for potential error return and error handling
42 int return_value = 0;
43
44 if (utf8_text == NULL) {
45 return_value = 2;
46 goto early_exit;
47 }
48 ztd_char8_t* p_output = utf8_text;
49 cnc_mcstate_t state = { 0 };
50
51 // Now, actually output it
52 const ztd_char16_t* p_input = utf16_text;
53 // ztdc_c_array_size INCLUDES the null terminator in the size!
54 size_t input_size = ztdc_c_string_array_size(utf16_text);
55 size_t output_size = output_size_needed;
56 cnc_mcerr err = cnc_c16snrtoc8sn(
57 // output first
58 &output_size, &p_output,
59 // input second
60 &input_size, &p_input,
61 // state parameter
62 &state);
63 const size_t input_consumed
64 = ztdc_c_string_array_size(utf16_text) - input_size;
65 const size_t output_written = output_size_needed - output_size;
66 const bool conversion_failed = err != cnc_mcerr_ok;
67 if (conversion_failed) {
68 // get error string to describe error code
69 const char* err_str = cnc_mcerr_to_str(err);
70 fprintf(stderr,
71 "An (unexpected) error occurred and the conversion could not "
72 "happen! The error occurred at UTF-16 input element #%zu, and only "
73 "managed to output %zu UTF-8 elements. Error string: %s (code: "
74 "'%d')\n",
75 input_consumed, output_written, err_str, (int)err);
76 return_value = 3;
77 goto early_exit;
78 }
79 // requires a capable terminal / output, but will be
80 // UTF-8 text!
81 printf("Converted UTF-8 text:\n");
82 fwrite(utf8_text, sizeof(ztd_char8_t), output_written, stdout);
83 printf("\n");
84
85early_exit:
86 if (utf8_text != NULL)
87 free(utf8_text);
88
89 return return_value;
90}
This can be useful for performance-oriented scenarios, where writing without doing any bounds checking may result in deeply improved speed.
Validating
Validation is similar to counting, except that the output size argument is NULL
. This effectively allows someone to check if the input is not only valid for that encoding, but also if it can be transcoded to the output assuming enough size.
1
2#include <ztd/cuneicode.h>
3
4#include <ztd/idk/size.h>
5
6#include <stdio.h>
7#include <string.h>
8#include <stdlib.h>
9
10int main() {
11
12 const ztd_char16_t utf16_text[] = u"🥺🙏";
13
14 const ztd_char16_t* count_input_ptr = utf16_text;
15 // ztdc_c_array_size INCLUDES the null terminator in the size!
16 const size_t initial_count_input_size = ztdc_c_array_size(utf16_text);
17 size_t count_input_size = initial_count_input_size;
18 cnc_mcstate_t count_state = { 0 };
19 // Use the function but with "nullptr" for the output pointer
20 cnc_mcerr err = cnc_c16snrtoc8sn(
21 // To get the proper size for this conversion, we use the same
22 // function but with "NULL" specificers:
23 NULL, NULL,
24 // input second
25 &count_input_size, &count_input_ptr,
26 // state parameter
27 &count_state);
28 size_t input_read = (size_t)(initial_count_input_size - count_input_size);
29 if (err != cnc_mcerr_ok) {
30 const char* err_str = cnc_mcerr_to_str(err);
31 fprintf(stderr,
32 "An (unexpected) error occurred and the counting/validating could "
33 "not happen!\nThe error happened at code unit %zu in the UTF-16 "
34 "input.\nError string: %s (code: '%d')\n",
35 input_read, err_str, (int)err);
36 }
37
38 printf(
39 "The input UTF-16 is valid and consumed all %zu code units (elements) "
40 "of input.\n",
41 input_read);
42
43 return 0;
44}
In many instances, simply validating that text can be converted rather than attempting the conversion can provide a far greater degree of speed using specialized algorithms and instruction sets.
Registry-Based Conversions
Conversion registries in cuneicode provide a way to obtain potentially runtime-defined encodings. It can be added to and removed from by a user, and all access to data (save for those which are defined to access global state such as the char
/execution
and wchar_t
/wide execution
encodings) is referenced straight from the objects created and involved and should involve no global, mutable state. This should enable users to create, use, and pass around registry objects freely without the burden of pre-allocated or statically-shared state, resulting in programs that are easier to reason about. Here is an example of converting between the Windows-1251 encoding and the UTF-8 encoding by passing both names to `cnc_conv_new(…)`
:
1
2#include <ztd/cuneicode.h>
3
4#include <ztd/idk/size.h>
5
6#include <stdio.h>
7#include <string.h>
8#include <stdlib.h>
9
10int main() {
11
12 cnc_conversion_registry* registry = NULL;
13 {
14 cnc_open_err err
15 = cnc_registry_new(®istry, cnc_registry_options_default);
16 if (err != cnc_open_err_ok) {
17 const char* err_str = cnc_open_err_to_str(err);
18 fprintf(stderr,
19 "An unexpected error has occurred: '%s' (code: '%d')", err_str,
20 (int)err);
21 return 1;
22 }
23 }
24
25 // Now that we've allocated, have a return value
26 // just in case
27 int return_value = 0;
28 cnc_conversion* conv = NULL;
29 {
30 cnc_conversion_info conv_info = { 0 };
31 cnc_open_err err = cnc_conv_new(
32 registry, "windows-1251", "utf-8", &conv, &conv_info);
33 if (err != cnc_open_err_ok) {
34 const char* err_str = cnc_open_err_to_str(err);
35 fprintf(stderr, "An unexpected error has occurred: %s (code: '%d')",
36 err_str, (int)err);
37 return_value = 2;
38 goto early_exit0;
39 }
40 // the conversion info structure can tell us about things
41 printf(
42 "Successfully opened a registry conversion between %.*s and "
43 "%.*s!\n",
44 (int)conv_info.from_code_size,
45 (const char*)conv_info.from_code_data, (int)conv_info.to_code_size,
46 (const char*)conv_info.to_code_data);
47 if (conv_info.is_indirect) {
48 // the strings used for this printf are UTF-8 encoded, but we
49 // know the names are ASCII-compatible, charset-invariant strings
50 // thanks to the request above, so we don't do the special
51 // printing method.
52 printf(
53 "(It is an indirect conversion, going from %.*s to %.*s, "
54 "then %.*s to %.*s for the conversion.)\n",
55 (int)conv_info.from_code_size,
56 (const char*)conv_info.from_code_data,
57 (int)conv_info.indirect_code_size,
58 (const char*)conv_info.indirect_code_data,
59 (int)conv_info.indirect_code_size,
60 (const char*)conv_info.indirect_code_data,
61 (int)conv_info.to_code_size,
62 (const char*)conv_info.to_code_data);
63 }
64 else {
65 printf(
66 "(The conversion is a direct conversion and deos not take an "
67 "intermediate conversion path.)\n");
68 }
69 }
70
71 const char input[]
72 = "\xd1\xeb\xe0\xe2\xe0\x20\xd3\xea\xf0\xe0\xbf\xed\xb3\x21\x0a";
73
74 const unsigned char* count_input_last = (const unsigned char*)input;
75 size_t count_input_byte_size_leftover = ztdc_c_array_size(input);
76 size_t count_output_byte_size = SIZE_MAX;
77
78 const cnc_mcerr count_err = cnc_conv(conv, &count_output_byte_size, NULL,
79 &count_input_byte_size_leftover, &count_input_last);
80 const size_t output_byte_size_needed = SIZE_MAX - count_output_byte_size;
81 const size_t count_input_byte_size_consumed
82 = ztdc_c_array_size(input) - count_input_byte_size_leftover;
83 if (count_err != cnc_mcerr_ok) {
84 const char* err_str = cnc_mcerr_to_str(count_err);
85 fprintf(stderr,
86 "The counting step failed with the error %s (code: "
87 "'%d') at byte #%zu in the input (which presently needs %zu bytes "
88 "of output space)",
89 err_str, (int)count_err, count_input_byte_size_consumed,
90 output_byte_size_needed);
91 return_value = 3;
92 goto early_exit1;
93 }
94
95 const unsigned char* input_last = (const unsigned char*)input;
96 size_t input_size = ztdc_c_array_size(input);
97 // not strictly necessary to multiply by unsigned char since it's
98 // defined to be 1, but it's consistent with other places where
99 // malloc gets used...
100 unsigned char* output
101 = malloc(output_byte_size_needed * sizeof(unsigned char));
102 if (output == NULL) {
103 return_value = 4;
104 goto early_exit2;
105 }
106 unsigned char* output_last = output;
107 size_t output_byte_size_leftover = output_byte_size_needed;
108 cnc_mcerr err = cnc_conv(conv, &output_byte_size_leftover, &output_last,
109 &input_size, &input_last);
110 const size_t output_byte_size_written
111 = output_byte_size_needed - output_byte_size_leftover;
112 const size_t input_byte_size_consumed
113 = ztdc_c_array_size(input) - count_input_byte_size_leftover;
114 if (err != cnc_mcerr_ok) {
115 const char* err_str = cnc_mcerr_to_str(err);
116 fprintf(stderr,
117 "The conversion step failed to convert with the error %s "
118 "(code: '%d') at byte #%zu in the input after writing as far as "
119 "byte #%zu in the output",
120 err_str, (int)count_err, input_byte_size_consumed,
121 output_byte_size_written);
122 return_value = 5;
123 goto early_exit2;
124 }
125
126 printf(
127 "The registry conversion was successful, writing %zu output bytes after "
128 "reading %zu input bytes.\nThe output is:\n",
129 output_byte_size_written, input_byte_size_consumed);
130 // It's UTF-8: this should print correctly on a UTF-8 capable terminal.
131 // We do not mill through `printf` because it can do a (potentially lossy)
132 // conversion.
133 fwrite((const char*)output, sizeof(char), output_byte_size_written, stdout);
134 printf("\n");
135
136early_exit2:
137 if (output != NULL) {
138 free(output);
139 }
140early_exit1:
141 if (conv != NULL) {
142 cnc_conv_delete(conv);
143 }
144early_exit0:
145 if (registry != NULL) {
146 cnc_registry_delete(registry);
147 }
148 return return_value;
149}
Care must be taken that, upon allocating one of these types, it is deallocated with care. A large number of additional registry functionality is described in the registry design documentation and the registry API documentation, including an example of registering an encoding which contains its own state and must be stored in a cnc_conv*
handle.
Most importantly in this short example is that there is no direct conversion between Windows-1251 and UTF-8 in the default offerings of cuneicode. Instead, the registry knows how to negotiate a pathway between the registered Windows-1251 encoding (which goes from itself to UTF-32 and back) to UTF-8 (which goes from itself to UTF-32 and back). This automatic handling of indirection is provided by-default and is described in the registry design for indirections.