Planted Seeds - Unicode Transcoding in 2021
A report on our efforts to fix Unicode Transcoding functionality at the bottom of the tech stack.
My name is Shepherd, and I have recently taken an interest in helping out ThePhD’s Unicode work for C++. The end goal of being able to transcode out of the system encoding and other shenanigans has not been fully realized in a Standard’s document. However, I am pleased to say we have made significant advancements that should make 2021 a successful year for Unicode capabilities in C and C++.
Planting the Seeds
2020 has been a hard year, by any measure. But, we have been focusing on providing a lot of great Unicode utilities and have thusly planted a lot of seeds to hopefully blossom into full support for Transcoding capabilities. This does not talk about things like bidirectional capabilities, segmentation algorithms, regexen, and other higher level Unicode facilities. We are strictly focusing on the seeds for the lowest level of functionality that will enable all developers to at least talk to these Unicode algorithms in the future.
Seed 0: Transcoding Support in C
The C Committee prioritizing a new design for transcoding function calls is the biggest win of 2020. It is a tacit admission that the old design failed and a general acceptance that we can fix these problems by providing new functionality. Making these functions available wherever a conforming C standard library exists is an enormous boon to giving people the tools for properly handle text. Mojibake is present everywhere, from cash register receipts to postal services. By allowing people to go from their execution charset / locale encoding to UTF-8, UTF-16, and UTF-32, we can provide a baseline out for all application developers interfacing with recent systems.
An additional “icing on the cake” for us is the C Committee’s encouragement to provide transcoding functions between Unicode formats. Pairwise conversions to and from UTF-8, UTF-16, and UTF-32 are fairly huge for modern ecosystems that have already adopted Unicode. We will be going out and implementing them in existing standard libraries. We hope to finalize the design and standardize something for C23.
You can read more about that functionality in this paper here.
Seed 1: Transcoding Support in C++
ThePhD’s work with C++ is what started it all, kicked off publicly at his CppCon 2019 presentation. Progress on P1629 has been measured, but slower. It is mostly bottlenecked by needing to provide wording for the entire paper. The theory and implementation parts have been taken care of by the paper and the private work we have been using, respectively, as well as a few companies also using it on the outside.
The C++ support is far more robust than the C support. The C paper focuses on just covering the “get me out of the (wide) execution encoding, right now!” use case. The C++ paper focuses on making an encoding system that is infinitely extensible. Rather than needing to hack your standard library for support for your encoding, the C++ paper includes concepts (normal ones, not C++-style ones) and rules that define what an encoding object is:
It provides a total covering of the design space and allows anyone on the tiniest chips to the largest systems to cover their needs. This is without having to inject charmaps into your system, hack on iconv, or beg your standard library distribution for help. Being able to liberate ourselves from needing any overarching body, from POSIX to WG21, is massive for individual developer ability. It also removes the burden from standard library maintainers. They can focus on providing high-impact encoding objects in extensions (at their leisure, most importantly) that suit their business needs as well.
Seed 2: Fixing Existing Implementations
The start of this for us is going to be supporting ThePhD to send submissions to musl-libc and glibc. Getting these basic functions without optimizations into a 2021 release of glibc and musl are the target goal. Optimizations will come later. However, in both of these libraries, optimized subroutines already exist in some Unicode to Unicode encoding pairwise conversions. The biggest hurdle will be figuring out how to use the currently present locale tables in both implementations to provide the functionality. They are geared towards the old implementations. Some adjustment of the locale tables may be required.
Seed 3: Publicly Available Libraries
A freely available, standalone library needs to come forth in addition to the modified existing implementations. ThePhD has been working on this for quite some time, and soon will release both the C library called “cuneicode” matching the C proposal and “text” library matching the C++ proposal that delivers on many of the promises made in the 5 talks he has given on the subject so far. Top of the ticket is making this code:
0#include <encoding>
1#include <text>
2
3int main (int, char*[]) {
4
5 using namespace std::literals;
6
7 // from "execution charset" string literal text to UTF-8.
8 std::text::u8text my_text
9 = std::text::transcode(“안녕하세요 👋”sv, std::text::utf8{});
10
11 std::cout << my_text << std::endl; // prints 안녕하세요 👋 to a capable console
12
13 std::cout << std::hex;
14 for (const auto& cp : my_text) {
15 std::cout << static_cast<uint32_t>(cp) << “ “;
16 }
17 // 0000c548 0000b155 0000d558 0000c138 0000c694 00000020 0001f44b
18
19 return 0;
20}
work, out of the box, on all implementations. The header names and namespace names will be different for the stand-alone library, since names like <encoding>
and <text>
belong to the standard library itself.
2021
We are looking forward to a successful 2021 for Unicode in Systems Programming! Hopefully, this will be the year everything is better and greater for us all.
What do you think of the article? Get in touch with us or leave us a comment on our social media to discuss!
Edited Title Photo by Binyamin Mellish from Pexels.