Unicode 15 released [LWN.net]

Unicode 15 released

Posted Sep 14, 2022 18:16 UTC (Wed) by SLi (subscriber, #53131) [Link] (3 responses)

Why do we get to use the "bottom left part of glyph is damaged" modifier only for hieroglyphs? :(

Unicode 15 released

Posted Sep 14, 2022 21:03 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

It makes sense given the percentage of source material that needs it, but I agree. Plenty of stuff written on paper could've used that too!

Unicode 15 released

Posted Sep 20, 2022 5:44 UTC (Tue) by willy (subscriber, #9762) [Link]

Or vellum (I may have just done the tourist thing in Dublin and been to see the Book of Kells)

Unicode 15 released

Posted Sep 15, 2022 9:10 UTC (Thu) by n8willis (subscriber, #43041) [Link]

So the good news is that it's not really forbidden or anything; the Script property in UCD is defined for text-processing purposes and TR24 says implementers need to support "out of scope" usage (https://www.unicode.org/reports/tr24/#Out_of_Scope).

Thus the super-secret path to getting it pushed up into the realm of the commonly accepted would probably be to open an issue on HarfBuzz to try and get it explicitly added to the various shapers, then once that's done, insist that a glyph for it be added to some high-profile FOSS fonts, then finally petition Unicode to add it to the Script Extensions on the grounds that it's so common.

Couple years, tops.

Unicode 15 released

Posted Sep 14, 2022 19:05 UTC (Wed) by alspnost (guest, #2763) [Link] (1 responses)

Finally, a WiFi emoji!

Unicode 15 released

Posted Sep 14, 2022 21:15 UTC (Wed) by Sesse (subscriber, #53779) [Link]

And one for HONK

Unicode 15 released

Posted Sep 15, 2022 5:02 UTC (Thu) by suckfish (guest, #69919) [Link] (33 responses)

What is the practical limit on the number of unicode code-points and when are due to run out on the current trajectory? (Is the growth rate linear or faster or slower?)

UTF-8 and 4 byte encodings give a 21 bit space, (but for whatever reason it is defined to max out at code point 0x10ffff rather than 0x1fffff). If we reach that limit, is it feasible to extend or does some combination of security/compatibility/stupidity make extension non-feasible?

(If we followed the UTF-8 scheme systematically you can go to 7 byte encoding and a 36 bit space, and leave the byte value 0xff as a prefix for even further extension. So theoretically possible...)

Unicode 15 released

Posted Sep 15, 2022 7:36 UTC (Thu) by dh (subscriber, #153) [Link] (14 responses)

Unicode has some 1100000 possible code points. With 150000 assigned and 5000 new ones per year we're save for another 190 years. So while this might lead to a year-2212-problem, I'd say it's a bit early to take actions.

Unicode 15 released

Posted Sep 15, 2022 14:20 UTC (Thu) by zwol (guest, #126152) [Link] (13 responses)

> I'd say it's a bit early to take actions.

I think it would be better to stop baking the artificial 16-plane limit into every program that processes UTF-8 /now/. The longer we put this change off, the more places will have to be changed and the more painful it will be. It's exactly the same deal as Y2K and Y2038: fix it early or it'll cost more.

Unicode 15 released

Posted Sep 15, 2022 17:40 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (12 responses)

First we would have to persuade Microsoft and Oracle that UTF-16 was a bad idea. For backcompat reasons, that's basically never going to happen (i.e. Windows and Java still use UTF-16 extensively and seemingly have no plans to remove or deprecate it).

Unicode 15 released

Posted Sep 15, 2022 18:20 UTC (Thu) by devslashilly (guest, #124291) [Link] (11 responses)

Good news since java 18 it's been UTF-8 https://openjdk.org/jeps/400 now we need to wait the 10 years for people to update their jvms.

Unicode 15 released

Posted Sep 15, 2022 23:10 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (10 responses)

That's nice, but String is still UTF-16 according to https://docs.oracle.com/javase/7/docs/api/java/lang/Strin..., and that's arguably a bigger problem than what charset the OS-level APIs use.

Meanwhile, Windows is never going to give up UTF-16, unless they have a full backcompat break. The problem, you see, looks roughly like this:

0. In the bad old days of everyone using a different ISO-8859 variant (except for East Asia, where they had a wildly different set of encodings because CJK scripts are huge), Windows came in different editions, and the code page was baked into the OS. All of their APIs would transparently use the OS-level code page, and if you wanted to support other code pages, too bad, it was impossible.
1. At some point, they decided that was too ugly, and modularized things to such an extent that you could install multiple locales on the same computer and switch between them. But the old APIs were still around, so the OS-level code page became the application code page, set by default to "whatever the active locale specifies," and the API grew a few functions for changing the code page if desired. Eventually, they also added manifest support so you could just do that at the packaging stage instead of having to write actual code for it.
2. The Unicode Consortium comes along and tells everyone "We're doing this great new encoding, it'll have all the languages and fit into 16 bits." Microsoft decides to go all-in on this, and deploys a brand-new set of APIs which are identical to the old APIs, except they use wchar instead of char, and only accept UTF-16 (native endianness, usually LE). The old char APIs are informally deprecated but continue to exist for backcompat reasons. Also, they introduce a whole bunch of preprocessor macros so that you can just code against the two APIs as if they were one API, not think about charsets at all, and then decide which API to use with a single global #define at build time.
3. Everyone figures out that 16 bits is not enough, and surrogate pairs are born. Microsoft bolts on surrogate pair checking to their existing UTF-16 APIs and calls it a day. (Technically, before surrogate pairs existed, it was called "UCS-2" rather than UTF-16, but I have avoided using that name to prevent confusion. It's exactly the same encoding, other than the existence of surrogate pairs.)
4. Everyone figures out that UTF-8 is far superior to UTF-16. After much hemming and hawing, Microsoft adds a code page for UTF-8, as if it's just another legacy encoding, but eventually they update their documentation to vaguely suggest that maybe using the char API with the UTF-8 code page is better in some circumstances. Also, they make the UTF-8 code page the system-level default in all locales, but that only affects the char API, because the wchar API never used code pages in the first place (it's hard-coded to UTF-16 and always has been).
5. So now, the wchar functions must continue to exist, and must continue to use UTF-16, or else lots of applications will stop working. Microsoft *could* tell everyone to recompile against the char functions with a UTF-8 code page, and then drop wchar support, breaking everyone who didn't recompile, but they are not Apple and can't get away with doing something like that. Also, the wording of their documentation strongly suggests that much of the NT codebase uses UTF-16 internally and that it's the "native" encoding of modern Windows, so changing the APIs would be putting lipstick on a pig anyway.

Unicode 15 released

Posted Sep 16, 2022 0:34 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> Meanwhile, Windows is never going to give up UTF-16, unless they have a full backcompat break.

Windows allows you to use the "-A" functions with UTF-8 now. It just internally translates the strings into WTF-16.

It's possible to flip this around to using UTF-8 internally and translating the WTF-16 into UTF-8 on the system library border. This will have impact on kernel-level drivers, but even there a compat layer can provide a smooth transition.

This is also made easier because of microkernel-ish design of Windows, where most ioctls/syscalls are actually done via a sort of message passing. So translation can be done in a central location that is fairly straightforward to maintain.

Unicode 15 released

Posted Sep 16, 2022 18:16 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (5 responses)

Yes, they can replace the internal guts of NT, if they really want to, but they can't get rid of the -W functions without forcing everyone to recompile. Which means that UTF-16 (I'm not aware of a WTF-16 encoding, although I know that WTF-8 is a thing) has to continue existing.

Furthermore, when the -W functions were introduced, the whole point of them was to cover all of Unicode without having to think about code pages ever again. Therefore, it was reasonable for applications at the time to assume that (for example) you could take the string you get from FindFirstFileW/FindNextFileW and pass it directly to CreateFileW, and that everything will round-trip correctly no matter what that string looks like.* Any future version of Windows has to preserve that invariant, which means that future versions of Windows cannot allow invalid-in-UTF-16 characters in filenames (or they can, but then they have to do some weird hack like the old "short file names" tilde nonsense).

* Obviously, there are TOCTTOU concerns here, but I'm assuming that you're not implementing a security boundary and do not actually care to support the use case where some random other process unexpectedly stomps on your %APPDATA% subdirectory.

Unicode 15 released

Posted Sep 16, 2022 20:11 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> FindFirstFileW/FindNextFileW and pass it directly to CreateFileW
I remember reading that Windows is starting to enforce at least some sanity in CreateFileW and doesn't allow some of the more malformed names. And that's discounting gotchas like "aux.txt".

Unicode 15 released

Posted Sep 16, 2022 20:17 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (3 responses)

That's beside the point. If you got it from FindFirstFileW in the first place, then it must have been a valid name to begin with, and so you can pass it to CreateFileW.

(Technically, you can create files with invalid names by prefixing the absolute path with \\?\, and then this may fail to work. But that's because \\?\ is the explicit "I know what I'm doing is not valid in the NT universe, just let me do it anyway" magic word, and when you use it, lots of stuff breaks. Barring that sort of chicanery, this always works.)

Unicode 15 released

Posted Sep 16, 2022 20:33 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> If you got it from FindFirstFileW in the first place, then it must have been a valid name to begin with, and so you can pass it to CreateFileW.

That's not true even now. You can create an NTFS filesystem on Linux with incorrect file names, for example. Or it might be a network file system with limits that are different from what you'd expect. Barring that, reparse points might cause some files to be "unfindable".

So in practice the symmetry between FindFirstFile and CreateFile is already kinda wobbly.

Unicode 15 released

Posted Sep 16, 2022 21:35 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

Yeah, but those are all really unlikely in practice, so nobody's going to bother checking for them anyway. OTOH, "The user's name contains a character that is not in the current code page, and so everything under C:\Users\<name> is inaccessible and/or requires the use of a tilde name hack that will look ugly in your UI" is a much bigger problem... but if you're all-in on the -W functions, you probably assume(d) that you were safe from that.

Unicode 15 released

Posted Nov 8, 2022 2:36 UTC (Tue) by vtjnash (subscriber, #141755) [Link]

They could use WTF-8 encoding instead. It is a superset of UTF-8 that also supports round-trip from malformed UTF-16. (The reverse is not fully true, since it can yield different results if two such WTF-8 strings are concatenated and end up yielding a well-formed UTF-16 string after conversion)

Unicode 15 released

Posted Sep 18, 2022 6:44 UTC (Sun) by jond (subscriber, #37669) [Link] (2 responses)

> That's nice, but String is still UTF-16 according to https://docs.oracle.com/javase/7/docs/api/java/lang/Strin...,

I don’t know whether this is fixed now or not, but that page won’t tell you, because it’s for Java SE 7, and OP was talking about the recently released Java SE 18.

Unicode 15 released

Posted Sep 18, 2022 8:50 UTC (Sun) by ABCD (subscriber, #53650) [Link]

The Java 18 docs for that class at https://docs.oracle.com/en/java/javase/18/docs/api/java.b... seem to indicate that this hasn't changed, it's still UTF-16.

Unicode 15 released

Posted Sep 18, 2022 8:55 UTC (Sun) by dtlin (subscriber, #36537) [Link]

char being a 16-bit value is hard-baked into the JVM, and thus anything that uses a char[] is inherently operating on UTF-16.

Java 9 did add the +XX:+CompactStrings option (JEP 254), which changed the internal representation of String from char[] to byte[], along with a bit determining whether that representation is Latin-1 or UTF-16, with the former taking up half the space. But there was no change to the user-visible API, it is only an implementation detail.

(Java 9 did add String#codePoints() returning an IntStream of code points, but it's unrelated and you could have implemented that yourself with codePointAt()+offsetByCodePoints() anyway, it's just more convenient.)

Unicode 15 released

Posted Sep 15, 2022 11:08 UTC (Thu) by grawity (subscriber, #80596) [Link] (4 responses)

> UTF-8 and 4 byte encodings give a 21 bit space, (but for whatever reason it is defined to max out at code point 0x10ffff rather than 0x1fffff)

UTF-8 has a 31-bit space (0x7fffffff). It's artificially capped to 0x10ffff because that's the limit for UTF-16 surrogate-pair encoding.

Unicode 15 released

Posted Sep 16, 2022 1:21 UTC (Fri) by scientes (guest, #83068) [Link] (3 responses)

> UTF-8 has a 31-bit space (0x7fffffff).

Only glibc accepts the version of 5 and 6-byte characters you are suggesting (which is docmented in th utf8 man page). Other encoding complete reject the following bytes: c0, c1, (utf16 surrogate pairs) f8-ff.

Unicode 15 released

Posted Sep 16, 2022 10:16 UTC (Fri) by grawity (subscriber, #80596) [Link] (2 responses)

> the version of 5 and 6-byte characters you are suggesting

Well, I'm not exactly "suggesting" them as something new – 5-byte and 6-byte sequences were already defined in Pike's original spec [1] back in 1992 and were still present in the RFC 2279 version; they only got removed at some point when RFC 3629 imposing the artificial 0x10ffff cap got published. (I've seen Git commits deliberately removing existing support for 5-byte and 6-byte sequence support from UTF-8 decoders which used to accept them, and that annoyed me a lot.)

[1]: https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

So my original point to @suckfish was:

1) there's nothing weird about UTF-8 being limited to 0x10ffff rather than 0x1fffff, because the 4-byte maximum is not the *reason* for such a codepoint limit, but rather it's the *consequence* of the codepoint limit;

2) it is indeed feasible to extend UTF-8 that way, because it was already designed "pre-extended", it's only being held back to the lowest common denominator by UTF-16.

(Sure, the current UTF-8 decoders won't accept larger codepoints, but… it can't be worse than old UCS-2 decoders mishandling UTF-16 surrogate pairs, can it? After all, lots of programs used to assume that Unicode ended at 0xffff and we extended that eventually.)

Unicode 15 released

Posted Sep 16, 2022 12:54 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

> (Sure, the current UTF-8 decoders won't accept larger codepoints, but… it can't be worse than old UCS-2 decoders mishandling UTF-16 surrogate pairs, can it? After all, lots of programs used to assume that Unicode ended at 0xffff and we extended that eventually.)

Mishandling surrogate pairs is often harmless - in particular if you have a UCS-2 decoder and a UCS-2 encoder, then UTF-16 surrogate pairs will roundtrip correctly. It's only a problem when you split up the pair, and most applications don't split strings at arbitrary points.

That means it was (mostly) safe to start writing UTF-16 documents before applications had been updated to understand UTF-16. Similarly you could use UTF-8 in applications that treated it like ASCII. But the same isn't true of extended-UTF-8 - if you try writing a document with a 5-byte code point, many UTF-8-aware applications will throw an exception or replace it with U+FFFD, so you'll have to update every application before you can start using the new encoding (i.e. it will never happen).

I suspect that if UTF-8 ever needs to be extended above U+10FFFF it may be done similarly to UTF-16 surrogate pairs, where a currently-unused part of the code point space becomes reserved and the new code points are represented as two 4-byte UTF-8 sequences. It's ugly and inefficient but (far more importantly) it would allow gradual adoption without major backward compatibility problems.

Unicode 15 released

Posted Sep 19, 2022 4:37 UTC (Mon) by grawity (subscriber, #80596) [Link]

> It's only a problem when you split up the pair, and most applications don't split strings at arbitrary points.

They still often do, even today in Android's YouTube app I'm seeing it truncate "collapsed" descriptions at length X, splitting an emoji in half, and showing a � at the end...

> if you try writing a document with a 5-byte code point, many UTF-8-aware applications will throw an exception or replace it with U+FFFD, so you'll have to update every application before you can start using the new encoding (i.e. it will never happen).

Yeah, throwing an exception for this would be really annoying; it's what really ticks me off about the entire situation – not that we'll need an extension anytime soon, but still, we *had* this capability and we deliberately removed it. It's like 240.0.0.0/4 all over again.

On the other hand, showing a 5-byte codepoint as U+FFFD isn't too different from showing a surrogate-pair-encoded character as two U+FFFDs that UCS-2 software used to do.

Unicode 15 released

Posted Sep 15, 2022 19:45 UTC (Thu) by plugwash (subscriber, #29694) [Link] (3 responses)

> UTF-8 and 4 byte encodings give a 21 bit space, (but for whatever reason it is defined to max out at code point 0x10ffff rather than 0x1fffff).

"whatever reason" being UTF-16.

> when are due to run out on the current trajectory? (Is the growth rate linear or faster or slower?)

Unicode has 17 planes, numbered 0 to 16.Planes 15 and 16 are reserved for private use, while planes 0-15 are mostly used for public allocations. Of those.

Unicode maintains a series of roadmaps on their site showing the status of the various planes. Taking a quick look at them.

Plane 0 (the basic multilingual plane) is basically full at this point,
Plane 1 (the supplementary multilingual plane) about half is formally assigned, but much of the rest is "pencilled in" for various scripts.
Plane 2 (the supplementary ideographic plane) is getting pretty full with a series of "CJK unified ideographs extension"s, though there is a little bit of unallocated space (presumablly because "CJK unified ideographs extension G" was too big to fit. There don't seem to be any tentitive allocations in this plane
Plane 3 (the tertiery ideographic) plane has "CJK unified ideographs extension G" which takes up a bit over a 16th of the plane, plus tentative allocations for various historic scripts used in China. Even taking the tentative allocations into account it's less than half full
Plane 14 (the supplementary special purpose plane) is mostly unused.

Planes 4 through 13 are completely unallocated.

So about two and a half planes worth of code space are formally assigned with tentative allocations taking that up to around 3 and a half. That is less than a quarter of the total encoding space.

> If we reach that limit, is it feasible to extend or does some combination of security/compatibility/stupidity make extension non-feasible?

Basically you would have to.

1. Define how to extend the existing encodings, for UTF-8 and UTF-32, that is trivial since both encodings are artificially limited, but UTF-16 would be more difficult.
2. Update software than handles encoding conversion and validation to accept the new versions of the encoding.
3. Get that modified software deployed widely

It's not impossible, but I think it's unlikely that a sufficiently compelling reason will come along to motivate the global software community to do it.

I think it's more likely that if/when encoding space becomes a problem that proposers will be encouraged to encode their scripts in a way that uses less encoding space, i.e. making more use of combining characters, variant selectors and so-on.

Unicode 15 released

Posted Sep 15, 2022 21:59 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> "whatever reason" being UTF-16.

We should just deprecate it and start moving away from it. UTF-8 is the best choice anyway.

Even Windows is supporting true UTF-8 APIs these days.

Unicode 15 released

Posted Sep 15, 2022 22:50 UTC (Thu) by khim (subscriber, #9252) [Link]

> Even Windows is supporting true UTF-8 APIs these days.

Externally but not internally.

I guess the first step would be to encourage to move programs to UTF-8.

Because that's the most important step, anyway.

Unicode 15 released

Posted Sep 20, 2022 4:03 UTC (Tue) by plugwash (subscriber, #29694) [Link]

I'm not sure what difference a "deprecation" would make. Most people probably don't actively choose their string type they use whatever their environment uses. If one is designing a new environment from scratch then utF-8 is the obvious choice, and I can;t see the maintainers of existing environments pushing a horribly painful switc just because Unicode says something is deprecated.

The NT line of windows has the idea that strings are sequences of 16-bit units as a core part of it's design. Whenever windows documentation refers to "unicode" assume they mean "a series of 16 bit code units that are nominally UTF-16 but may not actually be valid UTF-16".

It's cool that windows is letting you use UTF-8 as the "Ansi" codepage for an application now but it's far too late to make any noticeable difference. It might help a few legacy applications that are still stuck in the dark ages add unicode support more easily, but any modern windows software will have long-ago moved to the native UTF-16 "w" APIs and I really can't see them moving back. Also the documentation seems unclear on the details, such as what happens when a filename is not valid UTF-16.

It's a similar story with Java, .net and QT, their core string type is a sequence of 16 bit units and changing it would be a major compatibility break.

Unicode 15 released

Posted Sep 16, 2022 17:35 UTC (Fri) by k8to (guest, #15413) [Link] (8 responses)

Somewhere in here is what is the trajectory of the underlying information source. That is, what is the rate of creation of human scripts, or the rate at which existing scripts become worth adding to unicode.

Somehow I imagine we're not actually creating many new scripts and characters and that the growth is largely just adding the already existing scripts, which implies the rate of growth will fall off. So I'm a bit suspicious the storage may never be filled.

Unicode 15 released

Posted Sep 16, 2022 17:55 UTC (Fri) by sfeam (subscriber, #2841) [Link] (7 responses)

"Somehow I imagine we're not actually creating many new scripts and characters".

You underestimate the allure of cute emojis.

Unicode 15 released

Posted Sep 18, 2022 8:34 UTC (Sun) by Sesse (subscriber, #53779) [Link] (6 responses)

FWIW, the position of the Unicode Consortium is that emoji are a temporary solution that should be replaced by arbitrary “stickers” in the long run (and thus, presumably move out of their realm, as they do not define messaging protocols or file formats—stickers would live outside the concept of character sets).

Unicode 15 released

Posted Sep 18, 2022 19:29 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (5 responses)

The Unicode Consortium could do that tomorrow, if they were really serious about it. Just tell everyone "We're not encoding any new emoji ever again, if you want new emoji then put them in Private Use or use some non-text encoding instead." Then they'd be out of the emoji business forever. Nobody is forcing them to encode emoji.

(There is no deprecation process for Unicode code points, so no other work would be required on the Consortium's side of things. Once a code point is in the standard, that's it, it's there forever. See https://www.unicode.org/policies/stability_policy.html)

Unicode 15 released

Posted Sep 18, 2022 19:36 UTC (Sun) by Sesse (subscriber, #53779) [Link] (4 responses)

“Temporary solution” does not mean “needs to end right now”, though.

Unicode 15 released

Posted Sep 18, 2022 19:37 UTC (Sun) by Sesse (subscriber, #53779) [Link] (1 responses)

Also, there _is_ a deprecation process, it just doesn't end in removal: “The Unicode Standard may deprecate the character (that is, formally discourage its use), but it will not reallocate, remove, or reassign the character.”

Unicode 15 released

Posted Sep 19, 2022 19:16 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

That's not a process. It's a one-off announcement. You write "this (range of) code point(s) (is/are) deprecated" on some formal piece of paper and call it a day. Everything else is the implementation's problem, not the standard's problem. So I stand by my claim that this would not be difficult for the Consortium to do immediately.

Unicode 15 released

Posted Sep 19, 2022 18:37 UTC (Mon) by plugwash (subscriber, #29694) [Link] (1 responses)

The cynic in me suspects that "temporary solution" may mean "some members were pushing hard for this, but others were sceptical, lets call it a temporary solution to mollify the sceptics".

Unicode 15 released

Posted Sep 19, 2022 19:14 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

There is nothing more permanent than a temporary solution.