Unicode 15 released
This version adds 4,489 characters, bringing the total to 149,186 characters. These additions include two new scripts, for a total of 161 scripts, along with 20 new emoji characters, and 4,193 CJK (Chinese, Japanese, and Korean) ideographs.
Unicode 15 released
Posted Sep 14, 2022 18:16 UTC (Wed)
by SLi (subscriber, #53131)
[Link] (3 responses)
Posted Sep 14, 2022 18:16 UTC (Wed) by SLi (subscriber, #53131) [Link] (3 responses)
Unicode 15 released
Posted Sep 14, 2022 21:03 UTC (Wed)
by flussence (guest, #85566)
[Link] (1 responses)
Posted Sep 14, 2022 21:03 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)
Unicode 15 released
Posted Sep 20, 2022 5:44 UTC (Tue)
by willy (subscriber, #9762)
[Link]
Posted Sep 20, 2022 5:44 UTC (Tue) by willy (subscriber, #9762) [Link]
Unicode 15 released
Posted Sep 15, 2022 9:10 UTC (Thu)
by n8willis (subscriber, #43041)
[Link]
Posted Sep 15, 2022 9:10 UTC (Thu) by n8willis (subscriber, #43041) [Link]
Thus the super-secret path to getting it pushed up into the realm of the commonly accepted would probably be to open an issue on HarfBuzz to try and get it explicitly added to the various shapers, then once that's done, insist that a glyph for it be added to some high-profile FOSS fonts, then finally petition Unicode to add it to the Script Extensions on the grounds that it's so common.
Couple years, tops.
Unicode 15 released
Posted Sep 14, 2022 19:05 UTC (Wed)
by alspnost (guest, #2763)
[Link] (1 responses)
Finally, a WiFi emoji!
Posted Sep 14, 2022 19:05 UTC (Wed) by alspnost (guest, #2763) [Link] (1 responses)
Unicode 15 released
Posted Sep 14, 2022 21:15 UTC (Wed)
by Sesse (subscriber, #53779)
[Link]
Posted Sep 14, 2022 21:15 UTC (Wed) by Sesse (subscriber, #53779) [Link]
Unicode 15 released
Posted Sep 15, 2022 5:02 UTC (Thu)
by suckfish (guest, #69919)
[Link] (33 responses)
Posted Sep 15, 2022 5:02 UTC (Thu) by suckfish (guest, #69919) [Link] (33 responses)
UTF-8 and 4 byte encodings give a 21 bit space, (but for whatever reason it is defined to max out at code point 0x10ffff rather than 0x1fffff). If we reach that limit, is it feasible to extend or does some combination of security/compatibility/stupidity make extension non-feasible?
(If we followed the UTF-8 scheme systematically you can go to 7 byte encoding and a 36 bit space, and leave the byte value 0xff as a prefix for even further extension. So theoretically possible...)
Unicode 15 released
Posted Sep 15, 2022 7:36 UTC (Thu)
by dh (subscriber, #153)
[Link] (14 responses)
Posted Sep 15, 2022 7:36 UTC (Thu) by dh (subscriber, #153) [Link] (14 responses)
Unicode 15 released
Posted Sep 15, 2022 14:20 UTC (Thu)
by zwol (guest, #126152)
[Link] (13 responses)
Posted Sep 15, 2022 14:20 UTC (Thu) by zwol (guest, #126152) [Link] (13 responses)
I think it would be better to stop baking the artificial 16-plane limit into every program that processes UTF-8 /now/. The longer we put this change off, the more places will have to be changed and the more painful it will be. It's exactly the same deal as Y2K and Y2038: fix it early or it'll cost more.
Unicode 15 released
Posted Sep 15, 2022 17:40 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (12 responses)
Posted Sep 15, 2022 17:40 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (12 responses)
Unicode 15 released
Posted Sep 15, 2022 18:20 UTC (Thu)
by devslashilly (guest, #124291)
[Link] (11 responses)
Posted Sep 15, 2022 18:20 UTC (Thu) by devslashilly (guest, #124291) [Link] (11 responses)
Unicode 15 released
Posted Sep 15, 2022 23:10 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (10 responses)
Posted Sep 15, 2022 23:10 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (10 responses)
Meanwhile, Windows is never going to give up UTF-16, unless they have a full backcompat break. The problem, you see, looks roughly like this:
0. In the bad old days of everyone using a different ISO-8859 variant (except for East Asia, where they had a wildly different set of encodings because CJK scripts are huge), Windows came in different editions, and the code page was baked into the OS. All of their APIs would transparently use the OS-level code page, and if you wanted to support other code pages, too bad, it was impossible.
1. At some point, they decided that was too ugly, and modularized things to such an extent that you could install multiple locales on the same computer and switch between them. But the old APIs were still around, so the OS-level code page became the application code page, set by default to "whatever the active locale specifies," and the API grew a few functions for changing the code page if desired. Eventually, they also added manifest support so you could just do that at the packaging stage instead of having to write actual code for it.
2. The Unicode Consortium comes along and tells everyone "We're doing this great new encoding, it'll have all the languages and fit into 16 bits." Microsoft decides to go all-in on this, and deploys a brand-new set of APIs which are identical to the old APIs, except they use wchar instead of char, and only accept UTF-16 (native endianness, usually LE). The old char APIs are informally deprecated but continue to exist for backcompat reasons. Also, they introduce a whole bunch of preprocessor macros so that you can just code against the two APIs as if they were one API, not think about charsets at all, and then decide which API to use with a single global #define at build time.
3. Everyone figures out that 16 bits is not enough, and surrogate pairs are born. Microsoft bolts on surrogate pair checking to their existing UTF-16 APIs and calls it a day. (Technically, before surrogate pairs existed, it was called "UCS-2" rather than UTF-16, but I have avoided using that name to prevent confusion. It's exactly the same encoding, other than the existence of surrogate pairs.)
4. Everyone figures out that UTF-8 is far superior to UTF-16. After much hemming and hawing, Microsoft adds a code page for UTF-8, as if it's just another legacy encoding, but eventually they update their documentation to vaguely suggest that maybe using the char API with the UTF-8 code page is better in some circumstances. Also, they make the UTF-8 code page the system-level default in all locales, but that only affects the char API, because the wchar API never used code pages in the first place (it's hard-coded to UTF-16 and always has been).
5. So now, the wchar functions must continue to exist, and must continue to use UTF-16, or else lots of applications will stop working. Microsoft *could* tell everyone to recompile against the char functions with a UTF-8 code page, and then drop wchar support, breaking everyone who didn't recompile, but they are not Apple and can't get away with doing something like that. Also, the wording of their documentation strongly suggests that much of the NT codebase uses UTF-16 internally and that it's the "native" encoding of modern Windows, so changing the APIs would be putting lipstick on a pig anyway.
Unicode 15 released
Posted Sep 16, 2022 0:34 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Sep 16, 2022 0:34 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)
Windows allows you to use the "-A" functions with UTF-8 now. It just internally translates the strings into WTF-16.
It's possible to flip this around to using UTF-8 internally and translating the WTF-16 into UTF-8 on the system library border. This will have impact on kernel-level drivers, but even there a compat layer can provide a smooth transition.
This is also made easier because of microkernel-ish design of Windows, where most ioctls/syscalls are actually done via a sort of message passing. So translation can be done in a central location that is fairly straightforward to maintain.
Unicode 15 released
Posted Sep 16, 2022 18:16 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
Posted Sep 16, 2022 18:16 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (5 responses)
Furthermore, when the -W functions were introduced, the whole point of them was to cover all of Unicode without having to think about code pages ever again. Therefore, it was reasonable for applications at the time to assume that (for example) you could take the string you get from FindFirstFileW/FindNextFileW and pass it directly to CreateFileW, and that everything will round-trip correctly no matter what that string looks like.* Any future version of Windows has to preserve that invariant, which means that future versions of Windows cannot allow invalid-in-UTF-16 characters in filenames (or they can, but then they have to do some weird hack like the old "short file names" tilde nonsense).
* Obviously, there are TOCTTOU concerns here, but I'm assuming that you're not implementing a security boundary and do not actually care to support the use case where some random other process unexpectedly stomps on your %APPDATA% subdirectory.
Unicode 15 released
Posted Sep 16, 2022 20:11 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Posted Sep 16, 2022 20:11 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)
I remember reading that Windows is starting to enforce at least some sanity in CreateFileW and doesn't allow some of the more malformed names. And that's discounting gotchas like "aux.txt".
Unicode 15 released
Posted Sep 16, 2022 20:17 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
Posted Sep 16, 2022 20:17 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (3 responses)
(Technically, you can create files with invalid names by prefixing the absolute path with \\?\, and then this may fail to work. But that's because \\?\ is the explicit "I know what I'm doing is not valid in the NT universe, just let me do it anyway" magic word, and when you use it, lots of stuff breaks. Barring that sort of chicanery, this always works.)
Unicode 15 released
Posted Sep 16, 2022 20:33 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Sep 16, 2022 20:33 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)
That's not true even now. You can create an NTFS filesystem on Linux with incorrect file names, for example. Or it might be a network file system with limits that are different from what you'd expect. Barring that, reparse points might cause some files to be "unfindable".
So in practice the symmetry between FindFirstFile and CreateFile is already kinda wobbly.
Unicode 15 released
Posted Sep 16, 2022 21:35 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Sep 16, 2022 21:35 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)
Unicode 15 released
Posted Nov 8, 2022 2:36 UTC (Tue)
by vtjnash (subscriber, #141755)
[Link]
Posted Nov 8, 2022 2:36 UTC (Tue) by vtjnash (subscriber, #141755) [Link]
Unicode 15 released
Posted Sep 18, 2022 6:44 UTC (Sun)
by jond (subscriber, #37669)
[Link] (2 responses)
Posted Sep 18, 2022 6:44 UTC (Sun) by jond (subscriber, #37669) [Link] (2 responses)
I don’t know whether this is fixed now or not, but that page won’t tell you, because it’s for Java SE 7, and OP was talking about the recently released Java SE 18.
Unicode 15 released
Posted Sep 18, 2022 8:50 UTC (Sun)
by ABCD (subscriber, #53650)
[Link]
Posted Sep 18, 2022 8:50 UTC (Sun) by ABCD (subscriber, #53650) [Link]
Unicode 15 released
Posted Sep 18, 2022 8:55 UTC (Sun)
by dtlin (subscriber, #36537)
[Link]
Posted Sep 18, 2022 8:55 UTC (Sun) by dtlin (subscriber, #36537) [Link]
char being a 16-bit value is hard-baked into the JVM, and thus anything that uses a char[] is inherently operating on UTF-16.
Java 9 did add the +XX:+CompactStrings option (JEP 254), which changed the internal representation of String from char[] to byte[], along with a bit determining whether that representation is Latin-1 or UTF-16, with the former taking up half the space. But there was no change to the user-visible API, it is only an implementation detail.
(Java 9 did add String#codePoints() returning an IntStream of code points, but it's unrelated and you could have implemented that yourself with codePointAt()+offsetByCodePoints() anyway, it's just more convenient.)
Unicode 15 released
Posted Sep 15, 2022 11:08 UTC (Thu)
by grawity (subscriber, #80596)
[Link] (4 responses)
Posted Sep 15, 2022 11:08 UTC (Thu) by grawity (subscriber, #80596) [Link] (4 responses)
UTF-8 has a 31-bit space (0x7fffffff). It's artificially capped to 0x10ffff because that's the limit for UTF-16 surrogate-pair encoding.
Unicode 15 released
Posted Sep 16, 2022 1:21 UTC (Fri)
by scientes (guest, #83068)
[Link] (3 responses)
Posted Sep 16, 2022 1:21 UTC (Fri) by scientes (guest, #83068) [Link] (3 responses)
Only glibc accepts the version of 5 and 6-byte characters you are suggesting (which is docmented in th utf8 man page). Other encoding complete reject the following bytes: c0, c1, (utf16 surrogate pairs) f8-ff.
Unicode 15 released
Posted Sep 16, 2022 10:16 UTC (Fri)
by grawity (subscriber, #80596)
[Link] (2 responses)
Posted Sep 16, 2022 10:16 UTC (Fri) by grawity (subscriber, #80596) [Link] (2 responses)
Well, I'm not exactly "suggesting" them as something new – 5-byte and 6-byte sequences were already defined in Pike's original spec [1] back in 1992 and were still present in the RFC 2279 version; they only got removed at some point when RFC 3629 imposing the artificial 0x10ffff cap got published. (I've seen Git commits deliberately removing existing support for 5-byte and 6-byte sequence support from UTF-8 decoders which used to accept them, and that annoyed me a lot.)
[1]: https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
So my original point to @suckfish was:
1) there's nothing weird about UTF-8 being limited to 0x10ffff rather than 0x1fffff, because the 4-byte maximum is not the *reason* for such a codepoint limit, but rather it's the *consequence* of the codepoint limit;
2) it is indeed feasible to extend UTF-8 that way, because it was already designed "pre-extended", it's only being held back to the lowest common denominator by UTF-16.
(Sure, the current UTF-8 decoders won't accept larger codepoints, but… it can't be worse than old UCS-2 decoders mishandling UTF-16 surrogate pairs, can it? After all, lots of programs used to assume that Unicode ended at 0xffff and we extended that eventually.)
Unicode 15 released
Posted Sep 16, 2022 12:54 UTC (Fri)
by excors (subscriber, #95769)
[Link] (1 responses)
Posted Sep 16, 2022 12:54 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)
Mishandling surrogate pairs is often harmless - in particular if you have a UCS-2 decoder and a UCS-2 encoder, then UTF-16 surrogate pairs will roundtrip correctly. It's only a problem when you split up the pair, and most applications don't split strings at arbitrary points.
That means it was (mostly) safe to start writing UTF-16 documents before applications had been updated to understand UTF-16. Similarly you could use UTF-8 in applications that treated it like ASCII. But the same isn't true of extended-UTF-8 - if you try writing a document with a 5-byte code point, many UTF-8-aware applications will throw an exception or replace it with U+FFFD, so you'll have to update every application before you can start using the new encoding (i.e. it will never happen).
I suspect that if UTF-8 ever needs to be extended above U+10FFFF it may be done similarly to UTF-16 surrogate pairs, where a currently-unused part of the code point space becomes reserved and the new code points are represented as two 4-byte UTF-8 sequences. It's ugly and inefficient but (far more importantly) it would allow gradual adoption without major backward compatibility problems.
Unicode 15 released
Posted Sep 19, 2022 4:37 UTC (Mon)
by grawity (subscriber, #80596)
[Link]
Posted Sep 19, 2022 4:37 UTC (Mon) by grawity (subscriber, #80596) [Link]
They still often do, even today in Android's YouTube app I'm seeing it truncate "collapsed" descriptions at length X, splitting an emoji in half, and showing a � at the end...
> if you try writing a document with a 5-byte code point, many UTF-8-aware applications will throw an exception or replace it with U+FFFD, so you'll have to update every application before you can start using the new encoding (i.e. it will never happen).
Yeah, throwing an exception for this would be really annoying; it's what really ticks me off about the entire situation – not that we'll need an extension anytime soon, but still, we *had* this capability and we deliberately removed it. It's like 240.0.0.0/4 all over again.
On the other hand, showing a 5-byte codepoint as U+FFFD isn't too different from showing a surrogate-pair-encoded character as two U+FFFDs that UCS-2 software used to do.
Unicode 15 released
Posted Sep 15, 2022 19:45 UTC (Thu)
by plugwash (subscriber, #29694)
[Link] (3 responses)
Posted Sep 15, 2022 19:45 UTC (Thu) by plugwash (subscriber, #29694) [Link] (3 responses)
"whatever reason" being UTF-16.
> when are due to run out on the current trajectory? (Is the growth rate linear or faster or slower?)
Unicode has 17 planes, numbered 0 to 16.Planes 15 and 16 are reserved for private use, while planes 0-15 are mostly used for public allocations. Of those.
Unicode maintains a series of roadmaps on their site showing the status of the various planes. Taking a quick look at them.
Plane 0 (the basic multilingual plane) is basically full at this point,
Plane 1 (the supplementary multilingual plane) about half is formally assigned, but much of the rest is "pencilled in" for various scripts.
Plane 2 (the supplementary ideographic plane) is getting pretty full with a series of "CJK unified ideographs extension"s, though there is a little bit of unallocated space (presumablly because "CJK unified ideographs extension G" was too big to fit. There don't seem to be any tentitive allocations in this plane
Plane 3 (the tertiery ideographic) plane has "CJK unified ideographs extension G" which takes up a bit over a 16th of the plane, plus tentative allocations for various historic scripts used in China. Even taking the tentative allocations into account it's less than half full
Plane 14 (the supplementary special purpose plane) is mostly unused.
Planes 4 through 13 are completely unallocated.
So about two and a half planes worth of code space are formally assigned with tentative allocations taking that up to around 3 and a half. That is less than a quarter of the total encoding space.
> If we reach that limit, is it feasible to extend or does some combination of security/compatibility/stupidity make extension non-feasible?
Basically you would have to.
1. Define how to extend the existing encodings, for UTF-8 and UTF-32, that is trivial since both encodings are artificially limited, but UTF-16 would be more difficult.
2. Update software than handles encoding conversion and validation to accept the new versions of the encoding.
3. Get that modified software deployed widely
It's not impossible, but I think it's unlikely that a sufficiently compelling reason will come along to motivate the global software community to do it.
I think it's more likely that if/when encoding space becomes a problem that proposers will be encouraged to encode their scripts in a way that uses less encoding space, i.e. making more use of combining characters, variant selectors and so-on.
Unicode 15 released
Posted Sep 15, 2022 21:59 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Sep 15, 2022 21:59 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)
We should just deprecate it and start moving away from it. UTF-8 is the best choice anyway.
Even Windows is supporting true UTF-8 APIs these days.
Unicode 15 released
Posted Sep 15, 2022 22:50 UTC (Thu)
by khim (subscriber, #9252)
[Link]
> Even Windows is supporting true UTF-8 APIs these days.
Posted Sep 15, 2022 22:50 UTC (Thu) by khim (subscriber, #9252) [Link]
Externally but not internally.
I guess the first step would be to encourage to move programs to UTF-8.
Because that's the most important step, anyway.
Unicode 15 released
Posted Sep 20, 2022 4:03 UTC (Tue)
by plugwash (subscriber, #29694)
[Link]
Posted Sep 20, 2022 4:03 UTC (Tue) by plugwash (subscriber, #29694) [Link]
The NT line of windows has the idea that strings are sequences of 16-bit units as a core part of it's design. Whenever windows documentation refers to "unicode" assume they mean "a series of 16 bit code units that are nominally UTF-16 but may not actually be valid UTF-16".
It's cool that windows is letting you use UTF-8 as the "Ansi" codepage for an application now but it's far too late to make any noticeable difference. It might help a few legacy applications that are still stuck in the dark ages add unicode support more easily, but any modern windows software will have long-ago moved to the native UTF-16 "w" APIs and I really can't see them moving back. Also the documentation seems unclear on the details, such as what happens when a filename is not valid UTF-16.
It's a similar story with Java, .net and QT, their core string type is a sequence of 16 bit units and changing it would be a major compatibility break.
Unicode 15 released
Posted Sep 16, 2022 17:35 UTC (Fri)
by k8to (guest, #15413)
[Link] (8 responses)
Posted Sep 16, 2022 17:35 UTC (Fri) by k8to (guest, #15413) [Link] (8 responses)
Somehow I imagine we're not actually creating many new scripts and characters and that the growth is largely just adding the already existing scripts, which implies the rate of growth will fall off. So I'm a bit suspicious the storage may never be filled.
Unicode 15 released
Posted Sep 16, 2022 17:55 UTC (Fri)
by sfeam (subscriber, #2841)
[Link] (7 responses)
"Somehow I imagine we're not actually creating many new scripts and characters".Posted Sep 16, 2022 17:55 UTC (Fri) by sfeam (subscriber, #2841) [Link] (7 responses)
You underestimate the allure of cute emojis.
Unicode 15 released
Posted Sep 18, 2022 8:34 UTC (Sun)
by Sesse (subscriber, #53779)
[Link] (6 responses)
Posted Sep 18, 2022 8:34 UTC (Sun) by Sesse (subscriber, #53779) [Link] (6 responses)
Unicode 15 released
Posted Sep 18, 2022 19:29 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
Posted Sep 18, 2022 19:29 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (5 responses)
(There is no deprecation process for Unicode code points, so no other work would be required on the Consortium's side of things. Once a code point is in the standard, that's it, it's there forever. See https://www.unicode.org/policies/stability_policy.html)
Unicode 15 released
Posted Sep 18, 2022 19:36 UTC (Sun)
by Sesse (subscriber, #53779)
[Link] (4 responses)
Posted Sep 18, 2022 19:36 UTC (Sun) by Sesse (subscriber, #53779) [Link] (4 responses)
Unicode 15 released
Posted Sep 18, 2022 19:37 UTC (Sun)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Sep 18, 2022 19:37 UTC (Sun) by Sesse (subscriber, #53779) [Link] (1 responses)
Unicode 15 released
Posted Sep 19, 2022 19:16 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link]
Posted Sep 19, 2022 19:16 UTC (Mon) by NYKevin (subscriber, #129325) [Link]
Unicode 15 released
Posted Sep 19, 2022 18:37 UTC (Mon)
by plugwash (subscriber, #29694)
[Link] (1 responses)
Posted Sep 19, 2022 18:37 UTC (Mon) by plugwash (subscriber, #29694) [Link] (1 responses)
Unicode 15 released
Posted Sep 19, 2022 19:14 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link]
Posted Sep 19, 2022 19:14 UTC (Mon) by NYKevin (subscriber, #129325) [Link]