[aprssig] APRS Message character sets ?
Matti Aarnio oh2mqk at sral.fiWed Jul 9 22:37:11 UTC 2008
- Previous message: [aprssig] Xastir Development Snapshot
- Next message: [aprssig] APRS Message character sets ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I am asking the aprssig to form a consensus of what shall be the canonic message character-set and encoding beyond US-ASCII ? he APRS specification 1.0.1 defines that used character set is ASCII in many places where it defines charactersets at all. This is quite understandable in view of TNC2 interfaced systems with 7E1 type communication channel -- which is unable to pass thru any extended characters. However, I have reasons to believe that igate systems with such connections are in rarity. There are large parts of the world where the ASCII is insufficient, and thus plain TNC2 in text monitor mode is not used (not to mention its tendency to break MIC-e frames as well..) Instead we use KISS interfaced modems that are able to pass arbitrary binary contents. Now what people are actually using for messaging in between each other apparently include: - US-ASCII - PC DOS charset - PC Windows charsets (many) - Many ISO-8859-* variations - UTF-8 - UTF-16 The UTF-16 is much used by AGWtracker, the charsets with 8-bit characters are abundant with other softwares, depending on what charset user has configured the keyboard to use (or software vendor has decided that is the current default.) This babel of charsets makes reading messages with characters outside the US-ASCII somewhat challenging -- when both users have e.g. UI-View on Windows, both systems behave same way -- Windows latin charsets. Take some Linux user into the mix, and they do not see A-umlauts at all, because on Windows the code point is 0x84 (as I recall), while on ISO-8859-* that code point is "reserved unallocated". Indeed there are about a dozen of ISO 8859 charsets all for widely separate languages with widely separate glyph sets. All of them have one common thing though: US-ASCII forms code-points 0 to 127 on all of them. Some examples of these on code-point 0xAF: iso-8859-1: MACRON iso-8859-2: LATIN CAPITAL LETTER Z WITH DOT ABOVE iso-8859-7: HORIZONTAL BAR iso-8859-9: MACRON KOI8-R: FORMS DOUBLE UP AND LEFT (Cyrillic) etc. (Pick windows codepages, and you will have really merry mess...) On wire the UTF-8 would be compatible with US-ASCII as its subset for code-points of 0 thru 127. On code-points 128-255 (and beyond) the differences appear. There are lots and lots of old software running, but the PC-software is simpler to update than hardware -- we are still making new hardware things with ancient Bell-202 modems on them just because existing systems use that modulation. For example the UTF-16 has following nasty habit when encoding ASCII text: Every character is represented with two bytes. If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. (if U is bigger, encoding calls for two integers.) Canonical wire-order presentation of the 16-bit unsigned integer is to put high byte first, then low byte. (About UTF-16, see: http://www.ietf.org/rfc/rfc2781.txt ) Encoding US-ASCII text "message" results in byte sequence: "\000m\000e\000s\000s\000a\000g\000e" that is, every second byte is NUL, which on carelessly done C programs means that the string ends at first NUL byte - before the "m" character. What is more annoying is the space wastage. For west european languages where most of the used base alphabets are 'A' thru 'Z' and all fancy things like A-umlauts are in minority use, and thus the use of UTF-16 is waste -- not to mention the NUL-byte hazard.. For languages where ASCII characters are rare, like Cyrillic, Greek etc. things are different. In my opinnion we are already long overdue of specifying correct way to extend used character set beyond the ASCII. Ad-hoc de-facto extensions are already causing interoperability problems. 73 de Matti, OH2MQK
- Previous message: [aprssig] Xastir Development Snapshot
- Next message: [aprssig] APRS Message character sets ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the aprssig mailing list
