[aprssig] APRS Message character sets ?
Matti Aarnio oh2mqk at sral.fiThu Jul 10 11:05:54 UTC 2008
- Previous message: [aprssig] APRS Message character sets ?
- Next message: [aprssig] APRS Message character sets ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, Jul 09, 2008 at 06:41:35PM -0700, Stephen H. Smith wrote: > From: "Stephen H. Smith" <wa8lmf2 at aol.com> > > Matti Aarnio wrote: > > I am asking the aprssig to form a consensus of what shall be the canonic > > message character-set and encoding beyond US-ASCII ? > > > > > > he APRS specification 1.0.1 defines that used character set is ASCII in > > many places where it defines charactersets at all. > > > > This is quite understandable in view of TNC2 interfaced systems with > > 7E1 type communication channel -- which is unable to pass thru any extended > > characters. > > > 1) TNC2-type systems are normally initialized to 8-N-1 (not the > default 7-E-1) for APRS with the TNC commands ... > (or equivalent for other TNCs) so that they WILL be transparent to 8-bit > characters. This is required so that any arbitrary character values > created by Mic-E encoding will pass. In fact I have modified the TNC2 > firmware Ver 1.1.9 to DEFAULT to 8-N-1 bit transparent mode so no init > is even required. My point is more along the lines that even OLD TNC2 will work when commanded into KISS mode. I have enough experiences with "sanitized 8N1 text monitor" modes to know that they are very bad idea --> KISS mode into use. (And doing bi-directional igate with such monitor mode is wrought with unreliabilities.) > 2) Since the APRS network infrastructure is heavily based on legacy > 1980s-1990s packet hardware that doesn't support 16-bit/character > encoding (i.e. UniCode), I wouldn't hold my breath for 16-bit support > any time soon. You are confusing user interfaces and network infrastructures. You are also confusing character set and its encoding. The network infrastructure handles just sequences of 8-bit bytes that are presentable as text lines meaning that byte codes 0x0d 0x0a designate end of line. Even the UTF-16 characters are just pairs of such 8-bit bytes that are considered as encoding for single character of Unicode codespaces. And by the way, traditionally US people are completely ignoring the issue of international character sets. When one does not need characters outside US-ASCII, all is fine and dandy with limiting everything to US-ASCII. Europe is different, Asia even more so. We Finns need actively 4 characters outside the US-ASCII, a few more infrequently. We can do our things with ISO-8859-15 (including the Euro currency symbol.) ISO 8859-1 West European languages (Latin-1) ISO 8859-2 Central and East European languages (Latin-2) ISO 8859-3 Southeast European and miscellaneous languages (Latin-3) ISO 8859-4 Scandinavian/Baltic languages (Latin-4) ISO 8859-5 Latin/Cyrillic ISO 8859-6 Latin/Arabic ISO 8859-7 Latin/Greek ISO 8859-8 Latin/Hebrew ISO 8859-9 Latin-1 modification for Turkish (Latin-5) ISO 8859-10 Lappish/Nordic/Eskimo languages (Latin-6) ISO 8859-11 Latin/Thai ISO 8859-13 Baltic Rim languages (Latin-7) ISO 8859-14 Celtic (Latin-8) ISO 8859-15 West European languages (Latin-9) ISO 8859-16 Romanian (Latin-10) All are sufficient for their relevant sub-areas, but when you do not carry info of what character set is in use, the greek texts appear on our screen as a bunch of accented latin alphabets. > Not to mention that the most widely used APRS > application (UI-View) is now frozen in time and unchangeable. I do think that UI-View is dead software that should be deprecated, and as soon as somebody makes similar quality modern software THAT GETS SUFFICIENT PROMOTING, it will be irrelevant as to the message encodings. (No, I do not write Windows software.) The real problem is that there is no specification of what to do when needing to go beyond US-ASCII, just dozen(s) of ad-hoc solutions, all incompatible. - sending PC DOS characters - sending Windows characters - sending ISO-8859-XX characters - sending KOI8-R encoding characters - sending Unicode in UTF-8 sequences - sending Unicode in UTF-16 sequences Most surprises are caused by UTF-16 encoders sending US-ASCII text; every second byte has value 0x00. That particular byte value does cause tons of problems for C programmers who consider it as end of string token. (When coding without explicite string length info.) People being people they WILL use APRS messages to send native language texts, and that will mean often going outside US-ASCII. On "message" messages I would prefer some updated standard, preferrably ASCII compatible one, which in practice means UTF-8. On other type of messages there are some comment fields, which I have seen(*) carrying characters outside ASCII -- and those pesky zero bytes. *) Seen on APRS-IS traffic dumps written to pick _all_ of the traffic. Several such occur around the network at any hour. > -- > Stephen H. Smith wa8lmf (at) aol.com > Home Page: http://wa8lmf.com --OR-- http://wa8lmf.net 73 de Matti, OH2MQK
- Previous message: [aprssig] APRS Message character sets ?
- Next message: [aprssig] APRS Message character sets ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the aprssig mailing list
