Re: (MAITS.496) html, http, urls and internationalisation
Keld J|rn Simonsen (keld@dkuug.dk)
Sun, 28 Jan 1996 23:16:00 +0100
Francois Yergeau writes:
> >From: keld@dkuug.dk (Keld J|rn Simonsen)
> >Date: Sun, 28 Jan 1996 20:07:11 +0100
> >
> >Dan.Oscarsson@malmo.trab.se writes about URLs in more than ASCII.
> >
> >I would propose that URLs be written in the charset of the
> >document that references the url, possibly enhanced with
> >the extensions that we make to get further characters,
> >for example &a-ring; or &#xxxx;
>
> This is clearly insufficient. If I get the printer to put:
>
> "http://www.alis.com/~François"
>
> on my business card (there's a c-cedilla in there, for those who lost
> the 8th bit), how do you know what charset my business card uses?
> What bit pattern do you send my server to fetch that document?
I don't know which character encoding you have, but I know it is
a c-cedille - and then I can send you a c-cedille encoded in my
charset and labelled with my charset, and you can then translate
into your internal charset. It all works on the abstract character
level, as it should.
> There are not that many ways out of this problem: either URLs contain
> an indication of their charset, or the whole world agrees on a single
> implicit charset. Today, the world *has* agreed on the charset for
> octets values < 127, and this must be taken into account for a wider
> solution.
Which one is that, and which encoding form is it? Is it ISO-8859-1
which is the HTML default, or ISO 10646 in one of its many forms:
UCS-2, USC-4, UTF-16, UTF-8 or UTF-7? Or is it one of the many
other standards that is used today on the web, such as the other
8859 parts?
> Personally, I like the implicit UTF-8 idea: any non-ASCII character
> must be sent to a server as its UTF-8 encoding, either URL-encoded
> (the %XX hack) or not. A server receiving a non-ASCII octet (or its
> URL-encoding) must interpret it as part of the UTF-8 encoding of a
> character. No ambiguity, no need to tag the charset, and good
> compatibility with today's situation (ASCII only, in practice).
I use 8859-1 all the time here, and many of the other pages
in Europe are having 8859-1 characters in them. So please say
"iso-8859-1" instead of "ASCII" - this is actually also the HTML
standard. 8859-1 and UTF-8 use are in conflict, as they both use
the 8th bit.
>
> >I think that just using some kind of UCS would make it hard
> >when we have an environment where the html is in 8859-1 - that
> >would be mixing apples and oranges and thus very hard to maintain.
>
> For one thing, there is plenty of HTML *not* in 8859-1. Apart from
> that, most software (including HTML parsers) recognize only ASCII as
> syntax-significant, either passing 8-bit characters untouched and
> uninterpreted or chopping off the 8th bit, damaging 8859-1 just as
> surely as UTF-8.
I have been running in an 8 bit clean environment for years,
and do not recognize your assessment on a damaging environment.
I do agree that there are
a lot of pages out there in other 8-bit charsets than iso-8859-1.
Also for those pages having some kind of standard to say that
URLs are always encoded in some standard charset, say UTF-8,
would be mixing apples and oranges. That would be making requirements
on URL writing on the wrong level of abstraction; URLs are specified
with abstract characters, so it can be written in newspapers,
on business cards etc, we do not need to know or specify the
encoding (charset).
Keld