Re: A truly multilingual WWW

James Gosling (jag@scndprsn.Eng.Sun.COM)
Mon, 26 Dec 1994 14:51:53 +0800


I agree wholeheartedly that Unicode is The Answer.  (it may not be
perfect, but it solves a whole pile of problems).  In the argument over
who should do the conversions, the client or the server, I would vote
for the client: if only because it limits translation to the very last
step, where the true target is known.

Your paper deals mostly with transport issues (things that http
could help with), but there are a pile of issues that are related:

>    There are many issues facing a system claiming to be multilingual,
>    though all issues fall into one of 3 categories:
>     1. Data representation issues
>     2. Data manipulation issues
>     3. Data display issues

4. Data entry issues?
	What about entering data into forms?  This is essentially the dual of
	1&2.  Some of the issues are mostly browser specific (like "how
	do I type this Kanji character") but some are not.  Some are
	more universal, like "what character sets will the server
	accept in POSTed data??".  Some are a real quagmire, like "I
	expect a date to be entered here and it's value returned to me
	using the ISO conventions".

>   1.1 DATA REPRESENTATION ISSUES
>   
>    In general, the major data representation issues are character set
>    selection, and character set encoding.

Other things get represented in a document besides characters.  This is
almost certainly outside of the realm of http, but might fit in with
html-42.0.  For example, dates and measures.  A hypthetical document
might contain:

	I vow to lose <measure 10 pounds> this year.
Which when read by someone in the US could come out something like:
	I vow to lose 10 pounds this year.
And when read by someone in Canada could come out something like:
	I vow to lose 4.54 kilograms this year.
Or when read by someone in the UK could come out something like:
	I vow to lose .71 stone this year.

			:-)

>     3.2.1 Unicode incorportation architecture
>    
>    In order to make multilingual support as painless as possible, it is
>    proposed that all HTTP servers for multilingual documents *should* be
>    able to convert documents from the local character set encoding to
>    UCS-2, UTF-8, and UTF-7 (16, 8 and 7 bit encodings of Unicode). It is
>    also proposed that all HTTP clients *should* be able to parse UCS-2,
>    UTF-8 and UTF-7. It is *recommended* that browsers allow the data to be
>    saved as UTF-7, UTF-8, or UCS-2 (similar to the current ftp
>    interface). If possible, a browser *should* also allow the data to be
>    saved in the local character set encoding, but that might not always
>    be possible (for example, saving a document containing Arabic on an
>    ASCII based system). Documents sent from servers would then use a
>    content type of:
> 
>      Content-Type: text/...; charset=UNICODE-1-1-UTF-7
>      Content-Type: text/...; charset=UNICODE-1-1-UTF-8
>      Content-Type: text/...; charset=UNICODE-1-1-UCS-2
> 
>    Though UTF-8 and UCS-2 will need some additional encoding applied to
>    them in order to be strictly MIME compliant. An alternative is to use
>    an application/* type specifier instead.

But http isn't strictly MIME compliant.  In particular, the full 8 bit
nature of UTF8 fits in well with http.  While UTF7 makes sense in the
MIME mail world of corrosive transport mechanisms, it is not needed
in http.  For simiplicity I'd recommend a really limited set of
allowed encodings: ISO-8859-1 and UTF8.

>     3.2.4 Presentational hints for Unicode
>     
>    While Unicode certainly serves as an excellent lowest common
>    denominator for multilingual documents, systems using Unicode require
>    more information than that contained in the character codes
>    themselves
....    
>    However, high-level tag use (eg. defining them in a DTD) fails for
>    the following reasons:
>     1. It is not transparent. The application processing the data stream
>        must be able to parse the tags, even if it can not do anything
>        with them. This necessarily complicates the parser.
>     2. There are probably a huge number of presentation hints that could
>        be used, and the list is dynamic as societal trends tend to alter
>        languages. Good examples can be found by comparing almost any
>        current written form of a language to that used 100 years ago.
>        Some languages have even changed dramatically in the last 50
>        years.

These problems affect even low-level tags such as those you proposed.
This whole area should be left to standards above http.

>       Method 1: Code-based presentation hints

The big problem with the use of the private use area in this way is that
it is "syntax without semantics".  These numbers are meaningless unless there
is some mechanism for defining how they should be interpreted.  Something
higher-level is required if, for example, a document using one of these
extended characters is ever to be displayed.