Re: Charsets revisited

Larry Masinter (masinter@parc.xerox.com)
Wed, 24 Jan 1996 15:35:36 PST


> I agree with this.  However, it is true that (1) the URI wg no longer exists;
> (2) HTTP is the primary consumer/producer of URIs; and (3) a serious problem
> exists w.r.t. handling non-ASCII character data in URIs.  This problem needs
> to be addressed very quickly, so what forum would be best to address it?

Glenn,

In this particular case, the problem is with section 8.2.1 of RFC
1866 (HTML):

>       1. The form field names and values are escaped: space
>       characters are replaced by `+', and then reserved characters
>       are escaped as per [URL]; that is, non-alphanumeric
>       characters are replaced by `%HH', a percent sign and two
>       hexadecimal digits representing the ASCII code of the
>       character. Line breaks, as in multi-line text field values,
>       are represented as CR LF pairs, i.e. `%0D%0A'.

This specification calls for the _characters_ of the form results to
be encoded in a URL. However, the URL encoding (specified in section
2.2 of RFC 1738 (URL)) is a way of encoding octets, not a way of
encoding characters.

It is this disconnect that leaves the ambiguity that we're worried
about here: when a user fills out a form and the values in that form
are transmitted, what is the character set used in the transmission.

As such, I think this issue must be addressed in the HTML working
group as a technical review issue for RFC 1866. As we've discussed in
numerous other venues, there is no easy solution to the problem in
general, although RFC 1867 (file-upload) gives some relief in many
instances.