Re: Charsets revisited
Larry Masinter (masinter@parc.xerox.com)
Wed, 24 Jan 1996 15:35:36 PST
> I agree with this. However, it is true that (1) the URI wg no longer exists;
> (2) HTTP is the primary consumer/producer of URIs; and (3) a serious problem
> exists w.r.t. handling non-ASCII character data in URIs. This problem needs
> to be addressed very quickly, so what forum would be best to address it?
Glenn,
In this particular case, the problem is with section 8.2.1 of RFC
1866 (HTML):
> 1. The form field names and values are escaped: space
> characters are replaced by `+', and then reserved characters
> are escaped as per [URL]; that is, non-alphanumeric
> characters are replaced by `%HH', a percent sign and two
> hexadecimal digits representing the ASCII code of the
> character. Line breaks, as in multi-line text field values,
> are represented as CR LF pairs, i.e. `%0D%0A'.
This specification calls for the _characters_ of the form results to
be encoded in a URL. However, the URL encoding (specified in section
2.2 of RFC 1738 (URL)) is a way of encoding octets, not a way of
encoding characters.
It is this disconnect that leaves the ambiguity that we're worried
about here: when a user fills out a form and the values in that form
are transmitted, what is the character set used in the transmission.
As such, I think this issue must be addressed in the HTML working
group as a technical review issue for RFC 1866. As we've discussed in
numerous other venues, there is no easy solution to the problem in
general, although RFC 1867 (file-upload) gives some relief in many
instances.