Closed Bug 577945 Opened 14 years ago Closed 13 years ago

Mapping of 0x7F..0x9F in ISO-8859 encodings to U+7F..U+9F vs. U+FFFD

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla5

People

(Reporter: pub-mozilla, Assigned: Ms2ger)

Details

Attachments

(1 file, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16
Build Identifier: 

Firefox currently maps the range 0x7F..0x9F in most ISO-8859 encodings to U+FFFD.

Safari, Opera and IE generally map to U+7F..U+9F instead, and Firefox actually does the same for ISO 8859/10 and ISO 8859/16.

RFC1345 explicitly suggests that the ISO-8859 character sets should be supplemented with U+0..U+9F:

      Often the ISO registration number does not cover all the
   codes of a character set in use, but for instance only the graphical
   characters, where another ISO registration number covers the control
   characters; in the case of the 8-bit character sets the ISO
   registration only covers the upper graphical characters (GR).  The
   ISO registration number is here taken to indicate the full coded
   character set including control characters and lower half of the
   graphical characters, normally ISO 6429 and ASCII, respectively.
   [...]
      If the coded character set is
   a 96-character set, it is tabled with the relevant GL set (normally
   ISO-IR-6) and with ISO 6429 as C0 and C1 

RFC1345 follows this practice in its encoding tables.

I think Firefox should map 0x7F..0x9F to U+7F..U+9F.

(If I have overlooked a good reason for the current mapping of 0x7F..0x9F to U+FFFD, that would presumably apply to ISO 8859/10 and ISO 8859/16 as well.)

Reproducible: Always
The rationale for mapping 0x7F..0x9F to U+FFFD is that U+007F..U+009F are invalid in HTML.

I agree that we should be consistent. We should also check if that is still true in HTML5.
Thanks for explaining the rationale. I had forgotten about that particular HTML validity constraint.

HTML5 disallows U+007F..U+009F, but also non-whitespace in the range U+0000..U+001F, surrogates U+FDD0 to U+FDEF and non-characters U+*FFF{E,F}. As far as I can tell, there is no difference between C0 and C1 control characters according to HTML5. The HTML5 parser throws an error for all disallowed characters, but only substitutes U+FFFD for U+0000 and surrogates. (HTML5 did at one point make an attempt to avoid control characters in the DOM, but this has long since been abandoned.)

It can be argued that HTML validity concerns should not influence encoding vectors that are used for plain-text documents as well. The validity argument is also somewhat undermined by the fact that the same characters encoded as UTF-8 will not be replaced by U+FFFD in Firefox.

SUMMARY:
This issue may have looked a bit different a couple of years ago. I now think the right thing to do is to retain characters U+007F..U+009F.

ADDENDUM:
This issue also affects the Windows-125* and Windows-874 encodings. Currently, Firefox maps otherwise undefined characters in the range 0x7F..0x9F to U+007F..U+009F for Windows-1252, but to U+FFFD for the others. It would be reasonable to map to U+007F..U+009F for all of them (for consistency with ISO-8859 encodings and in line with what Microsoft products actually does and Microsoft has published as 'best-fit' mappings), and this seems to be the only consistent solution if (as may appear to be the case) mapping to  U+FFFD is not an option for Windows-1252.
(In reply to comment #2)
> surrogates U+FDD0 to U+FDEF

Correction: surrogates U+D800 to U+DFFF and non-characters U+FDD0 to U+FDEF
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to comment #1)
> The rationale for mapping 0x7F..0x9F to U+FFFD is that U+007F..U+009F are
> invalid in HTML.
I think it should be a HTML parser's job, not a decoder.

> I agree that we should be consistent. We should also check if that is still
> true in HTML5.
In HTML5, ISO-8859-1 decoder should never be called at all because of "misinterpret as Windows-1252 for compatibility".
http://www.w3.org/TR/html5/parsing.html#character-encodings-0
(In reply to comment #4)
> In HTML5, ISO-8859-1 decoder should never be called at all because of
> "misinterpret as Windows-1252 for compatibility".

Indeed, but ISO 8859/1 is only one of many ISO-8859 encodings.
Per HTML5 parsing rule, only U+0000 and orphaned surrogate code points are replaced by U+FFFD.
http://www.w3.org/TR/html5/parsing.html#preprocessing-the-input-stream
Note that "parse error" doesn't mean offending characters is replaced by U+FFFD. For example, "misinterpreted for compatibility" is also a parse error.
HTML 4 error handling was "do whatever the user-agent thinks reasonable".
http://www.w3.org/TR/html4/appendix/notes.html#h-B.1
> Since user agents may vary in how they handle error conditions, authors
> and users must not rely on specific error recovery behavior.
However HTML5 error handling isn't. It is very strictly defined. So I have to conclude the current behavior violates HTML5 spec.
(In reply to comment #4)
> (In reply to comment #1)
> > The rationale for mapping 0x7F..0x9F to U+FFFD is that U+007F..U+009F are
> > invalid in HTML.
> I think it should be a HTML parser's job, not a decoder.

If we want this bug to be fixed, the code change should definitely go into the decoders (and not to the parser) due to performance and code complexity concerns.
Attached patch Patch v1 (obsolete) — Splinter Review
Anne van Kesteren's research (<https://bitbucket.org/annevk/webencodings/>) suggests that all other browsers already do this. I think we should match.
Assignee: smontagu → Ms2ger
Status: NEW → ASSIGNED
Attachment #501082 - Flags: review?
Attachment #501082 - Flags: review? → review?(smontagu)
Attachment #501082 - Flags: review?(smontagu) → review+
Attachment #501082 - Flags: approval2.0?
Comment on attachment 501082 [details] [diff] [review]
Patch v1

I do too, but I don't want the risk now. This should wait for the next cycle.
Attachment #501082 - Flags: approval2.0? → approval2.0-
Makes sense.
Depends on: post2.0
Whiteboard: [needs landing]
Version: unspecified → Trunk
Attachment #501082 - Attachment is obsolete: true
Keywords: checkin-needed
Whiteboard: [needs landing]
Whiteboard: [not-ready-for-cedar]
Whiteboard: [not-ready-for-cedar]
http://hg.mozilla.org/projects/cedar/rev/2cea7ec12733
Flags: in-testsuite+
Whiteboard: fixed-in-cedar
Keywords: checkin-needed
http://hg.mozilla.org/mozilla-central/rev/2cea7ec12733
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
No longer depends on: post2.0
Resolution: --- → FIXED
Whiteboard: fixed-in-cedar
Target Milestone: --- → mozilla2.2
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: