Closed Bug 211343 Opened 22 years ago Closed 16 years ago

Need a cross-platform way to detect nbsp

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: akkzilla, Assigned: smontagu)

Details

Inspired by the patch to bug 14871: The following sequence appears in several places in mozilla: #ifdef XP_MAC #define IS_NBSP_CHAR(c) (((unsigned char)0xca)==(c)) #else #define IS_NBSP_CHAR(c) (((unsigned char)0xa0)==(c)) #endif In lots of other places, we just check for 0xa0 and not for 0xca. (Of course, in still other places we check for space and don't even bother to check for nbsp at all.) We need (1) A reliable call (macro or otherwise) that works on any platform and is callable from anywhere -- perhaps it just expands to the lines above -- so programmers don't have to know to duplicate that code all over. I can attach a patch for this, but I'd appreciate a suggestion as to where it best belongs, and whether in string or xpcom. (2) Extra credit: nsString nbsp-aware APIs (last I checked, the various space detection methods don't treat nbsp as whitespace). Cc'ing I18n people -- is the macro above sufficient? Is anything else needed? Are there other spaces besides ' ' and nbsp (and perhaps tab) which need to be tested for in cases where we really need to check whitespace? (Of course, whitespace detection isn't a replacement for nsIWordBreaker/nsILineBreaker.)
just random thoughts: zwj zero width joiner zwnj zero width non-joiner rs record separator us unit separator but what's the purpose of this check (I picked these from: [w2k] start>run>context menu>insert unicode control character)
what the heck is nbsp? I mean, I know it as a HTML DTD macro, but not as a character... care to explain a bit? I'm not sure nsString is appropriate for this because we'd want consistent behavior across platforms, but nsCRT might be good...
In the dom, strings of characters often include 0xa0 (on windows and linux; apparently it's 0xca on mac?) -- my understanding is that the parser is mapping   in input html into that character in the dom, and in addition, surprisingly often we get that character embedded in files or coming in pastes or drops from other programs.
a non-breaking space is a valid unicode character, and is entirely separate from the regular space. the unicode is 0xA0, and so according to comment #3 win/linux are correct. DOMStrings, which are unicode, should use 0xA0. Perhaps the native mac charset uses 0xca, but that should be handled during drawing or somewhere else late in the game.
I've only looked at the relevant code rather sketchily, but it seems we use this on text fragments that have been converted from Unicode into a platform charset. I think the current #ifdef is just wrong, because not even all Mac charsets have no-break space at codepoint 0xca (e.g. MacJapanese where 0xca is HALFWIDTH KATAKANA LETTER HA), and there are other charsets that have it at other codepoints (e.g. KOI8-R at 0x9a). Could we hack the converters to fold all Unicode space characters into 0x20, or do it in a pre-conversion pass?
Just to add to the complication: over in bug 14871 sfraser says that nbsp is not in fact different on Mac. So what is 0xca? Is it something that gets pasted erroneously from some mac app? Something specific to the spell checker? Should we even have it in the spellchecker code? Is it a problem that the spellchecker code isn't checking for 0xa0 on mac?
This #ifdef XP_MAC #define IS_NBSP_CHAR(c) (((unsigned char)0xca)==(c)) #else #define IS_NBSP_CHAR(c) (((unsigned char)0xa0)==(c)) #endif only appears on one place in the mozilla code; spellcheck wrapper code that rods checked in. On Mac, you get a 0xCA character when you type option-space in text editors. We probably need to filter this, converting it to an NBSP at places where we convert platform-charset data to DOM text. I think this character should never make it into the DOM.
The macro is used in the spellchecker in a post-conversion pass after converting from Unicode to the spellchecker's charset, so we won't lose any performance by changing to a pre-conversion pass that does <pseudocode> for PRUnichar in aStr if PRUnichar == 0x00A0 // maybe check for other whitespace characters as well PRUnichar = 0x0020 </pseudocode> If non-break space appeared in the DOM as anything but 0x00A0, it would be a serious error, so (1) in comment 0 is unnecessary. Generalizing whitespace detection as in (2) is a tricky issue, because not all whitespace characters have the semantics of whitespace in all contexts.
over to smontagu who seems to have a better handle on the i18n implications (if its charset-dependent, then its probably not my area!
Assignee: alecf → smontagu
QA Contact: scc → string
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → INCOMPLETE
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.