Closed
Bug 211343
Opened 22 years ago
Closed 16 years ago
Need a cross-platform way to detect nbsp
Categories
(Core :: XPCOM, defect)
Core
XPCOM
Tracking
()
RESOLVED
INCOMPLETE
People
(Reporter: akkzilla, Assigned: smontagu)
Details
Inspired by the patch to bug 14871:
The following sequence appears in several places in mozilla:
#ifdef XP_MAC
#define IS_NBSP_CHAR(c) (((unsigned char)0xca)==(c))
#else
#define IS_NBSP_CHAR(c) (((unsigned char)0xa0)==(c))
#endif
In lots of other places, we just check for 0xa0 and not for 0xca. (Of course,
in still other places we check for space and don't even bother to check for nbsp
at all.)
We need
(1) A reliable call (macro or otherwise) that works on any platform and is
callable from anywhere -- perhaps it just expands to the lines above -- so
programmers don't have to know to duplicate that code all over. I can attach a
patch for this, but I'd appreciate a suggestion as to where it best belongs, and
whether in string or xpcom.
(2) Extra credit: nsString nbsp-aware APIs (last I checked, the various space
detection methods don't treat nbsp as whitespace).
Cc'ing I18n people -- is the macro above sufficient? Is anything else needed?
Are there other spaces besides ' ' and nbsp (and perhaps tab) which need to be
tested for in cases where we really need to check whitespace? (Of course,
whitespace detection isn't a replacement for nsIWordBreaker/nsILineBreaker.)
just random thoughts:
zwj zero width joiner
zwnj zero width non-joiner
rs record separator
us unit separator
but what's the purpose of this check (I picked these from: [w2k]
start>run>context menu>insert unicode control character)
Comment 2•22 years ago
|
||
what the heck is nbsp? I mean, I know it as a HTML DTD macro, but not as a
character... care to explain a bit? I'm not sure nsString is appropriate for
this because we'd want consistent behavior across platforms, but nsCRT might be
good...
| Reporter | ||
Comment 3•22 years ago
|
||
In the dom, strings of characters often include 0xa0 (on windows and linux;
apparently it's 0xca on mac?) -- my understanding is that the parser is mapping
in input html into that character in the dom, and in addition,
surprisingly often we get that character embedded in files or coming in pastes
or drops from other programs.
Comment 4•22 years ago
|
||
a non-breaking space is a valid unicode character, and is entirely separate from
the regular space. the unicode is 0xA0, and so according to comment #3 win/linux
are correct. DOMStrings, which are unicode, should use 0xA0. Perhaps the native
mac charset uses 0xca, but that should be handled during drawing or somewhere
else late in the game.
| Assignee | ||
Comment 5•22 years ago
|
||
I've only looked at the relevant code rather sketchily, but it seems we use this
on text fragments that have been converted from Unicode into a platform charset.
I think the current #ifdef is just wrong, because not even all Mac charsets have
no-break space at codepoint 0xca (e.g. MacJapanese where 0xca is HALFWIDTH
KATAKANA LETTER HA), and there are other charsets that have it at other
codepoints (e.g. KOI8-R at 0x9a).
Could we hack the converters to fold all Unicode space characters into 0x20, or
do it in a pre-conversion pass?
| Reporter | ||
Comment 6•22 years ago
|
||
Just to add to the complication: over in bug 14871 sfraser says that nbsp is not
in fact different on Mac.
So what is 0xca? Is it something that gets pasted erroneously from some mac
app? Something specific to the spell checker? Should we even have it in the
spellchecker code? Is it a problem that the spellchecker code isn't checking
for 0xa0 on mac?
Comment 7•22 years ago
|
||
This
#ifdef XP_MAC
#define IS_NBSP_CHAR(c) (((unsigned char)0xca)==(c))
#else
#define IS_NBSP_CHAR(c) (((unsigned char)0xa0)==(c))
#endif
only appears on one place in the mozilla code; spellcheck wrapper code that rods
checked in.
On Mac, you get a 0xCA character when you type option-space in text editors. We
probably need to filter this, converting it to an NBSP at places where we
convert platform-charset data to DOM text. I think this character should never
make it into the DOM.
| Assignee | ||
Comment 8•22 years ago
|
||
The macro is used in the spellchecker in a post-conversion pass after converting
from Unicode to the spellchecker's charset, so we won't lose any performance by
changing to a pre-conversion pass that does
<pseudocode>
for PRUnichar in aStr
if PRUnichar == 0x00A0 // maybe check for other whitespace characters as well
PRUnichar = 0x0020
</pseudocode>
If non-break space appeared in the DOM as anything but 0x00A0, it would be a
serious error, so (1) in comment 0 is unnecessary.
Generalizing whitespace detection as in (2) is a tricky issue, because not all
whitespace characters have the semantics of whitespace in all contexts.
Comment 9•22 years ago
|
||
over to smontagu who seems to have a better handle on the i18n implications (if
its charset-dependent, then its probably not my area!
Assignee: alecf → smontagu
Updated•16 years ago
|
QA Contact: scc → string
Updated•16 years ago
|
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → INCOMPLETE
Updated•5 years ago
|
Component: String → XPCOM
You need to log in
before you can comment on or make changes to this bug.
Description
•