Universal Character Set characters
From Wikipedia, the free encyclopedia
(Redirected from Unicode range)
Jump to navigationJump to search
"Unicode characters" redirects here. For a complete list of UCS characters,
see List of Unicode characters.
This article contains special characters. Without proper rendering support, you may see question
marks, boxes, or other symbols.
Index of predominant national and selected regional or minority scripts
[L]ogographic
Alphabetic Abjad Abugida
and [S]yllabic
Latin Hanzi [L] Arabic North Indic
Cyrillic Kana [S] / Kanji [L] Hebrew South Indic
Greek Hanjab [L] Ethiopic
Armenian Thaana
Georgian Canadian syllabic
Hangul a
a
Featural-alphabetic. b Limited.
The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly
collaborate on the list of the characters in the Universal Coded Character
Set. The Universal Coded Character Set, most commonly called the Universal
Character Set (abbr. UCS, official designation: ISO/IEC 10646), is an
international standard to map characters, discrete symbols used in natural
language, mathematics, music, and other domains, to unique machine-
readable data values. By creating this mapping, the UCS enables
computer software vendors to interoperate, and transmit—interchange—UCS-
encoded text strings from one to another. Because it is a universal map, it can
be used to represent multiple languages at the same time. This avoids the
confusion of using multiple legacy character encodings, which can result in
the same sequence of codes having multiple interpretations depending on the
character encoding in use, resulting in mojibake if the wrong one is chosen.
UCS has a potential capacity of over 1 million characters. Each UCS
character is abstractly represented by a code point, an integer between 0 and
1,114,111 (1,114,112 = 220 + 216 or 17 × 216 = 0x110000 code points), used to
represent each character within the internal logic of text processing software.
As of Unicode 15.0, released in September 2022, 293,168 (26%) of these
code points are allocated, 149,251 (13%) have been assigned characters,
137,468 (12.3%) are reserved for private use, 2,048 are used to enable the
mechanism of surrogates, and 66 are designated as noncharacters, leaving
the remaining 820,944 (74%) unallocated. The number of encoded characters
is made up as follows:
149,014 graphical characters (some of which do not
have a visible glyph, but are still counted as
graphical)
237 special purpose
characters for control and formatting.
ISO maintains the basic mapping of characters from character name to code
point. Often, the terms character and code point will be used interchangeably.
However, when a distinction is made, a code point refers to the integer of the
character: what one might think of as its address. Meanwhile, a character in
ISO/IEC 10646 includes the combination of the code point and its name,
Unicode adds many other useful properties to the character set, such
as block, category, script, and directionality.
In addition to the UCS, the supplementary Unicode Standard, (not a joint
project with ISO, but rather a publication of the Unicode Consortium,) provides
other implementation details such as:
1. mappings between UCS and other character
sets
2. different collations of characters and character
strings for different languages
3. an algorithm for laying out bidirectional text
("the BiDi algorithm"), where text on the same
line may shift between left-to-right ("LTR")
and right-to-left ("RTL")
4. a case-folding algorithm
Computer software end users enter these characters into programs through
various input methods, for example, physical keyboards or virtual character
palettes.
The UCS can be divided in various ways, such as by plane, block, character
category, or character property.[1]
Contents
1Character reference overview
2Planes
3Blocks
4Categories
5Special-purpose characters
o 5.1Byte order mark
o 5.2Mathematical invisibles
o 5.3Fraction slash
o 5.4Bidirectional neutral formatting
o 5.5Bidirectional general formatting
o 5.6Interlinear annotation characters
o 5.7Script-specific
o 5.8Others
6Characters vs code points
7Whitespace, joiners, and separators
o 7.1Grapheme joiners and non-joiners
o 7.2Word joiners and separators
o 7.3Other separators
o 7.4Spaces
o 7.5Line-break control characters
8Types of code point
o 8.1Assigned characters
8.1.1Private-use characters
o 8.2Surrogates
o 8.3Noncharacters
o 8.4Reserved code points
9Characters, grapheme clusters and glyphs
o 9.1Compatibility characters
10Character properties
11See also
12References
13External links
Character reference overview[edit]
See also: List of XML and HTML character entity references and Unicode
input
An HTML or XML numeric character reference refers to a character by
its Universal Character Set/Unicode code point, and uses the format
&# nnnn ;
or
&#x hhhh ;
where nnnn is the code point in decimal form,
and hhhh is the code point in hexadecimal form.
The x must be lowercase in XML documents.
The nnnn or hhhh may be any number of digits
and may include leading zeros. The hhhh may
mix uppercase and lowercase, though
uppercase is the usual style.
In contrast, a character entity reference refers to
a character by the name of an entity which has
the desired character as its replacement text.
The entity must either be predefined (built into
the markup language) or explicitly declared in
a Document Type Definition (DTD). The format
is the same as for any entity reference:
& name ;
where name is the case-sensitive name of
the entity. The semicolon is required.
Planes[edit]
Main article: Plane (Unicode)
Unicode and ISO divide the set of code
points into 17 planes, each capable of
containing 65536 distinct characters or
1,114,112 total. As of 2022 (Unicode 15.0)
ISO and the Unicode Consortium has only
allocated characters and blocks in seven of
the 17 planes. The others remain empty and
reserved for future use.
Most characters are currently assigned to
the first plane: the Basic Multilingual Plane.
This is to help ease the transition for legacy
software since the Basic Multilingual Plane
is addressable with just two octets. The
characters outside the first plane usually
have very specialized or rare use.
Each plane corresponds with the value of
the one or two hexadecimal digits (0—9, A—
F) preceding the four final ones: hence
U+24321 is in Plane 2, U+4321 is in Plane 0
(implicitly read U+04321), and U+10A200
would be in Plane 16 (hex 10 = decimal 16).
Within one plane, the range of code points is
hexadecimal 0000—FFFF, yielding a
maximum of 65536 code points. Planes
restrict code points to a subset of that range.
Blocks[edit]
Main article: Unicode block
Unicode adds a block property to UCS that
further divides each plane into separate
blocks. Each block is a grouping of
characters by their use such as
"mathematical operators" or "Hebrew script
characters". When assigning characters to
previously unassigned code points, the
Consortium typically allocates entire blocks
of similar characters: for example all the
characters belonging to the same script or
all similarly purposed symbols get assigned
to a single block. Blocks may also maintain
unassigned or reserved code points when
the Consortium expects a block to require
additional assignments.
The first 256 code points in the UCS
correspond with those of ISO 8859-1, the
most popular 8-bit character encoding in
the Western world. As a result, the first 128
characters are also identical to ASCII.
Though Unicode refers to these as a Latin
script block, these two blocks contain many
characters that are commonly useful outside
of the Latin script. In general, not all
characters in a given block need be of the
same script, and a given script can occur in
several different blocks.
Categories[edit]
Unicode assigns to every UCS character
a general category and subcategory. The
general categories are: letter, mark, number,
punctuation, symbol, or control (in other
words a formatting or non-graphical
character).
Types include:
Modern, Historic, and Ancient Scripts.
As of 2022 (Unicode 15.0), the UCS
identifies 161 scripts that are, or have
been, used throughout of the world.
Many more are in various approval
stages for future inclusion of the UCS.[2]
International Phonetic Alphabet. The
UCS devotes several blocks (over 300
characters) to characters for
the International Phonetic Alphabet.
Combining Diacritical Marks. An
important advance conceived by
Unicode in designing the UCS and
related algorithms for handling text was
the introduction of combining diacritic
marks. By providing accents that can
combine with any letter character, the
Unicode and the UCS reduce
significantly the number of characters
needed. While the UCS also includes
precomposed characters, these were
included primarily to facilitate support
within UCS for non-Unicode text
processing systems.
Punctuation. Along with unifying
diacritical marks, the UCS also sought to
unify punctuation across scripts. Many
scripts also contain punctuation,
however, when that punctuation has no
similar semantics in other scripts.
Symbols. Many mathematics, technical,
geometrical and other symbols are
included within the UCS. This provides
distinct symbols with their own code
point or character rather than relying on
switching fonts to provide symbolic
glyphs.
o Currency.
o Letterlike. These symbols appear
like combinations of many common
Latin scripts letters such as ℅.
Unicode designates many of the
letterlike symbols as compatibility
characters usually because they can
be in plain text by substituting glyphs
for a composing sequence of
characters: for example substituting
the glyph ℅ for the composed
sequence of characters c/o.
o Number Forms. Number forms
primarily consist of precomposed
fractions and Roman numerals. Like
other areas of composing sequences
of characters, the Unicode approach
prefers the flexibility of composing
fractions by combining characters
together. In this case to create
fractions, one combines numbers
with the fraction slash character
(U+2044). As an example of the
flexibility this approach provides,
there are nineteen precomposed
fraction characters included within
the UCS. However, there are an
infinity of possible fractions. By using
composing characters the infinity of
fractions is handled by 11 characters
(0-9 and the fraction slash). No
character set could include code
points for every precomposed
fraction. Ideally a text system should
present the same glyphs for a
fraction whether it is one of the
precomposed fractions (such as ⅓)
or a composing sequence of
characters (such as 1⁄3). However,
web browsers are not typically that
sophisticated with Unicode and text
handling. Doing so ensures that
precomposed fractions and
combining sequence fractions will
appear compatible next to each
other.
o Arrows.
o Mathematical.
o Geometric Shapes.
o Legacy Computing.
o Control Pictures Graphical
representations of many control
characters.
o Box Drawing.
o Block Elements.
o Braille Patterns.
o Optical Character Recognition.
o Technical.
o Dingbats.
o Miscellaneous Symbols.
o Emoticons.
o Symbols and Pictographs.
o Alchemical Symbols.
o Game Pieces (chess, checkers, go,
dice, dominoes, mahjong, playing
cards, and many others).
o Chess Symbols
o Tai Xuan Jing.
o Yijing Hexagram Symbols.
CJK. Devoted to ideographs and other
characters to support languages in
China, Japan, Korea (CJK), Taiwan,
Vietnam, and Thailand.
o Radicals and Strokes.
o Ideographs. By far the largest
portion of the UCS is devoted to
ideographs used in languages of
Eastern Asia. While the glyph
representation of these ideographs
have diverged in the languages that
use them, the UCS unifies these Han
characters in what Unicode refers to
as Unihan (for Unified Han). With
Unihan, the text layout software must
work together with the available fonts
and these Unicode characters to
produce the appropriate glyph for the
appropriate language. Despite
unifying these characters, the UCS
still includes over 97,000 Unihan
ideographs.
Musical Notation.
Duployan shorthands.
Sutton SignWriting.
Compatibility Characters. Several
blocks in the UCS are devoted almost
entirely to compatibility characters.
Compatibility characters are those
included for support of legacy text
handling systems that do not make a
distinction between character and glyph
the way Unicode does. For example,
many Arabic letters are represented by a
different glyph when the letter appears at
the end of a word than when the letter
appears at the beginning of a word.
Unicode's approach prefers to have
these letters mapped to the same
character for ease of internal machine
text processing and storage. To
complement this approach, the text
software must select different glyph
variants for display of the character
based on its context. Over 4000
characters are included for such
compatibility reasons.
Control Characters.
Surrogates. The UCS includes 2048
code points in the Basic Multilingual
Plane (BMP) for surrogate code point
pairs. Together these surrogates allow
any code point in the sixteen other
planes to be addressed by using two
surrogate code points. This provides a
simple built-in method for encoding the
20.1 bit UCS within a 16 bit encoding
such as UTF-16. In this way UTF-16 can
represent any character within the BMP
with a single 16-bit byte. Characters
outside the BMP are then encoded using
two 16-bit bytes (4 octets total) using the
surrogate pairs.
Private Use. The consortium provides
several private use blocks and planes
that can be assigned characters within
various communities, as well as
operating system and font vendors.
Noncharacters. The consortium
guarantees certain code points will never
be assigned a character and calls these
noncharacter code points. The last two
code points of each plane (ending in FE
and FF ) are such code points. There are
a few others interspersed throughout the
Basic Multilingual Plane, the first plane.
Special-purpose
characters[edit]
See also: Unicode control characters
Unicode codifies over a hundred thousand
characters. Most of those represent
graphemes for processing as linear text.
Some, however, either do not represent
graphemes, or, as graphemes, require
exceptional treatment.[3][4] Unlike the ASCII
control characters and other characters
included for legacy round-trip capabilities,
these other special-purpose characters
endow plain text with important semantics.
Some special characters can alter the layout
of text, such as the zero-width joiner and
zero-width non-joiner, while others do not
affect text layout at all, but instead affect the
way text strings are collated, matched or
otherwise processed. Other special-purpose
characters, such as the mathematical
invisibles, generally have no effect on text
rendering, though sophisticated text layout
software may choose to subtly adjust
spacing around them.
Unicode does not specify the division of
labor between font and text layout software
(or "engine") when rendering Unicode text.
Because the more complex font formats,
such as OpenType or Apple Advanced
Typography, provide for contextual
substitution and positioning of glyphs, a
simple text layout engine might rely entirely
on the font for all decisions of glyph choice
and placement. In the same situation a more
complex engine may combine information
from the font with its own rules to achieve its
own idea of best rendering. To implement all
recommendations of the Unicode
specification, a text engine must be
prepared to work with fonts of any level of
sophistication, since contextual substitution
and positioning rules do not exist in some
font formats and are optional in the rest.
The fraction slash is an example: complex
fonts may or may not supply positioning
rules in the presence of the fraction slash
character to create a fraction, while fonts in
simple formats cannot.
Byte order mark[edit]
When appearing at the head of a text file or
stream, the byte order mark (BOM) U+FEFF
hints at the encoding form and its byte order.
If the stream's first byte is 0xFE and the
second 0xFF, then the stream's text is not
likely to be encoded in UTF-8, since those
bytes are invalid in UTF-8. It is also not likely
to be UTF-16 in little-endian byte order
because 0xFE, 0xFF read as a 16-bit little
endian word would be U+FFFE, which is
meaningless. The sequence also has no
meaning in any arrangement of UTF-
32 encoding, so, in summary, it serves as a
fairly reliable indication that the text stream
is encoded as UTF-16 in big-endian byte
order. Conversely, if the first two bytes are
0xFF, 0xFE, then the text stream may be
assumed to be encoded as UTF-16LE
because, read as a 16-bit little-endian value,
the bytes yield the expected 0xFEFF byte
order mark. This assumption becomes
questionable, however, if the next two bytes
are both 0x00; either the text begins with a
null character (U+0000), or the correct
encoding is actually UTF-32LE, in which the
full 4-byte sequence FF FE 00 00 is one
character, the BOM.
The UTF-8 sequence corresponding to
U+FEFF is 0xEF, 0xBB, 0xBF. This
sequence has no meaning in other Unicode
encoding forms, so it may serve to indicate
that that stream is encoded as UTF-8.
The Unicode specification does not require
the use of byte order marks in text streams.
It further states that they should not be used
in situations where some other method of
signaling the encoding form is already in
use.
Mathematical invisibles[edit]
Primarily for mathematics, the Invisible
Separator (U+2063) provides a separator
between characters where punctuation or
space may be omitted such as in a two-
dimensional index like ij. Invisible Times
(U+2062) and Function Application
(U+2061) are useful in mathematics text
where the multiplication of terms or the
application of a function is implied without
any glyph indicating the operation. Unicode
5.1 introduces the Mathematical Invisible
Plus character as well (U+2064) which may
indicate that an integral number followed by
a fraction should denote their sum, but not
their product.
Fraction slash[edit]
Example of fraction slash use. This typeface (Apple
Chancery) shows the synthesized common fraction
on the left and the precomposed fraction glyph on
the right as a rendering the plain text string "1 1⁄4
1¼". Depending on the text environment, the single
string "1 1⁄4" might yield either result, the one on the
right through substitution of the fraction sequence
with the single precomposed fraction glyph.
A more elaborate example of fraction slash usage:
plain text "4 221⁄225" rendered in Apple Chancery.
This font supplies the text layout software with
instructions to synthesize the fraction according to
the Unicode rule described in this section.
The fraction slash character (U+2044) has
special behavior in the Unicode Standard:
[5]
(section 6.2, Other Punctuation)
The standard form of a fraction built using
the fraction slash is defined as follows: any
sequence of one or more decimal digits
(General Category = Nd), followed by the
fraction slash, followed by any sequence of
one or more decimal digits. Such a fraction
should be displayed as a unit, such as ¾. If
the displaying software is incapable of
mapping the fraction to a unit, then it can
also be displayed as a simple linear
sequence as a fallback (for example, 3/4). If
the fraction is to be separated from a
previous number, then a space can be used,
choosing the appropriate width (normal, thin,
zero width, and so on). For example, 1
+ ZERO WIDTH SPACE + 3 + FRACTION SLASH + 4
is displayed as 1¾.
By following this Unicode recommendation,
text processing systems yield sophisticated
symbols from plain text alone. Here the
presence of the fraction slash character
instructs the layout engine to synthesize a
fraction from all consecutive digits preceding
and following the slash. In practice, results
vary because of the complicated interplay
between fonts and layout engines. Simple
text layout engines tend not to synthesize
fractions at all, and instead draw the glyphs
as a linear sequence as described in the
Unicode fallback scheme.
More sophisticated layout engines face two
practical choices: they can follow Unicode's
recommendation, or they can rely on the
font's own instructions for synthesizing
fractions. By ignoring the font's instructions,
the layout engine can guarantee Unicode's
recommended behavior. By following the
font's instructions, the layout engine can
achieve better typography because
placement and shaping of the digits will be
tuned to that particular font at that particular
size.
The problem with following the font's
instructions is that the simpler font formats
have no way to specify fraction synthesis
behavior. Meanwhile, the more complex
formats do not require the font to specify
fraction synthesis behavior and therefore
many do not. Most fonts of complex formats
can instruct the layout engine to replace a
plain text sequence such as "1⁄2" with the
precomposed "½" glyph. But because many
of them will not issue instructions to
synthesize fractions, a plain text string such
as "221⁄225" may well render as 22½25
(with the ½ being the substituted
precomposed fraction, rather than
synthesized). In the face of problems like
this, those who wish to rely on the
recommended Unicode behavior should
choose fonts known to synthesize fractions
or text layout software known to produce
Unicode's recommended behavior
regardless of font.
Bidirectional neutral
formatting[edit]
Writing direction is the direction glyphs are
placed on the page in relation to forward
progression of characters in the Unicode
string. English and other languages of Latin
script have left-to-right writing direction.
Several major writing scripts, such
as Arabic and Hebrew, have right-to-left
writing direction. The Unicode specification
assigns a directional type to each character
to inform text processors how sequences of
characters should be ordered on the page.
While lexical characters (that is, letters) are
normally specific to a single writing script,
some symbols and punctuation marks are
used across many writing scripts. Unicode
could have created duplicate symbols in the
repertoire that differ only by directional type,
but chose instead to unify them and assign
them a neutral directional type. They acquire
direction at render time from adjacent
characters. Some of these characters also
have a bidi-mirrored property indicating the
glyph should be rendered in mirror-image
when used in right-to-left text.
The render-time directional type of a neutral
character can remain ambiguous when the
mark is placed on the boundary between
directional changes. To address this,
Unicode includes characters that have
strong directionality, have no glyph
associated with them, and are ignorable by
systems that do not process bidirectional
text:
Arabic letter mark (U+061C)
Left-to-right mark (U+200E)
Right-to-left mark (U+200F)
Surrounding a bidirectionally neutral
character by the left-to-right mark will force
the character to behave as a left-to-right
character while surrounding it by the right-to-
left mark will force it to behave as a right-to-
left character. The behavior of these
characters is detailed in Unicode's
Bidirectional Algorithm.
Bidirectional general
formatting[edit]
Further information: Bidirectional text
While Unicode is designed to handle
multiple languages, multiple writing systems
and even text that flows either left-to-right or
right-to-left with minimal author intervention,
there are special circumstances where the
mix of bidirectional text can become intricate
—requiring more author control. For these
circumstances, Unicode includes five other
characters to control the complex
embedding of left-to-right text within right-to-
left text and vice versa:
Left-to-right embedding (U+202A)
Right-to-left embedding (U+202B)
Pop directional formatting (U+202C)
Left-to-right override (U+202D)
Right-to-left override (U+202E)
Left-to-right isolate (U+2066)
Right-to-left isolate (U+2067)
First strong isolate (U+2068)
Pop directional isolate (U+2069)
Interlinear annotation
characters[edit]
Interlinear Annotation Anchor (U+FFF9)
Interlinear Annotation Separator
(U+FFFA)
Interlinear Annotation Terminator
(U+FFFB)
Script-specific[edit]
Prefixed format control
o Arabic Number Sign (U+0600)
o Arabic Sign Sanah (U+0601)
o Arabic Footnote Marker (U+0602)
o Arabic Sign Safha (U+0603)
o Arabic Sign Samvat (U+0604)
o Arabic Number Mark Above
(U+0605)
o Arabic End of Ayah (U+06DD)
o Syriac Abbreviation Mark (U+070F)
o Arabic Pound Mark Above (U+0890)
o Arabic Piastre Mark Above (U+0891)
o Kaithi Number Sign (U+110BD)
o Kaithi Number Sign Above
(U+110CD)
Egyptian Hieroglyphs
o Egyptian Hieroglyph Vertical Joiner
(U+13430)
o Egyptian Hieroglyph Horizontal
Joiner (U+13431)
o Egyptian Hieroglyph Insert At Top
Start (U+13432)
o Egyptian Hieroglyph Insert At Bottom
Start (U+13433)
o Egyptian Hieroglyph Insert At Top
End (U+13434)
o Egyptian Hieroglyph Insert At Bottom
End (U+13435)
o Egyptian Hieroglyph Overlay Middle
(U+13436)
o Egyptian Hieroglyph Begin Segment
(U+13437)
o Egyptian Hieroglyph End Segment
(U+13438)
o Egyptian Hieroglyph Insert At Middle
(U+13439)
o Egyptian Hieroglyph Insert At Top
(U+1343A)
o Egyptian Hieroglyph Insert At Bottom
(U+1343B)
o Egyptian Hieroglyph Begin Enclosure
(U+1343C)
o Egyptian Hieroglyph End Enclosure
(U+1343D)
o Egyptian Hieroglyph Begin Walled
Enclosure (U+1343E)
o Egyptian Hieroglyph End Walled
Enclosure (U+1343F)
Brahmi
o Brahmi Number Joiner (U+1107F)
Brahmi-derived script dead-character
formation (Virama and similar diacritics)
o Devanagari Sign Virama (U+094D)
o Bengali Sign Virama (U+09CD)
o Gurmukhi Sign Virama (U+0A4D)
o Gujarati Sign Virama (U+0ACD)
o Oriya Sign Virama (U+0B4D)
o Tamil Sign Virama (U+0BCD)
o Telugu Sign Virama (U+0C4D)
o Kannada Sign Virama (U+0CCD)
o Malayalam Sign Vertical Bar Virama
(U+0D3B)
o Malayalam Sign Circular Virama
(U+0D3C)
o Malayalam Sign Virama (U+0D4D)
o Sinhala Sign Al-Lakuna (U+0DCA)
o Thai Character Phinthu (U+0E3A)
o Thai Character Yamakkan (U+0E4E)
o Lao Sign Pali Virama (U+0EBA)
o Myanmar Sign Virama (U+1039)
o Tagalog Sign Virama (U+1714)
o Tagalog Sign Pamudpod (U+1715)
o Hanunoo Sign Pamudpod (U+1734)
o Khmer Sign Viriam (U+17D1)
o Khmer Sign Coeng (U+17D2)
o Tai Tham Sign Sakot (U+1A60)
o Tai Tham Sign Ra Haam (U+1A7A)
o Balinese Adeg Adeg (U+1B44)
o Sundanese Sign Pamaaeh
(U+1BAA)
o Sundanese Sign Virama (U+1BAB)
o Batak Pangolat (U+1BF2)
o Batak Panongonan (U+1BF3)
o Syloti Nagri Sign Hasanta (U+A806)
o Syloti Nagri Sign Alternate Hasanta
(U+A82C)
o Saurashtra Sign Virama (U+A8C4)
o Rejang Virama (U+A953)
o Javanese Pangkon (U+A9C0)
o Meetei Mayek Virama (U+AAF6)
o Kharoshthi Virama (U+10A3F)
o Brahmi Virama (U+11046)
o Brahmi Sign Old Tamil Virama
(U+11070)
o Kaithi Sign Virama (U+110B9)
o Chakma Virama (U+11133)
o Sharada Sign Virama (U+111C0)
o Khojki Sign Virama (U+11235)
o Khudawadi Sign Virama (U+112EA)
o Grantha Sign Virama (U+1134D)
o Newa Sign Virama (U+11442)
o Tirhuta Sign Virama (U+114C2)
o Siddham Sign Virama (U+115BF)
o Modi Sign Virama (U+1163F)
o Takri Sign Virama (U+116B6)
o Ahom Sign Killer (U+1172B)
o Dogra Sign Virama (U+11839)
o Dives Akuru Sign Halanta (U+1193D)
o Dives Akuru Virama (U+1193E)
o Nandinagari Sign Virama (U+119E0)
o Zanabazar Square Sign Virama
(U+11A34)
o Zanabazar Square Subjoiner
(U+11A47)
o Soyombo Subjoiner (U+11A99)
o Bhaiksuki Sign Virama (U+11C3F)
o Masaram Gondi Sign Halanta
(U+11D44)
o Masaram Gondi Virama (U+11D45)
o Gunjala Gondi Virama (U+11D97)
o Kawi Sign Killer (U+11F41)
o Kawi Conjoiner (U+11F42)
Historical Viramas with other functions
o Tibetan Mark Halanta (U+0F84)
o Myanmar Sign Asat (U+103A)
o Limbu Sign Sa-I (U+193B)
o Meetei Mayek Apun Iyek (U+ABED)
o Chakma Maayyaa (U+11134)
Mongolian Variation Selectors
o Mongolian Free Variation Selector
One (U+180B)
o Mongolian Free Variation Selector
Two (U+180C)
o Mongolian Free Variation Selector
Three (U+180D)
o Mongolian Vowel Separator
(U+180E)
Generic Variation Selectors
o Variation Selector-1 through -16
(U+FE00–U+FE0F)
o Variation Selector-17 through -256
(U+E0100–U+E01EF)
Tag characters (U+E0001 and
U+E0020–U+E007F)
Tifinagh
o Tifinagh Consonant Joiner (U+2D7F)
Ogham
o Ogham Space Mark (U+1680)
Ideographic
o Ideographic variation indicator
(U+303E)
o Ideographic Description (U+2FF0–
U+2FFB)
Musical Format Control
o Musical Symbol Begin Beam
(U+1D173)
o Musical Symbol End Beam
(U+1D174)
o Musical Symbol Begin Tie
(U+1D175)
o Musical Symbol End Tie (U+1D176)
o Musical Symbol Begin Slur
(U+1D177)
o Musical Symbol End Slur (U+1D178)
o Musical Symbol Begin Phrase
(U+1D179)
o Musical Symbol End Phrase
(U+1D17A)
Shorthand Format Control
o Shorthand Format Letter Overlap
(U+1BCA0)
o Shorthand Format Continuing
Overlap (U+1BCA1)
o Shorthand Format Down Step
(U+1BCA2)
o Shorthand Format Up Step
(U+1BCA3)
Deprecated Alternate Formatting
o Inhibit Symmetric Swapping
(U+206A)
o Activate Symmetric Swapping
(U+206B)
o Inhibit Arabic Form Shaping
(U+206C)
o Activate Arabic Form Shaping
(U+206D)
o National Digit Shapes (U+206E)
o Nominal Digit Shapes (U+206F)
Others[edit]
Object Replacement Character
(U+FFFC)
Replacement Character (U+FFFD)
Characters vs code
points[edit]
The term "character" is not well defined, and
what we are referring to most of the time is
the grapheme. A grapheme is represented
visually by its glyph. The typeface (often
erroneously referred to as font) used can
depict visual variations of the same
character. It is possible that two different
graphemes can have the exact same glyph
or are visually so close that the average
reader cannot tell them apart.
A grapheme is almost always represented
by one code point, for example the LATIN
CAPITAL LETTER A is represented by only
code point U+0041.
The grapheme LATIN CAPITAL A WITH
DIAERESIS Ä is an example where a
character can be represented by more than
one code point. It can be U+00C4, or
U+0041U+0308. U+0041 is the familiar A
and U+0308 is the COMBINING
DIAERESIS ̈, a combining diacritical mark.
When a combining mark is adjacent to a
non-combining mark code point, text
rendering applications should superimpose
the combining mark onto the glyph
represented by the other code point to form
a grapheme according to a set of rules. [6]
The word BÄM would therefore be three
graphemes. It may be made up of three
code points or more depending on how the
characters are actually composed.
Whitespace, joiners, and
separators[edit]
Main article: Whitespace character
Unicode provides a list of characters it
deems whitespace characters for
interoperability support. Software
Implementations and other standards may
use the term to denote a slightly different set
of characters. For example, Java does not
consider U+00A0 NO-BREAK
SPACE or U+0085 <control-0085> (NEXT
LINE) to be whitespace, even though
Unicode does. Whitespace characters are
characters typically designated for
programming environments. Often they have
no syntactic meaning in such programming
environments and are ignored by the
machine interpreters. Unicode designates
the legacy control characters U+0009
through U+000D and U+0085 as whitespace
characters, as well as all characters whose
General Category property value is
Separator. There are 25 total whitespace
characters as of Unicode 15.0.
Grapheme joiners and non-
joiners[edit]
The zero-width joiner (U+200D) and zero-
width non-joiner (U+200C) control the
joining and ligation of glyphs. The joiner
does not cause characters that would not
otherwise join or ligate to do so, but when
paired with the non-joiner these characters
can be used to control the joining and
ligating properties of the surrounding two
joining or ligating characters. The Combining
Grapheme Joiner (U+034F) is used to
distinguish two base characters as one
common base or digraph, mostly for
underlying text processing, collation of
strings, case folding and so on.
Word joiners and separators[edit]
The most common word separator is a
space (U+0020). However, there are other
word joiners and separators that also
indicate a break between words and
participate in line-breaking algorithms. The
No-Break Space (U+00A0) also produces a
baseline advance without a glyph but inhibits
rather than enabling a line-break. The Zero
Width Space (U+200B) allows a line-break
but provides no space: in a sense joining,
rather than separating, two words. Finally,
the Word Joiner (U+2060) inhibits line
breaks and also involves none of the white
space produced by a baseline advance.
Baseline advance No baseline advance
Allow line-break
Space U+0020 Zero Width Space U+200B
(Separators)
Inhibit line-
break No-Break Space U+00A0 Word Joiner U+2060
(Joiners)
Other separators[edit]
Line Separator (U+2028)
Paragraph Separator (U+2029)
These provide Unicode with native
paragraph and line separators independent
of the legacy encoded ASCII control
characters such as carriage return
(U+000A), linefeed (U+000D), and Next Line
(U+0085). Unicode does not provide for
other ASCII formatting control characters
which presumably then are not part of the
Unicode plain text processing model. These
legacy formatting control characters include
Tab (U+0009), Line Tabulation or Vertical
Tab (U+000B), and Form Feed (U+000C)
which is also thought of as a page break.
Spaces[edit]
Further information: Space (punctuation)
The space character (U+0020) typically
input by the space bar on a keyboard serves
semantically as a word separator in many
languages. For legacy reasons, the UCS
also includes spaces of varying sizes that
are compatibility equivalents for the space
character. While these spaces of varying
width are important in typography, the
Unicode processing model calls for such
visual effects to be handled by rich text,
markup and other such protocols. They are
included in the Unicode repertoire primarily
to handle lossless roundtrip transcoding
from other character set encodings. These
spaces include:
1. En Quad (U+2000)
2. Em Quad (U+2001)
3. En Space (U+2002)
4. Em Space (U+2003)
5. Three-Per-Em Space (U+2004)
6. Four-Per-Em Space (U+2005)
7. Six-Per-Em Space (U+2006)
8. Figure Space (U+2007)
9. Punctuation Space (U+2008)
10. Thin Space (U+2009)
11. Hair Space (U+200A)
12. Medium Mathematical Space
(U+205F)
Aside from the original ASCII space, the
other spaces are all compatibility characters.
In this context this means that they
effectively add no semantic content to the
text, but instead provide styling control.
Within Unicode, this non-semantic styling
control is often referred to as rich text and is
outside the thrust of Unicode's goals. Rather
than using different spaces in different
contexts, this styling should instead be
handled through intelligent text layout
software.
Three other writing-system-specific word
separators are:
Mongolian Vowel Separator (U+180E)
Ideographic Space (U+3000): behaves
as an ideographic separator and
generally rendered as white space of the
same width as an ideograph.
Ogham Space Mark (U+1680): this
character is sometimes displayed with a
glyph and other times as only white
space.
Line-break control characters[edit]
Several characters are designed to help
control line-breaks either by discouraging
them (no-break characters) or suggesting
line breaks such as the soft hyphen
(U+00AD) (sometimes called the "shy
hyphen"). Such characters, though designed
for styling, are probably indispensable for
the intricate types of line-breaking they
make possible.
Break inhibiting
1. Non-breaking hyphen (U+2011)
2. No-break space (U+00A0)
3. Tibetan Mark Delimiter Tsheg Bstar
(U+0F0C)
4. Narrow no-break space (U+202F)
The break inhibiting characters are meant to
be equivalent to a character sequence
wrapped in the Word Joiner U+2060.
However, the Word Joiner may be
appended before or after any character that
would allow a line-break to inhibit such line-
breaking.
Break enabling
1. Soft hyphen (U+00AD)
2. Tibetan Mark Intersyllabic Tsheg
(U+0F0B)
3. Zero-width space (U+200B)
Both the break inhibiting and break enabling
characters participate with other punctuation
and whitespace characters to enable text
imaging systems to determine line breaks
within the Unicode Line Breaking Algorithm. [7]
Types of code point[edit]
All code points given some kind of purpose
or use are considered designated code
points. Of those, they may be assigned to an
abstract character, or otherwise designated
for some other purpose.
Assigned characters[edit]
The majority of code points in actual use
have been assigned to abstract characters.
This includes private-use characters, which
though not formally designated by the
Unicode standard for a particular purpose,
require a sender and recipient to have
agreed in advance how they should be
interpreted for meaningful information
interchange to take place.
Private-use characters[edit]
Main article: Private Use Areas
The UCS includes 137,468 private-use
characters, which are code points for private
use spread across three different blocks,
each called a Private Use Area (PUA). The
Unicode standard recognizes code points
within PUAs as legitimate Unicode character
codes, but does not assign them any
(abstract) character. Instead, individuals,
organizations, software vendors, operating
system vendors, font vendors and
communities of end-users are free to use
them as they see fit. Within closed systems,
characters in the PUA can operate
unambiguously, allowing such systems to
represent characters or glyphs not defined in
Unicode.[8] In public systems their use is
more problematic, since there is no registry
and no way to prevent several organizations
from adopting the same code points for
different purposes. One example of such a
conflict is Apple's use of U+F8FF for the
Apple logo, versus the ConScript Unicode
Registry's use of U+F8FF as KLINGON
MUMMIFICATION GLYPH in the Klingon script.[9]
The Basic Multilingual Plane (Plane 0)
contains 6,400 private-user characters in the
eponymously named PUA Private Use Area,
which ranges from U+E000 to U+F8FF.
The Private Use Planes, Plane 15 and Plane
16, each have their own PUAs of 65,534
private-use characters (with the final two
code points of each plane being
noncharacters). These are Supplementary
Private Use Area-A, which ranges from
U+F0000 to U+FFFFD, and Supplementary
Private Use Area-B, which ranges from
U+100000 to U+10FFFD.
PUAs are a concept inherited from certain
Asian encoding systems. These systems
had private use areas to encode what the
Japanese call gaiji (rare characters not
normally found in fonts) in application-
specific ways.
Surrogates[edit]
The UCS uses surrogates to address
characters outside the initial Basic
Multilingual Plane without resorting to more-
than-16-bit byte representations.[10] There are
1024 "high" surrogates (D800–DBFF) and
1024 "low" surrogates (DC00–DFFF). By
combining a pair of surrogates, the
remaining characters in all the other planes
can be addressed (1024 × 1024 = 1048576
code points in the other 16 planes). In UTF-
16, they must always appear in pairs, as a
high surrogate followed by a low surrogate,
thus using 32 bits to denote one code point.
A surrogate pair denotes the code point
1000016 + (H - D80016) × 40016 + (L - DC0016)
where H and L are the numeric values of
the high and low surrogates respectively.
Since high surrogate values in the range
DB80–DBFF always produce values in
the Private Use planes, the high
surrogate range can be further divided
into (normal) high surrogates (D800–
DB7F) and "high private use surrogates"
(DB80–DBFF).
Isolated surrogate code points have no
general interpretation; consequently, no
character code charts or names lists are
provided for this range. In the Python
programming language, individual
surrogate codes are used to embed
undecodable bytes in Unicode strings.[11]
Noncharacters[edit]
The unhyphenated term "noncharacter"
refers to 66 code points (labeled )
permanently reserved for internal use,
and therefore guaranteed to never be
assigned to a character.[12] Each of the 17
planes has its two ending code points
set aside as noncharacters. So,
noncharacters are: U+FFFE and
U+FFFF on the BMP, U+1FFFE and
U+1FFFF on Plane 1, and so on, up to
U+10FFFE and U+10FFFF on Plane 16,
for a total of 34 code points. In addition,
there is a contiguous range of another
32 noncharacter code points in the BMP:
U+FDD0..U+FDEF. Software
implementations are therefore free to
use these code points for internal use.
One particularly useful example of a
noncharacter is the code point U+FFFE.
This code point has the reverse UTF-
16/UCS-2 byte sequence of the byte
order mark (U+FEFF). If a stream of text
contains this noncharacter, this is a good
indication the text has been interpreted
with the incorrect endianness. <not a
character>
Versions of the Unicode standard from
3.1.0 to 6.3.0 claimed that noncharacters
"should never be
interchanged". Corrigendum #9 of the
standard later stated that this was
leading to "inappropriate over-rejection",
clarifying that "[Noncharacters] are not
illegal in interchange nor do they cause
ill-formed Unicode text", and removing
the original claim.
Reserved code points[edit]
All other code points, being those not
designated, are referred to as being
reserved. These code points may be
assigned for a particular use in future
versions of the Unicode standard.
Characters, grapheme
clusters and glyphs[edit]
Whereas many other character sets
assign a character for every possible
glyph representation of the character,
Unicode seeks to treat characters
separately from glyphs. This distinction
is not always unambiguous, however a
few examples will help illustrate the
distinction. Often two characters may be
combined typographically to improve the
readability of the text. For example, the
three letter sequence "ffi" may be treated
as a single glyph. Other character sets
would often assign a code point to this
glyph in addition to the individual letters:
"f" and "i".
In addition, Unicode
approaches diacritic modified letters as
separate characters that, when
rendered, become a single glyph. For
example, an "o" with diaeresis: "ö".
Traditionally, other character sets
assigned a unique character code point
for each diacritic modified letter used in
each language. Unicode seeks to create
a more flexible approach by allowing
combining diacritic characters to
combine with any letter. This has the
potential to significantly reduce the
number of active code points needed for
the character set. As an example,
consider a language that uses the Latin
script and combines the diaeresis with
the upper- and lower-case letters "a",
"o", and "u". With the Unicode approach,
only the diaeresis diacritic character
needs to be added to the character set
to use with the Latin letters: "a", "A", "o",
"O", "u", and "U": seven characters in all.
A legacy character sets needs to add
six precomposed letters with a
diaeresis in addition to the six code
points it uses for the letters without
diaeresis: twelve character code points
in total.
Compatibility characters[edit]
UCS includes thousands of characters
that Unicode designates as compatibility
characters. These are characters that
were included in UCS in order to provide
distinct code points for characters that
other character sets differentiate, but
would not be differentiated in the
Unicode approach to characters.
The chief reason for this differentiation
was that Unicode makes a distinction
between characters and glyphs. For
example, when writing English in
a cursive style, the letter "i" may take
different forms whether it appears at the
beginning of a word, the end of a word,
the middle of a word or in isolation.
Languages such as Arabic written in an
Arabic script are always cursive. Each
letter has many different forms. UCS
includes 730 Arabic form characters that
decompose to just 88 unique Arabic
characters. However, these additional
Arabic characters are included so that
text processing software may translate
text from other character sets to UCS
and back again without any loss of
information crucial for non-Unicode
software.
However, for UCS and Unicode in
particular, the preferred approach is to
always encode or map that letter to the
same character no matter where it
appears in a word. Then the distinct
forms of each letter are determined by
the font and text layout software
methods. In this way, the internal
memory for the characters remains
identical regardless of where the
character appears in a word. This greatly
simplifies searching, sorting and other
text processing operations.
Character properties[edit]
Main article: Unicode character property
Every character in Unicode is defined by
a large and growing set of properties.
Most of these properties are not part of
Universal Character Set. The properties
facilitate text processing including
collation or sorting of text, identifying
words, sentences and graphemes,
rendering or imaging text and so on.
Below is a list of some of the core
properties. There are many others
documented in the Unicode Character
Database.[13]
Property Example Details
This is a permanent name assigned by
the joint cooperation of Unicode and
the ISO UCS. A few known poorly
LATIN chosen names exist and are
Name CAPITAL acknowledged (e.g. U+039B GREEK
LETTER A CAPITAL LETTER LAMDA, which is
misspelled – should be LAMBDA) but
will not be changed, in order to ensure
specification stability.[14]
The Unicode code point is a number
also permanently assigned along with
the "Name" property and included in
Code Point U+0041 the companion UCS. The usual custom
is to represent the code point as
hexadecimal number with the prefix
"U+" in front.
Representative The representative glyphs are provided
Glyph [15] in code charts.[16]
General Uppercase_Letter The general category[17] is expressed as a
Category two-letter sequence such as "Lu" for
uppercase letter or "Nd", for decimal
digit number.
Since diacritics and other combining
marks can be expressed with multiple
characters in Unicode the "Combining
Class" property allows characters to be
differentiated by the type of combining
Combining Not_Reordered character it represents. The combining
Class (0) class can be expressed as an integer
between 0 and 255 or as a named value.
The integer values allow the combining
marks to be reordered into a canonical
order to make string comparison of
identical strings possible.
Indicates the type of character for
Bidirectional
Left_To_Right applying the Unicode bidirectional
Category
algorithm.
Indicates the character's glyph must be
reversed or mirrored within the
bidirectional algorithm. Mirrored
Bidirectional glyphs can be provided by font makers,
no
Mirrored extracted from other characters related
through the "Bidirectional Mirroring
Glyph" property or synthesized by the
text rendering system.
This property indicates the code point
Bidirectional of another character whose glyph can
Mirroring N/A serve as the mirrored glyph for the
Glyph present character when mirroring within
the bidirectional algorithm.
Decimal Digit For numerals, this property indicates
NaN
Value the numeric value of the character.
Decimal digits have all three values set
to the same value, presentational rich
text compatibility characters and other
Digit Value NaN
Arabic-Indic non-decimal digits
typically have only the latter two
properties set to the numeric value of
the character while numerals unrelated
to Arabic Indic digits such as Roman
Numeric Value NaN Numerals or Hanzhou/Suzhou numerals
typically have only the "Numeric
Value" indicated.
Indicates the character is a CJK
Ideographic False ideograph: a logograph in the Han
script.[18]
Indicates the character is ignorable for
Default implementations and that no glyph, last
False
Ignorable resort glyph, or replacement character
need be displayed.
Unicode never removes characters from
the repertoire, but on occasion Unicode
Deprecated False
has deprecated a small number of
characters.
Unicode provides an online
database[19] to interactively query the
entire Unicode character repertoire by
the various properties.
See also[edit]
ConScript Unicode Registry
Unicode compatibility characters
References[edit]
1. ^ "The Unicode Standard". The Unicode
Consortium. Retrieved 2016-08-09.
2. ^ "Roadmaps to Unicode". The Unicode
Consortium. Retrieved 2021-09-15.
3. ^ "Section 2.13: Special
Characters" (PDF). The Unicode Standard.
The Unicode Consortium. September
2022.
4. ^ "Section 4.12: Characters with Unusual
Properties" (PDF). The Unicode Standard.
The Unicode Consortium. September
2022.
5. ^ "Section 6.2: General
Punctuation" (PDF). The Unicode
Standard. The Unicode Consortium.
September 2022.
6. ^ "UTN #2: A General Method for
Rendering Combining
Marks". www.unicode.org.
Retrieved 2020-12-16.
7. ^ "UAX #14: Unicode Line Breaking
Algorithm". The Unicode Consortium.
2016-06-01. Retrieved 2016-08-09.
8. ^ "Section 23.5: Private-Use
Characters" (PDF). The Unicode Standard.
The Unicode Consortium. September
2022.
9. ^ Michael Everson (2004-01-15). "Klingon:
U+F8D0 - U+F8FF".
10. ^ "Section 23.6: Surrogates
Area" (PDF). The Unicode Standard. The
Unicode Consortium. September 2022.
11. ^ v. Löwis, Martin (2009-04-22). "Non-
decodable Bytes in System Character
Interfaces". Python Enhancement
Proposals. PEP 383. Retrieved 2016-08-
09.
12. ^ "Section 23.7:
Noncharacters" (PDF). The Unicode
Standard. The Unicode Consortium.
September 2022.
13. ^ "Unicode Character Database". The
Unicode Consortium. Retrieved 2016-08-
09.
14. ^ Freytag, Asmus; McGowan, Rick;
Whistler, Ken. "Unicode Technical Note
#27 — Known Anomalies in Unicode
Character Names". Unicode Consortium.
15. ^ Not the official Unicode representative
glyph, but merely a representative glyph.
To see the official Unicode representative
glyph, see the code charts.
16. ^ "Character Code Charts". The Unicode
Consortium. Retrieved 2016-08-09.
17. ^ "UAX #44: Unicode Character
Database". General Category Values. The
Unicode Consortium. 2014-06-05.
Retrieved 2016-08-09.
18. ^ Davis, Mark; Iancu, Laurențiu; Whistler,
Ken. "Table 9. Property Table §
PropList.txt". Unicode Standard Annex
#44 — Unicode Character Database.
Unicode Consortium.
19. ^ "Unicode Utilities: Character Property
Index". The Unicode Consortium.
Retrieved 2015-06-09.
External links[edit]
Wikimedia Commons has media related
to Unicode.
Unicode Consortium
decodeunicode.org Unicode Wiki
with all 98884 graphic characters of
Unicode 5.0 as gifs, full text search
Unicode Characters by Property
show
v
t
e
Unicode
Categories:
IEC standards
Unicode
Navigation menu
Not logged in
Talk
Contributions
Create account
Log in
Article
Talk
Read
Edit
View history
Search
Search Go
Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
Contribute
Help
Learn to edit
Community portal
Recent changes
Upload file
Tools
What links here
Related changes
Special pages
Permanent link
Page information
Cite this page
Wikidata item
Print/export
Download as PDF
Printable version
In other projects
Wikimedia Commons
Languages
Esperanto
Euskara
Ilokano
Nederlands
日本語
Edit links
This page was last edited on 16 September 2022, at 20:37 (UTC).
Text is available under the Creative Commons Attribution-ShareAlike License 3.0; additional
terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy.
Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit
organization.
Privacy policy
About Wikipedia
Disclaimers
Contact Wikipedia
Mobile view
Developers
Statistics
Cookie statement