You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A Unicode symbol with the given UTF-32 encoding. Some rare characters are encoded with two Unicode symbols, taking 4 bytes. This way we can insert long codes. |
89
+
|`\xXX`|A character whose [Unicode](https://en.wikipedia.org/wiki/Unicode) code point is `U+00XX`. `XX` is always two hexadecimal digits with value between `00` and `FF`, so `\xXX` notation can be used only for the first 256 Unicode characters (including all 128 ASCII characters). For example, `"\x7A"` is the same as `"z"` (Unicode code point `U+007A`).|
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A character with any given Unicode code point (a character with the given hex code in UTF-32 encoding). `X…XXXXXX` is a hex value between `0` and `10FFFF` (the highest code point defined by Unicode). This notation was added to the language in ECMAScript 2015 (ES6) standard and allows us to easily represent all existing Unicode characters without need for surrogate pairs. Unlike previous two notations, there is no need to add leading zeros for characters with "small" code point values: `"\u{7A}"`, `"\u{007A}"` and `"\u{00007A}"` are all acceptable.|
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode)
98
98
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
99
99
```
@@ -407,9 +407,9 @@ There are 3 methods in JavaScript to get a substring: `substring`, `substr` and
407
407
```
408
408
409
409
`str.substring(start [, end])`
410
-
: Returns the part of the string *between*`start` and `end`.
410
+
: Returns the part of the string *between*`start` and `end` (not including the greater of them).
411
411
412
-
This is almost the same as `slice`, but it allows `start` to be greater than `end`.
412
+
This is almost the same as `slice`, but it allows `start` to be greater than `end` (in this case it simply swaps `start` and `end` values).
413
413
414
414
For instance:
415
415
@@ -452,7 +452,7 @@ Let's recap these methods to avoid any confusion:
452
452
| method | selects... | negatives |
453
453
|--------|-----------|-----------|
454
454
|`slice(start, end)`| from `start` to `end` (not including `end`) | allows negatives |
455
-
|`substring(start, end)`| between `start` and `end`| negative values mean `0`|
455
+
|`substring(start, end)`| between `start` and `end`(not including the greater of them)| negative values mean `0`|
456
456
|`substr(start, length)`| from `start` get `length` characters | allows negative `start`|
457
457
458
458
```smart header="Which one to choose?"
@@ -486,19 +486,21 @@ To understand what happens, let's review the internal representation of strings
486
486
All strings are encoded using [UTF-16](https://en.wikipedia.org/wiki/UTF-16). That is: each character has a corresponding numeric code. There are special methods that allow to get the character for the code and back.
487
487
488
488
`str.codePointAt(pos)`
489
-
: Returns the code for the character at position `pos`:
489
+
: Returns a decimal number representing the code for the character at position `pos`:
490
490
491
491
```js run
492
492
// different case letters have different codes
493
-
alert( "z".codePointAt(0) ); // 122
494
493
alert( "Z".codePointAt(0) ); // 90
494
+
alert( "z".codePointAt(0) ); // 122
495
+
alert( "z".codePointAt(0).toString(16) ); // 7a (if we need a more commonly used hex value of the code)
495
496
```
496
497
497
498
`String.fromCodePoint(code)`
498
499
: Creates a character by its numeric `code`
499
500
500
501
```js run
501
502
alert( String.fromCodePoint(90) ); // Z
503
+
alert( String.fromCodePoint(0x5a) ); // Z (we can also use a hex value as an argument)
502
504
```
503
505
504
506
We can also add Unicode characters by their codes using `\u` followed by the hex code:
@@ -600,6 +602,11 @@ In the case above:
600
602
601
603
alert( '𝒳'.charCodeAt(0).toString(16) ); // d835, between 0xd800 and 0xdbff
602
604
alert( '𝒳'.charCodeAt(1).toString(16) ); // dcb3, between 0xdc00 and 0xdfff
605
+
606
+
// codePointAt is surrogate-pair aware, but with its own specificity
607
+
608
+
alert( '𝒳'.codePointAt(0).toString(16) ); // 1d4b3, reads both parts of the surrogate pair and returns the correct code for the symbol 𝒳
609
+
alert( '𝒳'.codePointAt(1).toString(16) ); // dcb3, returns only the code for the second part of the surrogate pair
603
610
```
604
611
605
612
You will find more ways to deal with surrogate pairs later in the chapter <info:iterable>. There are probably special libraries for that too, but nothing famous enough to suggest here.
@@ -608,9 +615,9 @@ You will find more ways to deal with surrogate pairs later in the chapter <info:
608
615
609
616
In many languages, there are symbols that are composed of the base character with a mark above/under it.
610
617
611
-
For instance, the letter `a` can be the base character for: `àáâäãåā`. Most common "composite" character have their own code in the UTF-16 table. But not all of them, because there are too many possible combinations.
618
+
For instance, the letter `a` can be the base character for: `àáâäãåā`. Most common "composite" character have their own code in the Unicode table. But not all of them, because there are too many possible combinations.
612
619
613
-
To support arbitrary compositions, UTF-16 allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
620
+
To support arbitrary compositions, Unicode standard allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
614
621
615
622
For instance, if we have `S` followed by the special "dot above" character (code `\u0307`), it is shown as Ṡ.
In reality, this is not always the case. The reason being that the symbol `Ṩ` is "common enough", so UTF-16 creators included it in the main table and gave it the code.
667
+
In reality, this is not always the case. The reason being that the symbol `Ṩ` is "common enough", so Unicode creators included it in the main table and gave it the code.
661
668
662
669
If you want to learn more about normalization rules and variants -- they are described in the appendix of the Unicode standard: [Unicode Normalization Forms](http://www.unicode.org/reports/tr15/), but for most practical purposes the information from this section is enough.
0 commit comments