Skip to content

Commit 026b1c4

Browse files
authored
Update String type chapter
Update descriptions for Unicode escape sequences. Add minor fixes for different string methods.
1 parent 53b35c1 commit 026b1c4

File tree

1 file changed

+19
-12
lines changed

1 file changed

+19
-12
lines changed

1-js/05-data-types/03-string/article.md

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -86,14 +86,14 @@ Here's the full list:
8686
|`\\`|Backslash|
8787
|`\t`|Tab|
8888
|`\b`, `\f`, `\v`| Backspace, Form Feed, Vertical Tab -- kept for compatibility, not used nowadays. |
89-
|`\xXX`|Unicode character with the given hexadecimal Unicode `XX`, e.g. `'\x7A'` is the same as `'z'`.|
90-
|`\uXXXX`|A Unicode symbol with the hex code `XXXX` in UTF-16 encoding, for instance `\u00A9` -- is a Unicode for the copyright symbol `©`. It must be exactly 4 hex digits. |
91-
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A Unicode symbol with the given UTF-32 encoding. Some rare characters are encoded with two Unicode symbols, taking 4 bytes. This way we can insert long codes. |
89+
|`\xXX`|A character whose [Unicode](https://en.wikipedia.org/wiki/Unicode) code point is `U+00XX`. `XX` is always two hexadecimal digits with value between `00` and `FF`, so `\xXX` notation can be used only for the first 256 Unicode characters (including all 128 ASCII characters). For example, `"\x7A"` is the same as `"z"` (Unicode code point `U+007A`).|
90+
|`\uXXXX`|A character whose Unicode code point is `U+XXXX` (a character with the hex code `XXXX` in UTF-16 encoding). `XXXX` must be exactly 4 hex digits with the value between `0000` and `FFFF`, so `\uXXXX` notation can be used for the first 65536 Unicode characters. Characters with Unicode value greater than `U+FFFF` can also be represented with this notation, but in this case we will need to use a so called surrogate pair (we will talk about surrogate pairs later in this chapter). For instance, `"\u00A9"` is a copyright symbol `©` (Unicode code point `U+00A9`), but for smiling cat face 😺 we have to use a surrogate pair `"\uD83D\uDE3A"` (because its Unicode code point `U+1F63A` is greater than `U+FFFF`).|
91+
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A character with any given Unicode code point (a character with the given hex code in UTF-32 encoding). `X…XXXXXX` is a hex value between `0` and `10FFFF` (the highest code point defined by Unicode). This notation was added to the language in ECMAScript 2015 (ES6) standard and allows us to easily represent all existing Unicode characters without need for surrogate pairs. Unlike previous two notations, there is no need to add leading zeros for characters with "small" code point values: `"\u{7A}"`, `"\u{007A}"` and `"\u{00007A}"` are all acceptable.|
9292

9393
Examples with Unicode:
9494

9595
```js run
96-
alert( "\u00A9" ); // ©
96+
alert( "\u00A9" ); // ©, we will get the very same result with alert( "\xA9" ) and alert( "\u{A9}" )
9797
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode)
9898
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
9999
```
@@ -407,9 +407,9 @@ There are 3 methods in JavaScript to get a substring: `substring`, `substr` and
407407
```
408408

409409
`str.substring(start [, end])`
410-
: Returns the part of the string *between* `start` and `end`.
410+
: Returns the part of the string *between* `start` and `end` (not including the greater of them).
411411

412-
This is almost the same as `slice`, but it allows `start` to be greater than `end`.
412+
This is almost the same as `slice`, but it allows `start` to be greater than `end` (in this case it simply swaps `start` and `end` values).
413413

414414
For instance:
415415

@@ -452,7 +452,7 @@ Let's recap these methods to avoid any confusion:
452452
| method | selects... | negatives |
453453
|--------|-----------|-----------|
454454
| `slice(start, end)` | from `start` to `end` (not including `end`) | allows negatives |
455-
| `substring(start, end)` | between `start` and `end` | negative values mean `0` |
455+
| `substring(start, end)` | between `start` and `end` (not including the greater of them)| negative values mean `0` |
456456
| `substr(start, length)` | from `start` get `length` characters | allows negative `start` |
457457

458458
```smart header="Which one to choose?"
@@ -486,19 +486,21 @@ To understand what happens, let's review the internal representation of strings
486486
All strings are encoded using [UTF-16](https://en.wikipedia.org/wiki/UTF-16). That is: each character has a corresponding numeric code. There are special methods that allow to get the character for the code and back.
487487
488488
`str.codePointAt(pos)`
489-
: Returns the code for the character at position `pos`:
489+
: Returns a decimal number representing the code for the character at position `pos`:
490490
491491
```js run
492492
// different case letters have different codes
493-
alert( "z".codePointAt(0) ); // 122
494493
alert( "Z".codePointAt(0) ); // 90
494+
alert( "z".codePointAt(0) ); // 122
495+
alert( "z".codePointAt(0).toString(16) ); // 7a (if we need a more commonly used hex value of the code)
495496
```
496497
497498
`String.fromCodePoint(code)`
498499
: Creates a character by its numeric `code`
499500
500501
```js run
501502
alert( String.fromCodePoint(90) ); // Z
503+
alert( String.fromCodePoint(0x5a) ); // Z (we can also use a hex value as an argument)
502504
```
503505
504506
We can also add Unicode characters by their codes using `\u` followed by the hex code:
@@ -600,6 +602,11 @@ In the case above:
600602

601603
alert( '𝒳'.charCodeAt(0).toString(16) ); // d835, between 0xd800 and 0xdbff
602604
alert( '𝒳'.charCodeAt(1).toString(16) ); // dcb3, between 0xdc00 and 0xdfff
605+
606+
// codePointAt is surrogate-pair aware, but with its own specificity
607+
608+
alert( '𝒳'.codePointAt(0).toString(16) ); // 1d4b3, reads both parts of the surrogate pair and returns the correct code for the symbol 𝒳
609+
alert( '𝒳'.codePointAt(1).toString(16) ); // dcb3, returns only the code for the second part of the surrogate pair
603610
```
604611

605612
You will find more ways to deal with surrogate pairs later in the chapter <info:iterable>. There are probably special libraries for that too, but nothing famous enough to suggest here.
@@ -608,9 +615,9 @@ You will find more ways to deal with surrogate pairs later in the chapter <info:
608615

609616
In many languages, there are symbols that are composed of the base character with a mark above/under it.
610617

611-
For instance, the letter `a` can be the base character for: `àáâäãåā`. Most common "composite" character have their own code in the UTF-16 table. But not all of them, because there are too many possible combinations.
618+
For instance, the letter `a` can be the base character for: `àáâäãåā`. Most common "composite" character have their own code in the Unicode table. But not all of them, because there are too many possible combinations.
612619

613-
To support arbitrary compositions, UTF-16 allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
620+
To support arbitrary compositions, Unicode standard allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
614621

615622
For instance, if we have `S` followed by the special "dot above" character (code `\u0307`), it is shown as Ṡ.
616623

@@ -657,7 +664,7 @@ alert( "S\u0307\u0323".normalize().length ); // 1
657664
alert( "S\u0307\u0323".normalize() == "\u1e68" ); // true
658665
```
659666

660-
In reality, this is not always the case. The reason being that the symbol `` is "common enough", so UTF-16 creators included it in the main table and gave it the code.
667+
In reality, this is not always the case. The reason being that the symbol `` is "common enough", so Unicode creators included it in the main table and gave it the code.
661668

662669
If you want to learn more about normalization rules and variants -- they are described in the appendix of the Unicode standard: [Unicode Normalization Forms](http://www.unicode.org/reports/tr15/), but for most practical purposes the information from this section is enough.
663670

0 commit comments

Comments
 (0)