Open In App

Strings, bytes, runes and characters in Go

Last Updated : 05 Feb, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In Go, strings are sequences of bytes, not characters. Understanding bytes, runes, and encoding is crucial for handling text correctly. This article explores their differences and key concepts that every developer should know.

1. String

A string in Go is essentially a read-only slice of bytes. This means that strings are backed by a byte slice and are immutable, which means once a string is created, its content cannot be changed directly. While the content of a string can be manipulated (for example, by creating a new string), the string object itself is fixed in terms of size and memory.

Here’s an example of a string:

package main

import "fmt"

func main() {
var str = "Hello, World!"
fmt.Println(str) // Output: Hello, World!
}

String Literals vs. Byte Slices

In Go, a string literal (enclosed in double quotes) is automatically UTF-8 encoded, while a byte slice is just a collection of arbitrary bytes, which could represent text in any encoding scheme, not necessarily UTF-8.

Example:

// String literal
str := "Hello, World!" // This is a UTF-8 encoded string

// Byte slice
bytes := []byte{72, 101, 108, 108, 111, 44, 32, 87, 111, 114, 108, 100, 33} // Same content, raw bytes

Understanding UTF-8 and String Encoding

In Go, strings are UTF-8 encoded by default, meaning each character can be one or more bytes long, depending on the Unicode character’s code point.

UTF-8 Encoding

UTF-8 is a variable-length character encoding that uses one to four bytes for each character. Characters from the ASCII set (U+0000 to U+007F) use a single byte, while characters from other scripts like Chinese or emojis can require multiple bytes.

For example, the character A (U+0041) is represented as the single byte 0x41 in UTF-8. However, a character like (U+2318) takes three bytes (e2 8c 98).

package main

import "fmt"

func main() {
str := "⌘" // Unicode character U+2318 (Place of Interest)
fmt.Println(len(str)) // Output: 3 because '⌘' is 3 bytes in UTF-8
}

Why Indexing a String in Go Doesn’t Return a Character

Go strings are slices of bytes, meaning that when you index a string, you get the individual byte values, not the characters. This can be confusing because, in many programming languages, strings are treated as sequences of characters. In Go, however, a character could span more than one byte, as seen in UTF-8 encoded strings.

package main

import "fmt"

func main() {
str := "⌘"
fmt.Printf("Character at position 0: %c\n", str[0]) // Output: Character at position 0: � (corrupted)
fmt.Printf("Character at position 0 (byte value): %d\n", str[0]) // Output: 226
}

Here, we see that str[0] returns the first byte (226), but that byte alone doesn't represent the character , which is a three-byte sequence.

2. Runes

Go introduces the rune type to represent Unicode code points. A rune is an alias for the int32 type, and it is used to represent a single character, regardless of how many bytes it takes in UTF-8 encoding.

Rune and Code Point

In the context of Unicode, a code point is a unique identifier for each character. A rune in Go is a 32-bit integer that represents a Unicode code point. For instance, the symbol has a Unicode code point of U+2318, which is represented as a rune in Go.

Example:

Go
package main

import "fmt"

func main() {
    // Declare a rune (character constant)
    var r rune = '⌘'

    // Print the rune value and its Unicode code point
    fmt.Printf("Rune value: %c\n", r)            // Output: Rune value: ⌘
    fmt.Printf("Unicode code point: U+%04X\n", r) // Output: Unicode code point: U+2318
}

For-Range Loop with Runes

Go has built-in support for iterating over strings using the for range loop, which handles multi-byte characters like runes properly by iterating over each individual character (rune) in the string.

Go
package main

import "fmt"

func main() {
    str := "日本語" // Japanese characters

    // Using for-range to loop over runes
    for i, runeValue := range str {
        fmt.Printf("Rune %c at byte position %d\n", runeValue, i)
    }
}

Output:

Rune 日 at byte position 0
Rune 本 at byte position 3
Rune 語 at byte position 6

In this example, for range iterates over the string, decoding the UTF-8 bytes into the correct Unicode code points (runemarks) at each index.

Bytes, Runes, and Characters in Go

What’s the Difference Between Bytes, Runes, and Characters?

  • Bytes: A byte represents 8 bits of data. In the context of strings, each byte corresponds to one ASCII character or part of a multi-byte character (like UTF-8).
  • Runes: A rune is an alias for int32 and represents a single Unicode code point. It's used in Go to handle characters that may span more than one byte in UTF-8.
  • Characters: While we often think of characters as being individual letters or symbols, the concept is fuzzy in computing because characters can be composed of one or more code points (like accented characters).

In Go:

  • A string holds arbitrary bytes, and indexing into it retrieves bytes, not individual characters.
  • A rune holds a Unicode code point, which represents a single character.

Practical Example: Converting Between Runes, Bytes, and Strings

Here’s a practical example where we convert between runes, bytes, and strings in Go:

Go
package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    // Original string
    str := "Hello, 世界" // "Hello, World" in English and Chinese characters

    // Convert string to byte slice
    bytes := []byte(str)
    fmt.Printf("Byte slice: %x\n", bytes)

    // Iterate over the string using a for-range loop
    fmt.Println("Iterating over string (runes):")
    for i, runeValue := range str {
        fmt.Printf("Rune: %c, at byte position %d\n", runeValue, i)
    }

    // Convert string to rune slice
    runes := []rune(str)
    fmt.Printf("Rune slice: %v\n", runes)

    // Convert rune back to string
    backToString := string(runes)
    fmt.Printf("Converted back to string: %s\n", backToString)

    // Find the length of the string and the number of runes
    fmt.Printf("String length (in bytes): %d\n", len(str))
    fmt.Printf("Number of runes: %d\n", utf8.RuneCountInString(str))
}

Output
Byte slice: 48656c6c6f2c20e4b896e7958c
Iterating over string (runes):
Rune: H, at byte position 0
Rune: e, at byte position 1
Rune: l, at byte position 2
Rune: l, at byte position 3
Rune: o, at byte p...

In Go, understanding strings, bytes, and runes is crucial for handling text, especially in multilingual applications. Strings store arbitrary bytes, while runes represent Unicode characters. Bytes work well for ASCII, but runes are essential for UTF-8 and international text. Using Go’s built-in types and libraries, you can efficiently convert and manipulate text while ensuring accuracy across different languages.


Next Article
Article Tags :

Similar Reads