May 14, 2020

UTF-8 Encoding in Go

Reading time about 10 min

Before we can even begin to talk about the complex topic of encodings, let’s first talk about character. What’s a character? Searching it on Google will yield the following definition:

A character is a symbol representing a letter or number.

And that is absolutely true, for a toddler that is. What about LF (line feed)from the ASCII? Does it represent a letter or a number? PS: That was a rhetorical question.

To be compact and precise, a character is any symbol that represents information. Information like what? Unfortunately, we cannot understand it without a little history lesson. While I will shed light on only a small part of it, you can read JOEL SPOLSKY’s excellent primer on character encoding to get the complete picture.

A Brief History

In computing terms, everything is represented in a sequence of 1s and 0s (duh!!) and characters are no exception. But how do we represent these characters in memory? Enter, the granddaddy of character sets, ASCII (yes, we still remember you EBCDIC ).

ASCII

The ASCII maps every English character along with numbers and symbols to a number ranging from 32 – 127. That means every character can be represented by a single byte (8 bits). Also, if we delve deeper, we only need 7 bits to actually store all the ASCII character mappings less than 128 (nerds rejoicing!!). Codes below 32 were called unprintable and were used as control characters, for example, 7 which made your computer beep, and 10 for LF (line feed).

Then came the pain. Because ASCII only included 128 mappings and a byte can have a maximum value of 256, thus began the topic of what to do with the remaining mappings. Since the ASCII only incorporated English letters, many people had the same idea to incorporate their own language characters for the remaining mappings (because why not!!!). And thus, the concept of Code Pages was invented.

Code Page

A Code page is basically a list of character codes and their corresponding symbols, but with more than 128 mappings. Every language, or rather region, had a Code Page mapping the symbols of that region with numbers. It was similar to ASCII in one way only. Below 128, the character mappings for every code page were the same as that in ASCII. For example, the English letter A mapped to 65 on every code page. But after 128, there was no guessing as to what symbol was represented by what number in which Code Page. Lots of Code Pages but not a single way to uniquely map every character in the world.

Unicode

To create a single-character set that included every reasonable writing system on the planet, Unicode was developed.

In Unicode, the letter A is a platonic letter:

This platonic A is different than B, and different from a, but the same as A and A ensuring fonts, like Times New Roman and Helvetica, and text formatting do not influence the Unicode character set.

Each platonic letter in every language is assigned a unique number by the Unicode guys which are written as U+0639. This number is called a Unicode Code Point. The U+ stands for “Unicode” and the 4 numbers are each hexadecimal. For example, the English letter A would be U+0041. With this system, there is no real limit on the number of characters that Unicode can define.

Let’s say we have a string Hello, which in Unicode corresponds to the following 5 Code points:

H        e        l        l        o
U+0048   U+0065   U+006C   U+006C   U+006F

Now that we have a single way of defining characters, how do we represent them in memory, given that the code point is merely an abstract concept?

Encodings

Encoding, in simple terms, is a way to convert one form of data into another. So in our case, a way to convert the Unicode code point U+0048 into some other form of data. One such form of encoding (albeit the most popular one) is UTF-8.

UTF-8 is an encoding system used for storing the Unicode Code Points, like U+0048 in memory using 8-bit bytes. In UTF-8, every code point from 0–127 is stored in a single byte. The code points 128 and above are stored using 2, 3, in fact, up to 4 bytes. Thus, all the accented characters and other miscellaneous symbols are represented by multiple bytes. Added to it is the benefit that the English alphabet looks exactly the same in ASCII as it does in Unicode, so we can safely say that ASCII and Unicode before 127, are completely similar to each other.

Unicode, UTF-8, and Go

Now that we have established the narrative that all the Unicode characters can be represented using 1– 4 bytes, how do we represent them in Go?

Go has a builtin data structure specifically for storing Unicode Code Points called a rune . While a byte is, a rune is int32 representing 4 bytes of memory, as 4 bytes is the max length for a Unicode Code Point.

The default type for character values is a rune, which means, if we don’t declare a type explicitly when declaring a variable with a character value, then Go will infer the type as a rune.

package main
import (
 "fmt"
)
func main() {
 byteVal := byte('a')
 runeVal := 'a' //rune as default type of char is rune
 fmt.Printf("%c = %d %c = %U",byteVal, byteVal, runeVal, runeVal)
}

Running the above code yields:

a = 97 a = U+0061

Now, how about streaming data from a file? We can read the data via the “`io.Reader interface

type Reader interface {
    Read(p []byte) (n int, err error)
}

This interface reads the bytes from the file into p. But how can we possibly decode the character from the bytes read, as we do not know which character is represented by how many bytes according to UTF-8? For example, let’s say we read 65 from the file, which is the letter A and, being less than 128, is represented by a single byte. But what about 165, which will be represented by multiple bytes?

For that, we make use of the excellent bufio package.

`bufio` package

ReadRune() (r rune, size int, err error)

The bufio package wraps the file with a buffer storage[]byte, with a size equal to N so as to provide buffered I/O operations. The ReadRune method returns the first rune from the buffer.

As we start to read the file, the buffer storage is first filled to the length N with the bytes. Then we start traversing the buffer. While traversing from the start, we get the rune by using the utf8.DecodeRune method and increment the index with the width of the returned rune.

DecodeRune(p []byte) (r rune, size int)

The method returns the first valid UTF8 rune from the buffer, rune and width (number of bytes). It reads the first byte of p , checks if it is enough to represent a UTF-8 Code Point (less than 128). Returns it, if yes. If not, it checks if the first two bytes are enough to represent, and so on till 4 bytes. If no valid UTF-8 Code Point can be formed from the first 4 bytes, it simply returns the Unicode replacement character \uFFFD with width 1 (yes, that dreaded symbol � devised by Satan himself).

Let’s say the buffer contains the following bytes at any point in time:

104 226 130 172 108 108 111

The DecodeRune the method checks 104 which is a valid Code Point. In this case, it represents H as is returned immediately. Then it checks 226, which, being greater than 127 cannot solely represent a character in UTF-8. So, instead, it checks if 226 130 represent a valid Code Point, which they still don’t. Finally 226 130 172 represent a valid code point € and is returned. Then comes 108 which is L and finally 111 which is O . Thus we get the string:

H€llo

What happens if the buffer ends?

For those wondering what happens if the buffer ends before we can check all the 4 combinations of bytes, let’s say we have the following buffer with the last 3 bytes:

104 226 130

While we will be able to read single-byte characters like 104 which map to H, 226 130 don’t represent any character and the buffer size is reached. So we simply copy the bytes 226 130 to the start of the buffer, fill up the remaining N-2 bytes with new bytes from the file and start reading bytes from the beginning of the buffer again till we get EOF . And that’s how you read characters from a stream of bytes in Go.

Storing characters

Now that we understand how characters are represented in Go, what about storing them. For that, we have the string datatype.

For Go, UTF-8 is the default encoding for storing characters in a string.

var a = "hello world"

Here a is of the type string and stores hello world in UTF-8 encoding. But that does not mean the string only contain UTF-8 characters.

In Go, a string is simply a read-only slice of bytes. It is not required to hold UTF-8 or any other predefined encoding format. The only data it holds is some arbitrary bytes. Let’s take an example:

const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

While sample is a valid string, if we try to print it, it’ll yield

��=� ⌘

Furthermore, if we try to loop through the string sample

for i := 0; i < len(sample); i++ {
    fmt.Printf("%x ", sample[i])
}

It will yield

bd b2 3d bc 20 e2 8c 98

thus proving that a string is nothing but a slice of bytes in Go.

Final Thoughts

I hope this little effort gives you a better insight into what a character is, what encoding is, and how UTF-8 is a central part of character representation in Go. There’s still much more to say about character encoding, UTF-8, and the humongous world of text processing, but I think it can wait for another post, or perhaps a series ? ? Till then,

HAPPY ENCODING!!!