Before we can even begin to talk about the complex topic of encodings, let’s first talk about character. What’s a character? Searching it on Google will yield the following definition:
A character is a symbol representing a letter or number.
And that is absolutely true, for a toddler that is. What about LF (line feed)from the ASCII? Does it represent a letter or a number? PS: That was a rhetorical question.
To be compact and precise, a character is any symbol that represents information. Information like what? Unfortunately, we cannot understand it without a little history lesson. While I will shed light on only a small part of it, you can read JOEL SPOLSKY’s excellent primer on character encoding to get the complete picture.
A Brief History
In computing terms, everything is represented in a sequence of 1s and 0s (duh!!) and characters are no exception. But how do we represent these characters in memory? Enter, the granddaddy of character sets, ASCII (yes, we still remember you EBCDIC ).
ASCII
The ASCII maps every English character along with numbers and symbols to a number ranging from 32 – 127. That means every character can be represented by a single byte (8 bits). Also, if we delve deeper, we only need 7 bits to actually store all the ASCII character mappings less than 128 (nerds rejoicing!!). Codes below 32 were called unprintable and were used as control characters, for example, 7 which made your computer beep, and 10 for LF (line feed).
Then came the pain. Because ASCII only included 128 mappings and a byte can have a maximum value of 256, thus began the topic of what to do with the remaining mappings. Since the ASCII only incorporated English letters, many people had the same idea to incorporate their own language characters for the remaining mappings (because why not!!!). And thus, the concept of Code Pages was invented.
Code Page
A Code page is basically a list of character codes and their corresponding symbols, but with more than 128 mappings. Every language, or rather region, had a Code Page mapping the symbols of that region with numbers. It was similar to ASCII in one way only. Below 128, the character mappings for every code page were the same as that in ASCII. For example, the English letter A mapped to 65 on every code page. But after 128, there was no guessing as to what symbol was represented by what number in which Code Page. Lots of Code Pages but not a single way to uniquely map every character in the world.
Unicode
To create a single-character set that included every reasonable writing system on the planet, Unicode was developed.
In Unicode, the letter A is a platonic letter:
A
This platonic A is different than B, and different from a, but the same as A and A ensuring fonts, like Times New Roman and Helvetica, and text formatting do not influence the Unicode character set.
Each platonic letter in every language is assigned a unique number by the Unicode guys which are written as U+0639. This number is called a Unicode Code Point. The U+ stands for “Unicode” and the 4 numbers are each hexadecimal. For example, the English letter A would be U+0041. With this system, there is no real limit on the number of characters that Unicode can define.
Let’s say we have a string Hello, which in Unicode corresponds to the following 5 Code points:
H e l l o
U+0048 U+0065 U+006C U+006C U+006F
Now that we have a single way of defining characters, how do we represent them in memory, given that the code point is merely an abstract concept?
Encodings
Encoding, in simple terms, is a way to convert one form of data into another. So in our case, a way to convert the Unicode code point U+0048 into some other form of data. One such form of encoding (albeit the most popular one) is UTF-8.
UTF-8 is an encoding system used for storing the Unicode Code Points, like U+0048 in memory using 8-bit bytes. In UTF-8, every code point from 0–127 is stored in a single byte. The code points 128 and above are stored using 2, 3, in fact, up to 4 bytes. Thus, all the accented characters and other miscellaneous symbols are represented by multiple bytes. Added to it is the benefit that the English alphabet looks exactly the same in ASCII as it does in Unicode, so we can safely say that ASCII and Unicode before 127, are completely similar to each other.
Unicode, UTF-8, and Go
Now that we have established the narrative that all the Unicode characters can be represented using 1– 4 bytes, how do we represent them in Go?
Go has a builtin data structure specifically for storing Unicode Code Points called a rune
. While a byte is, a rune is int32
representing 4 bytes of memory, as 4 bytes is the max length for a Unicode Code Point.
The default type for character values is a rune, which means, if we don’t declare a type explicitly when declaring a variable with a character value, then Go will infer the type as a rune.
package main
import (
"fmt"
)
func main() {
byteVal := byte('a')
runeVal := 'a' //rune as default type of char is rune
fmt.Printf("%c = %d %c = %U",byteVal, byteVal, runeVal, runeVal)
}
Running the above code yields:
a = 97 a = U+0061
Now, how about streaming data from a file? We can read the data via the “`io.Reader
interface
type Reader interface {
Read(p []byte) (n int, err error)
}
This interface reads the bytes from the file into p. But how can we possibly decode the character from the bytes read, as we do not know which character is represented by how many bytes according to UTF-8? For example, let’s say we read 65 from the file, which is the letter A and, being less than 128, is represented by a single byte. But what about 165, which will be represented by multiple bytes?
For that, we make use of the excellent bufio
package.
bufio
package
ReadRune() (r rune, size int, err error)
The bufio
package wraps the file with a buffer storage[]byte
, with a size equal to N so as to provide buffered I/O operations. The ReadRune method returns the first rune from the buffer.
As we start to read the file, the buffer storage is first filled to the length N
with the bytes. Then we start traversing the buffer. While traversing from the start, we get the rune
by using the utf8.DecodeRune
method and increment the index with the width of the returned rune
.
DecodeRune(p []byte) (r rune, size int)
The method returns the first valid UTF8 rune from the buffer, rune
and width (number of bytes). It reads the first byte of p
, checks if it is enough to represent a UTF-8 Code Point (less than 128). Returns it, if yes. If not, it checks if the first two bytes are enough to represent, and so on till 4 bytes. If no valid UTF-8 Code Point can be formed from the first 4 bytes, it simply returns the Unicode replacement character \uFFFD
with width 1 (yes, that dreaded symbol � devised by Satan himself).
Let’s say the buffer contains the following bytes at any point in time:
104 226 130 172 108 108 111
The DecodeRune
the method checks 104 which is a valid Code Point. In this case, it represents H as is returned immediately. Then it checks 226, which, being greater than 127 cannot solely represent a character in UTF-8. So, instead, it checks if 226 130
represent a valid Code Point, which they still don’t. Finally 226 130 172
represent a valid code point € and is returned. Then comes 108
which is L and finally 111
which is O . Thus we get the string:
What happens if the buffer ends?
For those wondering what happens if the buffer ends before we can check all the 4 combinations of bytes, let’s say we have the following buffer with the last 3 bytes:
104 226 130
While we will be able to read single-byte characters like 104
which map to H, 226 130
don’t represent any character and the buffer size is reached. So we simply copy the bytes 226 130
to the start of the buffer, fill up the remaining N-2 bytes with new bytes from the file and start reading bytes from the beginning of the buffer again till we get EOF
. And that’s how you read characters from a stream of bytes in Go.
Storing characters
Now that we understand how characters are represented in Go, what about storing them. For that, we have the string
datatype.
For Go, UTF-8 is the default encoding for storing characters in a string.
var a = "hello world"
Here a
is of the type string and stores hello world
in UTF-8 encoding. But that does not mean the string
only contain UTF-8 characters.
In Go, a string is simply a read-only slice of bytes. It is not required to hold UTF-8 or any other predefined encoding format. The only data it holds is some arbitrary bytes. Let’s take an example:
const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
While sample
is a valid string, if we try to print it, it’ll yield
��=� ⌘
Furthermore, if we try to loop through the string sample
for i := 0; i < len(sample); i++ {
fmt.Printf("%x ", sample[i])
}
It will yield
bd b2 3d bc 20 e2 8c 98
thus proving that a string
is nothing but a slice of bytes in Go.
Final Thoughts
I hope this little effort gives you a better insight into what a character is, what encoding is, and how UTF-8 is a central part of character representation in Go. There’s still much more to say about character encoding, UTF-8, and the humongous world of text processing, but I think it can wait for another post, or perhaps a series 😏 ? Till then,
HAPPY ENCODING!!!
Resources
[1]: https://blog.golang.org/strings
[3]: https://medium.com/golangspec/go-code-is-utf-8-encoded-b24b30c24c48