Petter Holt Juliussen • Mail | GitHub | Letterboxd

for later reference.

Encoding

2019-04-04

Unicode

Unicode contains a listing of characters from nearly every world script. However this is just one part of the Unicode Standard: the Universal Coded Character Set. The Unicode Standard includes rules for rendering, ordering, normalising and encodingof these Unicode characters.

UTF-8 is one of the three standard character encodings used to represent Unicode as computer text (the others being UTF-16 and UTF-32). UTF-8 is currently the dominant UCS encoding which is a variable-width encoding designed for backward compatibility with ASCII, and for avoiding the complications of endianness and byte-order marks in UTF-16 and UTF-32.

Unicode in Python

In Python 2.x there are two types that deal with text: (1) str is for strings of bytes and (2) unicode for strings of unicode code points.

Unicode strings are expressed as instances of the unicode type. The unicode() constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings. The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors.

s = 'hei på deg'                            # <type 'str'>
print s                                     # hei på deg

s = unicode('hei på deg')                   # <type 'unicode'>
                                            # UnicodeDecodeError: 'ascii' codec can't decode byte 
                                            # 0xc3 in position 5: ordinal not in range(128)

s = unicode('hei på deg', encoding='utf-8') # <type 'unicode'>
print s                                     # hei på deg

Since Python 3.0, the language features a str type that contain Unicode characters, meaning any string created using "unicode rocks!", 'unicode rocks!', or the triple-quoted string syntax is stored as Unicode. The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal.