Unicode is not a new topic in my blog. Last time, when I was writing PHP script on Windows, I met the problem about read file content with unicode file name. In that case, I have to get the file by give file name which is encoded by UTF-8, while all Chinese characters are encoded by CP936/GBK in Windows (I am using windows 7). I will not repeat that again. You can find the solution by reading that post. Today, I am going to talk about Unicode in python. Python is a great language to write crawlers. When getting the website content, we will save the content and parse it for later usage. Here is a piece of example source code to get content from URL.
import urllib.request with urllib.request.urlopen('http://python.org/') as response: data = response.read()
The above example code is from Python official website. It is using Python3. You can find more example source code from Fetch Internet Resources Using The urllib package. To create crawler with Python, you can also consider with Python library Requests, or just using the scrapy framework. Let’s come back and talk about the Unicode issue.
The code shown above will get content from url and the result will be saved in variable data. Currently, the data is the raw data in bytes (try to use type(data) to test). In this step, we can write the data into file directly. For example:
f = open("a.html", "wb") f.write(data) f.close()
For normal case, this implementation will not get any problem, but how about if the data is not utf8 encoded. We can use following way to know what the original charset/encoding:
response.info().get("Content-Type")
Saving GBK Content with UTF8 Encoding
In my example, the original page is encoded in gbk. Before saving in file by UTF-8, we can decode the data with original code setting and save the string in file. Here is the example source code:
htmlStr = data.decode("gbk") f = open("a.html", "w") f.write(htmlStr) f.close()
The data is decoded by “GBK” character set, the decoded string is assigned to variable htmlStr, then save the string into file a.html. By default, the file will be saved with UTF-8 encoding. To be ensure that the file is saved in UTF-8, we can explicitly write the file in this way:
htmlStr = data.decode("gbk") f = open("a.html", "wb") f.write(htmlStr.encode("utf8")) f.close()
Save Unicode in UTF8 Encoding File
Lots of friends are confused about the unicode and utf8. Actually they are different. Unicode is character set. UTF-8 is a encoding which can represent every character in the Unicode character set. The following table shows how UTF-8 encodes Unicode characters.
Unicode Char Range (2 Bytes) | UTF-8 Encoding (1 to 4 Bytes) | Notes |
---|---|---|
0000 0000-0000 007F | 0xxxxxxx | UTF-8 uses 1 byte to represent Unicode characters in this range. Characters in this ranges are ASCII code. |
0000 0080-0000 07FF | 110xxxxx 10xxxxxx | UTF-8 uses 2 bytes to represent Unicode characters in this range. |
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx | UTF-8 uses 3 bytes to represent Unicode characters in this range. |
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | UTF-8 uses 4 bytes to represent Unicode characters in this range. |
The Unicode character can be represented by 2 bytes or 16 bit hex value. For example, a Unicode character can be like “\u0394” or “U+0394” (“\u” or “U+” represent this is Unicode code). This is called Unicode escape sequences. Keep in mind, this 16 bit hex value is not the UTF-8 encoded code. Now, let me using following Python code to find its corresponding UTF-8 encoded code:
a='\u0394' #a is unicode string b=a.encode('utf8') #b is utf8 encoded byte array/bytes
In above example, I declare an Unicode string. However, the case is more complicated than that. For example, I get a string from ajax call “u+0394”, a string with 6 characters, instead of an Unicode escape sequence. In this case, we need to convert the string into integer value, Unicode code point.
escCode = 'u+0394' codePoint = int(escCode[2:], 16) uniStr = chr(codePoint)
Still confused, right? Let’s check following example, which covers all knowledge we mentioned in this article.
uniString = "大家好" #this is an unicode string variable uniString = "\u5927\u5bb6\u597d" #this is the same value as above uniString = str("\u5927\u5bb6\u597d") #this is the same value as above uniEscapeBytes = uniString.encode("unicode_escape") #get the unicode escape sequences, the result is bytes uniEscape = uniEscapeBytes.decode("ascii") #get the unicode escape sequences string uniEscapeBytes = uniEscape.encode("ascii") #convert the escape sequences string into bytes uniString = uniEscapeBytes.decode("unicode_escape") #get the unicode string uniUTFBytes = uniString.encode("utf8")
Additional Discussion
In the article, we are talking about the unicode, unicode escape sequences, unicode code point, and utf8. When we are working on Web application, there is another concept, URLENCODE. This makes the value object safe to transfer on internet. Basically, it will encode the unicode with UTF8, then transfer the UTF8 bytes in ascii text. Here is an example:
uniString = "大家好" #the unicode string uniUTF = uniString.encode('utf8') #the utf8 bytes asciiHex = binascii.hexlify(uniUTF).decode('ascii') #the hex value of utf8 bytes in text