Categorized | Python

Unicode String Handling

A common problem associated with I/O handling is that of dealing with international
characters represented as Unicode. If you have a string s of raw bytes containing an
encoded representation of a Unicode string, use the s.decode([encoding
[,errors]]) method to convert it into a proper Unicode string.To convert a Unicode
string, u, to an encoded byte string, use the string method u.encode([encoding [,
errors]]). Both of these conversion operators require the use of a special encoding
name that specifies how Unicode character values are mapped to a sequence of 8-bit
characters in byte strings, and vice versa.The encoding parameter is specified as a string and is one of more than a hundred different character encodings.The following values,
however, are most common:
Value Description
‘ascii’ 7-bit ASCII
‘latin-1′ or ‘iso-8859-1′ ISO 8859-1 Latin-1
‘cp1252′ Windows 1252 encoding
‘utf-8′ 8-bit variable-length encoding
‘utf-16′ 16-bit variable-length encoding (may be little or big
endian)
‘utf-16-le’ UTF-16, little endian encoding
‘utf-16-be’ UTF-16, big endian encoding
‘unicode-escape’ Same format as Unicode literals u”string”
‘raw-unicode-escape’ Same format as raw Unicode literals ur”string”
The default encoding is set in the site module and can be queried using
sys.getdefaultencoding(). In many cases, the default encoding is ‘ascii’, which
means that ASCII characters with values in the range [0x00,0x7f] are directly mapped
to Unicode characters in the range [U+0000, U+007F]. However, ‘utf-8′ is also a
very common setting.Technical details concerning common encodings appears in a
later section.
When using the s.decode() method, it is always assumed that s is a string of bytes.
In Python 2, this means that s is a standard string, but in Python 3, s must be a special
bytes type. Similarly, the result of t.encode() is always a byte sequence. One caution
if you care about portability is that these methods are a little muddled in Python 2. For
instance, Python 2 strings have both decode() and encode() methods, whereas in
Python 3, strings only have an encode() method and the bytes type only has a
decode() method.To simplify code in Python 2, make sure you only use encode() on
Unicode strings and decode() on byte strings.
When string values are being converted, a UnicodeError exception might be raised
if a character that can’t be converted is encountered. For instance, if you are trying to
encode a string into ‘ascii’ and it contains a Unicode character such as U+1F28, you
will get an encoding error because this character value is too large to be represented in
the ASCII character set.The errors parameter of the encode() and decode() methods
determines how encoding errors are handled. It’s a string with one of the following
values:
Value Description
‘strict’ Raises a UnicodeError exception for encoding and decoding
errors.
‘ignore’ Ignores invalid characters.
‘replace’ Replaces invalid characters with a replacement character
(U+FFFD in Unicode, ‘?’ in standard strings).
‘backslashreplace’ Replaces invalid characters with a Python character escape
sequence. For example, the character U+1234 is replaced
by ‘\u1234′.
‘xmlcharrefreplace’ Replaces invalid characters with an XML character reference.
For example, the character U+1234 is replaced by
‘ሴ’.

The default error handling is ‘strict’.
The ‘xmlcharrefreplace’ error handling policy is often a useful way to embed
international characters into ASCII-encoded text on web pages. For example, if you
output the Unicode string ‘Jalape\u00f1o’ by encoding it to ASCII with
‘xmlcharrefreplace’ handling, browsers will almost always correctly render the output
text as “Jalapeño” and not some garbled alternative.
To keep your brain from exploding, encoded byte strings and unencoded strings
should never be mixed together in expressions (for example, using + to concatenate).
Python 3 prohibits this altogether, but Python 2 will silently go ahead with such operations
by automatically promoting byte strings to Unicode according to the default
encoding setting.This behavior is often a source of surprising results or inexplicable
error messages.Thus, you should carefully try to maintain a strict separation between
encoded and unencoded character data in your program.


Your Ad Here
  • No Text AD Link within the last days, you can buy the advertising link!
  • Buy The AD link

VN:F [1.8.2_1042]
Rating: 0.0/10 (0 votes cast)
VN:F [1.8.2_1042]
Rating: 0 (from 0 votes)

Leave a Reply

  • Subscribe

Ads

Categories

Apple Ipad