Python 中 str 對象 encode 到底是一個怎樣的過程,實現的目的是什麼?

Python2 中 unicode 類型的 str.encode() 後得到的還是一個 str 對象,在 Python3 中 unicode 類型 str.encode() 得到的是一個 byte 對象,在解釋器的底層這個 encode 方法到底是把 str 對象做了什麼操作?


encode == 編碼,decode == 解碼。

encode 就是把邏輯上的字元變成二進位數據,以便存儲和傳輸。(至於編碼前、解碼後的字元是怎麼存儲的,是 Python 的內部實現,只有 Python 自己需要操心,你不用管。就像你不用管整數在 Python 內存里長什麼樣一樣,但是你把整數存起來或者傳輸到網路上時,你就得考慮,是轉成十進位字元串表示呢,還是轉成32位無符號小端序表示呢,還是64位有符號網路序表示呢……)

Python 2 因為有自動編解碼,所以整個是亂的。


python的string內部以unicode存儲字元,string的一切操作,比如indexOf,replace等函數的內部也是

認為其內容為unicode形式來處理的.

而在你的文本文件中,文字內容可能存儲為任何編碼,比如GBK,Big5等等, python將一個GBK編碼的文本讀入時,得到的是個以GBK編碼的"內容", python2中,因為沒有合適的承載這個"內容"的容器,它直接就認為這個"內容"就是個string, 而我們不應該認為一個非unicode編碼的"內容"實際是一個unicode編碼的string. 所以在你進行decode之前,對這個含有錯誤內容的string進行任何操作都是危險的. 因此,python2在處理非unicode編碼的string時,總是先獲得一個含有"錯誤"內容的string,再decode轉換為正確的string. 就這麼尷尬了一段時間.

python3裡面就解決了這個尷尬. 它認為你讀入的"內容"並非直接是一個string,而是一組位元組(bytes或者bytearray), 這一組位元組本身並沒有字元串相關操作,也不用關心自己到底是什麼編碼,它僅僅是一組位元組而已. 直到你使用decode去處理它時,才能得到string.

編碼也一樣,例如使用encode("Big5")將一個string轉換為特定編碼時, python2得到的結果還是一個string,但這個string中的內容已經不是unicode形式存在的了.是個"編碼錯誤"的string. python3就不同了, encode()得到的不是個string,而是bytes.也就是說你得到了一堆位元組,python根本不管這堆位元組到底是一個Big5編碼的文本還是一張圖片.

換個角度來說這個問題, 假如python內部有很多種string類型,比如Big5String, GBKString..... 那麼encode函數返回的類型就可以根據你所選的編碼返回不同編碼的String了.

但是現實是python內部只有一種類型,即unicode類型的String, 那麼在你把一個string使用encode轉換為非unicode編碼時,就相當於超出了python內置string能夠處理的能力範圍. python2中缺乏能夠承載轉換之後的內容的容器,只好還認為它依然是個string(儘管內容是錯的), 因此encode一段string之後得到的結果還是一個string(儘管內容是錯的). 而python3中,encode一段string得到的結果是bytes, 大概意思就是"我內部的string反正也不支持非unicode編碼, 我乾脆就不管你轉換後的結果到底是什麼了, 反正不是string! "


請死磕Python 文檔。

Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points. String literals are written in a variety of ways:

Single quotes: "allows embedded "double" quotes"

Double quotes: "allows embedded "single" quotes".

Triple quoted: """Three single quotes""", """Three double quotes"""

Triple quoted strings may span multiple lines - all associated whitespace will be included in the string literal.

String literals that are part of a single expression and have only whitespace between them will be implicitly converted to a single string literal. That is, ("spam " "eggs") == "spam eggs".

See String and Bytes literals for more about the various forms of string literal, including supported escape sequences, and the r (「raw」) prefix that disables most escape sequence processing.

Strings may also be created from other objects using the str constructor.

Since there is no separate 「character」 type, indexing a string produces strings of length 1. That is, for a non-empty string s, s[0] == s[0:1].

There is also no mutable string type, but str.join() or io.StringIO can be used to efficiently construct strings from multiple fragments.

Changed in version 3.3: For backwards compatibility with the Python 2 series, the u prefix is once again permitted on string literals. It has no effect on the meaning of string literals and cannot be combined with the r prefix.

4.8.1. Bytes

Bytes objects are immutable sequences of single bytes. Since many major binary protocols are based on the ASCII text encoding, bytes objects offer several methods that are only valid when working with ASCII compatible data and are closely related to string objects in a variety of other ways.

Firstly, the syntax for bytes literals is largely the same as that for string literals, except that a b prefix is added:

Single quotes: b"still allows embedded "double" quotes"

Double quotes: b"still allows embedded "single" quotes".

Triple quoted: b"""3 single quotes""", b"""3 double quotes"""

Only ASCII characters are permitted in bytes literals (regardless of the declared source code encoding). Any binary values over 127 must be entered into bytes literals using the appropriate escape sequence.

As with string literals, bytes literals may also use a r prefix to disable processing of escape sequences. See String and Bytes literals for more about the various forms of bytes literal, including supported escape sequences.

While bytes literals and representations are based on ASCII text, bytes objects actually behave like immutable sequences of integers, with each value in the sequence restricted such that 0 &<= x &< 256 (attempts to violate this restriction will trigger ValueError. This is done deliberately to emphasise that while many binary formats include ASCII based elements and can be usefully manipulated with some text-oriented algorithms, this is not generally the case for arbitrary binary data (blindly applying text processing algorithms to binary data formats that are not ASCII compatible will usually lead to data corruption).

In addition to the literal forms, bytes objects can be created in a number of other ways:

A zero-filled bytes object of a specified length: bytes(10)

From an iterable of integers: bytes(range(20))

Copying existing binary data via the buffer protocol: bytes(obj)

Also see the bytes built-in.

Since 2 hexadecimal digits correspond precisely to a single byte, hexadecimal numbers are a commonly used format for describing binary data. Accordingly, the bytes type has an additional class method to read data in that format:


推薦閱讀:

在Python 3.x中經常看到定義函數有一個單獨的 * 參數?定義這樣參數的目的是?怎樣對其取值呢?
python的庫、方法這麼多,寫程序的時候能記住嗎?
為什麼說不能用import導入自定義的包?

TAG:Python3x | Python編程 |