前兩天的Python思考題，給大家一個參考答案

01-24

題目在這裡：給大家留一道Python的思考題 - 知乎專欄

參考代碼傳到了GitHub

hubo1016/pychecktype

代碼當中有許多doctest，如果已經自己寫了，可以嘗試運行一下看看：

def check_type(value, type): """ Generic type checking. :param type: could be: - a Python type. Notice that `object` matches all types, including None. There are a few special rules: int or long type always match both int and long value; str or unicode type always match both str and unicode value; int type CANNOT match bool value. - a tuple of type, means that data can match any subtype. When multiple subtypes can be matched, the first matched subtype is used. - a empty tuple () means any data type which is not None - None, means None. Could be used to match nullable value e.g. `(str, None)`. Equal to types.NoneType - a list, means that data should be a list, or a single item which is converted to a list of length 1. Tuples are also converted to lists. - a list with exact one valid `type`, means a list which all items are in `type`, or an item in `type` which is converted to a list. Tuples are also converted to lists. - a dict, means that data should be a dict - a dict with keys and values. Values should be valid `type`. If a key starts with "?", it is optional and "?" is removed. If a key starts with "!", it is required, and "!" is removed. If a key starts with "~", the content after "~" should be a regular expression, and any keys in `value` which matches the regular expression (with re.search) and not matched by other keys must match the corresponding type. The behavior is undefined when a key is matched by multiple regular expressions. If a key does not start with "?", "!" or "~", it is required, as if "!" is prepended. :param value: the value to be checked. It is guaranteed that this value is not modified. :return: the checked and converted value. An exception is raised (usually TypeMismatchException) when `value` is not in `type`. The returned result may contain objects from `value`. Some examples:: >>> check_type("abc", str) "abc" >>> check_type([1,2,3], [int]) [1, 2, 3] >>> check_type((1,2,3), [int]) [1, 2, 3] >>> check_type(1, ()) 1 >>> check_type([[]], ()) [[]] >>> check_type(None, ()) Traceback (most recent call last): ... TypeMismatchException: None cannot match type () >>> check_type([1,2,"abc"], [int]) # doctest: +ELLIPSIS Traceback (most recent call last): ... TypeMismatchException: "abc" cannot match type <... "int"> >>> check_type("abc", [str]) ["abc"] >>> check_type(None, str) # doctest: +ELLIPSIS Traceback (most recent call last): ... TypeMismatchException: None cannot match type <... "str"> >>> check_type(None, (str, None)) is None True >>> check_type([1,2,"abc",["def","ghi"]], [(int, [str])]) [1, 2, ["abc"], ["def", "ghi"]] >>> check_type({"abc":123, "def":"ghi"}, {"abc": int, "def": str}) == {"abc":123, "def":"ghi"} True >>> check_type({"abc": {"def": "test", "ghi": 5}, "def": 1}, {"abc": {"def": str, "ghi": int}, "def": [int]}) == {"abc": {"def": "test", "ghi": 5}, "def": [1]} True >>> a = [] >>> a.append(a) >>> check_type(a, a) [[...]] >>> r = _ >>> r[0] is r True >>> check_type(1, None) Traceback (most recent call last): ... TypeMismatchException: 1 cannot match type None >>> check_type(a, ()) [[...]] >>> check_type(True, int) # doctest: +ELLIPSIS Traceback (most recent call last): ... TypeMismatchException: True cannot match type <... "int"> >>> check_type(1, bool) # doctest: +ELLIPSIS Traceback (most recent call last): ... TypeMismatchException: 1 cannot match type <... "bool"> >>> check_type([1], [list]) # doctest: +ELLIPSIS Traceback (most recent call last): ... TypeMismatchException: 1 cannot match type <... "list"> >>> check_type(1, 1) Traceback (most recent call last): ... InvalidTypeException: 1 is not a valid type: Unrecognized type >>> my_type = [] >>> my_type.append(([str], my_type)) >>> >>> my_data = ["abc"] >>> my_data.append(my_data) >>> >>> check_type(my_data, my_type) [["abc"], [...]] >>> r = _ >>> r[1] is r True >>> my_type = {} >>> my_type["abc"] = my_type >>> my_type["def"] = [my_type] >>> my_data = {} >>> my_data["abc"] = my_data >>> my_data["def"] = my_data >>> r = check_type(my_data, my_type) >>> r["abc"] is r True >>> r["def"][0] is r True >>> my_obj = [] >>> my_obj2 = [my_obj] >>> my_obj.append(my_obj2) >>> my_obj.append(1) >>> my_type = [] >>> my_type.append(my_type) >>> check_type(my_obj, (my_type, [(my_type, int)])) # doctest: +ELLIPSIS Traceback (most recent call last): ... TypeMismatchException: [[[...]], 1] cannot match type ([[...]], [([[...]], <... "int">)]) >>> my_type = [] >>> my_type.append(my_type) >>> check_type(1, my_type) Traceback (most recent call last): ... TypeMismatchException: 1 cannot match type [[...]] >>> check_type(True, bool) True >>> check_type(1, [[[[[[[[[[int]]]]]]]]]]) [[[[[[[[[[1]]]]]]]]]] >>> check_type([], [int, str]) # doctest: +ELLIPSIS Traceback (most recent call last): ... InvalidTypeException: [<... "int">, <... "str">] is not a valid type: list must contain 0 or 1 valid inner type >>> check_type([], []) [] >>> check_type([1,2,3], []) [1, 2, 3] >>> check_type([1,"abc"], []) [1, "abc"] >>> check_type((1, "abc"), []) [1, "abc"] >>> check_type({"a": 1}, []) [{"a": 1}] >>> check_type(1, {}) Traceback (most recent call last): ... TypeMismatchException: 1 cannot match type {} >>> check_type([], {}) Traceback (most recent call last): ... TypeMismatchException: [] cannot match type {} >>> check_type({"a":1}, {}) {"a": 1} >>> check_type({"a":1}, {"b": int}) # doctest: +ELLIPSIS Traceback (most recent call last): ... TypeMismatchException: {"a": 1} cannot match type {"b": <... "int">}: key "b" is required >>> check_type({"abc": 1, "abd": 2, "abe": "abc"}, {"~a.*": int}) # doctest: +ELLIPSIS Traceback (most recent call last): ... TypeMismatchException: "abc" cannot match type <... "int"> >>> check_type({"abc": 1, "abd": 2, "abe": "abc"}, {"~a.*": int, "abe": str}) == {"abc": 1, "abd": 2, "abe": "abc"} True >>> check_type({"abc": 1, "abd": 2, "abe": "abc"}, {"~a.*": int, "?abe": str}) == {"abc": 1, "abd": 2, "abe": "abc"} True >>> check_type({"abc": 1, "def": "abc"}, {"abc": int}) == {"abc": 1, "def": "abc"} True >>> check_type({"abc": 1, "abc": 2, "bcd": "abc", "bce": "abd"}, {"~a.*": int, "~b.*": str}) == {"abc": 1, "abc": 2, "bcd": "abc", "bce": "abd"} True >>> my_type = (str, []) >>> my_type[1].append(my_type) >>> check_type(1, my_type) # doctest: +ELLIPSIS Traceback (most recent call last): ... TypeMismatchException: 1 cannot match type (<... "str">, [(...)]) >>> my_obj = [] >>> my_obj.append(my_obj) >>> my_obj.append(1) >>> check_type(my_obj, my_type) # doctest: +ELLIPSIS Traceback (most recent call last): ... TypeMismatchException: [[...], 1] cannot match type (<... "str">, [(...)]) >>> my_obj = [] >>> my_obj.append(my_obj) >>> my_obj.append("abc") >>> check_type(my_obj, my_type) [[...], "abc"] >>> my_type = [] >>> my_type2 = {"a": my_type, "b": my_type} >>> my_type.append(my_type2) >>> my_obj = {} >>> my_obj["a"] = my_obj >>> my_obj["b"] = my_obj >>> r = check_type(my_obj, my_type) >>> r[0]["a"][0] is r[0]["b"][0] True >>> r[0]["a"][0] is r[0] True >>> r = check_type(my_obj, my_type2) >>> r["a"][0] is r["b"][0] True >>> r["a"][0] is r True >>> my_obj2 = [] >>> my_obj2.append(my_obj2) >>> my_obj2.append(1) >>> my_obj = [my_obj2, my_obj2] >>> my_type = [] >>> my_type.append((int, my_type)) >>> check_type(my_obj, my_type) [[[...], 1], [[...], 1]] >>> r = _ >>> r[0] is r[1] True >>> my_type = [] >>> my_type.append(([int], my_type)) >>> check_type(my_obj, my_type) [[[...], [1]], [[...], [1]]] >>> r = _ >>> r[0] is r[1] True """ return _check_type_inner(value, type)

我們來解讀一下比較關鍵的實現細節：

try: _long = longexcept Exception: _long = inttry: _unicode = unicodeexcept Exception: _unicode = strdef _check_type_inner(value, type_, _recursive_check = None): # print("Check type:", value, id(value), type_, id(type_)) if _recursive_check is None: # current, succeeded, failed, listloop _recursive_check = ({}, {}, {}, set()) current_check, succeded_check, failed_check, list_loop = _recursive_check # Use (id(value), id(type)) to store matches that are done before check_id = (id(value), id(type_)) if check_id in succeded_check: # This match is already done, return the result # print("Hit succedded cache:", succeded_check[check_id], id(succeeded_check[check_id])) return succeded_check[check_id] elif check_id in failed_check: # This match is already failed, raise the exception raise failed_check[check_id] elif check_id in current_check: # print("Hit succedded cache:", current_check[check_id], id(current_check[check_id])) # This match is in-operation. The final result is depended by itself. Return the object itself to form a recursive structure. return current_check[check_id] return_value = None try: if type_ == None: # Match None only if value is not None: raise TypeMismatchException(value, type_) else: return_value = value elif type_ == (): if value is None: raise TypeMismatchException(value, type_) else: return_value = value elif type_ is int or type_ is _long: # Enhanced behavior when matching int type: long is also matched; bool is NOT matched if not isinstance(value, bool) and (isinstance(value, int) or isinstance(value, _long)): return_value = value else: raise TypeMismatchException(value, type_) elif type_ is str or type_ is _unicode: # Enhanced behavior when matching str: unicode is always matched (even in Python2) if isinstance(value, str) or isinstance(value, _unicode): return_value = value else: raise TypeMismatchException(value, type_) elif isinstance(type_, type): if isinstance(value, type_): return_value = value else: raise TypeMismatchException(value, type_) elif isinstance(type_, tuple): for subtype in type_: try: return_value = _check_type_inner(value, subtype, _recursive_check) except TypeMismatchException: continue else: break else: raise TypeMismatchException(value, type_) elif isinstance(type_, list): if len(type_) > 1: raise InvalidTypeException(type_, "list must contain 0 or 1 valid inner type") if not type_: # matches any list or tuple if isinstance(value, list) or isinstance(value, tuple): return_value = list(value) else: return_value = [value] else: subtype = type_[0] if isinstance(value, list) or isinstance(value, tuple): # matches a list or tuple with all inner objects matching subtype current_result = [] # save the reference to the list current_check[check_id] = current_result # backup succedded check: it may depends on current result. If the type match fails, revert all succeeded check _new_recursive_check = (current_check, dict(succeded_check), failed_check, set()) current_result.extend(_check_type_inner(o, subtype, _new_recursive_check) for o in value) # copy succeeded checks succeded_check.clear() succeded_check.update(_new_recursive_check[1]) else: # a non-list value like "abc" cannot match an infinite looped [[...]] # when a non-list value is replaced to a list, we must prevent it from forming an infinite loop if check_id in list_loop: raise TypeMismatchException(value, type_) list_loop.add(check_id) try: current_result = [_check_type_inner(value, subtype, _recursive_check)] finally: list_loop.discard(check_id) return_value = current_result elif isinstance(type_, dict): if not isinstance(value, dict): raise TypeMismatchException(value, type_) if not type_: return_value = dict(value) else: required_keys = dict((k[1:] if isinstance(k, str) and k.startswith("!") else k, v) for k,v in type_.items() if not isinstance(k, str) or (not k.startswith("?") and not k.startswith("~"))) optional_keys = dict((k[1:], v) for k, v in type_.items() if k.startswith("?")) regexp_keys = [(k[1:], v) for k, v in type_.items() if k.startswith("~")] # check required keys for k in required_keys: if k not in value: raise TypeMismatchException(value, type_, "key " + repr(k) + " is required") optional_keys.update(required_keys) current_result = {} # save the reference to the dict current_check[check_id] = current_result # backup succedded check: it may depends on current result. If the type match fails, revert all succeeded check _new_recursive_check = (current_check, dict(succeded_check), failed_check, set()) for k, v in value.items(): if k in optional_keys: current_result[k] = _check_type_inner(v, optional_keys[k], _new_recursive_check) else: for rk, rv in regexp_keys: if re.search(rk, k): current_result[k] = _check_type_inner(v, rv, _new_recursive_check) break else: current_result[k] = v # copy succeeded checks succeded_check.clear() succeded_check.update(_new_recursive_check[1]) return_value = current_result else: raise InvalidTypeException(type_, "Unrecognized type") except Exception as exc: # This match fails, store the exception failed_check[check_id] = exc if check_id in current_check: del current_check[check_id] raise else: # This match succeeded if check_id in current_check: del current_check[check_id] # Only store the succeded_check if necessary. succeded_check[check_id] = return_value return return_value

這個實現中最困難的部分就是對遞歸結構的處理，我們可以看到，其中關鍵的細節在於，在遞歸調用過程中，傳遞了一個這樣的參數：

if _recursive_check is None: # current, succeeded, failed, listloop _recursive_check = ({}, {}, {}, set())

其中其實包括了三個dict和一個set。它們的key，是(id(value), id(type_))，id在Python中會返回對象的一個唯一標識，之所以使用id號而不是直接使用對象，是因為像dict, list這樣的對象是unhashable的，不能放進set或者map當中，使用dict則沒有這種問題。雖然一般來說，使用id號標識一個對象是不推薦的，因為對象可能會被GC釋放，然後其他對象可能會重新佔據這個id號，但是在我們這個過程中，對象不會被修改，id也只做臨時使用，所以沒有什麼大問題。在Python系統庫的json、pickle等實現中，也使用了這種方法。

在這裡使用值、類型的元組作為標識也是一個要點，並不能直接用值，因為不同的值可能會匹配到不同的類型，產生的結果也是不同的。但是在值和類型不改變的時候，匹配的結果則是唯一的。

這四個對象分別代表這樣的意思：

current - 記錄了遞歸到這個位置時，哪些匹配尚未完成，它們實際上就是調用當前這個過程的祖先。這個map的值指向一個list或者dict的對象，這個對象在匹配過程中正在被構建出來，是個不完整的對象。如果出現了循環，則直接使用current中的值來替代進一步的遞歸過程，這樣就防止出現無限循環
succeeded - 記錄了所有已經成功匹配的對象。之所以需要這一項，是為了在原始數據中相同引用的對象可以得到相同的結果，比如[my_obj, my_obj]這樣的結構，我們會希望返回列表中的兩個對象也是同一個引用。map的值指向已經成功生成的對象。
failed - 記錄已經失敗的匹配。一般來說沒有這一項也可以實現演算法，但有這一項在某些情況下可以減少嘗試的次數。由於存在用元組表示多種類型之一的語法，子結構匹配失敗並不總是會導致整體結構匹配失敗。
listloop - 這個記錄用來防止無限嵌套結構與非無限嵌套結構的匹配

----注意這裡的大坑 ----

第四點是非常容易被忽略的，我們來考慮這個測試例：

my_type = []my_type.append(my_type)check_type(1, my_type)

在第一個版本的實現里，這個表達式是可以成功匹配的！匹配的結果是[[...]]

原因在於我們規定了，如果是非列表的量，它可以自動轉換為只有一項的列表。也就是說，1可以匹配int, 也可以匹配[int]，進一步可以匹配[[int]]，進一步可以匹配[[[int]]]……

所以……它也許也能匹配[[...]]這個無限嵌套的列表呢？畢竟有無限層，也許最裡面有個int呢……

只特別判斷一下如果內層還有列表的時候不能無限嵌套行不行呢？也是不行的

my_type = []my_type.append((str, my_type))check_type(1, my_type)

解決這個問題的要點在於，在匹配過程中，如果我們曾經將一個非列表值轉化為列表，在它成功匹配到非列表類型之前，不能再重新匹配到同一個列表類型——這句話有點繞，表現為代碼就是這最後一個set。當我們將一個值轉化為列表的時候，會設置一個標識在這個set裡面，在繼續遞歸的時候，如果發現這個標識已經設置了，說明我們遇到了一個無限循環的嵌套結構，必須按照匹配失敗來返回。如果遞歸調用沒有失敗而是成功返回，則清除這個標識。

但是，如果說已經成功匹配到了一個非列表項，就需要在繼續遞歸的時候清除這個標識，考慮這個測試例：

>>> my_type = []>>> my_type2 = {"a": my_type, "b": my_type}>>> my_type.append(my_type2)>>> my_obj = {}>>> my_obj["a"] = my_obj>>> my_obj["b"] = my_obj>>> check_type(my_obj, my_type)[{"a": [{...}], "b": [{...}]}]

我們將一個不包含列表的無限嵌套結構，轉化成了每一層都增加了一層列表的無限嵌套結構，這個匹配是合法的。我們在dict的遞歸調用的時候，使用了這條語句：

# backup succedded check: it may depends on current result. If the type match fails, revert all succeeded check _new_recursive_check = (current_check, dict(succeded_check), failed_check, set())

在這裡提供了一個新的set替代以前的set。為什麼不使用clear()？如果這個調用失敗了，拋出異常並返回到上層的時候，需要重新恢復以前的標識。

另一個注意點在succeeded的使用上。在數據包含遞歸結構的時候，succeeded當中記錄的結果，可能包括了current中的尚未完成的匹配——這些匹配並不保證最終能夠成功。所以，在遞歸調用的時候，如果我們創建了一個current項，而這個項最終匹配失敗了，為了安全起見，我們必須清空在遞歸調用過程中增加的所有的succeeded里的緩存，因為其中的某些項可能包含了已經失敗了這個current匹配。這就是_new_recursive_check中創建了一個新的succeeded_check的map，只有成功的時候才將它寫進以前的map的原因。

failed沒有這個問題，如果某個匹配失敗了，它在任何情況下都會失敗。

代碼中有一些細節上的實現改動：

無論Python2還是Python3，int和long視作同一種類型；str和unicode視作同一種類型。
不允許bool值（True，False）匹配int類型。在Python中，bool是int的一個子類，True實際上就是1，因此isinstance(True, int)會返回True。但在我們的程序中做了特殊處理。

從今天起請叫我廣義表匹配狂魔