[Python網路編程]gevent httpclient以及網頁編碼

http://blog.csdn.net/yueguanghaidao/article/details/27688047

2014

之前看到geventhttpclient這個項目,https://github.com/gwik/geventhttpclient,官方文檔說非常快,由於響應使用了C的解析,所以我一直想把這玩意用到項目中,

這兩天一直在糾結這玩意,說實在一句話,比較難用,封裝的不給力,最大缺陷如下:

1.不支持重定向,重定向需要自己來寫,很費事

2.新建的httpclient對象只能發送同域名的請求

這相當的蛋疼,我花了一點時間封裝了一下,解決了上面的兩個問題,還增加了自動編解碼問題,代碼如下:

[python] view plain copy

  1. #!/usr/bin/envpython
  2. #-*-encoding:UTF-8-*-
  3. importre
  4. fromgeventhttpclient.urlimportURL
  5. fromgeventhttpclient.clientimportHTTPClient,HTTPClientPool
  6. fromurlparseimporturljoin
  7. #fromcore.commonimporturljoin
  8. HEADERS={
  9. "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  10. "User-Agent":"Mozilla/5.0(WindowsNT6.1;WOW64;rv:29.0)Gecko/20100101Firefox/29.0"
  11. }
  12. DEFAULF_METHOD="GET"
  13. MAX_REDIRECT_TIME=10
  14. DEFAULT_PAGE_ENCODING="utf8"
  15. classDifferDomainException(Exception):
  16. """
  17. ifrequestdifferentdomainurl,geventhttpclientwillthrowit,
  18. seegevent.client"raiseValueError("InvalidhostinURL")"
  19. """
  20. def__init__(self,uri):
  21. self.args=uri
  22. self.uri=uri
  23. classMaxRedirectException(Exception):
  24. def__init__(self,response):
  25. self.args=response
  26. self.response=response
  27. classHTTP(HTTPClient):
  28. defrequest(self,request_uri,method=DEFAULF_METHOD,body=b"",headers={},follow_redirect=True,redirects=MAX_REDIRECT_TIME):
  29. ifbodyandmethod==DEFAULF_METHOD:
  30. method="POST"
  31. h=[k.title()forkinheaders.iterkeys()]
  32. headers.update(dict([(k,v)fork,vinHEADERS.iteritems()ifknotinh]))
  33. response=super(HTTP,self).request(method,request_uri,body,headers)
  34. iffollow_redirectandresponse.status_codein(301,302,303,307)andresponse.methodin("GET","POST"):
  35. ifredirects:
  36. location=response.get("location")orresponse.get("content-location")orresponse.get("uri")
  37. iflocation:
  38. location=urljoin(request_uri,location)
  39. ifnotlocation.startswith(self._base_url_string):
  40. raiseDifferDomainException(location)
  41. returnself.request(location,method,body,headers,follow_redirect,redirects-1)
  42. else:
  43. raiseMaxRedirectException(response)
  44. returnresponse
  45. classHTTPPool(HTTPClientPool):
  46. defget_client(self,url):
  47. ifnotisinstance(url,URL):
  48. url=URL(url)
  49. client_key=url.host,url.port
  50. try:
  51. returnself.clients[client_key]
  52. exceptKeyError:
  53. client=HTTP.from_url(url,**self.client_args)
  54. self.clients[client_key]=client
  55. returnclient
  56. _POLL=HTTPPool(network_timeout=100,connection_timeout=100)
  57. META_CHARSET_REGEX=re.compile(r"(?si)<head>.*<metahttp-equiv="?content-type"?[^>]+charset=(?P<result>[^">]+).*</head>")
  58. defdecodePage(content,content_type):
  59. httpCharset,metaCharset=None,None
  60. ifcontent_typeandcontent_type.find("charset=")!=-1:
  61. httpCharset=content_type.split("charset=")[-1]
  62. match=META_CHARSET_REGEX.search(content)
  63. ifmatch:
  64. metaCharset=match.group("result")
  65. printhttpCharset,metaCharset
  66. charset=httpCharsetormetaCharsetorDEFAULT_PAGE_ENCODING
  67. returncontent.decode(charset).encode(DEFAULT_PAGE_ENCODING)
  68. defrequest(request_uri,method=DEFAULF_METHOD,body=b"",headers={},follow_redirect=True,auto_read=True):
  69. client=_POLL.get_client(request_uri)
  70. response=None
  71. try:
  72. response=client.request(request_uri,method,body,headers,follow_redirect)
  73. exceptDifferDomainException,e:
  74. print"DifferDomainException:"+e.uri
  75. response=request(e.uri,method,body,headers,follow_redirect)
  76. exceptMaxRedirectException,e:
  77. print"maxredirect"
  78. response=e.response#willreturnpreviousresponse,ofcourseredirectresponse
  79. exceptException,e:
  80. printstr(e)
  81. ifauto_readandresponse:
  82. withresponse:
  83. response.content=decodePage(response.read(),response.get("content-type"))
  84. returnresponse
  85. deftest():
  86. #printrequest("http://127.0.0.1/re.php",follow_redirect=False)
  87. #printrequest("http://127.0.0.1/re.php",follow_redirect=True).content
  88. r=request("http://www.baidu.com/",follow_redirect=False)
  89. #baiduutf8utf8
  90. printr.content[:10]
  91. r=request("http://www.163.com/",follow_redirect=False)
  92. #163gbkgb2312
  93. printr.content[:10]
  94. test()

在測試網頁編碼問題遇到了一些問題,看下面:

由於頭部的請求先到,所以我們一般認為返回的內容編碼是先根據頭部的,如果沒有再看頁面編碼。

我們看網易的編碼,頭部為gbk,網頁為gb2312,但用gb2312解碼竟然有問題,??? 我很不解,各位大大們為啥呢?

但用頭部gbk解碼是正常的,這也證明了頭部編碼優先。按理說網頁編碼是告訴瀏覽器以gb2312顯示,但明顯有問題,瀏覽器怎麼做到的?

我們再看新浪的,這更讓我鬱悶了,誰來拯救我啊?


推薦閱讀:

gevent、eventlet、Twisted、Tornado各有什麼區別和優劣?
Python並發學習筆記:從協程到GEVENT(二)
為什麼gevent不能配合非純Python的程序一起使用?
gevent - 如何並發的執行任務-學習筆記1
gunicorn和uwsgi是怎麼在使用gevent的,gunicorn/uwsgi和gevent?

TAG:編程 | Python | 網路編程 | 編碼 | 網頁 | gevent | client | 網路 |