Python 爬蟲調用 requests 如何設置代理(GoAgent/GoAgentX)？

12-28

# -*- utf-8 -*- import requests
s = requests.session() login_data = {"email": "myEmail", "password": "psw"} s.post("https://www.facebook.com/", login_data) r = s.get("http://www.facebook.com/people/someone/about") print r.text.encode("utf-8")
這段代碼爬知乎很正常

好多評論討論的朋友為啥都不點個贊呢。。

————————

謝邀，話說這一年前的問題應該是觀眾還沒有從別人的回答中看懂吧，其實三不青年的寫法是對的只是加了login_data，這段時間剛好總結這些基礎的反反爬技術。有興趣的可以看看我這個開源項目：Anti-Anti-Spider/Forge_head/requests at master · KCPClub/Anti-Anti-Spider · GitHub

簡而言之：引入url和headers。。proxies中的ip就是代理的ip地址+":"+埠也可以隨函數導入。來做ip輪換。

#coding:utf-8 import requests


#此處修改頭欄位,自己用f12查看谷歌瀏覽器下自己的瀏覽器頭信息，可以讓根據目標站點而寫的head會更好

headers = {

    "Host":"map.baidu.com",

    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",

    "Accept-Encoding": "gzip, deflate",

    "Accept-Language": "en-US,en;q=0.5",

    "Connection": "keep-alive",

    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/39.0"

}

#發起請求,記得取消注釋

def get_request(url,headers):

	"""參數引入及頭信息"""

	#可設置代理

    #proxies = {

	#	"http": "http://"+ip,

	#	"https": "http://"+ip,

	#}

    #url = "https://www.urlteam.org"
	html=requests.get(url,headers=headers, timeout=10，proxies=proxies).text

	print html

	return html

if __name__ == "__main__": url = "https://www.urlteam.org" get_request(url,headers)

附贈說個我一般採集用的代理要求穩定性要好，在國內的實在渣渣。。這個國外的免費代理平台雖然提供的特別少，但是非常穩定。 http://www.gatherproxy.com/zh/

別的沒什麼，爬蟲採集方面歡迎提問。儘力回答。

在評論區代碼放不下，我移動到這裡。裡面的代理可能不可用了

#encoding=utf8 import requests import sys from bs4 import BeautifulSoup reload(sys) sys.setdefaultencoding("utf-8") type = sys.getfilesystemencoding() s = requests.session() proxie = { "http" : "http://122.193.14.102:80" } url = "http://www.stilllistener.com/checkpoint1/test11/" header = { "User-Agent" : "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36", } print url response = s.get(url,headers = header,verify=False,proxies = proxie,timeout = 20) print response.text f = open("FC.html","w") f.write(response.content) f.close()

s = requests.session() login_data = {"email": "myEmail", "password": "psw"} proxies={"http":"http//x.x.x.x:xx", "http":"http://x.x.x.x:xx"} s.post("https://www.facebook.com/", login_data) r = s.get("http://www.facebook.com/people/someone/about",proxies=proxies) print r.text.encode("utf-8")

你好看了好多都是最多這樣兩個代理，proxies={"http":"http//x.x.x.x:xx", "https":"http://x.x.x.x:xx"}，如果我想加十幾條代理，格式是什麼楊的呢？還有「http」和「https」又代表什麼意思？謝謝，小白學習，希望解答。

http代理：
對於某個http server的請求，一般第一行是
METHOD path PROTOCOL
如
GET /foo/bar HTTP/1.0

而對http proxy的請求，是這樣的：
GET http://www.foobar.com/foo/bar HTTP/1.0
Host: http://www.foobar.com

HTTPS隧道代理：
CONNECT http://www.foobar.com:443 HTTP/1.1
Host: www.foobar.cpm:443

這個代理等於一個tcp隧道

當然隧道代理也可以用於其他tcp連接。

HTTPS中間人代理：一般這種代理會把自己偽裝成隧道代理，這種代理的連接方式和隧道代理一樣。

Socks5隧道代理：
socks5是一個二進位協議，不過實際上用法和樓上差不多。協議細節有點意思。懶得貼。

謝邀！寫個簡單的方法：（使用urllib2模塊）

# -*- coding:utf-8 -*-


import urllib2
def httpRequest(url, proxy = None):

    """

    @summary: 網路請求

    """

    try:

        ret = None

        SockFile = None

        request = urllib2.Request(url)

        request.add_header("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)")

        request.add_header("Pragma", "no-cache")

        if proxy:

            request.set_proxy(proxy, "http")

        opener = urllib2.build_opener()

        SockFile = opener.open(request)

        ret = SockFile.read()

    finally:

        if SockFile:

            SockFile.close()
    return ret

if __name__ == "__main__": print httpRequest("http://www.baidu.com", "112.12.4.38:8000")