python 爬蟲 圖片抓取問題,有的網站中的圖片不能抓取?
網址:http://www.variflight.com/flight/PEK-SYX.html?AE71649A58c77fdate=20160518
網頁截圖見上圖,抓取的圖片為紅色標註處。利用python中 urllib.urlretrieve可以抓取百度貼吧的圖片。可是到了這個網站就不行了。嘗試:複製該圖片的網址,在瀏覽器中打開,經常打不開,有時候能打開。爬蟲一次沒成功過。
求大神指導這是什麼原因,怎麼解決該類問題。。使用了以下代碼,測試的網站可以,這個網站的圖片就不可以。。。#-*-coding:utf-8-*- import osimport uuidimport urllib2import cookielibimport urllibfrom PIL import Imagedef show_image(target = "1.jpg"):
try: img = Image.open(target) img.show() except: print "pic not exist"def get_image(url = "http://hbimg.b0.upaiyun.com/d2024a8a998c8d3e4ba842e40223c23dfe1026c8bbf3-OudiPA_fw658"): target = "urllib.jpg" urllib.urlretrieve(url, target) show_image(target)def get_picture(url = "http://hbimg.b0.upaiyun.com/d2024a8a998c8d3e4ba842e40223c23dfe1026c8bbf3-OudiPA_fw658"): cj=cookielib.LWPCookieJar() opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) headers = {"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6"} req = urllib2.Request(url, headers=headers) operate=opener.open(req) data=operate.read() target = "urllib2.jpg" f=open(target, "wb")
f.write(data) f.flush() f.close() show_image(target)get_picture()
建議換成urllib2的request來獲取圖片。
因為你抓包後 看到header裡面的內容如下GET /flight/detail/productImgs=VDlzbXFpVEtiakZqbHZIaGtpQUR6YWp1RkVzPQ==w=50h=28fontSize=14fontColor=2f3032background=ffffff?AE71649A58c77= HTTP/1.1
Host: www.variflight.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:46.0) Gecko/20100101 Firefox/46.0
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: http://www.variflight.com/flight/PEK-SYX.html?AE71649A58c77fdate=20160518
Cookie: PHPSESSID=plc269662oe5lbjqho47l9gc12; salt=5745c5aa1ee87; Hm_lvt_d1f759cd744b691c20c25f874cadc061=1464189809; Hm_lpvt_d1f759cd744b691c20c25f874cadc061=1464189835
Connection: keep-alive
同意樓上,你要把header信息都加上去,完全模擬瀏覽器行為就可以抓我現在用的是py3的urllib以下為我使用的代碼,挺簡單就取出來了
import urllib.request
import http.cookiejar
import re
cj = http.cookiejar.MozillaCookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)
HEADER = {
"Host": "www.variflight.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:46.0) Gecko/20100101 Firefox/46.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Connection": "keep-alive",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
"Connection": "keep-alive",
"Cache-Control": "max-age=0"
}
rt = urllib.request.urlopen("http://www.variflight.com/flight/PEK-SYX.html?AE71649A58c77fdate=20160518")
html = rt.read().decode()
imglist = re.findall("&
思路:用XPATH拿到這個圖片的SRC屬性值,然後用header裡面的host和這個SRC拼接起來就可以了。(剛才在chrome裡面簡單的試了下,可行)
拼接:http://www.variflight.com/flight/detail/productImgs=Q2pPVXArR3FPV0RJdFN2V3VOSlo5QTJhTFRJPQ==w=50h=28fontSize=14fontColor=2f3032background=ffffff?AE71649A58c77=推薦閱讀:
※千里挑一的我乎妹子大V排行榜(數據初探1)
※Python中那些神一樣的代碼
※Python3.6正式版要來了,你期待哪些新特性?
※神奇的yield
※Python · 進度條
TAG:Python | 爬蟲計算機網路 | Python使用技巧 | 網頁爬蟲 |