python 爬取煎蛋網妹子圖
很久之前就想著寫個爬蟲把煎蛋網的妹子圖全部抓下來,畢竟尺度還是蠻大哈。
對於不想看過程的,直接複製粘貼代碼就行了,依賴庫都有的話,應該一個晚上就能爬完,單線程爬蟲並沒有優化速度,畢竟佔用太多伺服器的資源不好(其實就是懶得寫)
from bs4 import BeautifulSoup as bsimport osimport reimport requestsimport timeurl_list=[]url_undown_list=[]image_undown_list=[]path="d:/Picture/jd/"log="d:/Picture/jdlog.txt"def init(): home_url="http://jandan.net/ooxx/page-2227#comments" header={ Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8, Accept-Encoding:gzip, deflate, sdch, Accept-Language:zh-CN,zh;q=0.8, User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 } url_list.append(home_url) return headerdef get_page_id(html): next_page_url=html.find("a",{"title":"Older Comments"}).attrs["href"] print(next_page_url) url_list.append(next_page_url) return 0def get_img_url(html): image_url = html.find_all("a", {"class": "view_img_link"}) return image_urldef down_save_img(image_url_list,page_id): re_picture_format = re.compile(r".w{3,5}$") i = 0 for item in image_url_list: image_url1 = item.attrs["href"] image_url1 = image_url1[2:] print("downloading") print(image_url1) try: pic = requests.get("http://" + image_url1, timeout=1000) except requests.exceptions.ConnectionError: print("this pic cannt be download") continue picture_format=re_picture_format.findall(image_url1) string = path+"page"+page_id+"pic"+str(i)+picture_format[0] fp = open(string, wb) try: fp.write(pic.content) except: print("cannt write this pic") fp.close() i += 1def main(): header=init() i=150 while(True): for item in url_list: try: req=requests.get(item,headers=header,timeout=1000) html=bs(req.text) except requests.exceptions.ConnectionError: print(item) print("cannt connection") url_list.remove(item) time.sleep(3) continue except requests.exceptions.HTTPError: print(item) print("http error:"+str(requests.exceptions.HTTPError)) url_list.remove(item) time.sleep(3) continue print("get__"+item) get_page_id(html) image_url_list=get_img_url(html) down_save_img(image_url_list,page_id=str(i)) url_list.remove(item) time.sleep(1) i+=1 fp=open(log,a+) try: fp.write(item+"__has been download"+"
") fp.write("___________________________________
") except: print("write log failed") fp.close() return 0if __name__ ==__main__: main()
煎蛋網妹子圖的網頁結構超級簡單
舉個例子
http://jandan.net/ooxx/page-2438#comments
這就是首頁的地址,下一頁的話page-2438改成page-2437就行,不過為了避免中間出現斷開的頁碼,還是從html中找到下一頁的地址
def get_page_id(html): next_page_url=html.find("a",{"title":"Older Comments"}).attrs["href"] print(next_page_url) url_list.append(next_page_url) return 0
就是這一個函數,找到之後直接加入url_list中,爬蟲是從這個列表中獲取頁面鏈接的
打開一個頁面之後,從中找到圖片的地址
def get_img_url(html): image_url = html.find_all("a", {"class": "view_img_link"}) return image_url
然後保存圖片到本地
def down_save_img(image_url_list,page_id): re_picture_format = re.compile(r".w{3,5}$") i = 0 for item in image_url_list: image_url1 = item.attrs["href"] image_url1 = image_url1[2:] print("downloading") print(image_url1) try: pic = requests.get("http://" + image_url1, timeout=1000) except requests.exceptions.ConnectionError: print("this pic cannt be download") continue picture_format=re_picture_format.findall(image_url1) string = path+"page"+page_id+"pic"+str(i)+picture_format[0] fp = open(string, wb) try: fp.write(pic.content) except: print("cannt write this pic") fp.close() i += 1
推薦閱讀:
※誰才是鬥魚一哥?(用Python抓取鬥魚直播間信息)
※python爬蟲之scrapy掃盲:搜集美圖信息之多層掃描(2)
※Python爬蟲爬取知乎某一個問題下的圖片
※全球值得關注的5大爬蟲專業博客網站