python 爬取煎蛋網妹子圖

04-13

畢設做的差不多，終於可以開始玩爬蟲啦。

很久之前就想著寫個爬蟲把煎蛋網的妹子圖全部抓下來，畢竟尺度還是蠻大哈。

對於不想看過程的，直接複製粘貼代碼就行了，依賴庫都有的話，應該一個晚上就能爬完，單線程爬蟲並沒有優化速度，畢竟佔用太多伺服器的資源不好（其實就是懶得寫）

from bs4 import BeautifulSoup as bsimport osimport reimport requestsimport timeurl_list=[]url_undown_list=[]image_undown_list=[]path="d:/Picture/jd/"log="d:/Picture/jdlog.txt"def init(): home_url="http://jandan.net/ooxx/page-2227#comments" header={ Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8, Accept-Encoding:gzip, deflate, sdch, Accept-Language:zh-CN,zh;q=0.8, User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 } url_list.append(home_url) return headerdef get_page_id(html): next_page_url=html.find("a",{"title":"Older Comments"}).attrs["href"] print(next_page_url) url_list.append(next_page_url) return 0def get_img_url(html): image_url = html.find_all("a", {"class": "view_img_link"}) return image_urldef down_save_img(image_url_list,page_id): re_picture_format = re.compile(r".w{3,5}$") i = 0 for item in image_url_list: image_url1 = item.attrs["href"] image_url1 = image_url1[2:] print("downloading") print(image_url1) try: pic = requests.get("http://" + image_url1, timeout=1000) except requests.exceptions.ConnectionError: print("this pic cannt be download") continue picture_format=re_picture_format.findall(image_url1) string = path+"page"+page_id+"pic"+str(i)+picture_format[0] fp = open(string, wb) try: fp.write(pic.content) except: print("cannt write this pic") fp.close() i += 1def main(): header=init() i=150 while(True): for item in url_list: try: req=requests.get(item,headers=header,timeout=1000) html=bs(req.text) except requests.exceptions.ConnectionError: print(item) print("cannt connection") url_list.remove(item) time.sleep(3) continue except requests.exceptions.HTTPError: print(item) print("http error:"+str(requests.exceptions.HTTPError)) url_list.remove(item) time.sleep(3) continue print("get__"+item) get_page_id(html) image_url_list=get_img_url(html) down_save_img(image_url_list,page_id=str(i)) url_list.remove(item) time.sleep(1) i+=1 fp=open(log,a+) try: fp.write(item+"__has been download"+" ") fp.write("___________________________________ ") except: print("write log failed") fp.close() return 0if __name__ ==__main__: main()

煎蛋網妹子圖的網頁結構超級簡單

舉個例子

http://jandan.net/ooxx/page-2438#comments

這就是首頁的地址，下一頁的話page-2438改成page-2437就行，不過為了避免中間出現斷開的頁碼，還是從html中找到下一頁的地址

def get_page_id(html): next_page_url=html.find("a",{"title":"Older Comments"}).attrs["href"] print(next_page_url) url_list.append(next_page_url) return 0

就是這一個函數，找到之後直接加入url_list中，爬蟲是從這個列表中獲取頁面鏈接的

打開一個頁面之後，從中找到圖片的地址

def get_img_url(html): image_url = html.find_all("a", {"class": "view_img_link"}) return image_url

然後保存圖片到本地

def down_save_img(image_url_list,page_id): re_picture_format = re.compile(r".w{3,5}$") i = 0 for item in image_url_list: image_url1 = item.attrs["href"] image_url1 = image_url1[2:] print("downloading") print(image_url1) try: pic = requests.get("http://" + image_url1, timeout=1000) except requests.exceptions.ConnectionError: print("this pic cannt be download") continue picture_format=re_picture_format.findall(image_url1) string = path+"page"+page_id+"pic"+str(i)+picture_format[0] fp = open(string, wb) try: fp.write(pic.content) except: print("cannt write this pic") fp.close() i += 1

下載過的圖片會保存在log文件中，有中斷的話，下次從log文件中找到斷點，更改開始的url就好，短時間內寫完的，沒考慮太多，能用就行