[爬蟲,python]應用requests,正則表達式抓取貓眼電影TOP100

根據淘寶的教程使用python編寫爬蟲,使用requests和regex對貓眼電影的TOP100頁進行抓取,獲得電影排序、名稱、主創、上映日期、分數等信息,並保存在txt文件中。

本實例是最基本的對網頁源代碼中的文本進行抓取。


分析目標站點

    • 網頁元素:網站導航欄、正文:序號、電影名稱、主演、上映日期、評分、底部頁面序號:不同頁面url不同
    • 網站代碼:定位每個電影各個部位的代碼

流程分析

    • 單頁抓取
      • requests.get()需要判斷獲取狀態(狀態碼)並且設置異常處理
      • Q:貓眼會對自動化爬蟲進行禁止,有時會被禁止訪問

        A:使用get.(url,headers = urlheaders),將訪問時的RequestHeaders添加進get參數

# 添加訪問頭n MaoYanHeaders = {n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",n "Accept-Encoding": "gzip, deflate",n "Accept-Language": "zh-CN,zh;q=0.8",n "Connection": "keep-alive",n "Cookie": "uuid=1A6E888B4A4B29B16FBA1299108DBE9CDFE0F270F2640051092C5B91D4925C7A; _lx_utm=utm_source%3Dbaidu%26utm_medium%3Dorganic; __mta=219052582.1507114997794.1507115797315.1507118482776.11; _lxsdk_s=af27c2388b4347ab08f2353fe7c8%7C%7C4",n "Host": "maoyan.com",n "Referer": "http://maoyan.com/board/4?offset=90/",n "Upgrade-Insecure-Requests": "1",n "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"n }nn response = requests.get(url, headers=MaoYanHeaders) # 獲取url網站源碼n

  • 使用正則表達式抓取電影內容
    • 將提取的內容調整,填入字典

#提取查找內容,分條組成字典形式nfor movie in movies_Info:n yield {n index : movie[0],n poster : movie[1],n name : movie[2],n actors : movie[3].strip()[3:],n date : movie[4][5:],n score : movie[5]+movie[6]n }n

  • 保存文件,jason文件模式
    • 防止內容重複寫入,在寫入結果文件前先判斷是否存在之前的文件,有則刪除

import osnif os.path.exists(result.txt):nos.remove(result.txt)n

  • 循環調用,抓取全部10頁
  • 多線程的使用

# 多線程運行nfrom multiprocessing import Poolnpool = Pool()npool.map(main,[i*10 for i in range(10)])n

源代碼

import jsonn import osn import requestsn import ren from requests.exceptions import RequestException n from multiprocessing import Pooln n # page1: http://maoyan.com/board/4?n # page2: http://maoyan.com/board/4?10n n #獲取網頁代碼n def get_one_page(url): #def 後面的: 不要忘記n import requestsn n #添加訪問頭n MaoYanHeaders = {n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",n "Accept-Encoding": "gzip, deflate",n "Accept-Language": "zh-CN,zh;q=0.8",n "Connection": "keep-alive",n "Cookie": "uuid=1A6E888B4A4B29B16FBA1299108DBE9CDFE0F270F2640051092C5B91D4925C7A; _lx_utm=utm_source%3Dbaidu%26utm_medium%3Dorganic; __mta=219052582.1507114997794.1507115797315.1507118482776.11; _lxsdk_s=af27c2388b4347ab08f2353fe7c8%7C%7C4",n "Host": "maoyan.com",n "Referer": "http://maoyan.com/board/4?offset=90/",n "Upgrade-Insecure-Requests": "1",n "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"n }n n response = requests.get(url,headers = MaoYanHeaders) #獲取url網站源碼n #判斷獲取結果:通過狀態碼n try: #try except 異常處理,使用RequestException APIn if response.status_code == 200:n return response.textn else:n return Nonen except RequestException:n return Nonen n #查找字元串n def find_movies_Info(urltext):n pattern = re.compile(<dd>.*?board-index.*?>(.*?)<.*?data-src="(.*?)".*?name.*?<a.*?title="(.*?)".*?n +star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>,re.S)n n movies_Info = re.findall(pattern,urltext) #查找字元串n #提取查找內容,分條組成字典形式n for movie in movies_Info:n yield {n index : movie[0],n poster : movie[1],n name : movie[2],n actors : movie[3].strip()[3:],n date : movie[4][5:],n score : movie[5]+movie[6]n }n n #保存內容至文件n def write_to_file(content):n with open(result.txt, a, encoding=utf-8) as f:n f.write(json.dumps(content, ensure_ascii=False)+ n)n f.close()n n def main(offset):n #獲取網頁n url = http://maoyan.com/board/4?offset=+ str(offset)n html = get_one_page(url)n #print(html)n n #提取網頁信息n for movie in find_movies_Info(html):#寫入結果文件n print(movie)n write_to_file(movie)n n if __name__ == __main__:n if os.path.exists(result.txt): #存在結果文件時先刪除n os.remove(result.txt)n n #單線程運行n for i in range(10):n main(i*10)n n # 多線程運行n #from multiprocessing import Pooln pool = Pool()n pool.map(main,[i*10 for i in range(10)])n

推薦閱讀:

用python寫爬蟲時用自帶的urllib好還是用request庫好,還是兩個都要學?
怎樣利用數據爬取和分析工具寫出《黃燜雞米飯是怎麼火起來的》這樣的文章?
如何爬網易雲音樂的評論數?
Chrome的開發者工具怎麼查看錶單數據?網路選項卡里的參數一項在哪裡?
爬蟲怎麼解決封IP的問題?

TAG:数据分析 | Python | 爬虫计算机网络 |