Python網路爬蟲(七)- 深度爬蟲CrawlSpider

目錄:

  • Python網路爬蟲(一)- 入門基礎
  • Python網路爬蟲(二)- urllib爬蟲案例
  • Python網路爬蟲(三)- 爬蟲進階
  • Python網路爬蟲(四)- XPath
  • Python網路爬蟲(五)- Requests和Beautiful Soup
  • Python網路爬蟲(六)- Scrapy框架
  • Python網路爬蟲(七)- 深度爬蟲CrawlSpider
  • Python網路爬蟲(八) - 利用有道詞典實現一個簡單翻譯程序

深度爬蟲之前推薦一個簡單實用的庫fake-useragent,可以偽裝生成headers請求頭中的User Agent值

#安裝npip install fake-useragentnn#使用nimport requestsnfrom fake_useragent import UserAgentnua = UserAgent()nheaders = {User-Agent: ua.random}nurl = 待爬網頁的urlnresp = requests.get(url, headers=headers)n

1.深度爬蟲CrawlSpider

scrapy.spiders.CrawlSpidernn 創建項目:scrapy startproct <project_name>nn 創建爬蟲:scrapy genspider –t crawl <spider_name> <domains>nn 核心處理規則: from scrapy.spiders import CrawlSpider, Rulen 核心處理提取: from scrapy.linkextractors import LinkExtractorn

  • rules:該屬性為一個正則表達式集合,用於告知爬蟲需要跟蹤哪些鏈接
  • rules屬性還有一個callback函數,用於解析下載得到的響應,而parse_item()方法給我們提供了一個從響應中獲取數據的例子。
  • 使用shell命令抓取:scrapy shell http://baidu.com

2.鏈接提取:LinkExtractor

class scrapy.contrib.linkextractor.sgml.SgmlLinkExtractor(n allow = (), # 符合正則表達式參數的數據會被提取n deny = (), # 符合正則表達式參數的數據禁止提取n allow_domains = (), # 包含的域名中可以提取數據n deny_domains = (), # 包含的域名中禁止提取數據n deny_extensions = (), n restrict_xpath = (), # 使用xpath提取數據,和allow共同起作用n tags = (), # 根據標籤名稱提取數據n attrs = (), # 根據標籤屬性提取數據n canonicalize = (),n unique = True, # 剔除重複鏈接請求n process_value = Nonen)n

3.爬取規則:rules

rules = [n Rule(n link_extractor, # LinkExtractor對象n callback=None, # 請求到響應數據時的回調函數n cb_kwargs=None, # 調用函數設置的參數,不要指定為parsen follow=None, # 是否從response跟進鏈接,為布爾值n process_links=None, # 過濾linkextractor列表,每次獲取列表時都會調用n process_request=None # 過濾request,每次提取request都會調用n )n] n

4.如何在pycharm中直接運行爬蟲

1. 在項目下創建start.py文件

# -*- coding:utf-8 -*-nfrom scrapy import cmdline #引入命令行ncmdline.execute(scrapy crawl dang.split())n

2. 如圖所示

點擊Edit Configurations

添加python文件

配置完畢後,點擊ok

點擊運行

配置了這麼多最後發現start.py後直接運行就行,不需要配置那麼多。


5.使用CrawlSpider爬取獵聘網python相關崗位招聘信息

  • 創建項目

scrapy startproject liepn

  • 自動創建spiders文件

scrapy genspider lp liepin.comn

  • items.py

# -*- coding: utf-8 -*-nnimport scrapynnnclass LiepItem(scrapy.Item):nn name = scrapy.Field()n company = scrapy.Field()n salary = scrapy.Field()n address = scrapy.Field()n #投遞時間反饋n experience = scrapy.Field()n

  • pipelines.py

# -*- coding: utf-8 -*-nn# Define your item pipelines heren#n# Dont forget to add your pipeline to the ITEM_PIPELINES settingn# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlnnimport jsonnnnclass LiepPipeline(object):n def __init__(self):n self.file = open(liepin.json,w)nn def process_item(self, item, spider):n text = json.dumps(dict(item),ensure_ascii=False)n self.file.write(text.encode(utf-8))n print QAQ ----> 正在寫入數據nn def close(self):n self.file.close()n

  • lp.py

# -*- coding: utf-8 -*-nfrom scrapy.spiders import CrawlSpider,Rulenfrom scrapy.linkextractors import LinkExtractornfrom liep.items import LiepItemnimport rennclass LpSpider(CrawlSpider):n reg = re.compile(s*)n name = lpn allowed_domains = [www.liepin.com]n start_urls = [https://www.liepin.com/zhaopin/?pubTime=&ckid=6f6956c5d999c17e&fromSearchBtn=2&compkind=&isAnalysis=&init=-1&searchType=1&dqs=020&industryType=&jobKind=&sortFlag=15&degradeFlag=0&industries=040&salary=0%240&compscale=&key=python&clean_condition=&headckid=7a006343bdb04f47&curPage=0,]nn #定義提取超鏈接的提取規則n page_link = LinkExtractor(allow=(&curPage=d+))n #定義爬取數據的規則n rules = {n Rule(page_link,callback=parse_content,follow=True)nn }nn #定義處理函數n def parse_content(self, response):n #定義一個Item,用於存儲數據n item = LiepItem()n #獲取整個我們需要的數據區域n job_list = response.xpath(//div[@class="job-info"])n for job in job_list:n name = job.xpath(.//h3/a)n item[name] = self.reg.sub(, name.xpath(string(.)).extract()[0])n item[company] = job.xpath(..//p[@class="company-name"]/a/text()).extract()n item[salary] = job.xpath(.//span[@class="text-warning"]/text()).extract()n item[address] = job.xpath(.//p[@class="condition clearfix"]//a/text()).extract()n item[experience] = job.xpath(.//p[@class="condition clearfix"]//span[3]/text()).extract()nn yield itemn

  • settings.py

DEFAULT_REQUEST_HEADERS = {n Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8,n Accept-Language: en,n User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36,n}nn#把ITEM_PIPELINES的注釋取消nITEM_PIPELINES = {n firPro.pipelines.FirproPipeline: 300,n}n

  • 爬取的結果liepin.json

{n "salary": "12-24萬",n "company": "嗨皮(上海)網路科技股份有限公司",n "name": "python開發工程師",n "experience": "3年工作經驗",n "address": "上海"n}{n "salary": "14-28萬",n "company": "第一彈",n "name": "Python後端開發",n "experience": "3年工作經驗",n "address": "上海"n}{n "salary": "12-18萬",n "company": "易路軟體",n "name": "Python中級開發工程師",n "experience": "3年工作經驗",n "address": "上海-閔行區"n}{n "salary": "11-21萬",n "company": "信用飛/首付游",n "name": "Python開發工程師(風控方向)",n "experience": "1年工作經驗",n "address": "上海-徐匯區"n}{n "salary": "13-24萬",n "company": "聯車科技",n "name": "python開發",n "experience": "3年工作經驗",n "address": "上海"n}{n "salary": "12-24萬",n "company": "尋仟信息",n "name": "Python開發工程師",n "experience": "1年工作經驗",n "address": "上海"n}{n "salary": "12-22萬",n "company": "ifuwo",n "name": "Python開發工程師",n "experience": "1年工作經驗",n "address": "上海-浦東新區"n}{n "salary": "12-24萬",n "company": "小葫蘆",n "name": "python開發工程師",n "experience": "1年工作經驗",n "address": "上海"n}{n "salary": "14-24萬",n "company": "ifuwo",n "name": "python後台工程師",n "experience": "2年工作經驗",n "address": "上海-浦東新區"n}{n "salary": "面議",n "company": "森浦資訊",n "name": "Python開發工程師",n "experience": "2年工作經驗",n "address": "上海"n}{n "salary": "14-24萬",n "company": "優刻得",n "name": "OPL-python運維開發",n "experience": "2年工作經驗",n "address": "上海"n}{n "salary": "面議",n "company": "上海聰牛金融信息服務有限公司",n "name": "python開發工程師",n "experience": "2年工作經驗",n "address": "上海"n}{n "salary": "12-30萬",n "company": "進馨網路",n "name": "python開發工程師",n "experience": "3年工作經驗",n "address": "上海"n}{n "salary": "12-18萬",n "company": "載信軟體",n "name": "Python工程師",n "experience": "1年工作經驗",n "address": "上海"n}{n "salary": "14-24萬",n "company": "優刻得",n "name": "OPL-python運維開發J10605",n "experience": "1年工作經驗",n "address": "上海"n}{n "salary": "10-24萬",n "company": "上海霄騁信息科技有限公司",n "name": "Python爬蟲開發工程師",n "experience": "2年工作經驗",n "address": "上海"n}{n "salary": "面議",n "company": "五五海淘",n "name": "Python",n "experience": "1年工作經驗",n "address": "上海"n}n.................n.................n

6.使用中間件設置請求頭和代理

scrapyAPI文檔中關於中間件的描述

-settings.py

# -*- coding: utf-8 -*-nnnnBOT_NAME = teannSPIDER_MODULES = [tea.spiders]nNEWSPIDER_MODULE = tea.spidersnn# 用於設置日誌配置文件,將程序運行的信息,保存在指定的文件中nLOG_FILE = s.logn# 用於設置信息記錄級別 DEBUG最高級別~記錄所有信息 -- INFO WARNING...n# 詳細日誌<DEBUG> -> 摘要信息<INFO> -> 警告信息<WARNING> -> 錯誤信息<ERROR>....nLOG_LEVEL = INFOnnn# Crawl responsibly by identifying yourself (and your website) on the user-agentn#USER_AGENT = tea (+http://www.yourdomain.com)nUSER_AGENTS = [n "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",n "Opera/8.0 (Windows NT 5.1; U; en)",n "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",n "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",n "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",n "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",n "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ",n "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",n "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",n "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",n "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",n "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",n "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",n "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",n "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) ",n "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",n "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",n "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E) ",n "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",n "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) "n ]nnnnn# Obey robots.txt rulesnROBOTSTXT_OBEY = Truenn# Enable or disable downloader middlewaresn# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.htmlnDOWNLOADER_MIDDLEWARES = {n # tea.middlewares.MyCustomDownloaderMiddleware: 543,n tea.middlewares.UseragentMiddleware: 543,n tea.middlewares.ProxyMiddleware:600,n}nnPROXY = [n {"ip_port":"178.62.47.236:80"},n {"ip_port":"125.77.25.116:80"},n {"ip_port":"13.58.249.76:8080"},n {"ip_port":"37.204.253.2:8081"},n {"ip_port":"78.47.174.243:3128"},n {"ip_port":"139.59.235.243:3128", "user_password":"admin:123123"}n]n

  • middlewares.py

# -*- coding: utf-8 -*-nn# Define here the models for your spider middlewaren#n# See documentation in:n# http://doc.scrapy.org/en/latest/topics/spider-middleware.htmlnnimport randomnimport base64nfrom settings import USER_AGENTS,PROXYnn#創建一個自定義的下載中間件 -- 需要在settings.py中進行配置才能起作用nclass UseragentMiddleware(object):n #定義一個專門用於處理請求的函數:兩個參數,第一個參數就是要處理的請求對象,第二個參數是爬蟲程序n #該函數必須返回一個數據-None/request,如果返回的是None,表示處理完成,交給後續的中間件繼續操作n #如果返回的是request,此時返回的request會被重新交給引擎添加到請求隊列中,重新發起n def process_request(self,request,spider):n print (----QAQ-----)n #隨機獲取一個user-Agentn useragent = random.choice(USER_AGENTS)n #給request請求頭中添加user-agent配置n request.headers.setdefault(User-agent,useragent)n print (---->headers successful)n return Nonennclass ProxyMiddleware(object):n def process_request(self,request,spider):n print (------->-_-)n proxy = random.choice(PROXY)n # 給request請求中添加Proxy配置n print proxy[ip_port],proxy.get(user_password,None)n request.meta[proxy] = proxy.get(ip_port)nn #驗證n if proxy.get(user_password,None):n b64 = base64.b64encode(proxy.get(user_password))n print b64n request.headers[Proxy-Authorization] = Basic +b64n print ======proxy======n

可以看到請求頭和代理IP已被加入

作者:_知幾 Python愛好者社區專欄作者,請勿轉載,謝謝。

簡書主頁:jianshu.com/u/9dad6621d

博客專欄:_知幾的博客專欄

配套視頻教程:Python3爬蟲三大案例實戰分享:貓眼電影、今日頭條街拍美圖、淘寶美食 Python3爬蟲三大案例實戰分享

公眾號:Python愛好者社區(微信ID:python_shequ),關注,查看更多連載內容。

推薦閱讀:

用Python給女票寫一個小說網站
《網路爬蟲:從入門到實踐》一書勘誤
python爬蟲之圖片下載APP1.0
用python爬取絕地求生各區服top100玩家數據

TAG:Python | Python入门 | python爬虫 |