簡單爬取天眼查數據 附代碼
一、常規抓包分析
比如要爬取企業註冊信息查詢_企業工商信息查詢_企業信用信息查詢平台_發現人與企業關係的平台-天眼查該頁面的基礎信息。
通過火狐瀏覽器抓包,可以發現,所要數據都在下圖的json文件里
查看其請求
偽裝成瀏覽器爬取該文件:
偽裝成瀏覽器爬取該文件:
import requestsnheader = {nHost: www.tianyancha.com,nUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0,nAccept: application/json, text/plain, */*,nAccept-Language: zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3,nAccept-Encoding: gzip, deflate,nTyc-From: normal,nCheckError: check,nConnection: keep-alive,nReferer: http://www.tianyancha.com/company/2310290454,nCache-Control: max-age=0n,nCookie: _pk_id.1.e431=5379bad64f3da16d.1486514958.5.1486693046.1486691373.; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1486514958,1486622933,1486624041,1486691373; _pk_ref.1.e431=%5B%22%22%2C%22%22%2C1486691373%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3D95IaKh1pPrhNKUe5nDCqk7dJI9ANLBzo-1Vjgi6C0VTd9DxNkSEdsM5XaEC4KQPO%26wd%3D%26eqid%3Dfffe7d7e0002e01b00000004589c1177%22%5D; aliyungf_tc=AQAAAJ5EMGl/qA4AKfa/PDGqCmJwn9o7; TYCID=d6e00ec9b9ee485d84f4610c46d5890f; tnet=60.191.246.41; _pk_ses.1.e431=*; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1486693045; token=d29804c0b88842c3bb10c4bc1d48bc80; _utm=55dbdbb204a74224b2b084bfe674a767; RTYCID=ce8562e4e131467d881053bab1a62c3an}nr = requests.get(http://www.tianyancha.com/company/2310290454.json, headers=header)nprint(r.text)nprint(r.status_code)n
返回結果如下:
狀態碼為403,常規爬取不成功。考慮下面一種方式。
二、使用selenium+PHANTOMJS獲取數據
首先下載phantomjs到本地,並將phantomjs.exe存放在系統環境變數所在目錄下(本人講該文件放置在D:/Anaconda2/路徑下)。
為phantomjs添加useragent信息(經測試,不添加useragent信息爬取到的是錯亂的信息):
from selenium import webdrivernfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesnndcap = dict(DesiredCapabilities.PHANTOMJS)ndcap["phantomjs.page.settings.userAgent"] = (n"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"n )ndriver = webdriver.PhantomJS(executable_path=D:/Anaconda2/phantomjs.exe, desired_capabilities=dcap)n
獲取網頁源代碼:
driver.get(http://www.tianyancha.com/company/2310290454)n#等待5秒,更據動態網頁載入耗時自定義ntime.sleep(5)n# 獲取網頁內容ncontent = driver.page_source.encode(utf-8)ndriver.close()nprint(content)n
對照網頁,爬取的源代碼信息正確,接下去解析代碼,獲取對應的信息。
簡單寫了下獲取基礎信息的代碼:
#!/usr/bin/env pythonn# -*- coding:utf-8 -*-nfrom selenium import webdrivernimport timenfrom bs4 import BeautifulSoupnfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesnndef driver_open():n dcap = dict(DesiredCapabilities.PHANTOMJS)n dcap["phantomjs.page.settings.userAgent"] = (n"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"n )n driver = webdriver.PhantomJS(executable_path=D:/Anaconda2/phantomjs.exe, desired_capabilities=dcap)nreturn driverndef get_content(driver,url):n driver.get(url)n#等待5秒,更據動態網頁載入耗時自定義n time.sleep(5)n# 獲取網頁內容n content = driver.page_source.encode(utf-8)n driver.close()n soup = BeautifulSoup(content, lxml)nreturn soupnndef get_basic_info(soup):n company = soup.select(div.company_info_text > p.ng-binding)[0].text.replace("n","").replace(" ","")n fddbr = soup.select(.td-legalPersonName-value > p > a)[0].textn zczb = soup.select(.td-regCapital-value > p )[0].textn zt = soup.select(.td-regStatus-value > p )[0].text.replace("n","").replace(" ","")n zcrq = soup.select(.td-regTime-value > p )[0].textn basics = soup.select(.basic-td > .c8 > .ng-binding )n hy = basics[0].textn qyzch = basics[1].textn qylx = basics[2].textn zzjgdm = basics[3].textn yyqx = basics[4].textn djjg = basics[5].textn hzrq = basics[6].textn tyshxydm = basics[7].textn zcdz = basics[8].textn jyfw = basics[9].textnprint u公司名稱:+companynprint u法定代表人:+fddbrnprint u註冊資本:+zczbnprint u公司狀態:+ztnprint u註冊日期:+zcrqn# print basicsnprint u行業:+hynprint u工商註冊號:+qyzchnprint u企業類型:+qylxnprint u組織機構代碼:+zzjgdmnprint u營業期限:+yyqxnprint u登記機構:+djjgnprint u核准日期:+hzrqnprint u統一社會信用代碼:+tyshxydmnprint u註冊地址:+zcdznprint u經營範圍:+jyfwnndef get_gg_info(soup):n ggpersons = soup.find_all(attrs={"event-name": "company-detail-staff"})n ggnames = soup.select(table.staff-table > tbody > tr > td.ng-scope > span.ng-binding)n# print(len(gg))nfor i in range(len(ggpersons)):n ggperson = ggpersons[i].textn ggname = ggnames[i].textnprint (ggperson+" "+ggname)nndef get_gd_info(soup):n tzfs = soup.find_all(attrs={"event-name": "company-detail-investment"})nfor i in range(len(tzfs)):n tzf_split = tzfs[i].text.replace("n","").split()n tzf = .join(tzf_split)nprint tzfnndef get_tz_info(soup):n btzs = soup.select(a.query_name)nfor i in range(len(btzs)):n btz_name = btzs[i].select(span)[0].textnprint btz_namennif __name__==__main__:n url = "http://www.tianyancha.com/company/2310290454"n driver = driver_open()n soup = get_content(driver, url)nprint ----獲取基礎信息----n get_basic_info(soup)nprint ----獲取高管信息----n get_gg_info(soup)nprint ----獲取股東信息----n get_gd_info(soup)nprint ----獲取對外投資信息----n get_tz_info(soup)n
這僅僅是單頁面的一個示例,要寫完整的爬蟲項目加工,以後再花時間改進。
--------------------------------------------------------------------
作者:簡單的Happy
出處:簡單的happy
最近很多人私信問我問題,平常知乎評論看到不多,如果沒有及時回復,大家也可以加小編微信:tszhihu,進知乎大數據分析挖掘交流群,可以跟各位老師互相交流。謝謝。
推薦閱讀:
※如何看待中國「天眼之父」南仁東逝世國人反應?
※天眼可以發現引力波嗎?
※對於中國的「天眼」你知道多少?
※為什麼騙子的銀行卡賬戶那麼難查?