拉勾網_數據分析師崗位:Python爬蟲

我話不多,語文又不好,就不瞎叨叨了,請直接移步分割線以下的內容。

下班後熬夜寫東西很辛苦,轉載請標明出處,謝謝。


項目原址:Github - maysiuYang

包含文件:

代碼文檔

完整代碼

3月13日抓取的數據

ChromeSetup.exe (chrome瀏覽器安裝程序)

chromedriver.exe (驅動程序)


工具:python 3.6

chrome瀏覽器

chromedriver.exe (放到代碼文件所在文件夾)


方法:Python+selenium+BeautifulSoup抓取動態網頁的數據

因為網頁是動態的數據,嵌入了JavaScript代碼,因此藉助selenium模擬登陸瀏覽器來爬取信息。

BeautifulSoup是python的一個庫,最主要的功能是從網頁抓取數據。BeautifulSoup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱,通過解析文檔為用戶提供需要抓取的數據,避免了用繁雜的正則表達式。

拉勾網下數據分析師崗位的導航:全國站>設計>用戶研究>數據分析師

爬蟲代碼思路:

從拉勾網首頁進入網站 --> 利用time.sleep(30)函數暫停30秒 --> 手動輸入賬號/密碼進行登錄 --> 進入"數據分析師"導航第一頁 --> 構造翻頁url --> 抓取數據

拉勾網首頁:

拉勾網-專業的互聯網招聘平台_找工作_招聘_人才網_求職

"數據分析師"導航第一頁url:

數據分析師招聘-招聘求職信息-拉勾網


導入要調用的模塊

# -*- coding: utf-8 -*import refrom bs4 import BeautifulSoupimport timefrom selenium import webdriverimport osimport pandas as pd

啟動Chrome測試瀏覽器

chrome_options = webdriver.ChromeOptions()driver = webdriver.Chrome(./chromedriver, chrome_options=chrome_options)driver.implicitly_wait(10)

手動登錄拉勾網

拉勾網要登陸才能瀏覽信息,所以我們首先要完成登錄。好在拉勾網的登錄機制比較簡單,設置程序暫停30秒,然後人肉輸入賬號&密碼完成登錄。

#進入拉勾網首頁,完成手動登錄def login_in(): homepage = https://www.lagou.com/ driver.get(homepage) print("請在30秒內完成以下操作:") print("1.選擇全國站") print("2.點擊登錄:輸入你的賬號&密碼") # account = # password = time.sleep(30) #暫停30秒,必須要在30秒內完成登錄。

查看網頁源碼的方法:右擊 --> 檢查

翻頁+獲取崗位鏈接

進入數據分析師導航後,要翻頁抓取每個崗位的基本信息,最重要是得到每個崗位的鏈接,然後通過打開這個鏈接進入詳情頁,以獲得更多信息。

翻頁總頁數不是固定不變的,所以翻頁標誌不能直接設置為30,要通過定位翻頁框來獲取頁數。

翻頁時先獲取基本信息:公司名稱/崗位名稱/發布時間/崗位鏈接

最重要的就是要拿到 positionlink,我們下一步就是通過這個進入詳情頁獲取其他數據。

#翻頁拿目錄def getdata_turnpage(): url = https://www.lagou.com/zhaopin/shujufenxishi/ #全國站>設計>用戶研究>數據分析師的入口url driver.get(url) time.sleep(2) htm = driver.page_source htm = re.sub(r
| | , , htm) sou = BeautifulSoup(htm, lxml) t = sou.find(class_=pager_container).select(a)[-2].text #獲取翻頁頁數 t = int(t)+1 positionlist = [] for i in range(1,t): #翻頁 print("當前頁:{}".format(str(i))) turl = url + str(i) + / driver.get(turl) time.sleep(1) html = driver.page_source html = re.sub(r
| | , , html) soup = BeautifulSoup(html, lxml) contents = soup.find(class_=s_position_list).find(class_=item_con_list).find_all(class_=con_list_item default_list) for content in contents: try: company = content.find(class_=company).find(class_=company_name).text company = re.sub(r該企業已上傳營業執照並通過資質驗證審核,,company) except: company = try: position = content.find(class_=position).find(class_=p_top).find(class_=position_link).text position = re.sub(r[.*,,position) except: position = try: pubdate = content.find(class_=position).find(class_=p_top).find(class_=format-time).text pubdate = re.sub(r發布,,pubdate) except: pubdate = try: positionlink = content.find(class_=position).find(class_=position_link).attrs[href] except: positionlink = jsondata = {} jsondata[company] = company jsondata[position] = position jsondata[pubdate] = pubdate jsondata[positionlink] = positionlink positionlist.append(jsondata) return positionlist

獲取詳情頁數據

通過上面的getdata_turnpage()方法,我們已經得到每個崗位的鏈接,並存儲在字典jsondata中。現在將這個jsondata[positionlink]作為實參傳入到方法getitemdetails(data)中。

#獲取詳情頁信息def getitemdetails(data): #傳入dict參數 joblabels = salary = salary_min = salary_max = city = experience = education = fulltime = positionlabels = p_label1 = p_label2 = p_label3 = p_label4 = p_label5 = p_label6 = advantage = description = address = field = develop_stage = scale = company_website = purl = data[positionlink] #用實例獲取上一輪字典中的崗位url driver.get(purl) time.sleep(1) htm2 = driver.page_source htm2 = re.sub(r
|
| | ,,htm2) soup = BeautifulSoup(htm2,lxml) try: joblabels = soup.find(class_=job_request).p.select(span) except Exception as e: print(e) print(joblabels出現問題) try: salary = joblabels[0].text except: pass try: salary_min = re.sub(r-.*, , salary) salary_min = re.sub(r , , salary_min) salary_min = salary_min.rstrip(k) salary_min = salary_min.rstrip(K) salary_min = float(salary_min) except: pass try: salary_max = re.sub(r.*-, , salary) salary_max = re.sub(r , , salary_max) salary_max = salary_max.rstrip(k) salary_max = salary_max.rstrip(K) salary_max = float(salary_max) except: pass try: city = joblabels[1].text city = re.sub(r/,,city) except: pass try: experience = joblabels[2].text experience = re.sub(r/, , experience) except: pass try: education = joblabels[3].text education = re.sub(r/, , education) except: pass try: fulltime = joblabels[4].text except: pass try: positionlabels = soup.find(class_=job_request).find(class_=position-label).select(li) except Exception as e: print(e) print(positionlabels出現問題) try: p_label1 = positionlabels[0].text except: pass try: p_label2 = positionlabels[1].text except: pass try: p_label3 = positionlabels[2].text except: pass try: p_label4 = positionlabels[3].text except: pass try: p_label5 = positionlabels[4].text except: pass try: p_label6 = positionlabels[5].text except: pass try: advantage = soup.find(class_=job_detail).find(class_=job-advantage).text advantage = re.sub(r職位誘惑:, , advantage) except: pass try: description = soup.find(class_=job_detail).find(class_=job_bt).div.text except: pass try: address = soup.find(class_=job-address).find(class_=work_addr).text address = re.sub(r查看地圖, , address) except: pass try: for inf in soup.find(class_=content_r).find(class_=job_company).find(class_=c_feature).select(li): if inf.text.find(領域) != -1: field = re.sub(r領域, , inf.text) field = field.lstrip( ) except: pass try: for inf in soup.find(class_=content_r).find(class_=job_company).find(class_=c_feature).select(li): if inf.text.find(發展階段) != -1: develop_stage = re.sub(r發展階段, , inf.text) develop_stage = develop_stage.lstrip( ) except: pass try: for inf in soup.find(class_=content_r).find(class_=job_company).find(class_=c_feature).select(li): if inf.text.find(規模) != -1: scale = re.sub(r規模, , inf.text) scale = scale.lstrip( ) except: pass try: for inf in soup.find(class_=content_r).find(class_=job_company).find(class_=c_feature).select(li): if inf.text.find(公司主頁) != -1: company_website = inf.a.attrs[href] except: pass data[salary] = salary data[salary_min(k)] = salary_min data[salary_max(k)] = salary_max data[city] = city data[experience] = experience data[education] = education data[fulltime] = fulltime data[p_label1] = p_label1 data[p_label2] = p_label2 data[p_label3] = p_label3 data[p_label4] = p_label4 data[p_label5] = p_label5 data[p_label6] = p_label6 data[advantage] = advantage data[description] = description data[address] = address data[field] = field data[develop_stage] = develop_stage data[scale] = scale data[company_website] = company_website print(data) return data

導出數據

def convert_excel(datas): name1 = 拉勾網_數據分析師_全國 xls_name = name1 + ( + str(time.strftime(%Y%m%d, time.localtime(time.time())))+ ) + .xlsx exp_file_name = os.path.join(E:/, xls_name) writer = pd.ExcelWriter(exp_file_name) df = pd.DataFrame(datas) df.to_excel(writer, columns=[company,position,pubdate,positionlink,salary,salary_min(k),salary_max(k), city,experience,education,fulltime,p_label1,p_label2,p_label3,p_label4,p_label5,p_label6, advantage,description,address,field,develop_stage,scale,company_website], index=False) writer.save() writer.close() print("數據導出成功,在E盤文件夾")

設置代碼入口

if __name__ == __main__ 的意思就是,當模塊被直接運行時,以下代碼塊將被運行,當模塊是被導入時,代碼塊不被運行。

if __name__ == __main__: start_time = time.time() login_in() datalist = getdata_turnpage() enddatats = [] z = 1 for jsondata in datalist: print("當前數據:{}".format(str(z))) try: enddatats.append(getitemdetails(jsondata)) except Exception as e: print(e) print(jsondata) z += 1 driver.close() end_time = time.time() print(數據處理結束,用時%.5f秒 % (end_time - start_time)) convert_excel(enddatats)

為了檢測程序是否正常運行,我將每一條數據都print出來,效果如下,這樣就能直觀地看到採集進度和數據質量。


數據採集完成直接以Excel格式保存到E盤


這段代碼在python3上可以直接運行。

有興趣嘗試這個爬蟲的可以到我的Github拿完整代碼直接運行,有興趣做分析的同學可以直接拿我3/13抓取的數據進行相關分析。

有任何問題都可以聯繫我,但是最近工作比較忙,回復會很慢。


推薦閱讀:

python3下,re.findall返回值前後的[" 『]怎麼去掉?
如何通過html來執行python腳本?
據說Python中tuple的速度比list快,如果tuple中包含有list元素,tuple是如何保持比list快的?
為何我用sublime text3編譯python turtle畫圖後窗口自動關閉?

TAG:拉勾網 | Python3x | python爬蟲 |