Python數據採集Selenium、PantomJS淺談

一直以來我覺得用在運維的Selenium、PantomJS是一個重器,

不到萬不得已的時候不要祭出這個大殺器,

但是涉及到JavaScript及Ajax渲染的時候,Requests就完全懵逼了!

最近回過頭來重新審視這貨,

這個重器用反倒輕便了很多。

1.安裝Selenium、PantomJS

Selenium可以直接通過pip安裝,PantomJS則時一個exe可執行文件,需要下載解壓。在使用的時候指定exe的絕對路徑即可。

2.Selenium、PantomJS基本設置

from selenium import webdrivernfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesndcap = DesiredCapabilities.PHANTOMJSndcap[ "phantomjs.page.settings.userAgent"] = "Mozilla / 4.0(Windows NT 10.0; Win64;x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome/51.0.2704.79 Safari/ 537.36Edge/14.14393"n# 請求頭不一樣,自適應的窗口不一樣,卧槽,坑爹!ndriver = webdriver.PhantomJS(desired_capabilities=dcap)ndriver.set_page_load_timeout(10)ndriver.set_script_timeout(10) # 設置頁面退出時間,沒有必要等一個網頁載入完了採集n# 採集網頁源碼n try:n driver.get(inurl)n content = driver.page_sourcen # print(content)n time.sleep(1)n except:n driver.execute_script(window.stop())ndriver.close()n

3.Selenium、PantomJS基本操作

如果你的網路和機子足夠好,基本上就不用等待網頁渲染,

否則,還需要等待,如果用time.sleep(),則有點笨拙,

#等待頁面渲染完成nfrom selenium.webdriver.common.by import Bynfrom selenium.webdriver.support.ui import WebDriverWaitnfrom selenium.webdriver.support import expected_conditions as ECn...ntry:n element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))n# 等某個標籤元素出來,不見鴨子不撒鷹。nnfinally: # 撒鷹n print(driver.find_element_by_id("content").text)n driver.close()n

或者用

try:n elem == driver.find_element_by_tag_name("html")n # 拋出StaleElementReferenceException異常說明elem元素已經消失了,也就說明頁面已經跳轉了。nexcept StaleElementReferenceException: n returnn

其他driver內置函數,可以通過查看源代碼或者在pycharm提示獲取。

4.Xpath定位Html標籤

1.id定位:find_element_by_id(self, id_)n2.name定位:find_element_by_name(self, name)n3.class定位:find_element_by_class_name(self, name)n4.tag定位:find_element_by_tag_name(self, name)n5.link定位:find_element_by_link_text(self, link_text)n6.partial_link定位find_element_by_partial_link_text(self, link_text)n7.xpath定位:find_element_by_xpath(self, xpath)n8.css定位:find_element_by_css_selector(self, css_selector)n9.id複數定位find_elements_by_id(self, id_)n10.name複數定位find_elements_by_name(self, name)n11.class複數定位find_elements_by_class_name(self, name)n12.tag複數定位find_elements_by_tag_name(self, name)n13.link複數定位find_elements_by_link_text(self, text)n14.partial_link複數定位find_elements_by_partial_link_text(self, link_text)n15.xpath複數定位find_elements_by_xpath(self, xpath)n16.css複數定位find_elements_by_css_selector(self, css_selectorn17.find_element(self, by=id, value=None)n18.find_elements(self, by=id, value=None)n

其中element方法定位到是是單數,是直接定位到元素;elements方法是複數,這個學過英文的都知道,定位到的是一組元素,返回的是list隊列。可參照Re函數中的findall理解。

如果定位不了標籤,只能上JS大法:Selenium2+python自動化46-js解決click失效問題 - 上海-悠悠 - 博客園該作者已經出書,很便宜,可以考慮入一本。如果還想深入練習本節大法,同時推薦博客園ID:蟲師蟲師 - 博客園,這位兄台是做運維的,我最初就是用他的PDF教材入門Selenium,而且也已經出書了,是不是聽這ID,就不是一般的牛逼,哈哈...

5.完整例子

這個例子屬於標準化操作,在實際中可以適當簡化,並結合上面的Xpath定位完成。

from selenium import webdrivernimport timenfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesn ndcap = dict(DesiredCapabilities.PHANTOMJS)ndcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"nndriver = webdriver.PhantomJS(executable_path=rC:UserstaojwDesktoppyworkphantomjs-2.1.1-windowsbinphantomjs.exe, desired_capabilities=dcap)ndriver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")ntime.sleep(3)nprint(driver.find_element_by_id("content").text)ndriver.close()nn#設置PHANTOMJS的USER-AGENTnfrom selenium import webdrivernfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesn ndcap = dict(DesiredCapabilities.PHANTOMJS)ndcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"nn ndriver = webdriver.PhantomJS(executable_path=./phantomjs.exe, desired_capabilities=dcap)ndriver.get("http://dianping.com/")nncap_dict = driver.desired_capabilities #查看所有可用的desired_capabilities屬性。nfor key in cap_dict:n print(%s: %s % (key, cap_dict[key]))nprint(driver.current_url)ndriver.quit()nn#等待頁面渲染完成nfrom selenium.webdriver.common.by import Bynfrom selenium.webdriver.support.ui import WebDriverWaitnfrom selenium.webdriver.support import expected_conditions as ECnndriver = webdriver.PhantomJS(executable_path=rC:UserstaojwDesktoppyworkphantomjs-2.1.1-windowsbinphantomjs.exe)ndriver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")ntry:n element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))nfinally:n print(driver.find_element_by_id("content").text)n driver.close()nn#處理Javascript重定向nfrom selenium import webdrivernimport timenfrom selenium.webdriver.remote.webelement import WebElementnfrom selenium.common.exceptions import StaleElementReferenceExceptionnndef waitForLoad(driver):n elem = driver.find_element_by_tag_name("html")n count = 0n while True:n count += 1n if count > 20:n print("Timing out after 10 seconds and returning")n returnn time.sleep(.5)n try:n elem == driver.find_element_by_tag_name("html")n except StaleElementReferenceException:n returnnndriver = webdriver.PhantomJS(executable_path=rC:UserstaojwDesktoppyworkphantomjs-2.1.1-windowsbinphantomjs.exe)ndriver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")nwaitForLoad(driver)nprint(driver.page_source)n######nfrom selenium import webdrivernfrom selenium.webdriver.remote.webelement import WebElementnfrom selenium.webdriver import ActionChainsnndriver = webdriver.PhantomJS(executable_path=phantomjs/bin/phantomjs)ndriver.get(http://pythonscraping.com/pages/javascript/draggableDemo.html)nnprint(driver.find_element_by_id("message").text)nnelement = driver.find_element_by_id("draggable")ntarget = driver.find_element_by_id("div2")nactions = ActionChains(driver)nactions.drag_and_drop(element, target).perform()nnprint(driver.find_element_by_id("message").text)n#######n#截屏ndriver.get_screenshot_as_file(tmp/pythonscraping.png)nn####n#登陸知乎,然後能自動點擊頁面下方的「更多」,以載入更多的內容nfrom selenium import webdrivernfrom selenium.webdriver.common.keys import Keysnfrom selenium.webdriver.support.ui import WebDriverWaitnfrom selenium.webdriver import ActionChainsnimport timenimport sysnndriver = webdriver.PhantomJS(executable_path=C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe)ndriver.get("http://www.zhihu.com/#signin")n#driver.find_element_by_name(email).send_keys(your email)ndriver.find_element_by_xpath(//input[@name="password"]).send_keys(your password)n#driver.find_element_by_xpath(//input[@name="password"]).send_keys(Keys.RETURN)ntime.sleep(2)ndriver.get_screenshot_as_file(show.png)n#driver.find_element_by_xpath(//button[@class="sign-button"]).click()ndriver.find_element_by_xpath(//form[@class="zu-side-login-box"]).submit()nntry:n #等待頁面載入完畢n dr=WebDriverWait(driver,5)n dr.until(lambda the_driver:the_driver.find_element_by_xpath(//a[@class="zu-top-nav-userinfo "]).is_displayed())nexcept:n print(登錄失敗)n sys.exit(0)ndriver.get_screenshot_as_file(show.png)n#user=driver.find_element_by_class_name(zu-top-nav-userinfo )n#webdriver.ActionChains(driver).move_to_element(user).perform() #移動滑鼠到我的用戶名nloadmore=driver.find_element_by_xpath(//a[@id="zh-load-more"])nactions = ActionChains(driver)nactions.move_to_element(loadmore)nactions.click(loadmore)nactions.perform()ntime.sleep(2)ndriver.get_screenshot_as_file(show.png)nprint(driver.current_url)nprint(driver.page_source)ndriver.quit()n##################################################################################n

關於本節更多的實戰技巧將會放在實例中。

膠水語言博大精深,

本主只得一二為新人帶路,

老鳥返回專欄:Python中文社區

新手可查閱歷史目錄:

Python數據分析及可視化實例目錄


推薦閱讀:

使用anaconda以後再要使用不在conda環境中的包,要怎麼安裝?
python中return到底什麼意思?
python
期權的高頻交易回測平台怎麼編寫?
Python進階課程筆記(四)

TAG:Python | 数据分析 | 数据可视化 |