rvest包翻譯——利用R語言進行網頁抓取

01-27

由於對R語言抓取網頁信息的方法非常感興趣，所以這次的翻譯文獻作業選擇了翻譯rvest包。

題目：《rvest》

作者：Hadley Wickham

正文：

rvest helps you scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.

revest可以幫助你從網頁上抓取信息，它與管道「%>%」一起合作，可以更容易的表現一些常見的網頁抓取任務，靈感來源於python中的beautiful soup。

library(rvest) ##載入包nlego_movie <- read_html("http://www.imdb.com/title/tt1490017/") ##網址nrating <- lego_movie %>% ##管道%>%n html_nodes("strong span") %>%n html_text() %>%n as.numeric()nratingn#> [1] 7.8nncast <- lego_movie %>%n html_nodes("#titleCast .itemprop span") %>%n html_text()ncast ##查看castn#> [1] "Will Arnett" "Elizabeth Banks" "Craig Berry" n#> [4] "Alison Brie" "David Burrows" "Anthony Daniels"n#> [7] "Charlie Day" "Amanda Farinos" "Keith Ferguson" n#> [10] "Will Ferrell" "Will Forte" "Dave Franco" n#> [13] "Morgan Freeman" "Todd Hansen" "Jonah Hill"nnposter <- lego_movie %>%n html_nodes(".poster img") %>%n html_attr("src")nposter ##圖片形式保存n#> [1] "http://ia.media-imdb.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_UX182_CR0,0,182,268_AL_.jpg"n

Overview/概述

The most important functions in rvest are:

rvest最重要的功能是：

Create an html document from a url, a file on disk or a string containing html with read_html().

從url、磁碟上的文件或包含帶有read_html()的html的字元串創建一個html文檔。

Select parts of a document using css selectors: html_nodes(doc, "table td") (or if youve a glutton for punishment, use xpath selectors with html_nodes(doc, xpath = "//table//td")). If you havent heard of selectorgadget, make sure to read vignette("selectorgadget") to learn about it.

使用css選擇器選擇文檔的部分：html_nodes（doc，「table td」）（或者如果你不怕麻煩，請使用hpath_nodes（doc，xpath =「// table // td」）的xpath選擇器）。如果您沒有聽說過selectorgadget，請務必閱讀vignette（「selectorgadget」）以了解它。

Extract components with html_tag() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).

使用html_tag()（標籤的名稱）、html_text()（標籤內的所有文本），html_attr()（單個屬性的內容）和html_attrs()（所有屬性）來提取組件。

(You can also use rvest with XML files: parse with xml(), then extract components using xml_node(), xml_attr(), xml_attrs(), xml_text() and xml_tag().)

（您還可以與XML文件來用rvest：用xml()解析，然後用xml_node()，xml_attr()，xml_attr()，xml_text()和xml_tag()提取組件）

Parse tables into data frames with html_table().

使用html_table()將表解析成數據幀。

Extract, modify and submit forms with html_form(), set_values() and submit_form().

使用html_form()，set_values()和submit_form()完成提取、修改和提交。

Detect and repair encoding problems with guess_encoding() and repair_encoding().

使用guess_encoding（）和repair_encoding（）檢測和修復編碼問題。

Navigate around a website as if youre in a browser with html_session(), jump_to(), follow_link(), back(), forward(), submit_form() and so on. (This is still a work in progress, so Id love your feedback.)

瀏覽網站，就像您在瀏覽器中使用html_session()，jump_to()，follow_link()，back()，forward()，submit_form()等等。（這仍然是一項正在進行的工作，所以我希望你的反饋。）

To see examples of these function in use, check out the demos.

查看這些功能的示例，來查看演示。

Installation/安裝

Install the release version from CRAN:

從CRAN安裝發行版本：

install.packages("rvest")n

Or the development version from github：

或從github開發版本：

# install.packages("devtools")ndevtools::install_github("hadley/rvest")n

Inspirations/啟發

Python:Robobrower,beautiful soup.

Python中的Robobrower和beautiful soup。