標籤:

數據分析學習大神HADLEY WICKHAM關於「dplyr」的文章翻譯

github.com/tidyverse/dptidyverse/dplyrgithub.com/tidyverse/dp

本文是翻譯數據分析R語言大神的一篇關於「dplyr」的文章,原文的鏈接已經放在文章的開頭了。

首先來看看原文:

Overview

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation "by group". You can learn more about them in vignette("dplyr"). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette("two-table").

dplyr is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code. Install the dbplyr package then read vignette("databases", package = "dbplyr").

If you are new to dplyr, the best place to start is the data import chapter in R for data science.

Installation

# The easiest way to get dplyr is to install the whole tidyverse:install.packages("tidyverse")# Alternatively, install just dplyr:install.packages("dplyr")# Or the development version from GitHub:# install.packages("devtools")devtools::install_github("tidyverse/dplyr")

If you encounter a clear bug, please file a minimal reproducible example on github. For questions and other discussion, please use the manipulatr mailing list.

Usage

library(dplyr)starwars %>% filter(species == "Droid")#> # A tibble: 5 x 13#> name height mass hair_color skin_color eye_color birth_year gender#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> #> 1 C-3PO 167 75.0 <NA> gold yellow 112 <NA> #> 2 R2-D2 96 32.0 <NA> white, blue red 33.0 <NA> #> 3 R5-D4 97 32.0 <NA> white, red red NA <NA> #> 4 IG-88 200 140 none metal red 15.0 none #> 5 BB8 NA NA none none black NA none #> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,#> # vehicles <list>, starships <list>starwars %>% select(name, ends_with("color"))#> # A tibble: 87 x 4#> name hair_color skin_color eye_color#> <chr> <chr> <chr> <chr> #> 1 Luke Skywalker blond fair blue #> 2 C-3PO <NA> gold yellow #> 3 R2-D2 <NA> white, blue red #> 4 Darth Vader none white yellow #> 5 Leia Organa brown light brown #> # ... with 82 more rowsstarwars %>% mutate(name, bmi = mass / ((height / 100) ^ 2)) %>% select(name:mass, bmi)#> # A tibble: 87 x 4#> name height mass bmi#> <chr> <int> <dbl> <dbl>#> 1 Luke Skywalker 172 77.0 26.0#> 2 C-3PO 167 75.0 26.9#> 3 R2-D2 96 32.0 34.7#> 4 Darth Vader 202 136 33.3#> 5 Leia Organa 150 49.0 21.8#> # ... with 82 more rowsstarwars %>% arrange(desc(mass))#> # A tibble: 87 x 13#> name height mass hair_color skin_color eye_color birth_year gender#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> #> 1 Jabba … 175 1358 <NA> green-tan,… orange 600 herma…#> 2 Grievo… 216 159 none brown, whi… green, ye… NA male #> 3 IG-88 200 140 none metal red 15.0 none #> 4 Darth … 202 136 none white yellow 41.9 male #> 5 Tarfful 234 136 brown brown blue NA male #> # ... with 82 more rows, and 5 more variables: homeworld <chr>,#> # species <chr>, films <list>, vehicles <list>, starships <list>starwars %>% group_by(species) %>% summarise( n = n(), mass = mean(mass, na.rm = TRUE) ) %>% filter(n > 1)#> # A tibble: 9 x 3#> species n mass#> <chr> <int> <dbl>#> 1 Droid 5 69.8#> 2 Gungan 3 74.0#> 3 Human 35 82.8#> 4 Kaminoan 2 88.0#> 5 Mirialan 2 53.1#> # ... with 4 more rows


Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

dplyr

綜述

dplyr包是一個數據操作的語法,提供始終如一的動作幫助你解決絕大多數普遍的數據操作的挑戰。

·mutate() 添加新的現有變數的函數

·select()基於它們的名字挑選變數

·filter()挑選案例基於它們的值

·summarise()增加多重值到一個總結里去

·arrange()對行排序的更改

這些都是group_by()函數自然而然合併的允許你執行任何「按組」操作,你能學習它們更多在「vignette(」dplyr「)。和單表動詞一樣,」dplyr「也提供各種各樣的雙表動詞,你能在vignette(」two-table「)中學習。

」dplyr「函數旨在摘要數據是如何存儲的,意味著它也工作於本地數據幀,你也能工作於遠程數據表,使用完全相同的R代碼。

安裝」dplyr「包然後讀取:

vignette("databases", package = "dbplyr").

假如R對你來說是陌生的,最好開始的地方是R數據科學中的」數據輸入章「。

安裝

#最容易獲得」dplyr「的方式是安裝整個」tidyverse「:

install.package("tidyverse")

#作為一種選擇,只安裝」dplyr「

install.package("dplyr")

#或者來自社交編程及代碼託管網站的開發版本

#install。package(」devtools「)

devtools::install_github("tidyverse/dplyr"

假如你遭遇一個明確的錯誤,請歸檔一個最小限度可複寫的例子在」github「上,對問題與其它的討論,請使用」manipulatr mailing list「

用法

library(dplyr)

starwars %>%

filter(species == "Droid")

#> # A tibble: 5 x 13

#> name height mass hair_color skin_color eye_color birth_year gender

#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>

#> 1 C-3PO 167 75.0 <NA> gold yellow 112 <NA>

#> 2 R2-D2 96 32.0 <NA> white, blue red 33.0 <NA>

#> 3 R5-D4 97 32.0 <NA> white, red red NA <NA>

#> 4 IG-88 200 140 none metal red 15.0 none

#> 5 BB8 NA NA none none black NA none

#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,

#> # vehicles <list>, starships <list>

starwars %>%

select(name, ends_with("color"))

#> # A tibble: 87 x 4

#> name hair_color skin_color eye_color

#> <chr> <chr> <chr> <chr>

#> 1 Luke Skywalker blond fair blue

#> 2 C-3PO <NA> gold yellow

#> 3 R2-D2 <NA> white, blue red

#> 4 Darth Vader none white yellow

#> 5 Leia Organa brown light brown

#> # ... with 82 more rows

starwars %>%

mutate(name, bmi = mass / ((height / 100) ^ 2)) %>%

select(name:mass, bmi)

#> # A tibble: 87 x 4

#> name height mass bmi

#> <chr> <int> <dbl> <dbl>

#> 1 Luke Skywalker 172 77.0 26.0

#> 2 C-3PO 167 75.0 26.9

#> 3 R2-D2 96 32.0 34.7

#> 4 Darth Vader 202 136 33.3

#> 5 Leia Organa 150 49.0 21.8

#> # ... with 82 more rows

starwars %>%

arrange(desc(mass))

#> # A tibble: 87 x 13

#> name height mass hair_color skin_color eye_color birth_year gender

#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>

#> 1 Jabba … 175 1358 <NA> green-tan,… orange 600 herma…

#> 2 Grievo… 216 159 none brown, whi… green, ye… NA male

#> 3 IG-88 200 140 none metal red 15.0 none

#> 4 Darth … 202 136 none white yellow 41.9 male

#> 5 Tarfful 234 136 brown brown blue NA male

#> # ... with 82 more rows, and 5 more variables: homeworld <chr>,

#> # species <chr>, films <list>, vehicles <list>, starships <list>

starwars %>%

group_by(species) %>%

summarise(

n = n(),

mass = mean(mass, na.rm = TRUE)

) %>%

filter(n > 1)

#> # A tibble: 9 x 3

#> species n mass

#> <chr> <int> <dbl>

#> 1 Droid 5 69.8

#> 2 Gungan 3 74.0

#> 3 Human 35 82.8

#> 4 Kaminoan 2 88.0

#> 5 Mirialan 2 53.1

#> # ... with 4 more rows

請注意這些計劃發佈於」contribution code of conduct,通過參與式開發這個計劃你同意遵守其條款。


推薦閱讀:

積夢智能-為中國製造業揭開數據之霧
ClickHouse Beijing Meetup-數據分析領域的黑馬-ClickHouse-新浪-高鵬
精準營銷的兩階段預測模型-客戶響應模型+代碼
我們分析了1億條閱讀量超高的標題,這就是為什麼你會被標題黨吸引

TAG:大數據分析 |