翻譯：用R語言進行數據清洗

01-27

經猴子老師推薦去看了R中dplyr包的發明者網站，感覺發現新大陸：R for Data Science
翻譯一篇關於R語言數據清洗文章，英文版R for data science tidy-data
因為防止文章太太太長導致自己以後翻了都不看，適當縮減了文章，大綱不變
文末肺結核病人案例實踐文章

https://zhuanlan.zhihu.com/p/30831984?utm_source=qq&utm_medium=social

1.前言

本章將為你介紹tidyr軟體包中的整理數據和相關工具。想了解更多關於基礎理論的知識，可參考這篇文章http://www.jstatsoft.org/v59/i10/paper【PDF英文版】

1.1.安裝包

tidyr包提供了一些工具來幫助整理凌亂的數據集。tidyr屬於tidyverse中的一部分組件。安裝過程如下：

library(tidyverse)n

原作者介紹的這個方式我沒安裝起，改用下面這個：

install.packages("tidyr")nlibrary(tidyr)n

除此之外，為了使得%>%功能順利進行，需要安裝下面這個%>%具體介紹

install.packages("magrittr")nlibrary(magrittr)n

2.整理數據

你可以用多種方式表示相同數據。下面的例子表示用四種不同格式顯示數據。每個數據集顯示國家，年份，人口和案例四個變數的相同值，但格式類型不同。

table1n#> # 表格結構: 6 × 4 n#> country year cases populationn#> <chr> <int> <int> <int>n#> 1 Afghanistan 1999 745 19987071n#> 2 Afghanistan 2000 2666 20595360n#> 3 Brazil 1999 37737 172006362n#> 4 Brazil 2000 80488 174504898n#> 5 China 1999 212258 1272915272n#> 6 China 2000 213766 1280428583ntable2n#> # 表格結構: 12 × 4n#> country year type countn#> <chr> <int> <chr> <int>n#> 1 Afghanistan 1999 cases 745n#> 2 Afghanistan 1999 population 19987071n#> 3 Afghanistan 2000 cases 2666n#> 4 Afghanistan 2000 population 20595360n#> 5 Brazil 1999 cases 37737n#> 6 Brazil 1999 population 172006362n#> # ... with 6 more rowsntable3n#> # 表格結構: 6 × 3n#> country year raten#> * <chr> <int> <chr>n#> 1 Afghanistan 1999 745/19987071n#> 2 Afghanistan 2000 2666/20595360n#> 3 Brazil 1999 37737/172006362n#> 4 Brazil 2000 80488/174504898n#> 5 China 1999 212258/1272915272n#> 6 China 2000 213766/1280428583nn# 聯合上面幾個表ntable4a # casesn#> # 表格結構: 3 × 3n#> country `1999` `2000`n#> * <chr> <int> <int>n#> 1 Afghanistan 745 2666n#> 2 Brazil 37737 80488n#> 3 China 212258 213766ntable4b # populationn#> # 表格結構: 3 × 3n#> country `1999` `2000`n#> * <chr> <int> <int>n#> 1 Afghanistan 19987071 20595360n#> 2 Brazil 172006362 174504898n#> 3 China 1272915272 1280428583n

這些都是常見數據表示，但它們不是同樣易於使用。

有三個相互關聯的規則使數據集整齊：

每個變數都必須有自己的列。
每個觀察必須有自己的行。
每個值都必須有自己的單元格。

圖12.1直觀地顯示了規則。

圖12.1

這三條規則是相互關聯的。這種相互關係可以簡單理解為：

把每個數據集放在同一個數據結構中。
同一列的變數類型要相同。

在這個例子中，只有table1數據整潔。只有它每一列變數相同。

為什麼要確保你的數據是整潔的？有兩個優點：

選擇一種一致的數據存儲方式有一個普遍的優勢。如果你有一個一致的數據結構，學習使用它的工具就更容易了。
將變數放在列中有一個特殊的好處，它允許R的向量化特性。

dplyr，ggplot2以及所有其他的tidyverse軟體包都可以使用整潔的數據。以下是一些小例子，展示了如何使用table1。

# Compute rate per 10,000ntable1 %>% n mutate(rate = cases / population * 10000)n#> # A tibble: 6 × 5n#> country year cases population raten#> <chr> <int> <int> <int> <dbl>n#> 1 Afghanistan 1999 745 19987071 0.373n#> 2 Afghanistan 2000 2666 20595360 1.294n#> 3 Brazil 1999 37737 172006362 2.194n#> 4 Brazil 2000 80488 174504898 4.612n#> 5 China 1999 212258 1272915272 1.667n#> 6 China 2000 213766 1280428583 1.669nn# Compute cases per yearntable1 %>% n count(year, wt = cases)n#> # A tibble: 2 × 2n#> year nn#> <int> <int>n#> 1 1999 250740n#> 2 2000 296920nn# Visualise changes over timenlibrary(ggplot2)nggplot(table1, aes(year, cases)) + n geom_line(aes(group = country), colour = "grey50") + n geom_point(aes(colour = country))n

2.1練習

描述變數情況。
計算ratefor table2和table4a+ table4b。你將需要執行四個操作：

提取每個國家每年的結核病例數。
每個國家每年提取匹配的人口。
按人口劃分案例，乘以10000。
存放在適當的地方。

哪種表示方式最容易使用？哪個最難？為什麼？

3.重新創建顯示隨著時間的推移使用，table2而不是改變的情節table1。你需要先做什麼？

3.gather()和speard()

第一步需要找出變數和觀察值是什麼。有時候這很容易; 其他時候，需要詢問調研人員。第二步是解決兩個常見問題之一：

一個變數可能分布在多個列上。
一個觀察值可能分散在多行。

數據集通常只會遇到這些問題之一; 兩類問題都碰上，那麼你需要使用tidyr包中的兩個最重要的功能：gather()和spread()。

3.1gather()

一個常見的問題是一個數據集，其中一些列名不是變數的名稱，而是一個變數的值。取table4a：列名稱1999和2000是year的值，每行代表兩個觀察值，而不是一個。

table4an#> # A tibble: 3 × 3n#> country `1999` `2000`n#> * <chr> <int> <int>n#> 1 Afghanistan 745 2666n#> 2 Brazil 37737 80488n#> 3 China 212258 213766n

為了整理這樣的數據集，我們需要將這些列收集到一對新的變數中。為了描述這個操作，我們需要三個參數：

表示值的列集合，而不是變數。在這個例子中，這些列是1999和2000。
其值由列名構成的變數的名稱。用key表示year。
變數名稱的值分布在單元格中稱為value，表示cases

這些參數一起組成gather()：

table4a %>% n gather(`1999`, `2000`, key = "year", value = "cases")n#> # A tibble: 6 × 3n#> country year casesn#> <chr> <chr> <int>n#> 1 Afghanistan 1999 745n#> 2 Brazil 1999 37737n#> 3 China 1999 212258n#> 4 Afghanistan 2000 2666n#> 5 Brazil 2000 80488n#> 6 China 2000 213766n

須提前導入包dplyr::select()。這裡只有兩列，所以我們單獨列出它們。請注意，「1999」和「2000」是非句法名稱（因為它們不是以字母開頭），所以我們必須用反引號把它們包圍起來。要添加其他選擇列的方法，請參閱select。

圖2 運用gather（）函數

在最後的結果，收集的列被刪除，我們得到新的key和value列。否則，原始變數之間的關係將被保留。從視覺上看，如圖2所示。我們可以用類似的方式gather()來整理table4b。唯一的區別是存儲在單元格值中的變數：

table4b %>% n gather(`1999`, `2000`, key = "year", value = "population")n#> # A tibble: 6 × 3n#> country year populationn#> <chr> <chr> <int>n#> 1 Afghanistan 1999 19987071n#> 2 Brazil 1999 172006362n#> 3 China 1999 1272915272n#> 4 Afghanistan 2000 20595360n#> 5 Brazil 2000 174504898n#> 6 China 2000 1280428583n

為了將整理後的版本table4a和table4b單個的tibble 結合起來，我們需要使用dplyr::left_join()你將在關係數據中學到。

tidy4a <- table4a %>% n gather(`1999`, `2000`, key = "year", value = "cases")ntidy4b <- table4b %>% n gather(`1999`, `2000`, key = "year", value = "population")nleft_join(tidy4a, tidy4b)n#> Joining, by = c("country", "year")n#> # A tibble: 6 × 4n#> country year cases populationn#> <chr> <chr> <int> <int>n#> 1 Afghanistan 1999 745 19987071n#> 2 Brazil 1999 37737 172006362n#> 3 China 1999 212258 1272915272n#> 4 Afghanistan 2000 2666 20595360n#> 5 Brazil 2000 80488 174504898n#> 6 China 2000 213766 1280428583n

3.2 spread（）

gather和spread相反。當觀察值分散在多行時，可以使用它。例如，採取table2：一個觀察值是一年中的一個國家，但每個觀察值都分布在兩排。

table2n#> # A tibble: 12 × 4n#> country year type countn#> <chr> <int> <chr> <int>n#> 1 Afghanistan 1999 cases 745n#> 2 Afghanistan 1999 population 19987071n#> 3 Afghanistan 2000 cases 2666n#> 4 Afghanistan 2000 population 20595360n#> 5 Brazil 1999 cases 37737n#> 6 Brazil 1999 population 172006362n#> # ... with 6 more rowsn

為了整理這些，我們首先類比之前學習的gather()。只是這一次只需要兩個參數：

包含變數名稱的key列。在這裡指type。
包含值的列形成多個變數，即value列。在這裡指count。

一旦我們明白了這一點，我們可以使用spread()，如下面的程序所示，在圖3中直觀地顯示。

spread(table2, key = type, value = count)n#> # A tibble: 6 × 4n#> country year cases populationn#> * <chr> <int> <int> <int>n#> 1 Afghanistan 1999 745 19987071n#> 2 Afghanistan 2000 2666 20595360n#> 3 Brazil 1999 37737 172006362n#> 4 Brazil 2000 80488 174504898n#> 5 China 1999 212258 1272915272n#> 6 China 2000 213766 1280428583n

圖3

從key和value看出用法看出，spread()和gather()是互補的。gather()使寬工作表更窄，更長; spread()使長工作表更短，更寬。

3.3練習

1.為什麼gather()和spread()不完全相同？

仔細考慮下面的例子：

stocks <- tibble(n year = c(2015, 2015, 2016, 2016),n half = c( 1, 2, 1, 2),n return = c(1.88, 0.59, 0.92, 0.17)n)nstocks %>% n spread(year, return) %>% n gather("year", "return", `2015`:`2016`)n

（提示：查看變數類型並考慮列名。）

spread()和gather()有一個convert說法。它有什麼作用？

2.為什麼運行出錯？

table4a %>% n gather(1999, 2000, key = "year", value = "cases")n#> Error in combine_vars(vars, ind_list): Position must be between 0 and nn

3.為什麼傳輸這個tribble失敗？你怎麼能添加一個新的列來解決這個問題？

people <- tribble(n ~name, ~key, ~value,n #-----------------|--------|------n "Phillip Woods", "age", 45,n "Phillip Woods", "height", 186,n "Phillip Woods", "age", 50,n "Jessica Cordero", "age", 37,n "Jessica Cordero", "height", 156n)n

整理下面的簡單的tibble。你需要gather還是spread？

preg <- tribble(n ~pregnant, ~male, ~female,n "yes", NA, 10,n "no", 20, 12n)n

4.separate()和unit()

到目前為止，你已經學會了如何整理table2和table4，但不會table3。table3有一個問題：我們有一個rate包含兩個變數（cases和population）的列。要解決這個問題，我們需要separate()功能。

4.1 separate()

separate()通過將分隔符出現在任何地方，將一列分成多列。採取table3：

table3n#> # A tibble: 6 × 3n#> country year raten#> * <chr> <int> <chr>n#> 1 Afghanistan 1999 745/19987071n#> 2 Afghanistan 2000 2666/20595360n#> 3 Brazil 1999 37737/172006362n#> 4 Brazil 2000 80488/174504898n#> 5 China 1999 212258/1272915272n#> 6 China 2000 213766/1280428583n

該rate列同時包含cases和population變數，我們需要把它分成兩個變數。separate()將列的名稱分開，並將要分隔的列的名稱，如圖4和下面的代碼所示。

table3 %>% n separate(rate, into = c("cases", "population"))n#> # A tibble: 6 × 4n#> country year cases populationn#> * <chr> <int> <chr> <chr>n#> 1 Afghanistan 1999 745 19987071n#> 2 Afghanistan 2000 2666 20595360n#> 3 Brazil 1999 37737 172006362n#> 4 Brazil 2000 80488 174504898n#> 5 China 1999 212258 1272915272n#> 6 China 2000 213766 1280428583n

圖4

默認情況下，separate()將分隔值看到一個非字母數字字元（即不是數字或字母的字元）。例如，在上面的代碼中，separate()分割rate正斜杠字元的值。如果需要使用特定的字元來分隔列，則可以將該字元傳遞給sep參數separate()。例如，我們可以將上面的代碼重寫為：

table3 %>% n separate(rate, into = c("cases", "population"), sep = "/")n

（形式上，sep是一個正則表達式，你會在字元串中學到更多的東西。）

仔細查看列類型：你會注意到case和population是字元列。separate()默認為它保留列的類型。然而在這裡，它並不是非常有用，因為那些確實是數字。我們可以要求separate()嘗試使用convert = TRUE以下方法轉換為更好的類型：

table3 %>% n separate(rate, into = c("cases", "population"), convert = TRUE)n#> # A tibble: 6 × 4n#> country year cases populationn#> * <chr> <int> <int> <int>n#> 1 Afghanistan 1999 745 19987071n#> 2 Afghanistan 2000 2666 20595360n#> 3 Brazil 1999 37737 172006362n#> 4 Brazil 2000 80488 174504898n#> 5 China 1999 212258 1272915272n#> 6 China 2000 213766 1280428583n

你也可以傳遞一個整數向量sep。separate()將解釋整數作為分裂的位置。正值從字元串最左側的1開始; 負值在字元串的最右側從-1開始。當使用整數來分隔字元串時，其長度sep應該小於名字的數量into。

你可以使用此安排來分隔每年的最後兩位數字。這使得這個數據不那麼整潔，但在其他情況下是有用的。

table3 %>% n separate(year, into = c("century", "year"), sep = 2)n#> # A tibble: 6 × 4n#> country century year raten#> * <chr> <chr> <chr> <chr>n#> 1 Afghanistan 19 99 745/19987071n#> 2 Afghanistan 20 00 2666/20595360n#> 3 Brazil 19 99 37737/172006362n#> 4 Brazil 20 00 80488/174504898n#> 5 China 19 99 212258/1272915272n#> 6 China 20 00 213766/1280428583n

4.2 unit（）

unite()與separate()相反：它將多個列組合成一個列。你需要的次數要少得多separate()，但是它仍然是一個有用的工具。

圖5

我們可以unite()用來重新加入我們在上一個例子中創建的世紀和年代的專欄。該數據保存為tidyr::table5。unite()需要一個數據框，要創建的新變數的名稱，以及一組要組合的列，同樣以dplyr::select()樣式指定：

table5 %>% n unite(new, century, year)n#> # A tibble: 6 × 3n#> country new raten#> * <chr> <chr> <chr>n#> 1 Afghanistan 19_99 745/19987071n#> 2 Afghanistan 20_00 2666/20595360n#> 3 Brazil 19_99 37737/172006362n#> 4 Brazil 20_00 80488/174504898n#> 5 China 19_99 212258/1272915272n#> 6 China 20_00 213766/1280428583n

在這種情況下，我們也需要使用sep參數。預設值將_在不同列的值之間放置一個下劃線（）。這裡我們不需要任何分隔符，所以我們使用""：

table5 %>% n unite(new, century, year, sep = "")n#> # A tibble: 6 × 3n#> country new raten#> * <chr> <chr> <chr>n#> 1 Afghanistan 1999 745/19987071n#> 2 Afghanistan 2000 2666/20595360n#> 3 Brazil 1999 37737/172006362n#> 4 Brazil 2000 80488/174504898n#> 5 China 1999 212258/1272915272n#> 6 China 2000 213766/1280428583n

4.3練習

1.extra和fill參數在做separate()什麼？通過下面兩個數據集理解一下

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% n separate(x, c("one", "two", "three"))nntibble(x = c("a,b,c", "d,e", "f,g,i")) %>% n separate(x, c("one", "two", "three"))n

2.雙方unite()並separate()有一個remove說法。它有什麼作用？你為什麼要設置它FALSE？

3.比較和對比separate()和extract()。為什麼有三種不同的分離方式（按位置，按分隔符和分組），但只有一個聯合體？

5.缺少值

一般數據集有兩種形式的缺失：

顯式地，即用...標記NA。
隱含地，即根本不存在於數據中。

舉個例子：

stocks <- tibble(n year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),n qtr = c( 1, 2, 3, 4, 2, 3, 4),n return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)n)n

這個數據集中有兩個缺失值：

2015年第四季度的回報明顯不足，因為其價值應該包含的單元格NA。
2016年第一季度的回報隱含缺失，因為它不會出現在數據集中。

考慮這種差異的一種方法是用這種類似禪的公理：明顯的缺失值是缺席的存在; 一個隱含的缺失值是沒有存在。數據集表示的方式可以使隱式值明確。例如，我們可以通過在列中放置年份來顯式隱含缺失值：

stocks %>% n spread(year, return)n#> # A tibble: 4 × 3n#> qtr `2015` `2016`n#> * <dbl> <dbl> <dbl>n#> 1 1 1.88 NAn#> 2 2 0.59 0.92n#> 3 3 0.35 0.17n#> 4 4 NA 2.66n

由於這些明確的缺失值可能不是在數據的其他表示重要的是，你可以設置na.rm = TRUE在gather()轉顯性缺失值隱含的：

stocks %>% n spread(year, return) %>% n gather(year, return, `2015`:`2016`, na.rm = TRUE)n#> # A tibble: 6 × 3n#> qtr year returnn#> * <dbl> <chr> <dbl>n#> 1 1 2015 1.88n#> 2 2 2015 0.59n#> 3 3 2015 0.35n#> 4 2 2016 0.92n#> 5 3 2016 0.17n#> 6 4 2016 2.66n

在整理數據中顯示缺失值的另一個重要工具是complete()：

stocks %>% n complete(year, qtr)n#> # A tibble: 8 × 3n#> year qtr returnn#> <dbl> <dbl> <dbl>n#> 1 2015 1 1.88n#> 2 2015 2 0.59n#> 3 2015 3 0.35n#> 4 2015 4 NAn#> 5 2016 1 NAn#> 6 2016 2 0.92n#> # ... with 2 more rowsn

complete()需要一組列，並找到所有獨特的組合。然後確保原始數據集包含所有這些值，並NA在必要時填入顯式值。

還有一個重要的工具，你應該知道處理缺失的值。有時，當數據源主要用於數據輸入時，缺少的值表示前一個值應該結轉：

treatment <- tribble(n ~ person, ~ treatment, ~response,n "Derrick Whitmore", 1, 7,n NA, 2, 10,n NA, 3, 9,n "Katherine Burke", 1, 4n)n

你可以填寫這些缺少的值fill()。它需要一組列，您希望將缺失值替換為最近的非缺失值（有時稱為最後的觀察結果）。

treatment %>% n fill(person)n#> # A tibble: 4 × 3n#> person treatment responsen#> <chr> <dbl> <dbl>n#> 1 Derrick Whitmore 1 7n#> 2 Derrick Whitmore 2 10n#> 3 Derrick Whitmore 3 9n#> 4 Katherine Burke 1 4n

5.1練習

比較和對比fill參數spread()和complete()。
方向參數fill()要做什麼？

6.案例研究

運用所學來解決實際問題：該tidyr::who數據集包含結核病（TB）的情況下從年份，國家，年齡，性別和診斷方法進行細分。數據來自2014年世界衛生組織全球結核病報告，點我獲得原始數據。也可以直接輸入head(tidyr::who)查看

在這個數據集中有大量的流行病學信息，但以提供的形式處理數據是一個挑戰：

whon#> # A tibble: 7,240 × 60n#> country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534n#> <chr> <chr> <chr> <int> <int> <int> <int>n#> 1 Afghanistan AF AFG 1980 NA NA NAn#> 2 Afghanistan AF AFG 1981 NA NA NAn#> 3 Afghanistan AF AFG 1982 NA NA NAn#> 4 Afghanistan AF AFG 1983 NA NA NAn#> 5 Afghanistan AF AFG 1984 NA NA NAn#> 6 Afghanistan AF AFG 1985 NA NA NAn#> # ... with 7,234 more rows, and 53 more variables: new_sp_m3544 <int>,n#> # new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,n#> # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,n#> # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,n#> # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,n#> # new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,n#> # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,n#> # new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,n#> # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,n#> # new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,n#> # new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,n#> # new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,n#> # new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,n#> # new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,n#> # newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,n#> # newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,n#> # newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,n#> # newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,n#> # newrel_f65 <int>n

這是一個非常典型的現實生活中的示例數據集。它包含冗餘列，奇數可變代碼和許多缺少的值。總之，who是混亂的，我們需要多個步驟來整理它。像dplyr一樣，開始的最佳地點幾乎總是把不是變數的列彙集在一起。讓我們看看我們有什麼：

它看起來像country，iso2是和iso3冗餘地指定國家的三個變數。
year 顯然也是一個變數。
我們不知道所有其他列還沒有，但在變數名中給出的結構（例如new_sp_m014，new_ep_m014，new_ep_f014）這些可能是值，而不是變數。

因此，我們需要整理列new_sp_m014到newrel_f65。我們不知道這些值是什麼，所以我們給他們通用的名字"key"。我們知道單元格代表案例的數量，所以我們將使用這個變數cases。在當前表示中有很多缺失的值，我們只會使用na.rm，以便我們可以專註於當前的值。

who1 <- who %>% n gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)nwho1n#> # A tibble: 76,046 × 6n#> country iso2 iso3 year key casesn#> * <chr> <chr> <chr> <int> <chr> <int>n#> 1 Afghanistan AF AFG 1997 new_sp_m014 0n#> 2 Afghanistan AF AFG 1998 new_sp_m014 30n#> 3 Afghanistan AF AFG 1999 new_sp_m014 8n#> 4 Afghanistan AF AFG 2000 new_sp_m014 52n#> 5 Afghanistan AF AFG 2001 new_sp_m014 129n#> 6 Afghanistan AF AFG 2002 new_sp_m014 90n#> # ... with 7.604e+04 more rowsn

key通過計算它們，我們可以得到新列中值的一些結構：

who1 %>% n count(key)n#> # A tibble: 56 × 2n#> key nn#> <chr> <int>n#> 1 new_ep_f014 1032n#> 2 new_ep_f1524 1021n#> 3 new_ep_f2534 1021n#> 4 new_ep_f3544 1021n#> 5 new_ep_f4554 1017n#> 6 new_ep_f5564 1017n#> # ... with 50 more rowsn

您可以通過一些想法和一些實驗來自己解析，但幸運的是我們有方便的數據字典。它告訴我們：

每列的前三個字母表示該列是包含新的還是舊的TB病例。在這個數據集中，每一列都包含新的案例。
接下來的兩個字母描述了結核病的類型：

rel 代表複發病例
ep 代表肺外結核的病例
sn 代表無法通過肺部塗片診斷的肺結核病例（塗片陰性）
sp 代表可被診斷為肺部塗片（塗片陽性）的肺結核病例，
第六封信給出結核病人的性別。數據集將男性（m）和女性（f）的病例分組。
其餘的數字給出了年齡組。數據集將病例分成七個年齡段：
014 = 0 - 14歲
1524 = 15 - 24歲
2534 = 25 - 34歲
3544 = 35 - 44歲
4554 = 45 - 54歲
5564 = 55 - 64歲
65 = 65歲或以上

我們需要對列名的格式做一個小小的修改：不幸的是，這些名字有點不一致，因為new_rel我們沒有newrel（在這裡很難發現，但是如果你沒有修復它，我們會在隨後的步驟中出錯）。您將了解str_replace()在這裡，但基本的想法是很簡單：替換字元「newrel」與「new_rel」。這使得所有的變數名稱一致。

who2 <- who1 %>% n mutate(key = stringr::str_replace(key, "newrel", "new_rel"))nwho2n#> # A tibble: 76,046 × 6n#> country iso2 iso3 year key casesn#> <chr> <chr> <chr> <int> <chr> <int>n#> 1 Afghanistan AF AFG 1997 new_sp_m014 0n#> 2 Afghanistan AF AFG 1998 new_sp_m014 30n#> 3 Afghanistan AF AFG 1999 new_sp_m014 8n#> 4 Afghanistan AF AFG 2000 new_sp_m014 52n#> 5 Afghanistan AF AFG 2001 new_sp_m014 129n#> 6 Afghanistan AF AFG 2002 new_sp_m014 90n#> # ... with 7.604e+04 more rowsn

我們可以用兩遍來分隔每個代碼中的值separate()。第一遍將分割每個下劃線的代碼。

who3 <- who2 %>% n separate(key, c("new", "type", "sexage"), sep = "_")nwho3n#> # A tibble: 76,046 × 8n#> country iso2 iso3 year new type sexage casesn#> * <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>n#> 1 Afghanistan AF AFG 1997 new sp m014 0n#> 2 Afghanistan AF AFG 1998 new sp m014 30n#> 3 Afghanistan AF AFG 1999 new sp m014 8n#> 4 Afghanistan AF AFG 2000 new sp m014 52n#> 5 Afghanistan AF AFG 2001 new sp m014 129n#> 6 Afghanistan AF AFG 2002 new sp m014 90n#> # ... with 7.604e+04 more rowsn

那麼我們也可以放棄這個new列，因為它在這個數據集中是不變的。讓我們也清除iso2，iso3因為它們是多餘的。

who3 %>% n count(new)n#> # A tibble: 1 × 2n#> new nn#> <chr> <int>n#> 1 new 76046nwho4 <- who3 %>% n select(-new, -iso2, -iso3)n

下一步，我們將分離sexage到sex和age通過的第一個字元後拆分：

who5 <- who4 %>% n separate(sexage, c("sex", "age"), sep = 1)nwho5n#> # A tibble: 76,046 × 6n#> country year type sex age casesn#> * <chr> <int> <chr> <chr> <chr> <int>n#> 1 Afghanistan 1997 sp m 014 0n#> 2 Afghanistan 1998 sp m 014 30n#> 3 Afghanistan 1999 sp m 014 8n#> 4 Afghanistan 2000 sp m 014 52n#> 5 Afghanistan 2001 sp m 014 129n#> 6 Afghanistan 2002 sp m 014 90n#> # ... with 7.604e+04 more rowsn

該who數據集整理完畢！

我已經向你展示了代碼，將每個中間結果分配給一個新的變數。這通常不是互動式工作方式。相反，你會逐漸建立一個複雜的管道：

who %>%n gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% n mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%n separate(code, c("new", "var", "sexage")) %>% n select(-new, -iso2, -iso3) %>% n separate(sexage, c("sex", "age"), sep = 1)n

12.6.1練習

在這個案例研究中，na.rm = TRUE只是為了更容易檢查我們是否有正確的值。這是合理的嗎？考慮如何在這個數據集中表示缺失值。有隱含的缺失值嗎？an NA和zero有什麼區別？
如果你忽視這mutate()一步會發生什麼？（mutate(key = stringr::str_replace(key, "newrel", "new_rel"))）
是否iso2和iso3是多餘的country？。
對於每個國家，年份和性別來計算結核病例的總數。對數據進行信息可視化。

12.7非整潔的數據

如果你的數據自然適合於由觀測和變數組成的矩形結構，整齊的數據應該是你的默認選擇。但是使用其他結構有很好的理由。有很多有用的和有根據的數據結構是不整潔的數據。整潔的數據並不是唯一的方法。

原作者推薦閱讀博客文章：http : //simplystatistics.org/2016/02/17/non-tidy-data/