Learn R | 字元串處理之stringr包（上）

01-27

前言

在數據預處理階段，前面我們講到的dplyr包、reshape2包和tidyr包已經對大部分的數據處理遊刃有餘了，但隨著使用R使用的越來越多，字元串處理也是必不可少的。雖然R語言的基礎包本身有著基本的字元處理能力，但使用起來並不是很方便。為此，今天我們學習stringr包。該包極大的簡化了R語言中字元串的轉換，搜索，辨識，定位，匹配，替換，提取，分離等操作。同時封裝了一些R語言中原來繁瑣的字元串操作函數，當然，該包的作者依舊是偉大的Hadley Wickham。

本文的基本框架結構如下：

字元串的拼接函數
字元串的計算函數
字元串的匹配函數
字元串的變換函數

一、字元串拼接函數

1. 從句子中提取片語 word()

調用公式：

> word(string, start = , end = , sep = fixed(" "))n# sep：指定單詞之間的分隔符，默認為空格n

> data <- A new way to explore the worldn# 提取最後一個字元n> word(data,-1)n[1] "world"n

# 從第一個單詞開始，分別提取1至7個單詞（函數使用時將會不重疊顯示）n> word(data,1,1:7)n[1] "A" "A new" n[3] "A new way" "A new way to" n[5] "A new way to explore" "A new way to explore the" n[7] "A new way to explore the world"n# 表現形式與之前相反n> word(data,1:7,-1)n

2. 操作段落 str_wrap()

調用公式：

str_wrap(string, width = 80, indent = 0, exdent = 0)n# width：設定每行的寬度n# indent：設定每個段落第一行的縮進格式，默認沒有縮進n# exdent：設定每個段落第一行之後所有行的縮進格式，默認沒有縮進n

# 默認以80個位元組作為行寬n> str_wrap(string)n[1] "Now we are engaged in a great civil war, testing whether that nation, or anynnation so conceived, and so dedicated, can long endure. We are met on a greatnbattle field of that war."n

# 以換行符n連接固定長度的句子n# cat()：連接函數n> cat(str_wrap(string),sep=n)nNow we are engaged in a great civil war, testing whether that nation, or anynnation so conceived, and so dedicated, can long endure. We are met on a greatnbattle field of that war.n

# 段落首行空四個字元n> cat(str_wrap(string,indent = 4))n Now we are engaged in a great civil war, testing whether that nation, or anynnation so conceived, and so dedicated, can long endure. We are met on a greatnbattle field of that war.n

3. 剔除字元串多餘空格 str_trim()

調用公式：

> str_trim(string, side = "both"/"left"/"right")n

> data <- A new way to explore the world n# both：前後空格全部去掉n> str_trim(data,side = "both")n[1] "A new way to explore the world"n

4. 字元串連接函數 str_c()

調用公式：

str_c(..., sep = "", collapse = NULL)n# sep：字元串之間的連接符，功能類似於paste()函數n# collapse：如果是向量之間的連接，collapse的作用與sep一樣，只不過此時sep無效n

> str_c(x,c(1:10),:)n [1] "x1:" "x2:" "x3:" "x4:" "x5:" "x6:" "x7:" "x8:" "x9:" "x10:"n> str_c(c(2016,09,01), collapse = -) n[1] "2016-9-1"n

5. 字元填充函數 str_pad()

調用公式：

> str_pad(string, width, side = c("left", "right", "both"), pad = " ")n# side：指定填充的方向，默認向左填充n# pad：指定填充的字元，默認用空格填充n

> data <- Jason_labn# 指定的長度(width)少於string長度時，將只返回原stringn> str_pad(data,width = 19,side = "both",pad = *)n[1] "*****Jason_lab*****"n

6. 複製字元串 str_dup()

調用公式：

> str_dup(string, times)n

> fruit <- c("apple", "banana", "pear",100)n> str_dup(fruit,2)n[1] "appleapple" "bananabanana" "pearpear" "100100" n> str_dup(fruit,1:4)n[1] "apple" "bananabanana" "pearpearpear" "100100100100"n

7. 截取字元串 str_sub()

str_sub()與word()函數的區別在於前者提取字元串的子串，後者提取的是單詞,而且str_sub()也可以起到替換的作用。

> str_sub(data,1,4)n[1] "A ne"n> word(data,1,4)n[1] "A new way to"n

# 截取1-7的索引位置的字元串n> str_sub(data,end=7)n[1] "A new w"n# 截取7到結束的索引位置的字元串n> str_sub(data,7)n[1] "way to explore the world"n# 同樣我們可以使用負坐標進行截取，同樣包括以上兩種情況n

# 對截取的字元串進行賦值n> data <- A new way to explore the worldn> str_sub(data,1,5) <- a old;datan[1] "a old way to explore the world"n

二、字元串計算函數

1. 字元長度函數 str_length()

該函數類似於nchar()函數，但前者將NA返回為NA，而nchar則返回2

> fruit <- c("apple", "banana", "pear",NA)n> str_length(fruit)n[1] 5 6 4 NAn> nchar(fruit)n[1] 5 6 4 2n

2. 字元串計數函數 str_count()

調用公式：

str_count(string, pattern = "")n# pattern：指定匹配的模式，默認為""，計算每個字元串的長度n

> fruit <- c("apple", "banana", "pear",100)n> str_count(fruit,a)n[1] 1 3 1 0n# d：對數字的檢測n> str_count(fruit,d)n[1] 0 0 0 3n

3. 對字元向量排序 str_order()，str_sort()

調用公式：

## 返回排序後的索引n> str_order(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)n## 返回排序後的實際值n> str_sort(x, decreasing = FALSE, na_last = TRUE, locale = "", ....)n# decreasing：排序方式，默認為升序n# na_last：是否將缺失值置於末尾，默認為TRUEn

> fruit <- c("apple", "banana", "pear", "orange","pinapple")n> str_sort(fruit,decreasing = TRUE)n[1] "pinapple" "pear" "orange" "banana" "apple"n> str_order(fruit,decreasing = TRUE)n[1] 5 3 4 2 1n

未完待續