Learn R | 字元串處理之stringr包(上)
前言
在數據預處理階段,前面我們講到的dplyr包、reshape2包和tidyr包已經對大部分的數據處理遊刃有餘了,但隨著使用R使用的越來越多,字元串處理也是必不可少的。雖然R語言的基礎包本身有著基本的字元處理能力,但使用起來並不是很方便。為此,今天我們學習stringr包。該包極大的簡化了R語言中字元串的轉換,搜索,辨識,定位,匹配,替換,提取,分離等操作。同時封裝了一些R語言中原來繁瑣的字元串操作函數,當然,該包的作者依舊是偉大的Hadley Wickham。
本文的基本框架結構如下:
- 字元串的拼接函數
- 字元串的計算函數
- 字元串的匹配函數
- 字元串的變換函數
一、字元串拼接函數
1. 從句子中提取片語 word()
調用公式:
> word(string, start = , end = , sep = fixed(" "))n# sep:指定單詞之間的分隔符,默認為空格n
> data <- A new way to explore the worldn# 提取最後一個字元n> word(data,-1)n[1] "world"n
# 從第一個單詞開始,分別提取1至7個單詞(函數使用時將會不重疊顯示)n> word(data,1,1:7)n[1] "A" "A new" n[3] "A new way" "A new way to" n[5] "A new way to explore" "A new way to explore the" n[7] "A new way to explore the world"n# 表現形式與之前相反n> word(data,1:7,-1)n
2. 操作段落 str_wrap()
調用公式:
str_wrap(string, width = 80, indent = 0, exdent = 0)n# width:設定每行的寬度n# indent:設定每個段落第一行的縮進格式,默認沒有縮進n# exdent:設定每個段落第一行之後所有行的縮進格式,默認沒有縮進n
# 默認以80個位元組作為行寬n> str_wrap(string)n[1] "Now we are engaged in a great civil war, testing whether that nation, or anynnation so conceived, and so dedicated, can long endure. We are met on a greatnbattle field of that war."n
# 以換行符n連接固定長度的句子n# cat():連接函數n> cat(str_wrap(string),sep=n)nNow we are engaged in a great civil war, testing whether that nation, or anynnation so conceived, and so dedicated, can long endure. We are met on a greatnbattle field of that war.n
# 段落首行空四個字元n> cat(str_wrap(string,indent = 4))n Now we are engaged in a great civil war, testing whether that nation, or anynnation so conceived, and so dedicated, can long endure. We are met on a greatnbattle field of that war.n
3. 剔除字元串多餘空格 str_trim()
調用公式:
> str_trim(string, side = "both"/"left"/"right")n
> data <- A new way to explore the world n# both:前後空格全部去掉n> str_trim(data,side = "both")n[1] "A new way to explore the world"n
4. 字元串連接函數 str_c()
調用公式:
str_c(..., sep = "", collapse = NULL)n# sep:字元串之間的連接符,功能類似於paste()函數n# collapse:如果是向量之間的連接,collapse的作用與sep一樣,只不過此時sep無效n
> str_c(x,c(1:10),:)n [1] "x1:" "x2:" "x3:" "x4:" "x5:" "x6:" "x7:" "x8:" "x9:" "x10:"n> str_c(c(2016,09,01), collapse = -) n[1] "2016-9-1"n
5. 字元填充函數 str_pad()
調用公式:
> str_pad(string, width, side = c("left", "right", "both"), pad = " ")n# side:指定填充的方向,默認向左填充n# pad:指定填充的字元,默認用空格填充n
> data <- Jason_labn# 指定的長度(width)少於string長度時,將只返回原stringn> str_pad(data,width = 19,side = "both",pad = *)n[1] "*****Jason_lab*****"n
6. 複製字元串 str_dup()
調用公式:
> str_dup(string, times)n
> fruit <- c("apple", "banana", "pear",100)n> str_dup(fruit,2)n[1] "appleapple" "bananabanana" "pearpear" "100100" n> str_dup(fruit,1:4)n[1] "apple" "bananabanana" "pearpearpear" "100100100100"n
7. 截取字元串 str_sub()
str_sub()與word()函數的區別在於前者提取字元串的子串,後者提取的是單詞,而且str_sub()也可以起到替換的作用。
> str_sub(data,1,4)n[1] "A ne"n> word(data,1,4)n[1] "A new way to"n
# 截取1-7的索引位置的字元串n> str_sub(data,end=7)n[1] "A new w"n# 截取7到結束的索引位置的字元串n> str_sub(data,7)n[1] "way to explore the world"n# 同樣我們可以使用負坐標進行截取,同樣包括以上兩種情況n
# 對截取的字元串進行賦值n> data <- A new way to explore the worldn> str_sub(data,1,5) <- a old;datan[1] "a old way to explore the world"n
二、字元串計算函數
1. 字元長度函數 str_length()
該函數類似於nchar()函數,但前者將NA返回為NA,而nchar則返回2
> fruit <- c("apple", "banana", "pear",NA)n> str_length(fruit)n[1] 5 6 4 NAn> nchar(fruit)n[1] 5 6 4 2n
2. 字元串計數函數 str_count()
調用公式:
str_count(string, pattern = "")n# pattern:指定匹配的模式,默認為"",計算每個字元串的長度n
> fruit <- c("apple", "banana", "pear",100)n> str_count(fruit,a)n[1] 1 3 1 0n# d:對數字的檢測n> str_count(fruit,d)n[1] 0 0 0 3n
3. 對字元向量排序 str_order(),str_sort()
調用公式:
## 返回排序後的索引n> str_order(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)n## 返回排序後的實際值n> str_sort(x, decreasing = FALSE, na_last = TRUE, locale = "", ....)n# decreasing:排序方式,默認為升序n# na_last:是否將缺失值置於末尾,默認為TRUEn
> fruit <- c("apple", "banana", "pear", "orange","pinapple")n> str_sort(fruit,decreasing = TRUE)n[1] "pinapple" "pear" "orange" "banana" "apple"n> str_order(fruit,decreasing = TRUE)n[1] 5 3 4 2 1n
未完待續
推薦閱讀:
※【譯文】用R語言做網頁爬蟲和文本分析-Part2
※R 學習筆記: Par 函數
※【R語言基礎】02. 基本數據結構
※翻譯:用R語言進行數據清洗
※如何在Quora上獲得更多的贊——來自10393個回答的實證