Learn R | 字元串處理之stringr包(上)

前言

在數據預處理階段,前面我們講到的dplyr包、reshape2包和tidyr包已經對大部分的數據處理遊刃有餘了,但隨著使用R使用的越來越多,字元串處理也是必不可少的。雖然R語言的基礎包本身有著基本的字元處理能力,但使用起來並不是很方便。為此,今天我們學習stringr包。該包極大的簡化了R語言中字元串的轉換,搜索,辨識,定位,匹配,替換,提取,分離等操作。同時封裝了一些R語言中原來繁瑣的字元串操作函數,當然,該包的作者依舊是偉大的Hadley Wickham。

本文的基本框架結構如下:

  • 字元串的拼接函數
  • 字元串的計算函數
  • 字元串的匹配函數
  • 字元串的變換函數

一、字元串拼接函數

1. 從句子中提取片語 word()

調用公式:

> word(string, start = , end = , sep = fixed(" "))n# sep:指定單詞之間的分隔符,默認為空格n

> data <- A new way to explore the worldn# 提取最後一個字元n> word(data,-1)n[1] "world"n

# 從第一個單詞開始,分別提取1至7個單詞(函數使用時將會不重疊顯示)n> word(data,1,1:7)n[1] "A" "A new" n[3] "A new way" "A new way to" n[5] "A new way to explore" "A new way to explore the" n[7] "A new way to explore the world"n# 表現形式與之前相反n> word(data,1:7,-1)n

2. 操作段落 str_wrap()

調用公式:

str_wrap(string, width = 80, indent = 0, exdent = 0)n# width:設定每行的寬度n# indent:設定每個段落第一行的縮進格式,默認沒有縮進n# exdent:設定每個段落第一行之後所有行的縮進格式,默認沒有縮進n

# 默認以80個位元組作為行寬n> str_wrap(string)n[1] "Now we are engaged in a great civil war, testing whether that nation, or anynnation so conceived, and so dedicated, can long endure. We are met on a greatnbattle field of that war."n

# 以換行符n連接固定長度的句子n# cat():連接函數n> cat(str_wrap(string),sep=n)nNow we are engaged in a great civil war, testing whether that nation, or anynnation so conceived, and so dedicated, can long endure. We are met on a greatnbattle field of that war.n

# 段落首行空四個字元n> cat(str_wrap(string,indent = 4))n Now we are engaged in a great civil war, testing whether that nation, or anynnation so conceived, and so dedicated, can long endure. We are met on a greatnbattle field of that war.n

3. 剔除字元串多餘空格 str_trim()

調用公式:

> str_trim(string, side = "both"/"left"/"right")n

> data <- A new way to explore the world n# both:前後空格全部去掉n> str_trim(data,side = "both")n[1] "A new way to explore the world"n

4. 字元串連接函數 str_c()

調用公式:

str_c(..., sep = "", collapse = NULL)n# sep:字元串之間的連接符,功能類似於paste()函數n# collapse:如果是向量之間的連接,collapse的作用與sep一樣,只不過此時sep無效n

> str_c(x,c(1:10),:)n [1] "x1:" "x2:" "x3:" "x4:" "x5:" "x6:" "x7:" "x8:" "x9:" "x10:"n> str_c(c(2016,09,01), collapse = -) n[1] "2016-9-1"n

5. 字元填充函數 str_pad()

調用公式:

> str_pad(string, width, side = c("left", "right", "both"), pad = " ")n# side:指定填充的方向,默認向左填充n# pad:指定填充的字元,默認用空格填充n

> data <- Jason_labn# 指定的長度(width)少於string長度時,將只返回原stringn> str_pad(data,width = 19,side = "both",pad = *)n[1] "*****Jason_lab*****"n

6. 複製字元串 str_dup()

調用公式:

> str_dup(string, times)n

> fruit <- c("apple", "banana", "pear",100)n> str_dup(fruit,2)n[1] "appleapple" "bananabanana" "pearpear" "100100" n> str_dup(fruit,1:4)n[1] "apple" "bananabanana" "pearpearpear" "100100100100"n

7. 截取字元串 str_sub()

str_sub()與word()函數的區別在於前者提取字元串的子串,後者提取的是單詞,而且str_sub()也可以起到替換的作用。

> str_sub(data,1,4)n[1] "A ne"n> word(data,1,4)n[1] "A new way to"n

# 截取1-7的索引位置的字元串n> str_sub(data,end=7)n[1] "A new w"n# 截取7到結束的索引位置的字元串n> str_sub(data,7)n[1] "way to explore the world"n# 同樣我們可以使用負坐標進行截取,同樣包括以上兩種情況n

# 對截取的字元串進行賦值n> data <- A new way to explore the worldn> str_sub(data,1,5) <- a old;datan[1] "a old way to explore the world"n

二、字元串計算函數

1. 字元長度函數 str_length()

該函數類似於nchar()函數,但前者將NA返回為NA,而nchar則返回2

> fruit <- c("apple", "banana", "pear",NA)n> str_length(fruit)n[1] 5 6 4 NAn> nchar(fruit)n[1] 5 6 4 2n

2. 字元串計數函數 str_count()

調用公式:

str_count(string, pattern = "")n# pattern:指定匹配的模式,默認為"",計算每個字元串的長度n

> fruit <- c("apple", "banana", "pear",100)n> str_count(fruit,a)n[1] 1 3 1 0n# d:對數字的檢測n> str_count(fruit,d)n[1] 0 0 0 3n

3. 對字元向量排序 str_order(),str_sort()

調用公式:

## 返回排序後的索引n> str_order(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)n## 返回排序後的實際值n> str_sort(x, decreasing = FALSE, na_last = TRUE, locale = "", ....)n# decreasing:排序方式,默認為升序n# na_last:是否將缺失值置於末尾,默認為TRUEn

> fruit <- c("apple", "banana", "pear", "orange","pinapple")n> str_sort(fruit,decreasing = TRUE)n[1] "pinapple" "pear" "orange" "banana" "apple"n> str_order(fruit,decreasing = TRUE)n[1] 5 3 4 2 1n

未完待續


推薦閱讀:

【譯文】用R語言做網頁爬蟲和文本分析-Part2
R 學習筆記: Par 函數
【R語言基礎】02. 基本數據結構
翻譯:用R語言進行數據清洗
如何在Quora上獲得更多的贊——來自10393個回答的實證

TAG:R编程语言 | 数据处理 |