翻譯-Reshaping Data in R
本文嘗試翻譯了Hadley Wickham大神發表的一篇有關數據重塑(數據整形)包-reshape 的文章,該包在數據處理中得到了廣泛應用,後續的升級版reshape2更受廣大盆友的喜歡。感興趣的盆友們可以看一下,翻譯水平著實一般,歡迎拍磚或指正~~
——————————————————————————————————————————
Reshaping Data in R
R語言之——數據的重塑
Hadley Wickham. reshape. had.co.nz
Abstract
摘要
Restructuring data is a common task in practical data analysis, and it is usually unintuitive and tedious. Data often has multiple levels of grouping (nested treatments, split plot designs, or repeated measurements) and typically requires investigation at multiple levels. For example, from a long term clinical study we may be interested investigating relationships over time, or between times or patients or treatments. Performing these investigations fluently requires the data to be reshaped in different ways.
在實際數據分析中,對數據進行重組往往是一個繁瑣並且乏味的通用步驟。這些數據往往有著多層次的分組(嵌套處理,裂區設計,或重複測量),並且通常需要多個水平的處理分析。例如,對於一個長期臨床研究,我們可能會比較關注在臨床試驗研究期間給葯次數、患者、治療方案之間的聯繫。想要達到這些研究目的就經常需要使用不同的方法從而對原始數據進行重塑。
Currently R supplies a reshape function that can perform some of these tasks, but confounds multiple steps in the process and is hard to use. We propose a new conceptual framework for reshaping operations and an R package to 「deshape」 data frames and then flexibly 「reshape」 them to meet your needs. This framework also produces contingency tables, cross-tabulations,and summary statistics.
目前,R提供了一個重塑函數,該函數確實可以實現某些功能,但當程序進行多個步驟的語句運行時,該函數往往會混淆語句,故應用性較差。為此,我們提出了一個新的概念框架用於操作重塑,開發出了一個R包用於數據的重塑並且靈活地「重塑」它們從而來滿足您的需求。該框架還能夠實現生成列聯表,交叉表、以及統計總結功能。
1 Introduction
1.前言
This paper discusses a conceptual framework for data reshaping, and describes an implementation of these principles in a R package, reshape.
本文討論了一個有關數據重塑的概念性框架,並描述了一項應用reshape R包開展的實施事例。
Data reshaping is easiest to define with respect to aggregation. Aggregation is a common and familiar task where data is reduced and rearranged into a smaller, more convenient form, with a concomitant reduction in the amount of information. One commonly used aggregation procedure are Excel』s Pivot tables. Reshaping involves a similar rearrangement, but preserves all original information. Where aggregation reduces many cells in the original data set to one cell in the new dataset, reshaping preserves a one-to-one connection.
相對於聚合而言,數據重塑是最為容易的。聚合是一個常見的並為人熟悉的數據處理步驟,所謂聚合,就是數據被還原並重新排列成更小、更方便的形式,同時減少其信息量。Excel的樞軸表就是一個常用的聚合過程。重塑的原理類似於一種重排,但其保留所有原始信息。如果說聚合的套路是減少原始數據中的許多單元格而在新數據集中形成一個單元格,那麼重塑的模式則是保存新數據集與原始數據一對一的連接。
There are a number of general R functions that can aggregate data, for example tapply, by and aggregate, and a function specifically for reshaping data, reshape. Each of these functions tends to deal well with one or two specific scenarios, and each requires slightly different input arguments. In practice, careful thought is usually required to piece together the correct sequence of operations to arrange the data how you want. The reshape package overcomes these problems by using the conceptual framework defined below to solve a general set of problems using just two functions, reshape and deshape.
在R中有許多常見的函數可以進行數據聚合,例如 tapply, by 以及 aggregate函數,還有數據重塑的專屬函數reshape。以上每種函數都能處理一至二個具體方案,並且均需要輸入略有區別的參數。在實際操作中,我們需要通過仔細的思考將所需的數據按照正確操作順序進行排列。Reshape包則可以解決上述問題,通過使用reshape以及deshape這兩個函數定義的概念性框架,一般問題都能得以解決。
2 Conceptual framework
2 概念性框架
To help us think about all the ways we might rearrange a data set it is useful to think about data in a somewhat unusual fashion. Usually, we think about data in terms of a matrix or data frame, where we have observations in the rows and variables in the columns. In this form it is difficult to investigate relationships between other facets of the data: between subjects, or treatments, or replicates. Reshaping the data allows to explore these other relationships while still being able to use the familiar tools that operate on columns. Reshaping is an important (but often unrecognised) part of practical data analysis and is often necessary when exploring, displaying and analysing data.
為了有助於我們找出所有可適用於重新排列數據集的方法,採用一種不尋常的方式來思考數據是很有用的。通常來說,我們會對矩陣或數據框架中的數據進行考察,考察行、列的變數。然而在這種形式中,很難調查出數據在其他方面之間,例如受試者之間、處理方式之間的關係。而將數據重塑就可以探索這些其他關係,同時仍然能夠使用熟悉的工具,對列操作。重塑是一個重要的(但往往不被認可)的實際數據分析的步驟,並且在探索、顯示和分析數據時顯得非常必要。
For the purposes of reshaping, we can divide the variables into two groups: identifier and measured variables.
為了重塑,我們可以將變數分為兩組:標識符和測量變數。
1. Identifier, or id, variables identify the unit that measurements take place on. Id variables are usually discrete, and are typically fixed by design. In ANOVA notation (Yijk), id variables are the indices on the variables (i, j, k).
1.標識符或者id變數能夠識別測量所發生的單位。 Id變數通常是離散的並且往往是通過設計來確定的。 在ANOVA符號(Yijk)中,id變數是變數(i,j,k)上的索引。
2. Measured variables represent what is measured on that unit (Y ). It is possible to take this abstraction a step further and say there are only id variables and a value, where the id variables now also identify what measured variable the value represents. For example, we could represent this table:
2.測量的變數表示在該單元上測量的(Y)。 我們還可以進一步地提取這個抽象體,並且說明只有id變數和值,其中id變數現在也標識了測量變數所代表的值。 以下表為例:
目前每行代表著一個變數的觀測值。這就是我要提到的經過「重塑」數據的地方。相比與原始數據,它有了一個新的id變數「variable」以及一個新的列「value」,這些都是要觀測的值。這樣,我們就擁有了這樣一種數據,數據中我們的原始觀察變數和其他id變數之間沒有區別。
This form, in itself, is not terribly useful, but it is easy to manipulate. Another interesting feature of this form is that we do not need to store missing values explicitly, but instead are they are reconstructed as necessary when the data is reshaped.
這種形式本身並不是非常有用的,但它很容易操作。 該形式的另一個有意思的特徵是我們不需要明確地存儲缺失值,而是在數據重塑時根據需要進行相關重構。
3 Implementation
3 實施
With this conceptual framework established, I will discuss particular details of the implementation in R. Ideally, we want easy to use tools to restructure data frames that use the insights from the ideas above. I will discuss why we need a new package to reshape data, and how we can specify how the form of the reshaped data.
建立這個概念框架後,我將在R中討論實現的具體細節。理想情況下,我們希望便於操作的工具來重構使用上述想法的數據框架。 我將討論為什麼我們需要一個新的包來重塑數據,以及如何指定重塑數據的形式。
The first step is to 「deshape」 the data. This is essentially a trivial operation, and very similar to the existing R function stack. The next challenge is to specify how we want the data to look with the reshape function. A natural way to do this is to specify which variables should form the columns and which should form the rows. In the usual data frame, the 「variable」 id variable forms the columns,while all other id variables form the rows. Aggregation occurs when the variables do not uniquely identify one row, and in this case we need an aggregation function to reduce the data. Examples later in the chapter will make this concrete.
第一步是「去塑造」數據。 這本質上是一個微不足道的操作,非常類似於現有的R函數棧。 下一個挑戰是指定如何使用重塑功能來查看數據。 常用的方法是指定哪些變數應該形成列,哪些應該形成行。 在通常的數據框中,「variable」id變數形成列,而所有其他id變數形成行。 當變數不僅僅標識一行時,就得採用聚合方法,在這種情況下,我們需要一個聚合函數來減少數據。 本章後面的例子會具體講到。
The order the row and column variables are specified in is very important. As with a contingency table there are many possible ways of displaying the same variables, and the way they are organised reveals different patterns in the data. Variables specified first vary slowest, and those specified last vary fastest. Because comparisons are made most easily between adjacent cells, the variable you are mostinterested in making comparisons between should be specified last, and the early variables should be thought of as conditioning variables. An additional constraint is that displays have limited width but essentially infinite length, so variables with many levels must be specified as row variables. It is also desirable to adhere to common conventions, so where possible, 「variable」 should appear in the column specification.
指定行和列變數的順序非常重要。 與應急表一樣,有許多可能的方式來顯示相同的變數,它們的組織方式顯示了數據的不同模式。 最先指定的變數變化最慢,最後指定的變數變化最快。 因為在相鄰單元之間最容易進行比較,所以您最感興趣進行比較的變數應該最後指定,早期變數被認為是條件變數。 一個附加的約束是顯示具有有限的寬度但是基本上是無限長度,因此具有許多級別的變數必須被指定為行變數。 按照常規慣例,所以在可能的情況下,「變數」應該出現在列規範中。
3.1 Deshaping
3.1 去塑造化
The R command to deshape a data set is deshape. If you don』t specify either measured or id variables, the function will try to guess which are id variables: any factors, integers or columns with 5 or less different values. If you specify only the measured variables, it assumes the remainder are identifier variables, and vice versa.
將數據集去塑造化的R命令是deshape。 如果不指定測量值或id變數,函數將嘗試猜測哪些是id變數:任何因子,整數或具有5個或更少值的列。 如果僅指定測量變數,則假定餘數為標識符變數,反之亦然。
One complication of this design is that all values must be of the same type. This is not usually a big problem because most of the time you are dealing with numeric data. I have been experimenting with storing this data in a list for maximum flexibility - this however makes later code more complicated as we can no longer rely on straightforward vectorisation.
這種設計的一個複雜因素是所有值必須是相同的類型。 這通常不是一個大問題,因為大多數時候你正在處理數字型數據。 我一直在嘗試將這些數據存儲在一個列表中,以獲得最大的靈活性 - 然而這卻使得後來的代碼更加複雜,因為我們不再依賴於簡單的向量化。
3.2 Functions that return multiple values
3.2返回多個值的函數
Occasionally it is useful to aggregate with a function that returns multiple values, e.g.. range, summary etc. This can be thought of as combining multiple reshapes each with an aggregation function that returns one variable. We do this with an additional variable, result variable that differentiates the multiple return values. This result variable uses names if available, otherwise will create names of the form X1, X2,... By default, this new id variable will be shown as last column variable, but you can specify the position manually by including result variable in the list of row and column variables.
有時候,使用返回多個值的函數(例如,範圍,摘要等)進行聚合是有用的。這可以被認為是將多組重塑與每個返回單變數的聚合函數相結合。 我們用一個額外的變數來執行,採用結果變數進行多個返回值的區分。此結果變數能使用自定義的名稱,否則將自行創建X1,X2,...形式的名稱。默認情況下,此新的id變數將顯示為最後一列變數,但您也可以通過將結果變數包含在列和列變數列表中來指定位置。
3.3 Row and column names
3.3 行、列名稱
There are two ways to think about the results from an aggregation command, as either a matrix of numbers with some attributes that describe the row and column names, or as a data frame with the row names as columns. Most current R aggregation functions return the first, implicit, form, whereas reshape returns the explicit data frame form. Why the difference? The implicit form is often inconvenient to deal with – rownames are data too.
有兩種方法可以從聚合命令中考慮結果,即具有描述行和列名稱的某些屬性的數字矩陣,或作為列名稱的數據幀。大多數當前的R聚合函數返回第一個,隱式的形式,而重塑則返回顯式的數據框架。為什麼有這樣的區別呢?那是因為隱含的形式通常不便於處理 - 行名稱也是數據。
3.4 Example
3.4 示例
The reshape package is available on CRAN and can be installed using the R command install.packages("reshape"). This section will work through some techniques using the reshape package with an example data set (french fries). The data is from a sensory experiment investigating the effect of different frying oils on the taste of french fries over time. There are three different types of frying oils (treatment), each in two different fryers (rep), tested by 12 people (subject) on 10 different days (time). The sensory attributes recorded, in order of desirability, are potato, buttery, grassy, rancid, painty, flavours. The first few rows of the data look like:
該reshape包可從在CRAN上獲取,可以使用R命令install.packages(「reshape」)進行安裝。 本節將通過使用具有示例數據集(炸薯條)的數據進行重塑包練習。 數據來自於某個感官實驗,調查不同煎炸油對薯條的味道隨時間的影響。 有三種不同類型的煎炸油(treatment),分別在兩個不同的炸鍋(rep)中,由12人(subject)在10個不同日期(time)進行測試。 根據需要記錄的感官屬性是土豆、黃油、酸敗、grassy、顏色、味道。 數據的前幾行如下所示:
One of the first things we might be interested in is how balanced this design is,and whether there are many different missing values. We can investigate this using length as our aggregation function:
我們可能感興趣的第一件事情是這種設計的平衡性如何,以及是否有許多不同的缺失值。為此,我們可以使用長度作為聚合函數進行研究:
ff_d <- deshape(french_fries, id=1:4)
reshape(ff_d, subject ~ time, length)
subject X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
3 30 30 30 30 30 30 30 30 30 NA
10 30 30 30 30 30 30 30 30 30 30
15 30 30 30 30 25 30 30 30 30 30
16 30 30 30 30 30 30 30 29 30 30
19 30 30 30 30 30 30 30 30 30 30
31 30 30 30 30 30 30 30 30 NA 30
51 30 30 30 30 30 30 30 30 30 30
52 30 30 30 30 30 30 30 30 30 30
63 30 30 30 30 30 30 30 30 30 30
78 30 30 30 30 30 30 30 30 30 30
79 30 30 30 30 30 30 29 28 30 NA
86 30 30 30 30 30 30 30 30 NA 30
Of course we can also create our own aggregation function. Each subject should have had 30 observations at each time, so by displaying the difference we can more easily see where the data is missing.
當然我們也可以創建自己的聚合函數。 一個觀測目標每次應該有30次觀察,所以通過差異性顯示,我們可以更容易地看到數據丟失的位置。
reshape(ff_d, subject ~ time, function(x) 30 - length(x))
subject X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
3 0 0 0 0 0 0 0 0 0 NA
10 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 5 0 0 0 0 0
16 0 0 0 0 0 0 0 1 0 0
19 0 0 0 0 0 0 0 0 0 0
31 0 0 0 0 0 0 0 0 NA 0
51 0 0 0 0 0 0 0 0 0 0
52 0 0 0 0 0 0 0 0 0 0
63 0 0 0 0 0 0 0 0 0 0
78 0 0 0 0 0 0 0 0 0 0
79 0 0 0 0 0 0 1 2 0 NA
86 0 0 0 0 0 0 0 0 NA 0
We can also easily see the range of values that each variable takes:
我們還可以輕鬆地看到每個變數所需值的範圍:
variable max min
buttery 11.2 0
grassy 11.1 0
painty 13.1 0
potato 14.9 0
rancid 14.9 0
Since the data is fairly well balanced, we can do some (crude) investigation as
to the effects of the different treatments. For example, we can calculate the overall
means for each sensory attribute for each treatment:
由於數據相當平衡,我們可以針對對不同處理的效果做一些(粗略的)調查。 例如,我們可以對每項處理方式的感官屬性進行總體方式的計算:
reshape(ff_d, treatment ~ variable, mean,
margins=c("grand_col", "grand_row"))
treatment potato buttery grassy rancid painty NA.
1 6.89 1.78 0.649 4.07 2.58 3.19
2 7.00 1.97 0.663 3.62 2.46 3.15
3 6.97 1.72 0.681 3.87 2.53 3.15
. 6.95 1.82 0.664 3.85 2.52 3.16
Note the row and column margins. We can also produce margins at different levels. The following example shows the results broken down for subjects 3 and 11, with both overall means and means for each subject:
注意行和列邊距。 我們也可以在不同方面得出餘量。 以下示例顯示了針對主題3和11的分析結果,所用到的總體方式以及每個主題所用到的處理方式:
reshape(ff_d, treatment + subject ~ variable, mean,
margins="treatment", subset=subject %in% c(3,10))
treatment subject potato buttery grassy rancid painty
1 3 6.22 0.372 0.1889 2.11 3.11
10 9.95 6.750 0.5850 4.02 1.37
. 8.18 3.729 0.3974 3.11 2.20
2 3 6.74 0.589 0.1056 3.14 2.48
10 10.00 6.980 0.4750 2.15 0.82
. 8.45 3.953 0.3000 2.62 1.61
3 3 5.29 0.767 0.0944 2.86 2.87
10 10.03 6.450 0.1450 3.11 0.69
. 7.79 3.758 0.1211 2.99 1.72
Finally, since we have a repetition over treatments, we might be interested in how reliable each subject』s isare the scores for the two reps highly correlated? We can explore this graphically by reshaping the data and using a lattice plot. Our graphical tools work best when the things we want to compare are in different columns, so we』ll reshape the data so we now have a column for each rep.
最後,由於我們重複了處理方式,我們可能會對每個受試者的可靠程度感興趣 - 兩位代表的得分是否高度相關? 我們可以通過重新形成數據並使用格子圖來圖形化地探索。 當我們想要比較的事情在不同的列中時,圖形工具是最有效的,所以我們將數據進行重塑,因而對每個代表均有列。
xyplot(X1 X2 | variable, reshape(ff d, ... rep), aspect="iso")
If we wanted to explore the relationships between subjects or times or treatments we could follow similar steps.
如果我們想探索主體與時間或著處理方式之間的關係,我們可以採用類似的步驟。
4 Conclusion
4 結論
This paper has presented a useful framework with which to think about reshaping data, and an intuitive implementation in R.
本文提出了一個有用的框架,用於數據的重塑,並在R中直觀地實現。
Future work includes generalising the algorithms to deal better with non numeric data, and large data sets. There are also interesting opportunities to explore using the reshape tool to produce graphical summaries, and creating a GUI to make reshaping data easier. Stotle et al [1] have explored some of these ideas in their Polaris software. It would also be useful to be able to link in to relational databases so that as much aggregation as possible can be pushed off to a program that doesn』t require all the data to fit in memory.
未來的工作包括推廣該演算法以更好地處理非數字型數據和大數據集。還有一些有趣的機會探索使用重塑工具生成圖形摘要,並創建一個GUI從而使重塑數據更為容易。 Stotle等人[1]在Polaris軟體中探討了其中的一些想法。能夠鏈接到關係資料庫也是有用的,這樣儘可能多的聚合可以被推送到那些不需要所有數據適應內存的程序中。
References
參考文獻
[1] Chris Stolte, Diane Tang, and Pat Hanrahan. Polaris: A system for query, analysis, and visualization of multidimensional relational databases. IEEE Transactions on Visualization and Computer Graphics, 8(1), 2002.
推薦閱讀:
※大連市2016年空氣質量數據可視化
※R語言可視化——數據地圖離散百分比填充(環渤海)
※在《one.一個》改版前,你一直念念不忘的文章是哪篇?
※R語言可視化——ggplot繪製中心密度輻射圖