翻譯《R for Data Science》-chapter 1-1.1

1 Introduction

Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of 「R for Data Science」 is to help you learn the most important tools in R that will allow you to do data science. After reading this book, you』ll have the tools to tackle a wide variety of data science challenges, using the best parts of R.

1 簡介

數據科學是一個令人興奮的學科,它允許你用你的知識理解來轉化數據。本書的目標是幫助您學習R中最重要的工具,使您能夠進行數據科學工作。閱讀本書後,您將擁有使用R的最佳部分來處理各種數據科學挑戰的工具。

1.1 What you will learn

Data science is a huge field, and there』s no way you can master it by

reading a single book. The goal of this book is to give you a solid

foundation in the most important tools. Our model of the tools needed in

a typical data science project looks something like this:

1.1你將學到什麼

數據科學是一個巨大的領域,沒有辦法通過閱讀一本書來掌握它。本書的目標是為您提供最重要工具的堅實基礎。我們在典型的數據科學項目中所需的工具模型如下所示:

First you must import your data into R. This typically means that you take data stored in a file, database, or web API, and load it into a data frame in R. If you can』t get your data into R, you can』t do data science on it!

首先,您必須將數據導入到R中。這通常意味著您將數據存儲在文件,資料庫或Web

API中,並將其載入到R中的數據框中。如果無法將數據存入R中,則可以不要做數據科學吧!

Once you』ve imported your data, it is a good idea to tidy it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In

brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.

導入數據後,整理好它是一個好主意。整理數據意味著以與數據集的語義與存儲方式相匹配的一致形式來存儲數據。簡而言之,當您的數據整理時,每列都是一個變數,每行都是一個觀察值。整潔的數據是重要的,因為一致的結構可以讓您集中精力處理有關數據的問題,而不是爭取將數據轉換為不同功能的正確形式。

Once you have tidy data, a common first step is to transform it. Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing

velocity from speed and time), and calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called wrangling, because getting your data in a form that』s natural to work with often feels like a fight!

一旦你有整潔的數據,一個常見的第一步是轉換它。變革包括縮小感興趣的觀察(像一個城市的所有人,或去年的所有數據),創建作為現有變數的函數的新變數(如速度和時間的計算速度),以及計算一組總結統計(如計數或方法)。一起,整理和轉換被稱為爭吵,因為讓您的數據以自然的方式工作,經常感覺像一場戰鬥!

Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times.

一旦你有了所需變數的整理數據,就有兩個主要的知識生成:可視化和建模。這些具有互補的優點和缺點,所以任何真正的分析將在它們之間重複許多次。

Visualisation is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions about the data. A good visualisation might also hint that you』re asking the wrong question, or you need to collect different data.

Visualisations can surprise you, but don』t scale particularly well because they require a human to interpret them.

可視化是一項根本性的人類活動。一個很好的可視化將會顯示你沒有想到的內容,或提出有關數據的新問題。一個好的可視化也可能暗示你提出錯誤的問題,或者你需要收集不同的數據。可視化可以讓你感到驚訝,但是由於需要人類來解釋它們,因此不能很好地擴展。

Models are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don』t, it』s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.

模型是可視化的補充工具。一旦你提出了足夠的精確問題,你可以使用模型來回答。模型是一個基本的數學或計算工具,因此它們通常擴展得很好。即使不這樣做,購買更多的電腦通常比購買更多大腦更便宜!但每一個模型都做出假設,而且它的本質就是模型不能質疑自己的假設。這意味著一個模型不能從根本上讓你感到驚訝。

The last step of data science is communication, an absolutely critical part of any data analysis project. It doesn』t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.

數據科學的最後一步是溝通,是任何數據分析項目絕對關鍵的部分。無論您的模型和可視化程度如何,您都可以了解數據,除非您也可以將結果傳達給他人。

Surrounding all these tools is programming. Programming is a cross-cutting tool that you use in every part of the project. You don』t need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better

programmer allows you to automate common tasks, and solve new problems with greater ease.

圍繞所有這些工具是編程。編程是一個跨工具,您可以在項目的每個部分使用。您不需要成為一名數據科學家的專家程序員,而是學習更多有關編程的支付,因為成為更好的程序員可以讓您自動執行常見任務,並更輕鬆地解決新問題。

You』ll use these tools in every data science project, but for most projects they』re not enough. There』s a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you』ll learn in this book, but you』ll need other tools to tackle the remaining 20%.

Throughout this book we』ll point you to resources where you can learn more.

您將在每個數據科學項目中使用這些工具,但對於大多數項目來說,這還不夠。有一個粗略的80-20規則在玩;

您可以使用您在本書中學到的工具來解決大約80%的項目,但您需要其他工具來解決剩餘的20%。在本書中,我們將指出您可以了解更多信息的資源。


推薦閱讀:

TAG:R編程語言 | 數據分析 |