人工智慧會取代「數據分析」嗎？

01-29

原文引用自 NCSU 的學生博客——Data Column，平時愛好翻譯，，供大家學習交流。

In 2011, McKinsey & Company released a report about the shortage of analytical talents, marking the beginning of the era of big data. In the following year, an article in the Harvard Business Review called data scientist 「 the sexiest job in 21st century.」 Businesses and organizations, private and public sectors alike, continue to express their interests in hiring analytical talent, some even express difficulties in hiring one.

2011年，麥肯錫發布了一篇關於數據分析人才短缺的報告，標誌著大數據時代的開啟。第二年，哈佛商業評論的一篇文章把數據科學家稱作「21世紀最性感的工作」。各類企業和組織，私營部門和公共部門，都在不斷表達他們有興趣僱傭數據分析人才，其中一些組織甚至表達出找到這類人才的難度。

While the future seems bright and shining for analytical talent, those who have just set their mind on becoming an analyst probably share the same deepest fear that I do—Will my job as a data analyst be automated in the future? This is not just a question for data analysts. The world changes at a faster pace than we can possibly keep pace with or imagine. Everyone in their early 20』s should consider this before deciding which career to pursue and how technology could impact their careers / jobs in the future.

雖然數據分析人才的前途看似一片光明，但是那些已經決定成為分析師的人大概都會有一個共同的顧慮——我的這份數據分析工作將來會被自動化替代么？我也有同樣的擔憂，而且這不僅僅是數據分析師要面對的問題。世界正在以我們難以跟上甚至超乎想像的速度變化著，每個人在20出頭的時候，都應該考慮一下這個事實，然後再去決定從事什麼職業，並且判斷技術將可能如何影響他們的工作。

After reflecting on my own experience in the MSA program over the last six months, I would like to share some in sights on this question, and hopefully I will alleviate some of the anxiety.

回顧我過去六個月在北卡州立大學的分析碩士學習經歷，我想分享一些關於上面問題的個人見解，希望可以幫助大家緩解一些焦慮。

First, let』s consider why people think analytics jobs could potentially be automated by computers. It is no secret that data analysis and modeling highly rely on computers. Nowadays, computers can not only show the modeling results but also help the analysts pick the best model by applying the model to the test set or validation set, or alternatively use techniques such as cross validation.

首先，我們先思考為什麼人們覺得數據分析工作可能會被計算機取代。實際上數據分析和建模工作高度依賴計算機已經不是什麼秘密了。如今，計算機不僅能給出模型運行結果，還能通過測試、驗證、交叉驗證等方式幫助分析師選擇最佳模型。

However, to pick the best model, we need to decide on the criteria to pick the model, and this is where human intervention comes in. Often times in analytics, there』s no one universal answer to all situations (this is where the running joke 「it depends」 at the Institute comes from), and thus the final call still relies on human judgement.

然而，為了選擇最佳模型，我們需要選定選擇模型的條件，這部分需要人為干預。數據分析中往往沒有適用於所有情況的通用答案（這就是我們內部關於「視情況而定」這個笑話的出處），因此，最終的決定依然需要人為的判斷。

However, in my opinion, the biggest obstacle that prevents analytical jobs from being automated lies in the nature of data itself. To be more specific, let』s talk about two aspects of The Four Vs of Big Data –「Variety and Veracity.」

然而我認為，數據分析工作完全自動化的最大障礙在於數據的屬性本身。具體來說，我們討論下大數據四個性質中的兩個方面——多樣性和精確性（還有兩個方面是數量和速度）。

In most of the real world analytics projects, a large amount of effort goes into preparing the data in the analytics-ready format. Depending on the sources of your data, the resources required to get through this stage of preparing the data varies. Let』s assume you would like to predict powerusage over the next week for an energy company. The information you need maybe saved in the same database (in this case you』re extremely lucky), or it could be spread out in several database across different departments in the organization, which comes in different formats.

大多數的實際分析項目中，大量的工作是把數據提前處理成可分析的格式。根據你的數據來源，你需要完成處理數據多樣化的階段。假設你要預測一家能源公司未來一周的電力供給情況，你要用到的信息可能保存在一個資料庫中（遇到這種情況你簡直太幸運了），或者分布在不同部門各自的資料庫中，這種情況下數據格式不一樣。

Sometimes the information you need doesn』t exist in the company』s database. For example, weather is highly related to power usage, so you would like to include weather data in your analysis, but it is not available in the current data set. You may need to scrape the website that provides such information and convert the information into the same format as your current dataset and integrate the data together.

有時你需要的信息卻不在公司的資料庫里，比如天氣和電力供給高度相關，因此你想在分析中加入天氣數據，但是現有的資料庫並沒有。你可能需要去提供天氣信息的網站搜刮，並且把這些信息轉成和現有數據同樣的格式並把它們整合起來。

All these are just the tip of the iceberg of the variety of data. As you can see, to pull all the necessary information together, and transform them into an analytics-ready format requires lots of human intervention, not to mention all the data cleaning work (missing values etc.)once the data is put together.

這一切只不過是數據多樣性的冰山一角，正如你所見，把所有必需信息匯總到一起並且轉成可供分析的格式就已經耗費很多人力了，更別說之後的數據清洗工作了。

Let』s move on to the veracity of the data – one of data analysts』 biggest http://nightmares.In terms of the quality and accuracy of the data, this could only be determined by a human. After all, a computer is just a machine; it takes whatever data you feed in, and it does not have the ability to question the quality of data. In many cases, the data analysts are not involved in the data collection process and the data they』re given may not be suitable to answer the questions that are posed.

下面我們來談談數據的精確性——數據分析師們最大的噩夢之一。就數據的質量和精確度而言，只能憑藉人類來決定。畢竟電腦只是機器，無論什麼數據，只要你餵給它它都用，而電腦沒有能力去懷疑數據的質量。很多時候，數據分析師並沒有參與到數據收集過程，他們拿到的數據也許根本不適用要解決的問題。

Sometimes it is necessary for the analyst to communicate with those who design the data collection to assess the quality of data. Another factor that complicates the issue is privacy. Here at the Institute, every student is assigned to a practicum project and given the chance to work on a real world problem for their sponsors.

因此，分析師有時必須和設計數據收集過程的人溝通來評估數據的質量。另一個使得問題複雜化的因素是隱私性。在我們學院內部，每個學生都被分配到一個實踐項目以便有機會替贊助商解決實際問題。

For privacy concerns, the data handed to the students must not contain personal information identifiers, which sometimes pose extra challenges in data analysis. For example, if you can』t tell that two purchases are made by the same person, how could you find the purchasing pattern on the individual level? As a result, analyzing the data that are masked to protect personal privacy requires lots of human intervention.

出於隱私考慮，交給學生的數據不可以包含個人信息標識符，這有時給數據分析工作帶來更多的挑戰。比如說，如果無法辨別兩個商品是同一個人購買的，你怎麼才能發現個人層面的購買模式呢？結果就是，分析那些因為保護隱私而偽裝的數據，需要大量的人工介入。

So it seems like data analysis is nowhere near being automated—at least not in the next five years, and the demand for analytical talents might be larger than you think. If you think analytics is the right career for you, I would encourage you to pursue this path.

這樣看來，數據分析離完全自動化還遠——至少未來五年不會，而且對於數據分析人才的需求可能比你想像得還要多。如果你認為數據分析的工作適合你，我一定鼓勵你從事這行。

Again,one should never underestimate the disruptive power of technology. All these arguments are made based on the current technologies. If some unexpected technology comes into play, these arguments may no longer be valid.

再次提醒，切不可低估技術的破壞性力量。所以上述推論都是基於現在的技術。如果有未知的技術出現了，這些推論可能都將失效。

總結

數據分析的工作要面對諸如數據多樣化和精確性要求的多方面挑戰，目前的技術還完全做不到拋開人工介入，因此幾年之內都不用擔心數據分析師會失業。不過未來嘛，也許會有那麼一天，計算機可以解決現有的困難，替代很多人的工作。