想做數據分析師都要學什麼？

01-07

剛剛參見工作，遊戲方面的。真的很迷茫啊^ 本人統計學基礎，本科畢業。先拜謝了……

想到一句話：

big data is like teen sex.

Everyone talks about it,

Nobody really know how to do it,

Everyone thinks everyone else is doing it,

So everyone claim they are doing it...

同樣的大實話用於數據分析師，在美國可能稱為data scientist，或者quant analysist。之前quora上看人給出過統計，招聘中90%的類似職位其實並不能嚴格稱為data scientist。而且這個稱呼太過寬泛了，稍微接觸點data的說叫這個可能都適用。但不同的行業作法真是千差萬別，比如互聯網公司沒有用sas和spss，python和r是主力，資料庫也要求會hadoop。但一些些大的行業巨無霸可能就會用sas了，資料庫一般也是用sql。沒聽說過有用spss和excel做主力的。

如果樓主能清楚自己理解的數據分析師是什麼職位，我告訴你一個簡單的方法：搜自己dream的公司招人的鏈接，看job description和requirment。

簡單的貼兩個，一個是New York times，媒體行業，數據決策。

- Frame and conduct complex analyses and experiments using tremendously large (e.g. 10^6 to 10^10 records), complex (not always well-structured, highly variable) data sets

- Answer product questions by using appropriate statistical techniques on available data
- Analyze and interpret the results of product experiments;
- Design and implement scripts, programs, databases, and other software components
- Draw conclusions and effectively communicate findings with both technical and non-technical team members

- Review relevant academic and industry research to identify useful algorithms, techniques, libraries, etc.
- Share your knowledge by communicating your findings through sensible presentation and accessible language and mentor staff on relevant procedures and techniques
- Work closely with a product engineering team to identify and answer important product questions;
- Develop best practices for instrumentation and experimentation and communicate those to product engineering teams.
Requirements:
- Bachelors Degree in Math, Statistics, Computer Science, or other quantitative discipline
- 3+ years experience in R, Python, Java, or other languages appropriate for large scale analysis of numerical and textual data
- 2+ years experience with data mining, machine learning, statistical modeling tools and underlying algorithms
- 2+ years experience with relational databases and SQL

- 2+ years experience working with extremely large data sets
- 2+ years experience with Hadoop or similar distributed computing and storage platforms
- Proficiency with Unix/Linux environments.

Qualifications:
- Graduate degree in Math, Statistics, Computer Science, or other quantitative discipline
- Extremely strong analytical and problem-solving skills
- A strong passion for empirical research and for answering hard questions with data;
- Proven track record of solving challenging problems in both academia and industry
- Experience with real-time/online recommender systems

一個linkedin，互聯網公司，該公司data團隊業界聞名。

? Extract and analyze LinkedIn data to drive product strategy.
? Formulate success metrics for completely novel products.
? Design, build and use tools for understanding how our products perform.
? Design and analyze experiments to test new product ideas.
? Develop new algorithms and models to improve our products.
LinkedIn』s amazingly rich data about our worldwide network of professionals is a fantastic playground for a Data Scientist. You』ll have the opportunity to work with some of the best data people anywhere in an environment which truly values (read: is obsessed with) data-driven decisions.
Required qualifications include:

? BS/MS in a quantitative discipline: Statistics, Applied Mathematics, Operations Research, Computer Science, Engineering, Economics, etc.
? 1+ years experience working with large amounts of real data.
? Expertise in applied statistics, including regression models.
? Able to translate business objectives into actionable analyses.
? Able to communicate findings clearly to both technical and non-technical audiences
? Expertise in at least one statistical software package, preferably R.
? Proficiency in SQL.
? Advanced skills in a scripting language such as Python.
Preferred Qualifications include:

? PhD in a quantitative discipline: Statistics, Applied Mathematics, Operations Research, Computer Science, Engineering, Economics, etc.
? 3+ years experience working with large amounts of real data
? Advanced skills in Java/C++
? Experience in Hadoop or other MapReduce paradigms and associated languages such as Pig, Sawzall, etc.

數據 (Data) 是 DIKW Pyramid (Data, Information, Knowledge, Wisdom) 中最低級的材料。而數據工程是一整套對數據進行採集，處理，提取價值（變為 I 或 K）的過程。首先介紹一下相關的幾種角色： Data Engineer, Data Scientist Data Analyst。這三個角色任務重疊性高，要求合作密切，但各負責的領域稍有不同。大部分公司里的這些角色都會根據每個人本身的技能長短而身兼數職，所以有時候比較難以區分。

Data Engineer 數據工程師：分析數據少不了需要運用計算機和各種工具 automate 數據處理的過程，包括數據格式轉換，儲存，更新，查詢。數據工程師的工作就是開發工具完成 automate 的過程，屬於 Infrastructure/Tools 層。

這個角色出現的頻率不多。因為有現成的MySQL, Oracle等資料庫技術，很多大公司只需要DBA就足夠了。而 Hadoop, MongoDB 等 NoSQL 技術的開源，更是使在大數據的場景下都沒有太多 engineer 的事兒，一般都是交給 scientist 。據我所知 Facebook 有專門的 database team，因為數據量太超常了而且業務特殊； Square 有 Data Engineering team，因為對數據穩定性上要求苛刻；Google 就不用說了，膜拜一下 GFS, BigTable, MapReduce 這些名字就可以了。

Data Scientist 數據科學家：數據科學家是與數學相結合的中間角色，需要用數學方法處理原始數據找出肉眼看不到的更高層數據，一般是運用 Statistical Machine Learning 的方法，最近也有流行玩 Deep Learning的。有人稱 Data Scientist 為 Programming Statistician，他們需要有很好的統計學基礎，但也需要參與很多 learning 程序的開發（基於 Infrastructure 之上），而現在很多很多的 Data Scientist 職位都要求身兼 Data Engineer。 Data Scientist 是把 D 轉為 I 或 K 的主力軍。

Data Analyst 數據分析師：工程師和科學家做了大量的工作用計算機程序儘可能多地提取了價值（I/K），然而真正要從數據中洞察出更高的價值，則需要依靠豐富的行業經驗和洞察力，這些都需要人力的干預。 Data Analyst 需要的是對所在業務有深刻了解，能熟練運用手上的工具（無論是 Excel， SPSS也好， Python/R也好，工程師給你開發的工具也好，必要時還要能自己充當工程師和科學家，力盡所能得到自己需要的工具）有針對性地對數據作分析，並且需要把發現言之有物地向其他職能部門呈現出來，最終變為行動。這就是把數據最終得出 Wisdom。

這個職位出現也不是很多，在很多公司里沒有這樣的職位，因為都是 C-level 的人或產品經理在做著數據分析的事情。這樣的職位大量出現的地方我只知道 Wall Street 和 NSA，因為有大量的 case 需要處理，而每個 case 都需要有人分析。

值得一提的是 PayPal 當年內部處理 fraud 的問題，積累了大量欺詐分析的經驗，後來 PayPal 創始人 Peter Thiel 又創立了 Palantir，專門做數據分析工具平台，在美國成功幫很多機構解決著反恐，人口販賣等很多需要專家參與的問題。 Palantir 有一句口號是 Surface data, not mining it（呈現數據，而非挖掘）。是一個比較有意思的觀點：）

首次回答於如何成為一個數據分析師？需要具備哪些技能？

精通統計軟體excel、spss、sas等，懂SQL，熟悉常規的分析方法，最好還會建模。簡單的說就這麼多吧，供參考。

數據敏感度+excel+PPT。。。。

軟體什麼的都是浮雲，只要統計分析方法熟悉，建模方法熟悉（當然原理得懂），會操作一種統計軟體（如SPSS或者SAS)就可通殺。

當然也得會基本的SQL語句，能從資料庫里往外導數據。

同時，懂幾種語言是最好了，藝不壓身嘛。

語言這東西嘛，一通百通，大學裡學過C++或者java的話，別的語言讀起來很容易。

大數據分析師課程主要以技術為主線，以大數據分析師為培養目標，從數據分析基礎、linux操作系統入門知識學起，系統介紹Hadoop、HDFS、MapReduce、hive和Hbase等理論知識和基於Spark的大數據分析，詳細演示Hadoop三種模式的安裝配置，以案例的形式，重點講解基於Spark技術的回歸分析、聚類分析等大數據分析案例。課程的重點是基於Hadoop架構的大數據分析思想及架構設計，通過講解氣象大數據分析、Web海量日誌大數據分析、智慧高速大數據分析等多個大數據分析案例，使學員能在較短的時間內理解大數據分析的真實價值，掌握如何使用Hadoop架構應用於大數據分析過程，使學員能有一個快速提升成為兼有理論和實戰的大數據分析師，從而更好地適應當前互聯網經濟背景下對大數據分析師需求的旺盛的就業形勢。從課程體系設計和培訓理念中，引導學員一步步深入學習，適合零基礎但又有志於大數據行業的學員。融信教育專註大數據人才培養詳情http://www.rxzxedu.com

學點python，處理數據什麼的還是有用的。