數據分析修鍊歷程：你在哪一站？

01-25

連玉君 (知乎 | 簡書 | 碼雲 | github)

這篇文章很有意思，轉之……
原文標題：Software for Researchers: New Data and Applications
作者：Anton Tarasenko

原文鏈接：January 14, 2016

Amazing！先看看這張圖，找到自己所在的位置…… ? → ? → ? → ￥ → ￡

The tools mentioned here help manage reproducible research and handle new types of data. Why should you go after new data? New data provides new insights. For example, the recent Clark Medal winners used unconventional data in their major works. This data came large and unstructured, so Excel, Word, and email wouldn』t do the job.

I write for economists, but other social scientists can also find these recommendations useful. These tools have a steep learning curve and pay off over time. Some improve small-data analysis as well, but most gains come from new sources and real-time analysis.

Each section ends with a recommended reading list.

Standard Tools

LaTeX and DropBox streamline collaboration. The recommended LaTeX editor is LyX. Zotero and its browser plugin manage the references. LyX supports Zotero via another plugin.

Stata and Matlab do numerical computations. Both are paid, have good support and documentation. Free alternatives: IPython and RStudio to Stata, Octave to Matlab.

Mathematica does symbolic computations. Sage is a free alternative.

Frain, 「Applied LATEX for Economists, Social Scientists and Others.」 Or a shorter intro to LaTeX by another author.
UCLA, Stata Tutorial. This tutorial fits the economist』s goals. To make it shorter, study Stata』s very basic functionality and then google specific questions.
Varian, 「Mathematica for Economists.」 Written 20 years ago. Mathematica became more powerful since then. See their tutorials.

New Data Sources

The most general source is the Internet itself. Scraping info from websites sometimes requires a permission (see the website』s terms of use and robots.txt).

Some websites have APIs, which send data in structured formats but limit the number of requests. Site owners may alter the limit by agreement. When the website has no API, Kimono and Import.io extract structured data from webpages. When they can』t, BeautifulSoup and similar parsers can.

Other sources include industrial software, custom data collection systems (like surveys in Amazon Turk), and physical media. Text recognition systems require little manual labor, so digitizing analog sources is easy now.

Socrata, data.gov, quandl, FRED2 maintain the most comprehensive collection of public datasets. But the universe is much bigger, and exotic data hides elsewhere.

Varian, 「Big Data.」
Glaeser et al., 「Big Data and Big Cities.」
Athey and Imbens, 「Big Data and Economics, Big Data and Economies.」
National Academy of Sciences, Drawing Causal Inference from Big Data [videos]
StackExchange, Open Data. A website for data requests.

One Programming Language

A general purpose programming language can manage data that comes in peculiar formats or requires cleaning.

Use Python by default. Its packages also replicate core functionality of Stata, Matlab, and Mathematica. Other packages handle GIS, NLP, visual, and audio data.

Python comes as a standalone installation or in special distributions like Anaconda. For easier troubleshooting, I recommend the standalone installation. Use pip for package management.

Python is slow compared to other popular languages, but certain tweaks make it fast enough to avoid learning other languages, like Julia or Java. Generally, execution time is not an issue. Execution becomes twice cheaper each year (Moore』s Law) and coder』s time gets more expensive.

Command line interfaces make massive operations on files easier. For Macs and other *nix systems, learn bash). For Windows, see cmd.exe.

Kevin Sheppard, 「Introduction to Python for Econometrics, Statistics and Data Analysis.」
McKinney, Python for Data Analysis. [free demo code from the book]
Sargent and Stachurski, 「Quantitative Economics with Python.」 The major project using Python and Julia in economics. Check their lectures, use cases, and open source library.
Gentzkow and Shapiro, 「What Drives Media Slant?」 Natural language processing in media economics.
Dell, 「GIS Analysis for Applied Economists.」 Use of Python for GIS data. Outdated in technical details, but demonstrates the approach.
Dell, 「Trafficking Networks and the Mexican Drug War.」 Also see other works in economic geography by Dell.
Repository awesome-python. Best practices.

Version Control and Repository

Version control tracks changes in files. It includes:

showing changes made in text files: for taking control over multiple revisions
reverting and accepting changes: for reviewing contributions by coauthors
support for multiple branches: for tracking versions for different seminars and data sources
synchronizing changes across computers: for collaboration and remote processing
forking: for other researchers to replicate and extend your work

Version control by Git is a de-facto standard. GitHub.com is the largest service that maintains Git repositories. It offers free storage for open projects and paid storage for private repositories.

Kramer, How to use GitHub and the terminal: a guide
Video introductions to GitHub

Sharing

Storage

A GitHub repository is a one-click solution for both code and data. No problems with university servers, relocated personal pages, or sending large files via email.

When your project goes north of 1 GB, you can use GitHub』s Large File Storage or alternatives: AWS, Google Cloud, mega.nz, or torrents.

How to efficiently manage a statistical analysis project?

Demonstration

Jupyter notebooks combine text, code, and output on the same page. See examples:

QuantEcon』s notebooks.
Repository of data-science-ipython-notebooks. Machine learning applications.

Beamer for LaTeX is a standard solution for slides. TikZ for LaTeX draws diagrams and graphics.

Beamer theme gallery
Goulding, 「Usepackage{TikZ} for Economists.」 Or a similar intro by Cremer.
Tantau, TikZ and PGF Manual. [pdf]

Remote Server

Remote servers store large datasets in memory. They do numerical optimization and Monte Carlo simulations. GPU-based servers train artificial neural networks much faster and require less coding. These things save time.

If campus servers have peculiar limitations, third-party companies offer scalable solutions (AWS and Google Cloud). Users pay for storage and processor power, so exploratory analysis goes quickly.

A typical workflow with version control:

Creating a Git repository
Taking a small sample of data
Coding and debugging research on a local computer
Executing an instance on a remote server
Syncing the code between two locations via Git
Running the code on the full sample on the server

Some services allow writing code in a browser and running it right on their servers.

EC2 AMI for scientific computing in Python and R. Read the last paragraph first.
Amazon, Scientific Computing Using Spot Instances
Google, Datalab

Real-time Applications

Real-time analysis requires optimization for performance. I exemplify with industrial applications:

Jordan, On Computational Thinking, Inferential Thinking and Big Data. A general talk about getting better results faster.
Google, Economics and Electronic Commerce research
Microsoft, Economics and Computation research

The Map

A map for learning new data technologies by Swami Chandrasekaran:

Source

Stata 寒假班 報名中……
連玉君主講，2018年1月13日-21日，北京）
Stata初級班?｜?Stata高級班? | ?Stata全程班

往期回顧

Stata幫助和網路資源匯總(持續更新中)
協整：醉漢牽著一條狗
在 Markdown 中快速插入文字連接
Stata dofile 轉換 PDF 製作講義方法
Github使用方法及Stata資源
碼云：我把常用小軟體都放這兒了
連玉君的鏈接
Stata小抄：一組圖記住Stata常用命令

http://weixin.qq.com/r/7Ujm-tfEHIpjrZOd9x3- (二維碼自動識別)

關於我們

【Stata 連享會】由中山大學連玉君老師團隊創辦，旨在定期與大家分享 Stata 應用的各種經驗和技巧。
公眾號推文同步發佈於【簡書-Stata連享會】和【知乎-連玉君Stata專欄】。可以在簡書和知乎中搜索關鍵詞Stata或Stata連享會後關注我們。
推文中的相關數據和程序，以及 Markdown 格式原文可以在【Stata連享會-碼雲】中獲取。【Stata連享會-碼雲】中還放置了諸多 Stata 資源和程序。如 Stata命令導航 || stata-fundamentals || Propensity-score-matching-in-stata || Stata-Training 等。

聯繫我們

歡迎賜稿： 歡迎將您的文章或筆記投稿至Stata連享會，我們會保留您的署名；錄用稿件達五篇以上，即可免費獲得 Stata 現場培訓 (初級或高級選其一) 資格。
意見和資料： 歡迎您的寶貴意見，您也可以來信索取推文中提及的程序和數據。
招募英才： 歡迎加入我們的團隊，一起學習 Stata。合作編輯或撰寫稿件五篇以上，即可免費獲得 Stata 現場培訓 (初級或高級選其一) 資格。
聯繫郵件： StataChina@163.com