Github 上那些頂尖的貢獻組織 - 2017
簡評:讓我們來看一下在 2017 年 Github 上哪些項目最熱門,哪一家公司貢獻得最多?
這裡作者研究了 2017 年截止今天 Github 上的所有 PushEvents,對於每個用戶都盡量分辨了其屬於哪個組織,並且以下列出的倉庫都是在 2017 年增長超過 20 個 star 的(比如,Apache 目前在 Github 有超過 1500 個倉庫,但只有 205 個在今年獲得了 20 個以上的 star。因此,這裡也就只分析了這 205 個倉庫的提交情況)。
更詳細的內容,大家可以直接看這個可交互的版本:The top contributors to GitHub (2017)。
怎麼做到的?
1. 數據來源:GitHub Archive。
2. 怎麼判斷 Github 用戶屬於哪個公司?
這裡主要通過用戶提交記錄中的郵件地址來進行判斷。當然不是每個人都會用自己組織的郵箱地址,有很多人在 Github 上都選擇如 gmail.com, users.noreply.github.com, 或其他的郵箱,這種情況就沒有辦法了。
3. 所用工具:Google Big Query。
4. 具體的 SQL 代碼:
#standardSQLWITHperiod AS ( SELECT * FROM `githubarchive.month.2017*` a),repo_stars AS ( SELECT repo.id, COUNT(DISTINCT actor.login) stars, APPROX_TOP_COUNT(repo.name, 1)[OFFSET(0)].value repo_name FROM period WHERE type="WatchEvent" GROUP BY 1 HAVING stars>20), pushers_guess_emails_and_top_projects AS ( SELECT *, REGEXP_EXTRACT(email, r"@(.*)") domain FROM ( SELECT actor.id , APPROX_TOP_COUNT(actor.login,1)[OFFSET(0)].value login , APPROX_TOP_COUNT(JSON_EXTRACT_SCALAR(payload, "$.commits[0].author.email"),1)[OFFSET(0)].value email , COUNT(*) c , ARRAY_AGG(DISTINCT TO_JSON_STRING(STRUCT(b.repo_name,stars))) repos FROM period a JOIN repo_stars b ON a.repo.id=b.id WHERE type="PushEvent" GROUP BY 1 HAVING c>3 ))SELECT * FROM ( SELECT domain , githubers , (SELECT COUNT(DISTINCT repo) FROM UNNEST(repos) repo) repos_contributed_to , ARRAY( SELECT AS STRUCT JSON_EXTRACT_SCALAR(repo, "$.repo_name") repo_name , CAST(JSON_EXTRACT_SCALAR(repo, "$.stars") AS INT64) stars , COUNT(*) githubers_from_domain FROM UNNEST(repos) repo GROUP BY 1, 2 HAVING githubers_from_domain>1 ORDER BY stars DESC LIMIT 3 ) top , (SELECT SUM(CAST(JSON_EXTRACT_SCALAR(repo, "$.stars") AS INT64)) FROM (SELECT DISTINCT repo FROM UNNEST(repos) repo)) sum_stars_projects_contributed_to FROM ( SELECT domain, COUNT(*) githubers, ARRAY_CONCAT_AGG(ARRAY(SELECT * FROM UNNEST(repos) repo)) repos FROM pushers_guess_emails_and_top_projects #WHERE domain IN UNNEST(SPLIT("google.com|microsoft.com|amazon.com", "|")) WHERE domain NOT IN UNNEST(SPLIT("gmail.com|users.noreply.github.com|qq.com|hotmail.com|163.com|me.com|googlemail.com|outlook.com|yahoo.com|web.de|iki.fi|foxmail.com|yandex.ru", "|")) # email hosters GROUP BY 1 HAVING githubers > 30 ) WHERE (SELECT MAX(githubers_from_domain) FROM (SELECT repo, COUNT(*) githubers_from_domain FROM UNNEST(repos) repo GROUP BY repo))>4 # second filter email hosters)ORDER BY githubers DESC
原文:The top contributors to GitHub?—?2017
日報擴展閱讀:
- Github 將對你的軟體職業生涯產生什麼影響?
推薦閱讀: