【乾貨】找不到適合自己的編程書？我自己動手寫了一個熱門編程書搜索網站

01-28

原作者 Vlad Wetzel

編譯 CDA 編譯團隊

本文為 CDA 數據分析師原創作品，轉載需授權

選擇適合自己的編程書絕非易事，美國的程序員小哥根據國外著名編程技術問答網站Stack Overflow 所推薦的所有編程書，自己動手寫了一個搜索熱門編程書的網站。

選擇適合自己的編程書絕非易事。

作為一名開發者，你的時間是有限的，讀一本書需要很多時間。用這些時間你可以敲代碼，你可以休息，可以做很多事。但相反，你用這些寶貴的時間來閱讀和提升自己的技能。

那麼應該讀什麼書呢？我和同事經常討論這個問題，但是我發現我們對某本書的看法差別很大。

所以我決定深入探究這個問題——怎樣選擇適合自己的編程書呢？

在這裡我決定把目光轉向 Stack Overflow （國外著名編程技術問答網站），當中不少大神都有推薦他們的書單。我打算通過分析 Stack Overflow 中關於編程書籍的相關數據，從而得出當中哪些書被推薦最多的。

幸運的是， Stack Exchange （ Stack Overflow 的母公司）最近剛剛發布了他們的數據轉儲。以此為基礎，我構建了網站 http://dev-books.com ，通過對關鍵字的搜索，你可以發現 Stack Overflow 最被推崇的編程相關書籍列表。現在網站有超過10萬的用戶。

總體來說，如果你求知慾很強，那麼推薦你閱讀《Working Effectively with Legacy Code》，同時《Design Pattern: Elements of Reusable Object-Oriented Software》也是不錯的選擇。雖然這些書名看上去十分枯燥，但是內容保證乾貨滿滿。你可以通過標籤（如 JavaScript ， C ，圖形等等）對書籍進行分類排序。這顯然不是所有的書推薦，如果你剛剛入門編程或者想擴展你的知識，這兩本書是很好的開始。

下面我來描述該網站是如何構建的。

獲取和導入數據

我從 http://archive.org 獲取了 Stack Exchange 資料庫。

從一開始，我就意識到不可能使用如 myxml := pg_read_file(『path/to/my_file.xml』) 這類常用工具將 48GB XML 文件導入新創建的資料庫（PostgreSQL），因為我伺服器沒有 48GB 的內存。所以，我決定使用SAX解析器。

所有的值存儲在 <row> 標籤之間，從而我打算使用一個 Python 腳本來解析它：

def startElement(self, name, attributes):

if name == 『row』:

self.cur.execute(「INSERT INTO posts (Id, Post_Type_Id, Parent_Id, Accepted_Answer_Id, Creation_Date, Score, View_Count, Body, Owner_User_Id, Last_Editor_User_Id, Last_Editor_Display_Name, Last_Edit_Date, Last_Activity_Date, Community_Owned_Date, Closed_Date, Title, Tags, Answer_Count, Comment_Count, Favorite_Count) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)」,

(

(attributes[『Id』] if 『Id』 in attributes else None),

(attributes[『PostTypeId』] if 『PostTypeId』 in attributes else None),

(attributes[『ParentID』] if 『ParentID』 in attributes else None),

(attributes[『AcceptedAnswerId』] if 『AcceptedAnswerId』 in attributes else None),

(attributes[『CreationDate』] if 『CreationDate』 in attributes else None),

(attributes[『Score』] if 『Score』 in attributes else None),

(attributes[『ViewCount』] if 『ViewCount』 in attributes else None),

(attributes[『Body』] if 『Body』 in attributes else None),

(attributes[『OwnerUserId』] if 『OwnerUserId』 in attributes else None),

(attributes[『LastEditorUserId』] if 『LastEditorUserId』 in attributes else None),

(attributes[『LastEditorDisplayName』] if 『LastEditorDisplayName』 in attributes else None),

(attributes[『LastEditDate』] if 『LastEditDate』 in attributes else None),

(attributes[『LastActivityDate』] if 『LastActivityDate』 in attributes else None),

(attributes[『CommunityOwnedDate』] if 『CommunityOwnedDate』 in attributes else None),

(attributes[『ClosedDate』] if 『ClosedDate』 in attributes else None),

(attributes[『Title』] if 『Title』 in attributes else None),

(attributes[『Tags』] if 『Tags』 in attributes else None),

(attributes[『AnswerCount』] if 『AnswerCount』 in attributes else None),

(attributes[『CommentCount』] if 『CommentCount』 in attributes else None),

(attributes[『FavoriteCount』] if 『FavoriteCount』 in attributes else None)

)

);

經過近三天的導入（幾乎一半的 XML 在此期間被導入），我意識到我犯了一個錯誤： ParentID 欄位應該是 ParentId 。

但是，我並不想再浪費一個星期，於是我從 AMD E-350（2 x 1.35GHz）改為使用英特爾 G2020（2 x 2.90GHz）。但這仍然沒有加快進程。

下一個決定 - 批量插入：

class docHandler(xml.sax.ContentHandler):

def __init__(self, cusor):

self.cusor = cusor;

self.queue = 0;

self.output = StringIO();

def startElement(self, name, attributes):

if name == 『row』:

self.output.write(

attributes[『Id』] + t` +

(attributes[『PostTypeId』] if 『PostTypeId』 in attributes else N) + t +

(attributes[『ParentId』] if 『ParentId』 in attributes else N) + t +

(attributes[『AcceptedAnswerId』] if 『AcceptedAnswerId』 in attributes else N) + t +

(attributes[『CreationDate』] if 『CreationDate』 in attributes else N) + t +

(attributes[『Score』] if 『Score』 in attributes else N) + t +

(attributes[『ViewCount』] if 『ViewCount』 in attributes else N) + t +

(attributes[『Body』].replace(, ).replace(n, n).replace(r, r).replace(t, t) if 『Body』 in attributes else N) + t +

(attributes[『OwnerUserId』] if 『OwnerUserId』 in attributes else N) + t +

(attributes[『LastEditorUserId』] if 『LastEditorUserId』 in attributes else N) + t +

(attributes[『LastEditorDisplayName』].replace(n, n) if 『LastEditorDisplayName』 in attributes else N) + t +

(attributes[『LastEditDate』] if 『LastEditDate』 in attributes else N) + t +

(attributes[『LastActivityDate』] if 『LastActivityDate』 in attributes else N) + t +

(attributes[『CommunityOwnedDate』] if 『CommunityOwnedDate』 in attributes else N) + t +

(attributes[『ClosedDate』] if 『ClosedDate』 in attributes else N) + t +

(attributes[『Title』].replace(, ).replace(n, n).replace(r, r).replace(t, t) if 『Title』 in attributes else N) + t +

(attributes[『Tags』].replace(n, n) if 『Tags』 in attributes else N) + t +

(attributes[『AnswerCount』] if 『AnswerCount』 in attributes else N) + t +

(attributes[『CommentCount』] if 『CommentCount』 in attributes else N) + t +

(attributes[『FavoriteCount』] if 『FavoriteCount』 in attributes else N) + n

);

self.queue += 1;

if (self.queue >= 100000):

self.queue = 0;

self.flush();

def flush(self):

self.output.seek(0);

self.cusor.copy_from(self.output, 『posts』)

self.output.close();

self.output = StringIO();

StringIO 允許使用像文件的變數來處理使用 COPY 的函數 copy_from 。這樣，整個過程只花了一個晚上。

下面開始創建索引。理論上， GiST 所花的時間比 GIN 多，但佔用的空間更小。所以我決定使用 GiST 。一天後我得到了 70GB 的索引。

當我幾次嘗試查詢時，我發現處理時間特別長。其原因在於磁碟 IO 的等待時間。 SSD GOODRAM C40 120Gb 有很大的提升作用，即使它不是目前最快的 SSD 。

我創建了一個全新的 PostgreSQL 集群：

initdb -D /media/ssd/postgresq/data

然後我更改了服務配置的路徑（我使用的是 Manjaro 操作系統）：

vim /usr/lib/systemd/system/postgresql.service

Environment=PGROOT=/media/ssd/postgres

PIDFile=/media/ssd/postgres/data/postmaster.pid

接著重新載入配置並啟動 postgreSQL ：

systemctl daemon-reload

postgresql systemctl start postgresql

這一次我使用 GIN ，導入僅花了幾個小時。索引在 SSD 上占 20GB 的空間，查詢僅需不到一分鐘。

從資料庫中提取書籍信息

隨著數據的最終導入，我開始搜索提到推薦書籍的帖子，然後使用 SQL 將它們複製到單獨的表：

CREATE TABLE books_posts AS SELECT * FROM posts WHERE body LIKE 『%book%』」;

下一步是找到當中所有的超鏈接：

CREATE TABLE http_books AS SELECT * posts WHERE body LIKE 『%http%』」;

在這一點上，我發現 StackOverflow 代理所有的鏈接，如：

http://rads.stackowerflow.com/[$isbn]/

我創建了另一個表格，其中有所有包含鏈接的帖子：

CREATE TABLE rads_posts AS SELECT * FROM posts WHERE body LIKE 『%http://rads.stackowerflow.com%";

然後使用正則表達式提取所有 ISBN 。我通過 regexp_split_to_table 將 Stack Overflow 標籤提取到另一個表。

一旦對熱門標籤進行提取和計算，可以得出20本被推薦最多的書籍（文末附有書單）。

下一步：優化標籤。

這一步需要每個標籤中提取前 20 本書，並排除已處理的書籍。

因為它是「一次性」的工作，我決定使用 PostgreSQL 數組。我寫了一個腳本來實現查詢：

SELECT *

, ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude ))

, ARRAY_UPPER(ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude )), 1)

FROM (

SELECT *

, ARRAY[『isbn1』, 『isbn2』, 『isbn3』] AS to_exclude

FROM (

SELECT

tag

, ARRAY_AGG(DISTINCT isbn) AS isbns

, COUNT(DISTINCT isbn)

FROM (

SELECT *

FROM (

SELECT

it.*

, t.popularity

FROM isbn_tags AS it

LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn

LEFT OUTER JOIN tags AS t on t.tag = it.tag

WHERE it.tag in (

SELECT tag

FROM tags

ORDER BY popularity DESC

LIMIT 1 OFFSET 0

)

ORDER BY post_count DESC LIMIT 20

) AS t1

UNION ALL

SELECT *

FROM (

SELECT

it.*

, t.popularity

FROM isbn_tags AS it

LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn

LEFT OUTER JOIN tags AS t on t.tag = it.tag

WHERE it.tag in (

SELECT tag

FROM tags

ORDER BY popularity DESC

LIMIT 1 OFFSET 1

)

ORDER BY post_count

DESC LIMIT 20

) AS t2

UNION ALL

SELECT *

FROM (

SELECT

it.*

, t.popularity

FROM isbn_tags AS it

LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn

LEFT OUTER JOIN tags AS t on t.tag = it.tag

WHERE it.tag in (

SELECT tag

FROM tags

ORDER BY popularity DESC

LIMIT 1 OFFSET 2

)

ORDER BY post_count DESC

LIMIT 20

) AS t3

...

UNION ALL

SELECT *

FROM (

SELECT

it.*

, t.popularity

FROM isbn_tags AS it

LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn

LEFT OUTER JOIN tags AS t on t.tag = it.tag

WHERE it.tag in (

SELECT tag

FROM tags

ORDER BY popularity DESC

LIMIT 1 OFFSET 78

)

ORDER BY post_count DESC

LIMIT 20

) AS t79

) AS tt

GROUP BY tag

ORDER BY max(popularity) DESC

) AS ttt

) AS tttt

ORDER BY ARRAY_upper(ARRAY(SELECT UNNEST(arr) EXCEPT SELECT UNNEST(la)), 1) DESC;

有了這些數據，我開始建網站。

構建Web應用

由於我不是一個 Web 開發人員，也不是一個 Web 界面專家，我決定創建一個基於默認 Bootstrap 主題的非常簡易的單頁面應用程序。

我創建了一個「按標籤搜索」選項，然後提取熱門標籤，每次搜索時可點擊對應標籤。

我使用條形圖顯示搜索結果。我試過 Hightcharts 和 D3 ，但它們更適合做儀錶盤。同時有一些有響應性的問題，並配置相當複雜。所以，我創建了基於 SVG 的響應圖表。為了使它能夠響應，必須在改變屏幕方向時刷新：

var w = $(#plot).width();

var bars = "";var imgs = "";

var texts = "";

var rx = 10;

var tx = 25;

var max = Math.floor(w / 60);

var maxPop = 0;

for(var i =0; i < max; i ++){

if(i > books.length - 1 ){

break;

}

obj = books[i];

if(maxPop < Number(obj.pop)) {

maxPop = Number(obj.pop);

}

for(var i =0; i < max; i ++){

if(i > books.length - 1){

break;

}

obj = books[i];

h = Math.floor((180 / maxPop ) * obj.pop);

dt = 0;

if(( + obj.pop + ).length == 1){

dt = 5;

}

if(( + obj.pop + ).length == 3){

dt = -3;

}

var scrollTo = onclick="scrollTo(+ obj.id +); return false;" ";

bars += <rect id="rect+ obj.id +" x="+ rx +" y=" + (180 - h + 30) + " width_="50" height=" + h + " + scrollTo + >;

bars += <title> + obj.name+ </title>;

bars += </rect>;

imgs += <image height="70" x="+ rx +" y="220" href="img/ol/jpeg/ + obj.id + .jpeg" onmouseout="unhoverbar(+ obj.id +);" onmouseover="hoverbar(+ obj.id +);" width_="50" + scrollTo + >;

imgs += <title> + obj.name+ </title>;

imgs += </image>;

texts += <text x="+ (tx + dt) +" y="+ (180 - h + 20) +" class="bar-label" stylex="font-size: 16px;" + scrollTo + > + obj.pop + </text>;

rx += 60;

tx += 60;

}

$(#plot).html(

+ <defs>

+ <style type="text/css"><![CDATA[

+ .cla {

+ fill: #337ab7;

+ }

+ .cla:hover {

+ fill: #5bc0de;

+ }

+ ]]></style>

+ </defs>

+ <g>

+ bars

+ </g>

+ <g>

+ imgs

+ </g>

+ <g>

+ texts

+ </g>

+ </svg>);

Web伺服器故障

發布 http://dev-books.com 之後，馬上有許多用戶訪問我的網站。 Apache 不能同時為超過 500 個訪問者服務，所以我很快設置切換為 Nginx 。當實時訪問者高達 800人時我真的很驚訝。