Python抓取網頁數據的終極辦法
來自專欄 編程實驗室
假設你在網上搜索某個項目所需的原始數據,但壞消息是數據存在於網頁中,並且沒有可用於獲取原始數據的API。
所以現在你必須浪費30分鐘寫腳本來獲取數據(最後花費 2小時)。這不難但是很浪費時間。
Pandas庫有一種內置的方法,可以從名為read_html()的html頁面中提取表格數據:
import pandas as pdtables = pd.read_html("https://apps.sandiego.gov/sdfiredispatch/")print(tables[0])
就這麼簡單! Pandas可以在頁面上找到所有重要的html表,並將它們作為一個新的DataFrame對象返回。
輸入表格0行有列標題,並要求它將基於文本的日期轉換為時間對象:
import pandas as pdcalls_df, = pd.read_html("http://apps.sandiego.gov/sdfiredispatch/", header=0, parse_dates=["Call Date"])print(calls_df)
得到:
Call Date Call Type Street Cross Streets Unit 2017-06-02 17:27:58 Medical HIGHLAND AV WIGHTMAN ST/UNIVERSITY AV E17 2017-06-02 17:27:58 Medical HIGHLAND AV WIGHTMAN ST/UNIVERSITY AV M34 2017-06-02 17:23:51 Medical EMERSON ST LOCUST ST/EVERGREEN ST E22 2017-06-02 17:23:51 Medical EMERSON ST LOCUST ST/EVERGREEN ST M47 2017-06-02 17:23:15 Medical MARAUDER WY BARON LN/FROBISHER ST E38 2017-06-02 17:23:15 Medical MARAUDER WY BARON LN/FROBISHER ST M41
這只是一行代碼,數據不能作為json記錄可用。
import pandas as pdcalls_df, = pd.read_html("http://apps.sandiego.gov/sdfiredispatch/", header=0, parse_dates=["Call Date"])print(calls_df.to_json(orient="records", date_format="iso"))
運行下面的代碼你將得到一個漂亮的json輸出(即使有適當的ISO 8601日期格式):
[ { "Call Date": "2017-06-02T17:34:00.000Z", "Call Type": "Medical", "Street": "ROSECRANS ST", "Cross Streets": "HANCOCK ST/ALLEY", "Unit": "M21" }, { "Call Date": "2017-06-02T17:34:00.000Z", "Call Type": "Medical", "Street": "ROSECRANS ST", "Cross Streets": "HANCOCK ST/ALLEY", "Unit": "T20" }, { "Call Date": "2017-06-02T17:30:34.000Z", "Call Type": "Medical", "Street": "SPORTS ARENA BL", "Cross Streets": "CAM DEL RIO WEST/EAST DR", "Unit": "E20" } // etc...]
你甚至可以將數據保存到CSV或XLS文件中:
import pandas as pdcalls_df, = pd.read_html("http://apps.sandiego.gov/sdfiredispatch/", header=0, parse_dates=["Call Date"])calls_df.to_csv("calls.csv", index=False)
運行並雙擊calls.csv在電子表格中打開:
當然,Pandas還可以更簡單地對數據進行過濾,分類或處理:
>>> calls_df.describe() Call Date Call Type Street Cross Streets Unitcount 69 69 69 64 69unique 29 2 29 27 60top 2017-06-02 16:59:50 Medical CHANNEL WY LA SALLE ST/WESTERN ST E1freq 5 66 5 5 2first 2017-06-02 16:36:46 NaN NaN NaN NaNlast 2017-06-02 17:41:30 NaN NaN NaN NaN>>> calls_df.groupby("Call Type").count() Call Date Street Cross Streets UnitCall TypeMedical 66 66 61 66Traffic Accident (L1) 3 3 3 3>>> calls_df["Unit"].unique()array([E46, MR33, T40, E201, M6, E34, M34, E29, M30, M43, M21, T20, E20, M20, E26, M32, SQ55, E1, M26, BLS4, E17, E22, M47, E38, M41, E5, M19, E28, M1, E42, M42, E23, MR9, PD, LCCNOT, M52, E45, M12, E40, MR40, M45, T1, M23, E14, M2, E39, M25, E8, M17, E4, M22, M37, E7, M31, E9, M39, SQ56, E10, M44, M11], dtype=object)
原文:https://medium.com/@ageitgey/quick-tip-the-easiest-way-to-grab-data-out-of-a-web-page-in-python-7153cecfca58
推薦閱讀:
※自動處理excel數據,用什麼語言合適?
※用 Python 怎樣實現一個九九乘法表?
※基於bs4庫的HTML內容查找方法
※Linux運維人員如何學習python編程