如何編寫 Python 程序爬取新浪軍事論壇？

12-28

最近實驗要做一個項目，爬取新浪軍事論壇，先把數據爬回來，然後做大數據分析。
現在我在爬取新浪論壇帖子的時候遇到了困難，我定義了一個正則，但是只能爬取帖子的第一段。比如這個帖子：[原創]奧巴馬為何聲稱釣島適用《美日安保條約》？
我定義的正則是：
def getPageContext(page_data):
#定義正則
#context_re = r"&(\n*)|(.*)&"
#context_re = r"&
(.*)&
"
#print(type(page_data))
#去掉nbsp標記，替換成『』
page_data = page_data.replace("nbsp;"," ")
#去掉空格
page_data = page_data.rstrip()
#page_data = page_data.replace("&
" ,"")
#page_data = page_data.replace("
" ,"")
print(page_data)
context_re = r"&(.*?)&
"

font_begin = page_data.index("&")
font_end = page_data.index("&")
print(font_begin)
print(font_end)

return re.compile(context_re).findall(page_data)

但是爬取的是結果是：

顯然這不是我想要的結果。
最後我想請問大家：
（1）這裡正則該怎樣寫
（2）有沒有成熟的架構可以實現爬取論壇
最後，我是一個新手，學習python不到一個星期。謝謝大家多多指教，謝謝了。

context_re = r"&(.*?)&
"

你準備的這個正則表達式啊，truncated！斷在了&
這裡，所以只能爬第一段。

爬取新浪軍事論壇需要做三件事：

一、

上CSDN汪海老師的專欄，http://blog.csdn.net/column/details/why-bug.html，學習一個。

二、

按F12看一下前端。

三、

from bs4 import BeautifulSoup import requests


response = requests.get("http://club.mil.news.sina.com.cn/thread-666013-1-1.html?retcode=0") #硬點網址

response.encoding = "gb18030" #中文編碼

soup = BeautifulSoup(response.text, "html.parser") #構建BeautifulSoup對象
divs = soup("div", "mainbox")

 #每個樓層
for div in divs:

    comments = div.find_all("div","cont f14") #每個樓層的正文

with open("Sina_Military_Club.txt","a") as f: f.write(" "+str(comments)+" ")

最後祝你，射蜜不上網，上網不射蜜，再見！

import requests from bs4 import BeautifulSoup

r = requests.get("http://club.mil.news.sina.com.cn/thread-666013-1-1.html") r.encoding = r.apparent_encoding soup = BeautifulSoup(r.text) result = soup.find(attrs={"class": "cont f14"}) print result.text

需要requests和BeuatifulSoup4

pip install requests pip install beautifulsoup4

參考資料:
快速上手 — Requests 1.1.0 文檔
Beautiful Soup 4.2.0 文檔

先用了BeautifulSoup爬取數據

# -*- coding:utf-8 -*-


import re, requests

from bs4 import BeautifulSoup

import sys

reload(sys)

sys.setdefaultencoding("utf-8")
url = "http://club.mil.news.sina.com.cn/viewthread.php?tid=666013extra=page%3D1page=1"
req = requests.get(url)

req.encoding = req.apparent_encoding

html = req.text
soup = BeautifulSoup(html)

file = open("sina_club.txt", "w") x = 1 for tag in soup.find_all("div", attrs = {"class": "cont f14"}): word = tag.get_text() line1 = "---------------評論" + str(x) + "---------------------" + " " line2 = word + " " line = line1 + line2 x += 1 file.write(line) file.close()

看下結果：

總共10條評論
後來想想用正則看下如何

r = "&(.*)?&[ s]+?&

?=s

_/:-;"".,()%""{}]+")

    items_1 = re.findall(pattern_1, item)

    for line in items_1:

        file.write(line + "

")

file.close()

看下結果：

需要的數據在裡面，但多了一些無用數據，這些可以進一步處理達到我們想要的效果，不過沒有第一種方法簡單。

用beautifulSoup吧,正則太多了看著都頭疼.

剛好幾個小時前就在寫一個爬取網站會員（公司）資料的小程序
具體的編程問題就不回答了，跟用什麼語言寫代碼無關，關鍵是你要分析好這個頁面的html代碼結構，寫出合適的正則表達式來進行匹配，如果想簡化的話，可以進行分次匹配（比如先得到&裡面的第一個&裡面的內容就是原帖的地址，然後再進一步處理）
大數據分析就不會了，還請賜教。

哎，扒就扒吧，發了paper能不能告訴我刊號頁數讓我看一下？我們自己都沒做大數據分析……

你需要pyquery，可以使用jquery一樣的語法。你值得擁有。
https://pythonhosted.org/pyquery/