使用python爬蟲結合mysql進行大數據存儲

02-04

我很早之前就想進行大數據存儲了一直沒有機會，今天突發奇想，突然想到使用python爬蟲進行爬取，在利用MySQLdb 模塊進行寫資料庫。以後利用這些數據那豈不是美美達。

#!/usr/bin/pythonn# -*- coding:utf-8 -*-nimport urllib2nimport sysnimport MySQLdb as mdbnimport renimport commandsnndef urlfindall():n ZZ = "http://[a-zA-Z0-9]{0,20}.[a-zA-Z0-9]{0,20}.[a-zA-Z0-9]{0,20}"n prots = re.compile(ZZ)n file = open("/tmp/temphtmlfile.txt","r")n fileurllist = []n while 1:n htmldate = file.readline().strip("n")n urllist = prots.findall(htmldate)n if len(urllist) != 0:n fileurllist += urllistn if not htmldate:n breakn file.close()n quurl = list(set(fileurllist))n filewriteurl = open("/tmp/urllist.txt", "aw")n for i in range(0,len(quurl)):n filewriteurl.write(quurl[i]+"n")n filewriteurl.close()nndef get_file_url():n conn = mdb.connect(host="127.0.0.1",port=3306,user="root",passwd="guanji",db="database_site",charset="utf8")n cursor = conn.cursor()n while 1:n file = open("/tmp/urllist.txt", "r")n domain = file.readline().strip("n")n values = [domain]n cursor.execute("select url from site where url like %s;", values)n dbdate = cursor.fetchone()n if dbdate == None:n get_url(domain)n values1 = [domain]n cursor.execute("insert into site(url) VALUES(%s);",values1)n conn.commit()n commands.getoutput("sed -i 1d /tmp/urllist.txt")n else:n commands.getoutput("sed -i 1d /tmp/urllist.txt")nn if not domain:n breakn cursor.close()n conn.cursor()nnnndef get_url(url):n headers = {"User_Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"}n req = urllib2.Request(url, data=None, headers=headers)n date = urllib2.urlopen(req).read()n file = open("/tmp/temphtmlfile.txt","w")n file.write(date)n file.close()n urlfindall()nnnget_file_url()n

但是還有點不知，我發現有的網站防止爬蟲。看來的找點書看看了。

好了今天就到這了。

時間「Wed Aug 23 00:25:17 CST 2017」