Python實現多線程HTTP下載器

04-21

大家好，我是oYabea，今天向大家介紹使用Python編寫一個多線程HTTP下載器。為了方便大家的使用，使用py2exe模塊生成（*.exe）可執行文件。

環境：windows/Linux + Python2.7.x

在介紹多線程之前首先介紹單線程。（本文將著重使用代碼進行說明。）

關於本項目的代碼見my github

單線程

編寫單線程的思路為：

解析url；
連接web伺服器；
構造http請求包；
下載文件。

解析url

通過用戶輸入url進行解析，需要解析得到：路徑（path）、主機名（host）、埠號（port）以及文件名（filename）。

關於以下幾點進行說明：

如果解析的路徑為空，則賦值為/；
如果埠號為空，則賦值為"80」；
下載文件的文件名可根據用戶的意願進行更改（輸入y表示更改，輸入其它表示不需要更改）。

下面列出幾個解析函數，更多代碼見：analysisUrl

import urllibdef analyHostAndPath(totalUrl): protocol,s1 = urllib.splittype(totalUrl) host, path = urllib.splithost(s1) if path == : path = / return host, pathdef analysisPort(host): host, port = urllib.splitport(host) if port is None: return 80 return portdef analysisFilename(path): filename = path.split(/)[-1] if . not in filename: return None return filename

連接web伺服器

上文解析了url，那麼解析得到的數據有什麼用呢？

這裡就要用到了。使用socket模塊，根據解析url得到的host和port連接web伺服器：

import socketfrom analysisUrl import port, hostip = socket.gethostbyname(host)s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)s.connect((ip, port))print "success connected webServer"

構造http請求包

連接了web伺服器，接下來當然就是要向伺服器發送請求了，發送請求之前需要構造http請求包，怎麼構造呢？

哈哈，這裡又要用到解析URL得到的內容了，根據http請求包的格式，根據解析得到的path, host, port構造一個HTTP請求包：

from analysisUrl import path, host, portpacket = GET + path + HTTP/1.1 Host: + host +

下載文件

連接了web伺服器，構造了http請求包，接下來終於可以向web伺服器發送請求了。

根據伺服器的響應，保存對應的buffer，就可以下載文件了。

為了保證下載文件的完整性和準確性，首先，抓取響應報文頭部的"Content-Length"：

def getLength(self): s.send(packet) print "send success!" buf = s.recv(1024) print buf p = re.compile(rContent-Length: (d*)) length = int(p.findall(buf)[0]) return length, buf

接下里，下載文件並計算下載所用的時間：

def download(self): file = open(self.filename,wb) length,buf = self.getLength() packetIndex = buf.index(

) buf = buf[packetIndex+4:] file.write(buf) sum = len(buf) while 1: buf = s.recv(1024) file.write(buf) sum = sum + len(buf) if sum >= length: break print "Success!!"if __name__ == "__main__": start = time.time() down = downloader() down.download() end = time.time() print "The time spent on this program is %f s"%(end - start)

這樣，單線程http下載器就完成了。運行結果示意圖：

關於單線程的代碼見singleThreadDownload

多線程

實現了單線程，接下來就是實現多線程了。怎麼實現呢？

同樣，首先抓取響應報文頭部的"Content-Length"欄位；然後便可以結合線程個數劃分每個線程的下載範圍；之後就可以加鎖分段下載了。

與單線程的不同，這裡將所有代碼整合為一個文件，代碼中使用更多的Python自帶模塊。

得到"Content-Length"：

def getLength(self): opener = urllib2.build_opener() req = opener.open(self.url) meta = req.info() length = int(meta.getheaders("Content-Length")[0]) return length

根據得到的Length，結合線程個數劃分範圍：

def get_range(self): ranges = [] length = self.getLength() offset = int(int(length) / self.threadNum) for i in range(self.threadNum): if i == (self.threadNum - 1): ranges.append((i*offset,)) else: ranges.append((i*offset,(i+1)*offset)) return ranges

實現多線程下載，有兩點需要注意：

在向文件寫入內容時，向線程加鎖，並使用with lock代替lock.acquire( )...lock.release( );
使用file.seek( )設置文件偏移地址，保證寫入文件的準確性。。

def downloadThread(self,start,end): req = urllib2.Request(self.url) req.headers[Range] = bytes=%s-%s % (start, end) f = urllib2.urlopen(req) offset = start buffer = 1024 while 1: block = f.read(buffer) if not block: break with lock: self.file.seek(offset) self.file.write(block) offset = offset + len(block) def download(self): filename = self.getFilename() self.file = open(filename, wb) thread_list = [] n = 1 for ran in self.get_range(): start, end = ran print starting:%d thread % n n += 1 thread = threading.Thread(target=self.downloadThread,args=(start,end)) thread.start() thread_list.append(thread) for i in thread_list: i.join() print Download %s Success!%(self.file) self.file.close()

運行結果：

為了展示結果，在這裡還是使用了下載圖片為例，當然，你完全可以下載大點的文件夾等，我自己測試時下載了Python的安裝包，親測可用。

關於多線程的代碼見：multiThreadDownload

將.py文件轉化為.exe可執行文件

以上，我們已經完成了一個多線程http下載器。但是，又有一個問題了，如何讓那些沒有安裝Python的人使用這個工具呢？

這就需要將.py文件轉化為.exe文件了。

怎麼實現呢？

我自己也是第一次實現，網上查閱了些許資料，最終選擇了Python的py2exe模塊，初次使用，對其進行介紹：

py2exe是一個將Python腳本轉換成windows上可獨立執行的可執行文件（*.exe）的工具，這樣，就可以不用裝Python在windows上運行這個可執行程序。

接下來，在multiThreadDownload.py的同目錄下，創建mysetup.py文件，編寫：

from distutils.core import setupimport py2exesetup(console=["multiThreadDownload.py"])

接著執行命令：Python mysetup.py py2exe

命令執行完畢，可以看到同目錄下生成了dist和build文件夾，multiTjhreadDownload.exe文件位於其中，點擊運行即可（如下圖）：

上面就是我使用Python實現多線程HTTP下載器的過程。

各位有什麼好的想法意見，還望不吝賜教！！覺得寫得不錯的話，就點個贊吧。

知乎上第一次寫文章，歡迎大家留言討論~~

我的github：https://github.com/Yabea