谷歌爬蟲主要是用C++開發嗎?
聽說是以C++為主,以Python為輔,是這樣嗎?覺得Python寫起來比較方便··
從谷歌創始人的論文《The Anatomy of a Large-Scale Hypertextual Web Search Engine
》里可以看到,谷歌一開始的爬蟲是用Python寫的。
http://infolab.stanford.edu/pub/papers/google.pdf
4.3 Crawling the Web
In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.
據說現在改成了C++,但是我沒有找到明確的材料。
目前是的
Why did Google move from Python to C++ for use in its crawler?
推薦閱讀:
※Linux下connect函數埠連到自己, 這種問題怎麼解決?
※C++ 中的命名空間和類有什麼區別?
※單元測試到底是什麼?應該怎麼做?
※怎麼理解API和MFC的關係?
※C++ 中為什麼要有. -> ::這幾種成員訪問操作符?