標籤:

4. Command line tool 命令行工具

New in version 0.10.

現在最新版本是 0.10.

Scrapy is controlled through the scrapy command-line tool, to be referred here as the 「Scrapy tool」 to differentiate it from the sub-commands, which we just call 「commands」 or 「Scrapy commands」.

Scrapy 是通過命令來控制的,為了區分其它子命令,我們在此引用為 "Scrapy 工具",稱為命令行或者 Scrapy 命令行。

The Scrapy tool provides several commands, for multiple purposes, and each one accepts a different set of arguments and options.

Scrapy 為了多個目的考慮,工具提供了幾個命令,並且每個命令接受一組不同的參數與選項。

(The scrapy deploy command has been removed in 1.0 in favor of the standalone scrapyd-deploy. See Deploying your project.)

(scrapy deploy命令已在1.0版本中被移除,以支持獨立的 scrapyd-deploy。 請參閱部署您的項目。)

Configuration settings

配置設置

Scrapy will look for configuration parameters in ini-style scrapy.cfg files in standard locations:

Scrapy 將在標準位置的 ini-style 文件中scrapy.cfg文件中查找配置參數:

/etc/scrapy.cfg or c:scrapyscrapy.cfg (system-wide),

~/.config/scrapy.cfg ($XDG_CONFIG_HOME) and ~/.scrapy.cfg ($HOME) for global (user-wide) settings, and

scrapy.cfg inside a scrapy project』s root (see next section).

Settings from these files are merged in the listed order of preference: user-defined values have higher priority than system-wide defaults and project-wide settings will override all others, when defined.

這些文件中的設置按所列出的優先順序進行合併:用戶定義的值比系統範圍的默認值具有更高的優先順序,並且在定義時,項目範圍的設置將覆蓋所有其他文件。

Scrapy also understands, and can be configured through, a number of environment variables. Currently these are:

Scrapy也理解並可以通過一些環境變數進行配置。 目前這些是:

SCRAPY_SETTINGS_MODULE (see Designating the settings)

SCRAPY_PROJECT

SCRAPY_PYTHON_SHELL (see Scrapy shell)

Default structure of Scrapy projects

默認的 Scrapy 項目架構

Before delving into the command-line tool and its sub-commands, let』s first understand the directory structure of a Scrapy project.

在探討命令行和它的子命令話題前,scrapy 項目架構先了解下。

Though it can be modified, all Scrapy projects have the same file structure by default, similar to this:

所有的 scrapy 項目都默認擁有著一樣的文件如下,雖然這可以被修改:

scrapy.cfgmyproject/ __init__.py items.py middlewares.py pipelines.py settings.py spiders/ __init__.py spider1.py spider2.py ...

The directory where the scrapy.cfg file resides is known as the project root directory. That file contains the name of the python module that defines the project settings. Here is an example:

scrapy.cfg 文件所在的目錄稱為項目根目錄。 該文件包含定義項目設置的python模塊的名稱。 這裡是一個例子:

[settings]default = myproject.settings

Using the scrapy tool

使用 Scrapy tool

You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands:

你可以不添加參數的方式來運行 scrapy tool作為開始,它會列印一些有用的幫助信息和可供使用的命令:

Scrapy X.Y - no active projectUsage: scrapy <command> [options] [args]Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader[...]

The first line will print the currently active project if you』re inside a Scrapy project. In this example it was run from outside a project. If run from inside a project it would have printed something like this:

如果您在Scrapy項目中,則第一行將列印當前活動的項目。 在這個例子中,它是從一個項目之外運行的。 如果從一個項目中運行,它會列印出如下所示的內容:

Scrapy X.Y - project: myprojectUsage: scrapy <command> [options] [args][...]

Creating projects

創建項目

The first thing you typically do with the scrapy tool is create your Scrapy project:

通常來說,第一件事是創建一個 scrapy 項目:

scrapy startproject myproject [project_dir]

That will create a Scrapy project under the project_dir directory. If project_dir wasn』t specified, project_dir will be the same as myproject.

這個命令將為在 [project_dir] 目錄下創建一個 scrapy 項目。如果 [project_dir] 沒有明確指定的話, project_dir 將會和 myproject 同名。

Next, you go inside the new project directory:

下一步,你點進去這個新的項目路徑中:

cd project_dir

And you』re ready to use the scrapy command to manage and control your project from there.

並且你已經準備好從你所處的地方使用 scrapy 命令來管理與控制你的項目。

Controlling projects

控制項目

You use the scrapy tool from inside your projects to control and manage them.

你正在處於你的 scrapy 項目中,使用 scrapy tool 來控制與管理它們。

For example, to create a new spider:

舉個例子,來創建一個新的爬蟲:

scrapy genspider mydomain Domain Names, Web Hosting, and Free Domain Services

Some Scrapy commands (like crawl) must be run from inside a Scrapy project. See the commands reference below for more information on which commands must be run from inside projects, and which not.

一些 scrapy 命令需要在 scrapy 項目內來運行啟動。通過下文提到的命令來獲取更多需要和不需要運行在項目中的命令,

Also keep in mind that some commands may have slightly different behaviours when running them from inside projects. For example, the fetch command will use spider-overridden behaviours (such as the user_agent attribute to override the user-agent) if the url being fetched is associated with some specific spider. This is intentional, as the fetch command is meant to be used to check how spiders are downloading pages.

同時需要記住的是,同樣的命令當從項目內部運行它們時,可能會有細微的差別行為。舉個例子,fetch 命令使用爬蟲覆蓋行為(好比 user_agent 屬性覆蓋了 user_agent),如果一個 url 與一個特定的爬蟲相關聯。這種行為是有意為之的,因為 fetch 命令意味著用來檢察蜘蛛是怎麼下載頁面的。

Available tool commands

可選的命令

This section contains a list of the available built-in commands with a description and some usage examples. Remember, you can always get more info about each command by running:

這一章節包括了一些帶有描述和使用例子的內置命令的表單命令,記住,你可以通過這個命令來獲取更多關於每個命令的更多的信息。

scrapy <command> -h

And you can see all available commands with:

你也可以查看所有命令的幫助:

scrapy -h

There are two kinds of commands, those that only work from inside a Scrapy project (Project-specific commands) and those that also work without an active Scrapy project (Global commands), though they may behave slightly different when running from inside a project (as they would use the project overridden settings).

這裡有兩種類型的命令,一種是只能在項目內部使用的,項目特定的命令,另外一種是與項目無關的全局命令,雖然運行在項目中表現出來的行為只有些許的差別(因為他們會使用項目重寫設置)

Global commands:

全局命令:

startprojectgenspidersettingsrunspidershellfetchviewversion

Project-only commands:

項目相關的命令

crawlchecklisteditparsebench

startproject

開始運行

Syntax: scrapy startproject <project_name> [project_dir]

語法: scrapy startproject <project_name> [prlject_dir]

Requires project: no

項目必需: 否

Creates a new Scrapy project named project_name, under the project_dir directory. If project_dir wasn』t specified, project_dir will be the same as project_name.

創建一個 project_dir 目錄下名為 project_name 新的 scrapy 項目。如果 project_dir 沒有指定的話,project_dir 名稱與 project_name 一樣。

Usage example:

例子:

$ scrapy startproject myproject

$ scrapy startproject myproject

genspider

Syntax: scrapy genspider [-t template] <name> <domain>

Requires project: no

Create a new spider in the current folder or in the current project』s spiders folder, if called from inside a project. The <name> parameter is set as the spider』s name, while <domain> is used to generate the allowed_domains and start_urls spider』s attributes.

在當前文件夾或者當前項目的爬蟲文件夾內創建一個新爬蟲,如果在一個項目內被調用的話,name 參數會被設置為一個項目的名字,且 domain 被用來產生 allowed_domains 值和 start_urls 屬性值。

Usage example:

使用例子:

$ scrapy genspider -l

Available templates:

可選模板:

basic

crawl

csvfeed

xmlfeed

$ scrapy genspider example Example Domain

Created spider example using template basic

$ scrapy genspider -t crawl scrapyorg A Fast and Powerful Scraping and Web Crawling Framework

Created spider scrapyorg using template crawl

This is just a convenience shortcut command for creating spiders based on pre-defined templates, but certainly not the only way to create spiders. You can just create the spider source code files yourself, instead of using this command.

這只是一些幫你快捷創建一個scrapy 項目的命令,當然你也可以不用,只需要自己創建源碼文件夾和繼承好相關類而已。

crawl

Syntax: scrapy crawl <spider>

Requires project: yes

Start crawling using a spider.

使用爬蟲開始爬取

Usage examples:

$ scrapy crawl myspider

[ ... myspider starts crawling ... ]

check

Syntax: scrapy check [-l] <spider>

Requires project: yes

Run contract checks.

運行衝突檢查

Usage examples:

$ scrapy check -l

first_spider

* parse

* parse_item

second_spider

* parse

* parse_item

$ scrapy check

[FAILED] first_spider:parse_item

>>> RetailPricex field is missing

[FAILED] first_spider:parse

>>> Returned 92 requests, expected 0..4

list

Syntax: scrapy list

Requires project: yes

List all available spiders in the current project. The output is one spider per line.

列出當前項目所有的可用爬蟲

Usage example:

$ scrapy list

spider1

spider2

edit

Syntax: scrapy edit <spider>

Requires project: yes

Edit the given spider using the editor defined in the EDITOR environment variable or (if unset) the EDITOR setting.

使用定義好的編輯環境來編輯當前給定的爬蟲

This command is provided only as a convenience shortcut for the most common case, the developer is of course free to choose any tool or IDE to write and debug spiders.

有自己常用的編輯器可忽略這個東東。

Usage example:

$ scrapy edit spider1

fetch

Syntax: scrapy fetch <url>

Requires project: no

Downloads the given URL using the Scrapy downloader and writes the contents to standard output.

使用 scrapy 下載器來下載給定路徑的內容,通過標準輸出寫出。

The interesting thing about this command is that it fetches the page how the spider would download it. For example, if the spider has a USER_AGENT attribute which overrides the User Agent, it will use that one.

關於這個命令的有趣之處在於它抓取頁面蜘蛛如何下載它。 例如,如果蜘蛛有一個覆蓋用戶代理的USER_AGENT屬性,它將使用該屬性。

So this command can be used to 「see」 how your spider would fetch a certain page.

所以這個命令可以用來「看」你的蜘蛛如何獲取某個頁面。

If used outside a project, no particular per-spider behaviour would be applied and it will just use the default Scrapy downloader settings.

如果在項目之外使用,則不會應用特定的蜘蛛行為,它將僅使用默認的Scrapy下載器設置。

Supported options:

支持的選項:

--spider=SPIDER: bypass spider autodetection and force use of specific spider

--headers: print the response』s HTTP headers instead of the response』s body

--no-redirect: do not follow HTTP 3xx redirects (default is to follow them)

Usage examples:

$ scrapy fetch --nolog example.com/some/page.h

[ ... html content here ... ]

$ scrapy fetch --nolog --headers Example Domain{Accept-Ranges: [bytes], Age: [1263 ], Connection: [close ], Content-Length: [596], Content-Type: [text/html; charset=UTF-8], Date: [Wed, 18 Aug 2010 23:59:46 GMT], Etag: ["573c1-254-48c9c87349680"], Last-Modified: [Fri, 30 Jul 2010 15:30:18 GMT], Server: [Apache/2.2.3 (CentOS)]}

view

Syntax: scrapy view <url>

Requires project: no

Opens the given URL in a browser, as your Scrapy spider would 「see」 it. Sometimes spiders see pages differently from regular users, so this can be used to check what the spider 「sees」 and confirm it』s what you expect.

在瀏覽器打開給定的url作為觀察。一些瀏覽器與人常規看到的東西不一致,所以這個命令可以用來檢察爬蟲所看到的東西是否你所期望的。

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider

--no-redirect: do not follow HTTP 3xx redirects (default is to follow them)

Usage example:

$ scrapy view example.com/some/page.h

[ ... browser starts ... ]

shell

Syntax: scrapy shell [url]

Requires project: no

Starts the Scrapy shell for the given URL (if given) or empty if no URL is given. Also supports UNIX-style local file paths, either relative with ./ or ../ prefixes or absolute file paths. See Scrapy shell for more info.

開始 scrapy shell 腳本來與給定的 url 互動,同樣也支持本地文件路徑,或者相對與絕對的路徑。

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider

-c code: evaluate the code in the shell, print the result and exit

--no-redirect: do not follow HTTP 3xx redirects (default is to follow them); this only affects the URL you may pass as argument on the command line; once you are inside the shell, fetch(url) will still follow HTTP redirects by default.

Usage example:

$ scrapy shell http://www.example.com/some/page.html[ ... scrapy shell starts ... ]

$ scrapy shell --nolog Example Domain -c (response.status, response.url)(200, Example Domain)

# shell follows HTTP redirects by default$ scrapy shell --nolog Example Domain -c (response.status, response.url)(200, Example Domain)

# you can disable this with --no-redirect# (only for the URL passed as command line argument)$ scrapy shell --no-redirect --nolog Example Domain -c (response.status, response.url)(302, Example Domain)

parse

Syntax: scrapy parse <url> [options]

Requires project: yes

Fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback option, or parse if not given.

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider

--a NAME=VALUE: set spider argument (may be repeated)

--callback or -c: spider method to use as callback for parsing the response

--meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: –meta=』{「foo」 : 「bar」}』

--pipelines: process items through pipelines

--rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the response

--noitems: don』t show scraped items

--nolinks: don』t show extracted links

--nocolour: avoid using pygments to colorize the output

--depth or -d: depth level for which the requests should be followed recursively (default: 1)

--verbose or -v: display information for each depth level

Usage example:

$ scrapy parse Example Domain -c parse_item

[ ... scrapy log lines crawling Example Domain spider ... ]

>>> STATUS DEPTH LEVEL 1 <<<

# Scraped Items ------------------------------------------------------------

[{name: uExample item,

category: uFurniture,

length: u12 cm}]

# Requests -----------------------------------------------------------------

[]

。。。 。。。 不寫了。自己看官網


推薦閱讀:

R vs Python:R是現在最好的數據科學語言嗎?
【Python爬蟲實戰】為啥學Python,BOSS告訴你
selenium爬蟲被檢測到 該如何破?
如何用100行Python代碼做出魔性聲控遊戲「八分音符醬」
【Python3網路爬蟲開發實戰】3-基本庫的使用

TAG:scrapy | Python |