在新建的articleSpider.py文件里面,写如下代码:
from scrap.selector import Selector
from scrap import Spider
from wikiSpider.items import Article
class ArticleSpider(Spider)
name = "article"
allowed_domains ["en. wikipedia. org"]
start_urls = ["http://en.wikipediaorg/wiki/main_page",
"http://en.wikipedia.org/wiki/python_%28
programing_language%29"]
def parse(self, response):
item= Article()
title= response. xpath('//h1/text()')[0].extract()
print("Title is:"+title)
item['title']= title
return item
你可以在wikiSpider主目录下用如下命令运行ArticleSpider:
$ scrapy crawl article
这行命令会根据条目名称name(这里就是article)来调用爬虫。
陆续出现的调试信息中应该会包含这两行结果:
Title is: Main Page
Title is: Python (programming language)
from scrapy. contrib. spiders import CrawlSpider, Rule
from wikiSpider.items import Article
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
cLass ArticleSpider(CrawlSpider):
name="article
allowed_domains =["en. wikipedia. org"]
start_urls=["http://en.wikipedia.org/wiki/pyth
on_
%28programming_language%29"]
rules= [Rule(SgmlLinkExtractor (allow=('(/wiki/)((?!: ).)*$'),),
callback="parse_item", follow=True)]
def parse_item (self,response):
item= Article()
title= response. xpath('//h1/text()')[0]. extract()
print("Title is:"+title)
item['title']= title
return item
虽然这个爬虫和前面那个爬虫的启动命令一样,但是如果你不用Ctrl+C中止程序,它是不会停止的(很长时间也不会停止)。
from scrap.selector import Selector
from scrap import Spider
from wikiSpider.items import Article
class ArticleSpider(Spider)
name = "article"
allowed_domains ["en. wikipedia. org"]
start_urls = ["http://en.wikipediaorg/wiki/main_page",
"http://en.wikipedia.org/wiki/python_%28
programing_language%29"]
def parse(self, response):
item= Article()
title= response. xpath('//h1/text()')[0].extract()
print("Title is:"+title)
item['title']= title
return item
你可以在wikiSpider主目录下用如下命令运行ArticleSpider:
$ scrapy crawl article
这行命令会根据条目名称name(这里就是article)来调用爬虫。
陆续出现的调试信息中应该会包含这两行结果:
Title is: Main Page
Title is: Python (programming language)