python - Scrapy crawl all sitemap links -


i want crawl links present in sitemap.xml of fixed site. i've came across scrapy's sitemapspider. far i've extracted urls in sitemap. want crawl through each link of sitemap. highly useful. code far is:

class myspider(sitemapspider): name = "xyz" allowed_domains = ["xyz.nl"] sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]   def parse(self, response):  print response.url 

you need add sitemap_rules process data in crawled urls, , can create many want. instance have page named http://www.xyz.nl//x/ want create rule:

class myspider(sitemapspider):     name = 'xyz'     sitemap_urls = 'http://www.xyz.nl/sitemap.xml'     # list tuples - example contains 1 page      sitemap_rules = [('/x/', parse_x)]      def parse_x(self, response):         sel = selector(response)         paragraph = sel.xpath('//p').extract()          return paragraph 

Comments

Popular posts from this blog

hibernate - How to load global settings frequently used in application in Java -

python 3.x - Mapping specific letters onto a list of words -

objective c - Ownership modifiers with manual reference counting -