python - Scrapy crawl all sitemap links -


i want crawl links present in sitemap.xml of fixed site. i've came across scrapy's sitemapspider. far i've extracted urls in sitemap. want crawl through each link of sitemap. highly useful. code far is:

class myspider(sitemapspider): name = "xyz" allowed_domains = ["xyz.nl"] sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]   def parse(self, response):  print response.url 

you need add sitemap_rules process data in crawled urls, , can create many want. instance have page named http://www.xyz.nl//x/ want create rule:

class myspider(sitemapspider):     name = 'xyz'     sitemap_urls = 'http://www.xyz.nl/sitemap.xml'     # list tuples - example contains 1 page      sitemap_rules = [('/x/', parse_x)]      def parse_x(self, response):         sel = selector(response)         paragraph = sel.xpath('//p').extract()          return paragraph 

Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -