python - Scrapy crawl all sitemap links -

i want crawl links present in sitemap.xml of fixed site. i've came across scrapy's sitemapspider. far i've extracted urls in sitemap. want crawl through each link of sitemap. highly useful. code far is:

class myspider(sitemapspider): name = "xyz" allowed_domains = ["xyz.nl"] sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]   def parse(self, response):  print response.url

you need add sitemap_rules process data in crawled urls, , can create many want. instance have page named http://www.xyz.nl//x/ want create rule:

class myspider(sitemapspider):     name = 'xyz'     sitemap_urls = 'http://www.xyz.nl/sitemap.xml'     # list tuples - example contains 1 page      sitemap_rules = [('/x/', parse_x)]      def parse_x(self, response):         sel = selector(response)         paragraph = sel.xpath('//p').extract()          return paragraph

Search This Blog

Brazzel

python - Scrapy crawl all sitemap links -

Comments

Post a Comment

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

Reading inputs from Keyboard in Objective C -

inno setup - TLabel or TNewStaticText - change .Font.Style on Focus like Cursor changes with .Cursor -