python - Scrapy crawl all sitemap links -
i want crawl links present in sitemap.xml of fixed site. i've came across scrapy's sitemapspider. far i've extracted urls in sitemap. want crawl through each link of sitemap. highly useful. code far is:
class myspider(sitemapspider): name = "xyz" allowed_domains = ["xyz.nl"] sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] def parse(self, response): print response.url
you need add sitemap_rules process data in crawled urls, , can create many want. instance have page named http://www.xyz.nl//x/ want create rule:
class myspider(sitemapspider): name = 'xyz' sitemap_urls = 'http://www.xyz.nl/sitemap.xml' # list tuples - example contains 1 page sitemap_rules = [('/x/', parse_x)] def parse_x(self, response): sel = selector(response) paragraph = sel.xpath('//p').extract() return paragraph
Comments
Post a Comment