regex - Scrapy not crawling all links -
i using scrapy crawl , scrap site of fixed domain . want crawl site matches fixed regular expression , ignore rest. code works returns around 10-15 pages out of atleast 1000 pages. code is:
name = "xyz" allowed_domains = ["xyz.com"] start_urls = ["http://www.xyz.com"] rules = (rule(sgmllinkextractor(allow=[r'\/v-\d{7}\/[\w\s]+']),callback='parse_item'),) def parse_item(self, response): sel = selector(response) title = sel.xpath("//h1[@class='no-bd']/text()").extract() print title
can please tell me doing wrong ?
i don't think code problem, site crawling has limit how many requests handle specific ip in given period. try implementing sleep(2)
calls in between , see if makes difference.
Comments
Post a Comment