regex - Scrapy not crawling all links -


i using scrapy crawl , scrap site of fixed domain . want crawl site matches fixed regular expression , ignore rest. code works returns around 10-15 pages out of atleast 1000 pages. code is:

name = "xyz" allowed_domains = ["xyz.com"] start_urls = ["http://www.xyz.com"]  rules = (rule(sgmllinkextractor(allow=[r'\/v-\d{7}\/[\w\s]+']),callback='parse_item'),)    def parse_item(self, response):  sel = selector(response)   title = sel.xpath("//h1[@class='no-bd']/text()").extract()  print title 

can please tell me doing wrong ?

i don't think code problem, site crawling has limit how many requests handle specific ip in given period. try implementing sleep(2) calls in between , see if makes difference.


Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -