regex - Scrapy not crawling all links -


i using scrapy crawl , scrap site of fixed domain . want crawl site matches fixed regular expression , ignore rest. code works returns around 10-15 pages out of atleast 1000 pages. code is:

name = "xyz" allowed_domains = ["xyz.com"] start_urls = ["http://www.xyz.com"]  rules = (rule(sgmllinkextractor(allow=[r'\/v-\d{7}\/[\w\s]+']),callback='parse_item'),)    def parse_item(self, response):  sel = selector(response)   title = sel.xpath("//h1[@class='no-bd']/text()").extract()  print title 

can please tell me doing wrong ?

i don't think code problem, site crawling has limit how many requests handle specific ip in given period. try implementing sleep(2) calls in between , see if makes difference.


Comments

Popular posts from this blog

hibernate - How to load global settings frequently used in application in Java -

python 3.x - Mapping specific letters onto a list of words -

objective c - Ownership modifiers with manual reference counting -