regex - Scrapy not crawling all links -

i using scrapy crawl , scrap site of fixed domain . want crawl site matches fixed regular expression , ignore rest. code works returns around 10-15 pages out of atleast 1000 pages. code is:

name = "xyz" allowed_domains = ["xyz.com"] start_urls = ["http://www.xyz.com"]  rules = (rule(sgmllinkextractor(allow=[r'\/v-\d{7}\/[\w\s]+']),callback='parse_item'),)    def parse_item(self, response):  sel = selector(response)   title = sel.xpath("//h1[@class='no-bd']/text()").extract()  print title

can please tell me doing wrong ?

i don't think code problem, site crawling has limit how many requests handle specific ip in given period. try implementing sleep(2) calls in between , see if makes difference.

Search This Blog

Brazzel

regex - Scrapy not crawling all links -

Comments

Post a Comment

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

Reading inputs from Keyboard in Objective C -

inno setup - TLabel or TNewStaticText - change .Font.Style on Focus like Cursor changes with .Cursor -