python - Pythonic way to process 200 million element data set? -


i have directory of 1,000 files. each file has many lines each line ngram varying 4 - 8 bytes. i'm trying parse files distinct ngrams header row, each file, want write row has frequency of ngram sequence occurring within file.

the following code made through gathering headers, hit memory error when trying write headers csv file. running on amazon ec2 instance 30gb of ram. can provide recommendations optimizations of i'm unaware?

#note: combination of list , set used maintain order of metadata #but still performance since non-meta headers not need maintain order header_list = [] header_set = set() header_list.extend(meta_list) ngram_dir in ngram_dirs:   ngram_files = os.listdir(ngram_dir)   ngram_file in ngram_files:             open(ngram_dir+ngram_file, 'r') file:         line in file:           if not '.' in line , line.rstrip('\n') not in ignore_list:             header_set.add(line.rstrip('\n'))  header_list.extend(header_set)#memory error occurred here  outfile = open(model_dir+model_file_name, 'w') csvwriter = csv.writer(outfile) csvwriter.writerow(header_list)  #convert ngram representations vector model of frequencies ngram_dir in ngram_dirs:   ngram_files = os.listdir(ngram_dir)   ngram_file in ngram_files:             open(ngram_dir+ngram_file, 'r') file:         write_list = []         linecount = 0         header_dict = collections.ordereddict.fromkeys(header_set, 0)         while linecount < meta_fields: #meta_fields = 3           line = file.readline()           write_list.append(line.rstrip('\n'))           linecount += 1          file_counter = collections.counter(line.rstrip('\n') line in file)         header_dict.update(file_counter)         value in header_dict.itervalues():           write_list.append(value)         csvwriter.writerow(write_list)  outfile.close()  

just don't extend list then. use chain itertools chain list , set instead.

instead of this:

header_list.extend(header_set)#memory error occurred here 

do (assuming csvwriter.writerow accepts iterator):

headers = itertools.chain(header_list, header_set) ... csvwriter.writerow(headers) 

that should @ least avoid memory issue you're seeing.


Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -