python - scraping site to move data to multiple csv columns -


scraping page multiple categories csv. succeeding in getting first category column, second column data not writing csv. code using:

import urllib2 import csv bs4 import beautifulsoup url = "http://digitalstorage.journalism.cuny.edu/sandeepjunnarkar/tests/jazz.html" page = urllib2.urlopen(url) soup_jazz = beautifulsoup(page) all_years = soup_jazz.find_all("td",class_="views-field views-field-year") all_category = soup_jazz.find_all("td",class_="views-field views-field-category-code") open("jazz.csv", 'w') f:     csv_writer = csv.writer(f)     csv_writer.writerow([u'year won', u'category'])     years in all_years:         year_won = years.string         if year_won:             csv_writer.writerow([year_won.encode('utf-8')])     categories in all_category:         category_won = categories.string         if category_won:             csv_writer.writerow([category_won.encode('utf-8')]) 

it's writing column headers not category_won second column.

based on suggestion, have compiled read:

with open("jazz.csv", 'w') f:     csv_writer = csv.writer(f)     csv_writer.writerow([u'year won', u'category']) years, categories in zip(all_years, all_category):     year_won = years.string     category_won = categories.string     if year_won , category_won:         csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')]) 

but have getting following error:

csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')]) valueerror: i/o operation on closed file

you zip() 2 lists together:

for years, categories in zip(all_years, all_category):     year_won = years.string     category_won = categories.string     if year_won , category_won:         csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')]) 

unfortunately, html page broken , cannot search table rows you'd expect able to.

next best thing search years, find sibling cells:

soup_jazz = beautifulsoup(page) open("jazz.csv", 'w') f:     csv_writer = csv.writer(f)     csv_writer.writerow([u'year won', u'category'])     year_cell in soup_jazz.find_all('td', class_='views-field-year'):         year = year_cell , year_cell.text.strip().encode('utf8')         if not year:             continue         category = next((e e in year_cell.next_siblings                          if getattr(e, 'name') == 'td' ,                              'views-field-category-code' in e.attrs.get('class', [])),                         none)         category = category , category.text.strip().encode('utf8')         if year , category:             csv_writer.writerow([year, category]) 

this produces:

year won,category 2012,best improvised jazz solo 2012,best jazz vocal album 2012,best jazz instrumental album 2012,best large jazz ensemble album .... 1960,best jazz composition of more 5 minutes duration 1959,best jazz performance - soloist 1959,best jazz performance - group 1958,"best jazz performance, individual" 1958,"best jazz performance, group" 

Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -