python - scraping site to move data to multiple csv columns -
scraping page multiple categories csv. succeeding in getting first category column, second column data not writing csv. code using:
import urllib2 import csv bs4 import beautifulsoup url = "http://digitalstorage.journalism.cuny.edu/sandeepjunnarkar/tests/jazz.html" page = urllib2.urlopen(url) soup_jazz = beautifulsoup(page) all_years = soup_jazz.find_all("td",class_="views-field views-field-year") all_category = soup_jazz.find_all("td",class_="views-field views-field-category-code") open("jazz.csv", 'w') f: csv_writer = csv.writer(f) csv_writer.writerow([u'year won', u'category']) years in all_years: year_won = years.string if year_won: csv_writer.writerow([year_won.encode('utf-8')]) categories in all_category: category_won = categories.string if category_won: csv_writer.writerow([category_won.encode('utf-8')])
it's writing column headers not category_won second column.
based on suggestion, have compiled read:
with open("jazz.csv", 'w') f: csv_writer = csv.writer(f) csv_writer.writerow([u'year won', u'category']) years, categories in zip(all_years, all_category): year_won = years.string category_won = categories.string if year_won , category_won: csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')])
but have getting following error:
csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')]) valueerror: i/o operation on closed file
you zip()
2 lists together:
for years, categories in zip(all_years, all_category): year_won = years.string category_won = categories.string if year_won , category_won: csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')])
unfortunately, html page broken , cannot search table rows you'd expect able to.
next best thing search years, find sibling cells:
soup_jazz = beautifulsoup(page) open("jazz.csv", 'w') f: csv_writer = csv.writer(f) csv_writer.writerow([u'year won', u'category']) year_cell in soup_jazz.find_all('td', class_='views-field-year'): year = year_cell , year_cell.text.strip().encode('utf8') if not year: continue category = next((e e in year_cell.next_siblings if getattr(e, 'name') == 'td' , 'views-field-category-code' in e.attrs.get('class', [])), none) category = category , category.text.strip().encode('utf8') if year , category: csv_writer.writerow([year, category])
this produces:
year won,category 2012,best improvised jazz solo 2012,best jazz vocal album 2012,best jazz instrumental album 2012,best large jazz ensemble album .... 1960,best jazz composition of more 5 minutes duration 1959,best jazz performance - soloist 1959,best jazz performance - group 1958,"best jazz performance, individual" 1958,"best jazz performance, group"
Comments
Post a Comment