python - Parsing links and strings from unstructured HTML data -
i have html string looks this:
<p> type: <a href="wee.html">tough</a><br /> main type: <a href='abnormal.html'>abnormal</a> <br /> wheel: <a href='none.html'>none</a>,<a href='squared.html'>squared</a>,<a href='triangle.html'>triangled</a> <br /> movement type: <a href=forward.html">forward</a><br /> level: <a href="beginner.html">beginner</a><br /> sport: <a href="no.html">no</a><br/>force: <a href="pull.html">pull</a><br/> <span style="float:left;">your rating: </span> <div id="headersmallrating" style="float:left; line-height:20px;"><a href="rate.html">login rate</a></div><br /> </p>
in other words unstructured. want able first detect strings type
, main type
plus links (and link text). have tried detecting words regular expression, not serve purpose. how 1 handle kind of dodgy data?
if know categories beforehand type
, force
, etc, it's perhaps easier prepare list in advance.
code:
from bs4 import beautifulsoup bsoup import re ofile = open("test.html", "rb") soup = bsoup(ofile) soup.prettify() categories = ["type:","main type:","wheel:","movement type:","level:","sport:","force:"] category in categories: f = soup.find(text=re.compile(category)).next_sibling string = f.get_text() ref = f.get("href") print "%s %s (%s)" % (category, string, ref)
result:
type: tough (wee.html) main type: abnormal (abnormal.html) wheel: none (none.html) movement type: forward (forward.html) level: beginner (beginner.html) sport: no (no.html) force: pull (pull.html) [finished in 0.2s]
let me know if helps.
edit:
this handle wheel
if has multiple elements after it.
code:
from bs4 import beautifulsoup bsoup, tag import re ofile = open("unstructured.html", "rb") soup = bsoup(ofile) soup.prettify() categories = ["type:","main type:","wheel:","movement type:","level:","sport:","force:"] category in categories: wheel_list = [] f = soup.find(text=re.compile(category)).next_sibling if category != "wheel:": string = f.get_text() ref = f.get("href") print "%s %s (%s)" % (category, string, ref) else: while f.name == "a": content = f.contents[0] res = f.get("href") wheel_list.append("%s (%s)" % (content, res)) f = f.find_next() ref = ", ".join(wheel_list) print "%s %s" % (category, ref)
result:
type: tough (wee.html) main type: abnormal (abnormal.html) wheel: none (none.html), squared (squared.html), triangled (triangle.html) movement type: forward (forward.html) level: beginner (beginner.html) sport: no (no.html) force: pull (pull.html) [finished in 0.3s]
let know if helps.
Comments
Post a Comment