python - Parsing links and strings from unstructured HTML data -

i have html string looks this:

        <p>                                 type: <a href="wee.html">tough</a><br />                                  main type:                 <a href='abnormal.html'>abnormal</a>                    <br />                                   wheel:                 <a href='none.html'>none</a>,<a href='squared.html'>squared</a>,<a href='triangle.html'>triangled</a>                    <br />                                  movement type: <a href=forward.html">forward</a><br />                                  level: <a href="beginner.html">beginner</a><br />             sport: <a href="no.html">no</a><br/>force: <a href="pull.html">pull</a><br/>              <span style="float:left;">your rating:&nbsp;</span> <div id="headersmallrating" style="float:left; line-height:20px;"><a href="rate.html">login rate</a></div><br />          </p>

in other words unstructured. want able first detect strings type , main type plus links (and link text). have tried detecting words regular expression, not serve purpose. how 1 handle kind of dodgy data?

if know categories beforehand type, force, etc, it's perhaps easier prepare list in advance.

code:

from bs4 import beautifulsoup bsoup import re  ofile = open("test.html", "rb") soup = bsoup(ofile) soup.prettify()  categories = ["type:","main type:","wheel:","movement type:","level:","sport:","force:"] category in categories:     f = soup.find(text=re.compile(category)).next_sibling     string = f.get_text()     ref = f.get("href")     print "%s %s (%s)" % (category, string, ref)

result:

type: tough (wee.html) main type: abnormal (abnormal.html) wheel: none (none.html) movement type: forward (forward.html) level: beginner (beginner.html) sport: no (no.html) force: pull (pull.html) [finished in 0.2s]

let me know if helps.

edit:

this handle wheel if has multiple elements after it.

code:

from bs4 import beautifulsoup bsoup, tag import re  ofile = open("unstructured.html", "rb") soup = bsoup(ofile) soup.prettify()  categories = ["type:","main type:","wheel:","movement type:","level:","sport:","force:"] category in categories:     wheel_list = []     f = soup.find(text=re.compile(category)).next_sibling     if category != "wheel:":         string = f.get_text()         ref = f.get("href")         print "%s %s (%s)" % (category, string, ref)     else:         while f.name == "a":             content = f.contents[0]             res = f.get("href")             wheel_list.append("%s (%s)" % (content, res))             f = f.find_next()         ref = ", ".join(wheel_list)         print "%s %s" % (category, ref)

result:

type: tough (wee.html) main type: abnormal (abnormal.html) wheel: none (none.html), squared (squared.html), triangled (triangle.html) movement type: forward (forward.html) level: beginner (beginner.html) sport: no (no.html) force: pull (pull.html) [finished in 0.3s]

let know if helps.

Search This Blog

Brazzel

python - Parsing links and strings from unstructured HTML data -

Comments

Post a Comment

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

Reading inputs from Keyboard in Objective C -

javascript - jQuery show full size image on click -