python - Finding patterns in two files -
i have analyse 10 years worth of data , 50+ files each year. extracted data internet , have done extracted text regular expressions. format files differ each year , i'm not sure pattern consistent in files of individual years. format 2003 seems be
title (.*)
[header(.*)
color number number number string (\w+\s\d+\s\d+\s\d+\s.+)
color number number number string
color number number number string
color number number number string]<==== 1 block
header
color number number number string
color number number number string
color number number number string
color number number number string
........
my question is, there way program in python identify patterns within text files of given year?
a kind of pattern recognition, program outputs regular expression matches 1 block of data perhaps.
i using data linear algebra, want data accessible , organized other uses.
if it's possible maybe should go simpler , check see if each line of each block of data has same length when split space (or tab, or whatever token divides each column). there create tree data. like:
{title: { block: [ [color, number, number, number, string], [color, number, number, number, string] ] } title: ... }
if data irregular that, try using third party libraries either (1) clean html scraping data or (2) use natural language processing tokenize/parse data, seems overkill.
Comments
Post a Comment