Python split() not working as expected for first line in file -


i have large text file of data mined opinions , each classified positive, negative, neutral, or mixed. every line begins "+ ", "- ", "= ", or "* " correspond these classifiers. additionally, lines begin "!! " represent comment ignore.

below simple python script supposed count each of classifiers , ignore comment lines:

classes = [0, 0, 0, 0] # "+", "-", "=", "*"  f = open("all_classified.txt") i, line in enumerate(f):     line = line.strip()     classifier = line.split(" ")[0]      if classifier == "+": classes[0] += 1     elif classifier == "-": classes[1] += 1     elif classifier == "=": classes[2] += 1     elif classifier == "*": classes[3] += 1     elif classifier == "!!": continue     else: print "line "+str(i+1)+": \""+line+"\"" f.close()  print classes 

here sample of first 5 lines of "all_classified.txt":

!! group 1, 1001 - 1512 = 1001//cd titletitle//nnp how//wrb many//jj conditioners/conditioner/nns do//vbp you//prp have//vbp ?//.  = 1002//cd i//prp have//vbp two//cd different//jj kinds/kind/nns ,//, garnier//nnp fructis//nnp triple//nnp nutrition//nnp conditioner//nn ,//, and//cc suave//nnp coconut//nn .//.  = 1003//cd but//cc i//prp think//vbp i//prp have//vbp about//in 8//cd bottles/bottle/nns of//in the//dt suave//nnp coconut//nn my//prp$ mom//nn gave/give/vbd me//prp a//dt bunch//nn for//in christmas//nnp because//in she//prp was/be/vbd getting/get/vbg tired/tire/vbn of//in me//prp saying/say/vbg i//prp was/be/vbd out//in  = 1004//cd titletitle//nnp need//vb a//dt gel//nn that//in works/work/nns ,//, 8//cd mo//nn ,//, post//nn ,//, ready//jj to//to relax//vb edges/edge/nns ,//, help//nnp ,//,  

for whatever reason output triggering else statement during first iteration if not recognize "!!", not sure why. getting output:

line 1: "!! group 1, 1001 - 1512" [2986, 1034, 16278, 535] 

additionally, if delete first line "all_classified.txt" not recognize "=" of first line. not sure has done first line recognized expected.

edit (again): peter asked, here output if replace else: print "line "+str(i+1)+": \""+line+"\"" else: print "classifier "+classifier+ " line "+str(i+1)+": \""+line+"\"":

classifier !! line 1: "!! group 1, 1001 - 1512" [2986, 1034, 16278, 535] 

edit: first section using xxd all_classified.txt:

0000000: efbb bf21 2120 4752 4f55 5020 312c 2031  ...!! group 1, 1 0000010: 3030 3120 2d20 3135 3132 0d0a 3d20 3130  001 - 1512..= 10 0000020: 3031 2f2f 4344 2054 4954 4c45 5449 544c  01//cd titletitl 0000030: 452f 2f4e 4e50 2048 6f77 2f2f 5752 4220  e//nnp how//wrb  

i suspect input file isn't seems. example, classifier contain control characters not shown when print (but affect comparison):

>>> classifier = '!\0!' >>> print classifier !! >>> classifier == '!!' false 

edit there have it:

0000000: efbb bf21 2120          ^^^^ ^^ 

it's utf-8 bom, becomes part of classifier.

try opening file using codecs.open() "utf-8-sig" encoding (see, example, https://stackoverflow.com/a/13156715/367273).


Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

inno setup - TLabel or TNewStaticText - change .Font.Style on Focus like Cursor changes with .Cursor -