Python split() not working as expected for first line in file -
i have large text file of data mined opinions , each classified positive, negative, neutral, or mixed. every line begins "+ ", "- ", "= ", or "* " correspond these classifiers. additionally, lines begin "!! " represent comment ignore.
below simple python script supposed count each of classifiers , ignore comment lines:
classes = [0, 0, 0, 0] # "+", "-", "=", "*" f = open("all_classified.txt") i, line in enumerate(f): line = line.strip() classifier = line.split(" ")[0] if classifier == "+": classes[0] += 1 elif classifier == "-": classes[1] += 1 elif classifier == "=": classes[2] += 1 elif classifier == "*": classes[3] += 1 elif classifier == "!!": continue else: print "line "+str(i+1)+": \""+line+"\"" f.close() print classes
here sample of first 5 lines of "all_classified.txt":
!! group 1, 1001 - 1512 = 1001//cd titletitle//nnp how//wrb many//jj conditioners/conditioner/nns do//vbp you//prp have//vbp ?//. = 1002//cd i//prp have//vbp two//cd different//jj kinds/kind/nns ,//, garnier//nnp fructis//nnp triple//nnp nutrition//nnp conditioner//nn ,//, and//cc suave//nnp coconut//nn .//. = 1003//cd but//cc i//prp think//vbp i//prp have//vbp about//in 8//cd bottles/bottle/nns of//in the//dt suave//nnp coconut//nn my//prp$ mom//nn gave/give/vbd me//prp a//dt bunch//nn for//in christmas//nnp because//in she//prp was/be/vbd getting/get/vbg tired/tire/vbn of//in me//prp saying/say/vbg i//prp was/be/vbd out//in = 1004//cd titletitle//nnp need//vb a//dt gel//nn that//in works/work/nns ,//, 8//cd mo//nn ,//, post//nn ,//, ready//jj to//to relax//vb edges/edge/nns ,//, help//nnp ,//,
for whatever reason output triggering else statement during first iteration if not recognize "!!", not sure why. getting output:
line 1: "!! group 1, 1001 - 1512" [2986, 1034, 16278, 535]
additionally, if delete first line "all_classified.txt" not recognize "=" of first line. not sure has done first line recognized expected.
edit (again): peter asked, here output if replace else: print "line "+str(i+1)+": \""+line+"\""
else: print "classifier "+classifier+ " line "+str(i+1)+": \""+line+"\""
:
classifier !! line 1: "!! group 1, 1001 - 1512" [2986, 1034, 16278, 535]
edit: first section using xxd all_classified.txt
:
0000000: efbb bf21 2120 4752 4f55 5020 312c 2031 ...!! group 1, 1 0000010: 3030 3120 2d20 3135 3132 0d0a 3d20 3130 001 - 1512..= 10 0000020: 3031 2f2f 4344 2054 4954 4c45 5449 544c 01//cd titletitl 0000030: 452f 2f4e 4e50 2048 6f77 2f2f 5752 4220 e//nnp how//wrb
i suspect input file isn't seems. example, classifier
contain control characters not shown when print (but affect comparison):
>>> classifier = '!\0!' >>> print classifier !! >>> classifier == '!!' false
edit there have it:
0000000: efbb bf21 2120 ^^^^ ^^
it's utf-8 bom, becomes part of classifier
.
try opening file using codecs.open()
"utf-8-sig"
encoding (see, example, https://stackoverflow.com/a/13156715/367273).
Comments
Post a Comment