Output blank from python hadoop mapper -


input text such, repeated kabillion times:

value1 | foo="bar" value2 | value3 

i wrote basic mapper in python basic streaming job:

#!/usr/bin/env python import sys line in sys.stdin:     line = line.replace('foo=','')     line = line.replace('"','') # kills double-quotes     print line     # alternatively, have tried print >>sys.stdout, line 

i run job such; runs without error output file empty:

bin/hadoop jar contrib/streaming/hadoop-streaming.jar -file ~/mapper1.py -mapper mapper1.py -input hdfs:///rawdata/0208head.txt -output hdfs:///rawdata/clean0208.txt 

i assumed without mapper, print print output file. i'm suspecting print command printing output memory of each javavm , without explicit way write back, dies in vm.

i wrote basic reducer took sys.stdin , printed sys.stdout above in "#alternatively". didn't work either.

guidance welcome. thanks

i have followed below steps execute hadoop streaming job :

1) first have created text file called head.txt contains line mentioned you.

value1 | foo="bar" value2 | value3 

2) saved file , put hdfs using :

hadoop fs -put /head.txt /head.txt 

3) have copy-paste python code in mapper.py file , after saving copied hdfs :

hadoop fs -put /mapper.py /mapper.py 

4) executed below hadoop streaming command :

 hadoop jar /opt/hadoop/lib/hadoop-streaming-1.0.3.jar -d mapred.reduce.tasks=0 -file /mapper.py -mapper mapper.py -input /head.txt -output /out.txt 

/opt/hadoop/lib/ hadoop library path. can add path here. if have set hadoop_home in .bashrc file no need mention bin/hadoop.

otherwise can write bin/hadoop executing job.

this work you. have got following output in out.txt file.

value1 | bar value2 | value3 

Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -