Output blank from python hadoop mapper -
input text such, repeated kabillion times:
value1 | foo="bar" value2 | value3
i wrote basic mapper in python basic streaming job:
#!/usr/bin/env python import sys line in sys.stdin: line = line.replace('foo=','') line = line.replace('"','') # kills double-quotes print line # alternatively, have tried print >>sys.stdout, line
i run job such; runs without error output file empty:
bin/hadoop jar contrib/streaming/hadoop-streaming.jar -file ~/mapper1.py -mapper mapper1.py -input hdfs:///rawdata/0208head.txt -output hdfs:///rawdata/clean0208.txt
i assumed without mapper, print print output file. i'm suspecting print command printing output memory of each javavm , without explicit way write back, dies in vm.
i wrote basic reducer took sys.stdin , printed sys.stdout above in "#alternatively". didn't work either.
guidance welcome. thanks
i have followed below steps execute hadoop streaming job :
1) first have created text file called head.txt
contains line mentioned you.
value1 | foo="bar" value2 | value3
2) saved file , put hdfs using :
hadoop fs -put /head.txt /head.txt
3) have copy-paste python code in mapper.py file , after saving copied hdfs :
hadoop fs -put /mapper.py /mapper.py
4) executed below hadoop streaming command :
hadoop jar /opt/hadoop/lib/hadoop-streaming-1.0.3.jar -d mapred.reduce.tasks=0 -file /mapper.py -mapper mapper.py -input /head.txt -output /out.txt
/opt/hadoop/lib/
hadoop library path. can add path here. if have set hadoop_home
in .bashrc
file no need mention bin/hadoop
.
otherwise can write bin/hadoop
executing job.
this work you. have got following output in out.txt
file.
value1 | bar value2 | value3
Comments
Post a Comment