本文共 2743 字,大约阅读时间需要 9 分钟。
Hadoop是用JAVA开发的,可以利用JAVA开发数据分析应用的jar包,但也支持C++或Python来开发数据分析应用。本文以wordcount应用为例,阐述利用Python脚本开发Hadoop的MapReduce大数据分析应用过程。
mapper.py代码如下所示:
#!/usr/bin/env python3import sysfor line in sys.stdin: line = line.strip() words = line.split() for word in words: print ("%s %s"% (word, 1))
reducer.py代码如下:
#!/usr/bin/env python3from operator import itemgetterimport syscurrent_word = Nonecurrent_count = 0word = Nonefor line in sys.stdin: line = line.strip() word, count = line.split(' ', 1) try: count = int(count) except ValueError: #count如果不是数字的话,直接忽略掉 continue if current_word == word: current_count += count else: if current_word: print ("%s %s" %(current_word, current_count)) current_count = count current_word = wordif word == current_word: #不要忘记最后的输出 print ("%s %s"%(current_word, current_count))
mapper.py和reducer.py放在目录P:/home/hadoop/
待统计文件位于:/user/hadoop/input/文件夹下,统计结果输出目录为:/test。其中/user/hadoop/input/目录下有以下文件:
其中myLocalFile.txt内容如下:进入到Hadoop安装目录,在terminal中输入以下命令:
./bin/hadoop jar ./share/hadoop/tools/lib/hadoop-*streaming*.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input /user/hadoop/input/* -output /test
分析过程(部分)如下所示:
File System Counters FILE: Number of bytes read=3478 FILE: Number of bytes written=1063577 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=46 HDFS: Number of bytes written=35 HDFS: Number of read operations=17 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 HDFS: Number of bytes read erasure-coded=0 Map-Reduce Framework Map input records=3 Map output records=4 Map output bytes=35 Map output materialized bytes=49 Input split bytes=107 Combine input records=0 Combine output records=0 Reduce input groups=4 Reduce shuffle bytes=49 Reduce input records=4 Reduce output records=4 Spilled Records=8 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=10 Total committed heap usage (bytes)=381681664 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=23 File Output Format Counters Bytes Written=352020-08-09 17:04:12,925 INFO streaming.StreamJob: Output directory: /test
/test目录内容
统计分析结果在part-00000中,如下所示:转载地址:http://jvrti.baihongyu.com/