MapReduce 入门案例wordcount

最新推荐文章于 2025-05-13 10:11:00 发布

.道不虚行

最新推荐文章于 2025-05-13 10:11:00 发布

阅读量520

点赞数

CC 4.0 BY-SA版权

分类专栏： hadoop 文章标签： mapreduce

本文链接：https://blog.csdn.net/weixin_44387652/article/details/106451213

本文介绍了MapReduce入门案例，通过单机版wordcount的实现，模拟MapReduce过程，并详细解析了MapReduce中wordcount的执行步骤。内容涵盖了Map阶段、Reduce阶段以及自定义实现MapReduce wordcount的完整流程，最后探讨了学习MapReduce的基础知识。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1、单机版的wordcount

1、有5个文件，每个文件中存储的都是单词，每行多个单词之间使用\t进行分割的，求这5个文件中每一个单词出现的总次数。
（1）定义5个流，分别读取五个文件（一个方法调用）。
（2）定义五个容器，map存储数据。
（3）流读取，放在map集合中。
key ======> 单词；
value =====> 次数；
map集合中：单词已经放过（value + 1）；单词没有放入（添加1）
（4）进行汇总统计（5个map集合），每一个单词的总次数。
定义一个容器map ====> 存储结果；
分别读取5个map集合 ====> 累加结果 ====> 存储result_map集合；

package com.zc.hadoop.mapreduce.wc.standalone;

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;

public class WordCount {
   
   

	public static void main(String[] args) throws IOException {
   
   
		// 分别统计每一个文件中的单词次数
		Map<String, Integer> map1 = readOneFile("./src/com/zc/hadoop/mapreduce/wc/standalone/wc1.txt");
		Map<String, Integer> map2 = readOneFile("./src/com/zc/hadoop/mapreduce/wc/standalone/wc2.txt");
		Map<String, Integer> map3 = readOneFile("./src/com/zc/hadoop/mapreduce/wc/standalone/wc3.txt");
		Map<String, Integer> map4 = readOneFile("./src/com/zc/hadoop/mapreduce/wc/standalone/wc4.txt");
		Map<String, Integer> map5 = readOneFile("./src/com/zc/hadoop/mapreduce/wc/standalone/wc5.txt");
		
		System.out.println(map1);
		System.out.println(map2);
		System.out.println(map3);
		System.out.println(map4);
		System.out.println(map5);
		
		// 汇总统计
		Map<String, Integer> result_map = mergeAllFiles(map1, map2, map3, map4, map5);
		System.out.println(result_map);
		
	}
	
	/*
	 * 分而治之，读取一个文件
	 */
	public static Map<String, Integer> readOneFile(String path) throws IOException {
   
   
		// 定义一个流
		BufferedReader br = new BufferedReader(new FileReader(path));
		// 定义一个map容器，存储结果
		HashMap<String , Integer> map = new HashMap<>();
		// 进行读取
		String line = null;
		while ((line = br.readLine()) != null) {
   
   
			// line 代表每行数据
			String[] words = line.split("\t");
			// 循环遍历每一个单词放入map中
			for (String w : words) {
   
   
				if (!map.containsKey(w)) {
   
   
					map.put(w, 1);
				} else {
   
   
					map.put(w, map.get(w) + 1);
				}
			}
		}
		br.close();
		return map;
	}
	
	/*
	 * 汇总统计
	 */
	public static Map<String, Integer> mergeAllFiles(Map<String, Integer> ...maps){
   
   
		//定义一个结果集
		Map<String, Integer> result_map = new HashMap<String, Integer>();
		// 循环遍历所有map
		for (Map<String, Integer> map : maps) {
   
   
			// 遍历每一个map集合的每一个元素
			Set<String> keys = map.keySet();
			for (String key : keys) {
   
   
				if (!result_map.containsKey(key)) {
   
   
					result_map.put(key, map.get(key));
				} else {
   
   
					result_map.put(key, result_map.get(key) + map.get(key));
				}
			}
		}
		return result_map;
	}
}

2、有5个文件，每个文件中存储的都是数字，每行多个数字之间使用 \t 进行分割的，求这5个文件中最大的数字。
（1）分别统计每一个文件中最大的数字（分而治之）；
（2）求5个文件中的最大值（汇总）；

package com.zc.hadoop.mapreduce.wc.standalone;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;

public class MaxNumber {
   
   

	public static void main(String[] args) throws IOException {
   
   
		// 分别统计每一个文件中的单词次数
		Map<String, Integer> map1 =