81 网站点击流数据分析案例（数据预处理功能）

最新推荐文章于 2025-05-20 09:24:54 发布

杨林伟

最新推荐文章于 2025-05-20 09:24:54 发布

阅读量698

点赞数 2

CC 4.0 BY-SA版权

分类专栏：大数据

本文链接：https://blog.csdn.net/qq_20042935/article/details/98975002

大数据专栏收录该内容

206 篇文章

订阅专栏

本文介绍了一个MapReduce程序WeblogPreProcess的设计与实现，用于对Web日志数据进行预处理，包括过滤不合规数据、格式转换和规整，以及生成点击流模型数据。通过MR程序梳理出每次访问的起止时间和页面信息，为后续的统计需求提供基础数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

主要目的

过滤“不合规”数据
格式转换和规整
根据后续的统计需求，过滤分离出各种不同主题的基础数据

实现方式

开发一个mr程序WeblogPreProcess：

public class WeblogPreProcess {
	static class WeblogPreProcessMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
		Text k = new Text();
		NullWritable v = NullWritable.get();
		@Override
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			String line = value.toString();
			WebLogBean webLogBean = WebLogParser.parser(line);
//			WebLogBean productWebLog = WebLogParser.parser2(line);
//			WebLogBean bbsWebLog = WebLogParser.parser3(line);
//			WebLogBean cuxiaoBean = WebLogParser.parser4(line);
			if (!webLogBean.isValid())
				return;
			k.set(webLogBean.toString());
			context.write(k, v);
//			k.set(productWebLog);
//			context.write(k, v);
		}
	}
	public static void main(String[] args) throws Exception {
		
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		job.setJarByClass(WeblogPreProcess.class);
		job.setMapperClass(WeblogPreProcessMapper.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.waitForCompletion(true);
		
	}
}

运行mr对数据进行预处理

hadoop jar weblog.jar  cn.itcast.bigdata.hive.mr.WeblogPreProcess /weblog/input /weblog/preout

点击流模型数据梳理

由于大量的指标统计从点击流模型中更容易得出，所以在预处理阶段，可以使用mr程序来生成点击流模型的数据。

1.点击流模型pageviews表

Pageviews表模型数据生成

hadoop jar weblogpreprocess.jar  \
cn.itcast.bigdata.hive.mr.ClickStreamThree   \
/user/hive/warehouse/dw_click.db/test_ods_weblog_origin/datestr=2013-09-20/ /test-click/pageviews/

表结构：
在这里插入图片描述

2.点击流模型visit信息表

注：“一次访问”=“N次连续请求”

直接从原始数据中用hql语法得出每个人的“次”访问信息比较困难，可先用mapreduce程序分析原始数据得出“次”信息数据，然后再用hql进行更多维度统计

用MR程序从pageviews数据中，梳理出每一次visit的起止时间、页面信息。

hadoop jar weblogpreprocess.jar cn.itcast.bigdata.hive.mr.ClickStreamVisit /weblog/sessionout /weblog/visitout

然后，在hive仓库中建点击流visit模型表

drop table if exist click_stream_visit;
create table click_stream_visit(
session     string,
remote_addr string,
inTime      string,
outTime     string,
inPage      string,
outPage     string,
referal     string,
pageVisits  int)
partitioned by (datestr string);

然后，将MR运算得到的visit数据导入visit模型表

load data inpath '/weblog/visitout' into table click_stream_visit partition(datestr='2013-09-18');