flume1.7 新特性

最新推荐文章于 2023-03-27 19:04:26 发布

原创最新推荐文章于 2023-03-27 19:04:26 发布 · 394 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #flume #1.7 #新特性

大数据专栏收录该内容

4 篇文章

订阅专栏

Flume 1.7 引入了新的source类型Taildir，解决了之前通过exec tail监控文件时服务器宕机导致的数据重复问题。Taildir可以监控目录下的多个文件并实时保存记录，确保数据不丢失、不重复。配置示例展示了如何使用Taildir Source将目录中的文件内容实时收集到Hadoop集群。为了防止小文件过多，还需要进行特定设置。

flume1.7 新特性

1.taildir

1.7版本之前
- 在flume1.7之前如果想要监控一个文件新增的内容，我们一般采用的source 为 exec tail ,但是这会有一个弊端，就是当你的服务器宕机重启后，此时数据读取还是从头开始，这显然不是我们想看到的！在flume1.7没有出来之前我们一般的解决思路为：当读取一条记录后，就把当前的记录的行号记录到一个文件中，宕机重启时，我们可以先从文件中获取到最后一次读取文件的行数，然后继续监控读取下去。保证数据不丢失、不重复。
- 避免重复消费配置文件修改
```
a1.sources.r3.command = tail  -n +$(tail -n1 /root/nnn) -F /root/data/web.log | awk 'ARGIND==1{i=$0;next}{i++;if($0~/^tail/){i=0};print $0;print i >> "/root/nnn";fflush("")}' /root/nnn

其中/root/data/web.log 为监控的文件，/root/nnn为保存读取记录的文件
```

而在flume1.7时新增了一个source 的类型为taildir,它可以监控一个目录下的多个文件，并且实现了实时读取记录保存的功能 Taildir Source

官方介绍

This source is provided as a preview feature. It does not work on Windows.
Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files. If the new lines are being written, this source will retry reading them in wait for the completion of the write.
This source is reliable and will not miss data even when the tailing files rotate. It periodically writes the last read position of each files on the given position file in JSON format. If Flume is stopped or down for some reason, it can restart tailing from the position written on the existing position file.
In other use case, this source can also start tailing from the arbitrary position for each files using the given position file. When there is no position file on the specified path, it will start tailing from the first line of each files by default.
Files will be consumed in order of their modification time. File with the oldest modification time will be consumed first.
This source does not rename or delete or do any modifications to the file being tailed. Currently this source does not support tailing binary files. It reads text files line by line.

在这里插入图片描述

使用案例

需求：实现flume监控一个目录下的多个文件内容，实时的收集存储到hadoop集群中。
配置文件

a1.channels = ch1

a1.sources = s1

a1.sinks = hdfs-sink1

#channel
a1.channels.ch1.type = memory

a1.channels.ch1.capacity=100000

a1.channels.ch1.transactionCapacity=50000

#source
a1.sources.s1.channels = ch1

#监控一个目录下的多个文件新增的内容
a1.sources.s1.type = taildir

#通过 json 格式存下每个文件消费的偏移量，避免从头消费
a1.sources.s1.positionFile = /var/local/apache-flume-1.7.0-bin/taildir_position.json

a1.sources.s1.filegroups = f1 f2 f3

a1.sources.s1.filegroups.f1 = /root/data/access.log

a1.sources.s1.filegroups.f2 = /root/data/nginx.log

a1.sources.s1.filegroups.f3 = /root/data/web.log

a1.sources.s1.headers.f1.headerKey = access

a1.sources.s1.headers.f2.headerKey = nginx

a1.sources.s1.headers.f3.headerKey = web

a1.sources.s1.fileHeader  = true

##sink
a1.sinks.hdfs-sink1.channel = ch1

a1.sinks.hdfs-sink1.type = hdfs

a1.sinks.hdfs-sink1.hdfs.path =hdfs://master:9000/demo/data

a1.sinks.hdfs-sink1.hdfs.filePrefix = event_data

a1.sinks.hdfs-sink1.hdfs.fileSuffix = .log

a1.sinks.hdfs-sink1.hdfs.rollSize = 10485760

a1.sinks.hdfs-sink1.hdfs.rollInterval =20

a1.sinks.hdfs-sink1.hdfs.rollCount = 0

a1.sinks.hdfs-sink1.hdfs.batchSize = 1500

a1.sinks.hdfs-sink1.hdfs.round = true

a1.sinks.hdfs-sink1.hdfs.roundUnit = minute

a1.sinks.hdfs-sink1.hdfs.threadsPoolSize = 25

a1.sinks.hdfs-sink1.hdfs.useLocalTimeStamp = true

a1.sinks.hdfs-sink1.hdfs.minBlockReplicas = 1

a1.sinks.hdfs-sink1.hdfs.fileType =DataStream

a1.sinks.hdfs-sink1.hdfs.writeFormat = Text

a1.sinks.hdfs-sink1.hdfs.callTimeout = 60000

flume1.7安装

下载文件进行解压
进如flume所在文件夹检验是否安装成功，输入命令

apache-flume-1.7.0-bin/bin/flume-ng version

安装成功提示信息

Flume 1.7.0

Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git

Revision: 511d868555dd4d16e6ce4fedc72c2d1454546707

Compiled by bessbd on Wed Oct 12 20:51:10 CEST 2016

From source with checksum 0d21b3ffdc55a07e1d08875872c00523

修改配置文件

mv flume-env.sh.template flume-env.sh

vi flume-env.sh # 修改flume-env.sh中JAVA_HOME变量的值

几个小案例
conf目录下创建Flume任务的配置文件 taildir_behavior.conf

vi taildir_behavior.conf 

#agent命名为a1

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = TAILDIR

a1.sources.r1.filegroups = f1

a1.sources.r1.filegroups.f1 = /opt/ebohailife/logs/rest/.*behavior.*

a1.sources.r1.positionFile = /tmp/flume/taildir_behavior_position.json

a1.sources.r1.fileHeader = false

# Describe the sink

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink

a1.sinks.k1.kafka.bootstrap.servers = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092

a1.sinks.k1.kafka.topic = behaviorlog_r1p3

a1.sinks.k1.kafka.producer.acks= 1

a1.sinks.k1.kafka.producer.linger.ms = 1

a1.sinks.k1.flumeBatchSize = 100

# a1.sinks.k1.topic = behaviorlog_r1p3

# Kafka集群Broker列表，以下属性在1.7以上版本已弃用

# a1.sinks.k1.brokerList = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092

# a1.sinks.k1.requiredAcks = 1

# a1.sinks.k1.batchSize = 100

# Use a channel which buffers events in file

a1.channels.c1.type = file

#检查点文件存储路径

a1.channels.c1.checkpointDir = /opt/ebohailife/apache-flume-1.7.0-bin/checkpoint

#消息数据存储路径

a1.channels.c1.dataDirs = /opt/ebohailife/apache-flume-1.7.0-bin/data

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

创建Flume任务的配置文件 taildir_phoneinfo.conf

vi taildir_phoneinfo.conf

#agent命名为a2

a2.sources = r2

a2.sinks = k2

a2.channels = c2

# Describe/configure the source

a2.sources.r2.type = TAILDIR

a2.sources.r2.filegroups = f1

a2.sources.r2.filegroups.f1 = /opt/ebohailife/logs/rest/.*phoneinfo.*

a2.sources.r2.positionFile = /tmp/flume/taildir_phoneinfo_position.json

a2.sources.r2.fileHeader = false

# Describe the sink

a2.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink

a2.sinks.k2.kafka.bootstrap.servers = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092

a2.sinks.k2.kafka.topic = phoneinfolog_r1p3

a2.sinks.k2.kafka.producer.acks= 1

a2.sinks.k2.kafka.producer.linger.ms = 1

a2.sinks.k2.flumeBatchSize = 100

# a2.sinks.k2.topic = behaviorlog_r1p3

# Kafka集群Broker列表，以下属性在1.7以上版本已弃用

# a2.sinks.k2.brokerList = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092

# a2.sinks.k2.requiredAcks = 1

# a2.sinks.k2.batchSize = 100

# Use a channel which buffers events in file

a2.channels.c2.type = file

#检查点文件存储路径

a2.channels.c2.checkpointDir = /opt/ebohailife/apache-flume-1.7.0-bin/checkpoint

#消息数据存储路径

a2.channels.c2.dataDirs = /opt/ebohailife/apache-flume-1.7.0-bin/data

# Bind the source and sink to the channel

a2.sources.r2.channels = c2

a2.sinks.k2.channel = c2

创建Kafka Topic

# 创建topic behaviorlog_r1p3 
./kafka-topics.sh --zookeeper 10.104.0.227:2181 --create --topic behaviorlog_r1p3 --partition 3 --replication-factor 1

# 创建topic phoneinfolog_r1p3

./kafka-topics.sh --zookeeper 10.104.0.227:2181 --create --topic phoneinfolog_r1p3 --partition 3 --replication-factor 1

- 查看topic

./kafka-topics.sh  --list --zookeeper 10.104.0.227:2181

- 启动Flume NG，后台运行

./flume-ng agent -c /opt/ebohailife/apache-flume-1.7.0-bin/conf -f /opt/ebohailife/apache-flume-1.7.0-bin/conf/taildir_behavior.conf  -n a1  >/dev/null 2>&1 &

./flume-ng agent -c /opt/ebohailife/apache-flume-1.7.0-bin/conf -f /opt/ebohailife/apache-flume-1.7.0-bin/conf/taildir_phoneinfo.conf  -n a2  >/dev/null 2>&1 & 

#  -Dflume.root.logger=INFO,console

- 启动Kafka Consumer，后台运行

# 启动behaviorlog_r1p3

./kafka-console-consumer.sh --topic behaviorlog_r1p3 --bootstrap-server 10.104.0.226:9092 >/dev/null 2>&1 &

 # 启动phoneinfolog_r1p3

./kafka-console-consumer.sh --topic phoneinfolog_r1p3 --bootstrap-server 10.104.0.226:9092 >/dev/null 2>&1 &

- 创建日志收集流

# 创建phoneinfo_log流

CREATE STREAM phoneinfo_log_stream(phoneinfo STRING, tmp1 STRING, ip STRING, tmp2 STRING, phone_model STRING, tmp3 STRING,  phone_version STRING, tmp4 STRING,  area STRING, tmp5 STRING, start_time TIMESTAMP, tmp6 STRING, KDDI STRING, tmp7 STRING, app_source STRING, tmp8 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' TBLPROPERTIES("topic"="phoneinfolog_r1p3" ,"kafka.zookeeper"="10.104.0.227:2181","kafka.broker.list"="10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092"); 

# 创建behavior_log流

CREATE STREAM behavior_log_stream(eventid STRING, tmp1 STRING, ip STRING, tmp2 STRING, user_id STRING, tmp3 STRING,  user_name STRING, tmp4 STRING,  in_time TIMESTAMP, tmp5 STRING, operate_time TIMESTAMP, tmp6 STRING, phone_unicode STRING, tmp7 STRING, trigger_count STRING, tmp8 STRING, diff_in_oper INT, tmp9 STRING, tel_no STRING, tmp10 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' TBLPROPERTIES("topic"="behaviorlog_r1p3" ,"kafka.zookeeper"="10.104.0.227:2181","kafka.broker.list"="10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092");

- 创建日志表

# 创建phoneinfo_log表

CREATE TABLE phoneinfo_log_tab(phone STRING, ip STRING, phone_model STRING, phone_version STRING, area STRING, start_time TIMESTAMP, KDDI STRING, app_source STRING);

# 创建behavior_log表

CREATE TABLE behavior_log_tab(eventid STRING, ip STRING, user_id STRING, user_name STRING,  in_time TIMESTAMP,  operate_time TIMESTAMP,  

phone_unicode STRING, trigger_count STRING, diff_in_oper INT, tel_no STRING);

为防止小文件过多，进行以下设置：

set streamsql.enable.hdfs.batchflush = true # 打开批量flush开关
set streamsql.hdfs.batchflush.size = <num> #设置一次flush的消息个数，消息量达到该参数时flush一次
set [streamsql.hdfs.batchflush.interval.ms](http://streamsql.hdfs.batchflush.interval.ms) = <num> #设置每过多长时间（单位为毫秒）flush一次 

# 需满足 batchflush.size 和 [batchflush.interval.ms](http://batchflush.interval.ms) 其中的一个条件即会触发一次flush

- 启动日志流

# 触发phoneinfo_log_stream流计算

INSERT INTO phoneinfo_log_tab SELECT phoneinfo, ip, phone_model, phone_version, area, start_time, KDDI, app_source FROM phoneinfo_log_stream; 

# 触发behavior_log_stream流计算 

INSERT INTO behavior_log_tab SELECT eventid, ip, user_id, user_name,  in_time,  operate_time, phone_unicode, trigger_count, diff_in_oper, tel_no FROM behavior_log_stream;