flume1.7 新特性
1.taildir
-
1.7版本之前
- 在flume1.7之前如果想要监控一个文件新增的内容,我们一般采用的source 为 exec tail ,但是这会有一个弊端,就是当你的服务器宕机重启后,此时数据读取还是从头开始,这显然不是我们想看到的! 在flume1.7没有出来之前我们一般的解决思路为:当读取一条记录后,就把当前的记录的行号记录到一个文件中,宕机重启时,我们可以先从文件中获取到最后一次读取文件的行数,然后继续监控读取下去。保证数据不丢失、不重复。
- 避免重复消费配置文件修改
a1.sources.r3.command = tail -n +$(tail -n1 /root/nnn) -F /root/data/web.log | awk 'ARGIND==1{i=$0;next}{i++;if($0~/^tail/){i=0};print $0;print i >> "/root/nnn";fflush("")}' /root/nnn 其中/root/data/web.log 为监控的文件,/root/nnn为保存读取记录的文件
-
而在flume1.7时新增了一个source 的类型为taildir,它可以监控一个目录下的多个文件,并且实现了实时读取记录保存的功能 Taildir Source
- 官方介绍
This source is provided as a preview feature. It does not work on Windows. Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files. If the new lines are being written, this source will retry reading them in wait for the completion of the write. This source is reliable and will not miss data even when the tailing files rotate. It periodically writes the last read position of each files on the given position file in JSON format. If Flume is stopped or down for some reason, it can restart tailing from the position written on the existing position file. In other use case, this source can also start tailing from the arbitrary position for each files using the given position file. When there is no position file on the specified path, it will start tailing from the first line of each files by default. Files will be consumed in order of their modification time. File with the oldest modification time will be consumed first. This source does not rename or delete or do any modifications to the file being tailed. Currently this source does not support tailing binary files. It reads text files line by line.
-
使用案例
- 需求:实现flume监控一个目录下的多个文件内容,实时的收集存储到hadoop集群中。
- 配置文件
a1.channels = ch1 a1.sources = s1 a1.sinks = hdfs-sink1 #channel a1.channels.ch1.type = memory a1.channels.ch1.capacity=100000 a1.channels.ch1.transactionCapacity=50000 #source a1.sources.s1.channels = ch1 #监控一个目录下的多个文件新增的内容 a1.sources.s1.type = taildir #通过 json 格式存下每个文件消费的偏移量,避免从头消费 a1.sources.s1.positionFile = /var/local/apache-flume-1.7.0-bin/taildir_position.json a1.sources.s1.filegroups = f1 f2 f3 a1.sources.s1.filegroups.f1 = /root/data/access.log a1.sources.s1.filegroups.f2 = /root/data/nginx.log a1.sources.s1.filegroups.f3 = /root/data/web.log a1.sources.s1.headers.f1.headerKey = access a1.sources.s1.headers.f2.headerKey = nginx a1.sources.s1.headers.f3.headerKey = web a1.sources.s1.fileHeader = true ##sink a1.sinks.hdfs-sink1.channel = ch1 a1.sinks.hdfs-sink1.type = hdfs a1.sinks.hdfs-sink1.hdfs.path =hdfs://master:9000/demo/data a1.sinks.hdfs-sink1.hdfs.filePrefix = event_data a1.sinks.hdfs-sink1.hdfs.fileSuffix = .log a1.sinks.hdfs-sink1.hdfs.rollSize = 10485760 a1.sinks.hdfs-sink1.hdfs.rollInterval =20 a1.sinks.hdfs-sink1.hdfs.rollCount = 0 a1.sinks.hdfs-sink1.hdfs.batchSize = 1500 a1.sinks.hdfs-sink1.hdfs.round = true a1.sinks.hdfs-sink1.hdfs.roundUnit = minute a1.sinks.hdfs-sink1.hdfs.threadsPoolSize = 25 a1.sinks.hdfs-sink1.hdfs.useLocalTimeStamp = true a1.sinks.hdfs-sink1.hdfs.minBlockReplicas = 1 a1.sinks.hdfs-sink1.hdfs.fileType =DataStream a1.sinks.hdfs-sink1.hdfs.writeFormat = Text a1.sinks.hdfs-sink1.hdfs.callTimeout = 60000
-
flume1.7安装
- 下载文件进行解压
- 进如flume所在文件夹检验是否安装成功,输入命令
apache-flume-1.7.0-bin/bin/flume-ng version
- 安装成功提示信息
Flume 1.7.0 Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git Revision: 511d868555dd4d16e6ce4fedc72c2d1454546707 Compiled by bessbd on Wed Oct 12 20:51:10 CEST 2016 From source with checksum 0d21b3ffdc55a07e1d08875872c00523
- 修改配置文件
mv flume-env.sh.template flume-env.sh vi flume-env.sh # 修改flume-env.sh中JAVA_HOME变量的值
- 几个小案例
- conf目录下创建Flume任务的配置文件 taildir_behavior.conf
vi taildir_behavior.conf #agent命名为a1 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = TAILDIR a1.sources.r1.filegroups = f1 a1.sources.r1.filegroups.f1 = /opt/ebohailife/logs/rest/.*behavior.* a1.sources.r1.positionFile = /tmp/flume/taildir_behavior_position.json a1.sources.r1.fileHeader = false # Describe the sink a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.kafka.bootstrap.servers = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092 a1.sinks.k1.kafka.topic = behaviorlog_r1p3 a1.sinks.k1.kafka.producer.acks= 1 a1.sinks.k1.kafka.producer.linger.ms = 1 a1.sinks.k1.flumeBatchSize = 100 # a1.sinks.k1.topic = behaviorlog_r1p3 # Kafka集群Broker列表,以下属性在1.7以上版本已弃用 # a1.sinks.k1.brokerList = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092 # a1.sinks.k1.requiredAcks = 1 # a1.sinks.k1.batchSize = 100 # Use a channel which buffers events in file a1.channels.c1.type = file #检查点文件存储路径 a1.channels.c1.checkpointDir = /opt/ebohailife/apache-flume-1.7.0-bin/checkpoint #消息数据存储路径 a1.channels.c1.dataDirs = /opt/ebohailife/apache-flume-1.7.0-bin/data # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
- 创建Flume任务的配置文件 taildir_phoneinfo.conf
vi taildir_phoneinfo.conf #agent命名为a2 a2.sources = r2 a2.sinks = k2 a2.channels = c2 # Describe/configure the source a2.sources.r2.type = TAILDIR a2.sources.r2.filegroups = f1 a2.sources.r2.filegroups.f1 = /opt/ebohailife/logs/rest/.*phoneinfo.* a2.sources.r2.positionFile = /tmp/flume/taildir_phoneinfo_position.json a2.sources.r2.fileHeader = false # Describe the sink a2.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink a2.sinks.k2.kafka.bootstrap.servers = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092 a2.sinks.k2.kafka.topic = phoneinfolog_r1p3 a2.sinks.k2.kafka.producer.acks= 1 a2.sinks.k2.kafka.producer.linger.ms = 1 a2.sinks.k2.flumeBatchSize = 100 # a2.sinks.k2.topic = behaviorlog_r1p3 # Kafka集群Broker列表,以下属性在1.7以上版本已弃用 # a2.sinks.k2.brokerList = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092 # a2.sinks.k2.requiredAcks = 1 # a2.sinks.k2.batchSize = 100 # Use a channel which buffers events in file a2.channels.c2.type = file #检查点文件存储路径 a2.channels.c2.checkpointDir = /opt/ebohailife/apache-flume-1.7.0-bin/checkpoint #消息数据存储路径 a2.channels.c2.dataDirs = /opt/ebohailife/apache-flume-1.7.0-bin/data # Bind the source and sink to the channel a2.sources.r2.channels = c2 a2.sinks.k2.channel = c2
# 创建topic behaviorlog_r1p3 ./kafka-topics.sh --zookeeper 10.104.0.227:2181 --create --topic behaviorlog_r1p3 --partition 3 --replication-factor 1 # 创建topic phoneinfolog_r1p3 ./kafka-topics.sh --zookeeper 10.104.0.227:2181 --create --topic phoneinfolog_r1p3 --partition 3 --replication-factor 1
- 查看topic
./kafka-topics.sh --list --zookeeper 10.104.0.227:2181
- 启动Flume NG,后台运行
./flume-ng agent -c /opt/ebohailife/apache-flume-1.7.0-bin/conf -f /opt/ebohailife/apache-flume-1.7.0-bin/conf/taildir_behavior.conf -n a1 >/dev/null 2>&1 & ./flume-ng agent -c /opt/ebohailife/apache-flume-1.7.0-bin/conf -f /opt/ebohailife/apache-flume-1.7.0-bin/conf/taildir_phoneinfo.conf -n a2 >/dev/null 2>&1 & # -Dflume.root.logger=INFO,console
- 启动Kafka Consumer,后台运行
# 启动behaviorlog_r1p3 ./kafka-console-consumer.sh --topic behaviorlog_r1p3 --bootstrap-server 10.104.0.226:9092 >/dev/null 2>&1 & # 启动phoneinfolog_r1p3 ./kafka-console-consumer.sh --topic phoneinfolog_r1p3 --bootstrap-server 10.104.0.226:9092 >/dev/null 2>&1 &
- 创建日志收集流
# 创建phoneinfo_log流 CREATE STREAM phoneinfo_log_stream(phoneinfo STRING, tmp1 STRING, ip STRING, tmp2 STRING, phone_model STRING, tmp3 STRING, phone_version STRING, tmp4 STRING, area STRING, tmp5 STRING, start_time TIMESTAMP, tmp6 STRING, KDDI STRING, tmp7 STRING, app_source STRING, tmp8 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' TBLPROPERTIES("topic"="phoneinfolog_r1p3" ,"kafka.zookeeper"="10.104.0.227:2181","kafka.broker.list"="10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092"); # 创建behavior_log流 CREATE STREAM behavior_log_stream(eventid STRING, tmp1 STRING, ip STRING, tmp2 STRING, user_id STRING, tmp3 STRING, user_name STRING, tmp4 STRING, in_time TIMESTAMP, tmp5 STRING, operate_time TIMESTAMP, tmp6 STRING, phone_unicode STRING, tmp7 STRING, trigger_count STRING, tmp8 STRING, diff_in_oper INT, tmp9 STRING, tel_no STRING, tmp10 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' TBLPROPERTIES("topic"="behaviorlog_r1p3" ,"kafka.zookeeper"="10.104.0.227:2181","kafka.broker.list"="10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092");
- 创建日志表
# 创建phoneinfo_log表 CREATE TABLE phoneinfo_log_tab(phone STRING, ip STRING, phone_model STRING, phone_version STRING, area STRING, start_time TIMESTAMP, KDDI STRING, app_source STRING); # 创建behavior_log表 CREATE TABLE behavior_log_tab(eventid STRING, ip STRING, user_id STRING, user_name STRING, in_time TIMESTAMP, operate_time TIMESTAMP, phone_unicode STRING, trigger_count STRING, diff_in_oper INT, tel_no STRING);
为防止小文件过多,进行以下设置:
set streamsql.enable.hdfs.batchflush = true # 打开批量flush开关 set streamsql.hdfs.batchflush.size = <num> #设置一次flush的消息个数,消息量达到该参数时flush一次 set [streamsql.hdfs.batchflush.interval.ms](http://streamsql.hdfs.batchflush.interval.ms) = <num> #设置每过多长时间(单位为毫秒)flush一次 # 需满足 batchflush.size 和 [batchflush.interval.ms](http://batchflush.interval.ms) 其中的一个条件即会触发一次flush
- 启动日志流
# 触发phoneinfo_log_stream流计算 INSERT INTO phoneinfo_log_tab SELECT phoneinfo, ip, phone_model, phone_version, area, start_time, KDDI, app_source FROM phoneinfo_log_stream; # 触发behavior_log_stream流计算 INSERT INTO behavior_log_tab SELECT eventid, ip, user_id, user_name, in_time, operate_time, phone_unicode, trigger_count, diff_in_oper, tel_no FROM behavior_log_stream;