flume1.7 新特性

Flume 1.7 引入了新的source类型Taildir,解决了之前通过exec tail监控文件时服务器宕机导致的数据重复问题。Taildir可以监控目录下的多个文件并实时保存记录,确保数据不丢失、不重复。配置示例展示了如何使用Taildir Source将目录中的文件内容实时收集到Hadoop集群。为了防止小文件过多,还需要进行特定设置。

flume1.7 新特性

1.taildir

  • 1.7版本之前

    • 在flume1.7之前如果想要监控一个文件新增的内容,我们一般采用的source 为 exec tail ,但是这会有一个弊端,就是当你的服务器宕机重启后,此时数据读取还是从头开始,这显然不是我们想看到的! 在flume1.7没有出来之前我们一般的解决思路为:当读取一条记录后,就把当前的记录的行号记录到一个文件中,宕机重启时,我们可以先从文件中获取到最后一次读取文件的行数,然后继续监控读取下去。保证数据不丢失、不重复。
    • 避免重复消费配置文件修改
    a1.sources.r3.command = tail  -n +$(tail -n1 /root/nnn) -F /root/data/web.log | awk 'ARGIND==1{i=$0;next}{i++;if($0~/^tail/){i=0};print $0;print i >> "/root/nnn";fflush("")}' /root/nnn
    
    其中/root/data/web.log 为监控的文件,/root/nnn为保存读取记录的文件
    
  • 而在flume1.7时新增了一个source 的类型为taildir,它可以监控一个目录下的多个文件,并且实现了实时读取记录保存的功能 Taildir Source

    • 官方介绍
    This source is provided as a preview feature. It does not work on Windows.
    Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files. If the new lines are being written, this source will retry reading them in wait for the completion of the write.
    This source is reliable and will not miss data even when the tailing files rotate. It periodically writes the last read position of each files on the given position file in JSON format. If Flume is stopped or down for some reason, it can restart tailing from the position written on the existing position file.
    In other use case, this source can also start tailing from the arbitrary position for each files using the given position file. When there is no position file on the specified path, it will start tailing from the first line of each files by default.
    Files will be consumed in order of their modification time. File with the oldest modification time will be consumed first.
    This source does not rename or delete or do any modifications to the file being tailed. Currently this source does not support tailing binary files. It reads text files line by line.
    

在这里插入图片描述

  • 使用案例

    • 需求:实现flume监控一个目录下的多个文件内容,实时的收集存储到hadoop集群中。
    • 配置文件
    a1.channels = ch1
    
    a1.sources = s1
    
    a1.sinks = hdfs-sink1
    
    #channel
    a1.channels.ch1.type = memory
    
    a1.channels.ch1.capacity=100000
    
    a1.channels.ch1.transactionCapacity=50000
    
    #source
    a1.sources.s1.channels = ch1
    
    #监控一个目录下的多个文件新增的内容
    a1.sources.s1.type = taildir
    
    #通过 json 格式存下每个文件消费的偏移量,避免从头消费
    a1.sources.s1.positionFile = /var/local/apache-flume-1.7.0-bin/taildir_position.json
    
    a1.sources.s1.filegroups = f1 f2 f3
    
    a1.sources.s1.filegroups.f1 = /root/data/access.log
    
    a1.sources.s1.filegroups.f2 = /root/data/nginx.log
    
    a1.sources.s1.filegroups.f3 = /root/data/web.log
    
    a1.sources.s1.headers.f1.headerKey = access
    
    a1.sources.s1.headers.f2.headerKey = nginx
    
    a1.sources.s1.headers.f3.headerKey = web
    
    a1.sources.s1.fileHeader  = true
    
    ##sink
    a1.sinks.hdfs-sink1.channel = ch1
    
    a1.sinks.hdfs-sink1.type = hdfs
    
    a1.sinks.hdfs-sink1.hdfs.path =hdfs://master:9000/demo/data
    
    a1.sinks.hdfs-sink1.hdfs.filePrefix = event_data
    
    a1.sinks.hdfs-sink1.hdfs.fileSuffix = .log
    
    a1.sinks.hdfs-sink1.hdfs.rollSize = 10485760
    
    a1.sinks.hdfs-sink1.hdfs.rollInterval =20
    
    a1.sinks.hdfs-sink1.hdfs.rollCount = 0
    
    a1.sinks.hdfs-sink1.hdfs.batchSize = 1500
    
    a1.sinks.hdfs-sink1.hdfs.round = true
    
    a1.sinks.hdfs-sink1.hdfs.roundUnit = minute
    
    a1.sinks.hdfs-sink1.hdfs.threadsPoolSize = 25
    
    a1.sinks.hdfs-sink1.hdfs.useLocalTimeStamp = true
    
    a1.sinks.hdfs-sink1.hdfs.minBlockReplicas = 1
    
    a1.sinks.hdfs-sink1.hdfs.fileType =DataStream
    
    a1.sinks.hdfs-sink1.hdfs.writeFormat = Text
    
    a1.sinks.hdfs-sink1.hdfs.callTimeout = 60000
    
  • flume1.7安装

    • 下载文件进行解压
    • 进如flume所在文件夹检验是否安装成功,输入命令
    apache-flume-1.7.0-bin/bin/flume-ng version
    
    • 安装成功提示信息
    Flume 1.7.0
    
    Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
    
    Revision: 511d868555dd4d16e6ce4fedc72c2d1454546707
    
    Compiled by bessbd on Wed Oct 12 20:51:10 CEST 2016
    
    From source with checksum 0d21b3ffdc55a07e1d08875872c00523
    
    • 修改配置文件
    mv flume-env.sh.template flume-env.sh
    
    vi flume-env.sh # 修改flume-env.sh中JAVA_HOME变量的值
    
    • 几个小案例
    • conf目录下创建Flume任务的配置文件 taildir_behavior.conf
    vi taildir_behavior.conf 
    
    #agent命名为a1
    
    a1.sources = r1
    
    a1.sinks = k1
    
    a1.channels = c1
    
    # Describe/configure the source
    
    a1.sources.r1.type = TAILDIR
    
    a1.sources.r1.filegroups = f1
    
    a1.sources.r1.filegroups.f1 = /opt/ebohailife/logs/rest/.*behavior.*
    
    a1.sources.r1.positionFile = /tmp/flume/taildir_behavior_position.json
    
    a1.sources.r1.fileHeader = false
    
    # Describe the sink
    
    a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
    
    a1.sinks.k1.kafka.bootstrap.servers = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092
    
    a1.sinks.k1.kafka.topic = behaviorlog_r1p3
    
    a1.sinks.k1.kafka.producer.acks= 1
    
    a1.sinks.k1.kafka.producer.linger.ms = 1
    
    a1.sinks.k1.flumeBatchSize = 100
    
    # a1.sinks.k1.topic = behaviorlog_r1p3
    
    # Kafka集群Broker列表,以下属性在1.7以上版本已弃用
    
    # a1.sinks.k1.brokerList = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092
    
    # a1.sinks.k1.requiredAcks = 1
    
    # a1.sinks.k1.batchSize = 100
    
    # Use a channel which buffers events in file
    
    a1.channels.c1.type = file
    
    #检查点文件存储路径
    
    a1.channels.c1.checkpointDir = /opt/ebohailife/apache-flume-1.7.0-bin/checkpoint
    
    #消息数据存储路径
    
    a1.channels.c1.dataDirs = /opt/ebohailife/apache-flume-1.7.0-bin/data
    
    # Bind the source and sink to the channel
    
    a1.sources.r1.channels = c1
    
    a1.sinks.k1.channel = c1
    
    • 创建Flume任务的配置文件 taildir_phoneinfo.conf
    vi taildir_phoneinfo.conf
    
    #agent命名为a2
    
    a2.sources = r2
    
    a2.sinks = k2
    
    a2.channels = c2
    
    # Describe/configure the source
    
    a2.sources.r2.type = TAILDIR
    
    a2.sources.r2.filegroups = f1
    
    a2.sources.r2.filegroups.f1 = /opt/ebohailife/logs/rest/.*phoneinfo.*
    
    a2.sources.r2.positionFile = /tmp/flume/taildir_phoneinfo_position.json
    
    a2.sources.r2.fileHeader = false
    
    # Describe the sink
    
    a2.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
    
    a2.sinks.k2.kafka.bootstrap.servers = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092
    
    a2.sinks.k2.kafka.topic = phoneinfolog_r1p3
    
    a2.sinks.k2.kafka.producer.acks= 1
    
    a2.sinks.k2.kafka.producer.linger.ms = 1
    
    a2.sinks.k2.flumeBatchSize = 100
    
    # a2.sinks.k2.topic = behaviorlog_r1p3
    
    # Kafka集群Broker列表,以下属性在1.7以上版本已弃用
    
    # a2.sinks.k2.brokerList = 10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092
    
    # a2.sinks.k2.requiredAcks = 1
    
    # a2.sinks.k2.batchSize = 100
    
    # Use a channel which buffers events in file
    
    a2.channels.c2.type = file
    
    #检查点文件存储路径
    
    a2.channels.c2.checkpointDir = /opt/ebohailife/apache-flume-1.7.0-bin/checkpoint
    
    #消息数据存储路径
    
    a2.channels.c2.dataDirs = /opt/ebohailife/apache-flume-1.7.0-bin/data
    
    # Bind the source and sink to the channel
    
    a2.sources.r2.channels = c2
    
    a2.sinks.k2.channel = c2
    
    
    • 创建Kafka Topic
    # 创建topic behaviorlog_r1p3 
    ./kafka-topics.sh --zookeeper 10.104.0.227:2181 --create --topic behaviorlog_r1p3 --partition 3 --replication-factor 1
    
    # 创建topic phoneinfolog_r1p3
    
    ./kafka-topics.sh --zookeeper 10.104.0.227:2181 --create --topic phoneinfolog_r1p3 --partition 3 --replication-factor 1
    
    - 查看topic
    ./kafka-topics.sh  --list --zookeeper 10.104.0.227:2181
    
    - 启动Flume NG,后台运行
    ./flume-ng agent -c /opt/ebohailife/apache-flume-1.7.0-bin/conf -f /opt/ebohailife/apache-flume-1.7.0-bin/conf/taildir_behavior.conf  -n a1  >/dev/null 2>&1 &
    
    ./flume-ng agent -c /opt/ebohailife/apache-flume-1.7.0-bin/conf -f /opt/ebohailife/apache-flume-1.7.0-bin/conf/taildir_phoneinfo.conf  -n a2  >/dev/null 2>&1 & 
    
    #  -Dflume.root.logger=INFO,console 
    
    - 启动Kafka Consumer,后台运行
    # 启动behaviorlog_r1p3
    
    ./kafka-console-consumer.sh --topic behaviorlog_r1p3 --bootstrap-server 10.104.0.226:9092 >/dev/null 2>&1 &
    
     # 启动phoneinfolog_r1p3
    
    ./kafka-console-consumer.sh --topic phoneinfolog_r1p3 --bootstrap-server 10.104.0.226:9092 >/dev/null 2>&1 & 
    
    - 创建日志收集流
    # 创建phoneinfo_log流
    
    CREATE STREAM phoneinfo_log_stream(phoneinfo STRING, tmp1 STRING, ip STRING, tmp2 STRING, phone_model STRING, tmp3 STRING,  phone_version STRING, tmp4 STRING,  area STRING, tmp5 STRING, start_time TIMESTAMP, tmp6 STRING, KDDI STRING, tmp7 STRING, app_source STRING, tmp8 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' TBLPROPERTIES("topic"="phoneinfolog_r1p3" ,"kafka.zookeeper"="10.104.0.227:2181","kafka.broker.list"="10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092"); 
    
    # 创建behavior_log流
    
    CREATE STREAM behavior_log_stream(eventid STRING, tmp1 STRING, ip STRING, tmp2 STRING, user_id STRING, tmp3 STRING,  user_name STRING, tmp4 STRING,  in_time TIMESTAMP, tmp5 STRING, operate_time TIMESTAMP, tmp6 STRING, phone_unicode STRING, tmp7 STRING, trigger_count STRING, tmp8 STRING, diff_in_oper INT, tmp9 STRING, tel_no STRING, tmp10 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' TBLPROPERTIES("topic"="behaviorlog_r1p3" ,"kafka.zookeeper"="10.104.0.227:2181","kafka.broker.list"="10.104.0.226:9092,10.104.0.227:9092,10.104.0.228:9092"); 
    
    - 创建日志表
    # 创建phoneinfo_log表
    
    CREATE TABLE phoneinfo_log_tab(phone STRING, ip STRING, phone_model STRING, phone_version STRING, area STRING, start_time TIMESTAMP, KDDI STRING, app_source STRING);
    
    # 创建behavior_log表
    
    CREATE TABLE behavior_log_tab(eventid STRING, ip STRING, user_id STRING, user_name STRING,  in_time TIMESTAMP,  operate_time TIMESTAMP,  
    
    phone_unicode STRING, trigger_count STRING, diff_in_oper INT, tel_no STRING);
    

    为防止小文件过多,进行以下设置:

    set streamsql.enable.hdfs.batchflush = true # 打开批量flush开关
    set streamsql.hdfs.batchflush.size = <num> #设置一次flush的消息个数,消息量达到该参数时flush一次
    set [streamsql.hdfs.batchflush.interval.ms](http://streamsql.hdfs.batchflush.interval.ms) = <num> #设置每过多长时间(单位为毫秒)flush一次 
    
    # 需满足 batchflush.size 和 [batchflush.interval.ms](http://batchflush.interval.ms) 其中的一个条件即会触发一次flush
    
    - 启动日志流
    # 触发phoneinfo_log_stream流计算
    
    INSERT INTO phoneinfo_log_tab SELECT phoneinfo, ip, phone_model, phone_version, area, start_time, KDDI, app_source FROM phoneinfo_log_stream; 
    
    # 触发behavior_log_stream流计算 
    
    INSERT INTO behavior_log_tab SELECT eventid, ip, user_id, user_name,  in_time,  operate_time, phone_unicode, trigger_count, diff_in_oper, tel_no FROM behavior_log_stream;
    
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值