环境介绍
CDH6.2.1
spark-thrift版本2.4.5
异常描述
使用dbeaver连接时延迟10秒左右
主要日志为
20/10/21 11:41:16 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V1
20/10/21 11:41:16 DEBUG ThriftCLIService: Client's IP Address: 10.103.117.243
20/10/21 11:41:16 DEBUG ThriftCLIService: Client's username: anonymous
20/10/21 11:41:16 DEBUG ThriftCLIService: Client's IP Address: 10.103.117.243
20/10/21 11:41:16 DEBUG UserGroupInformation: PrivilegedAction as:anonymous (auth:SIMPLE) from:org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
20/10/21 11:41:16 DEBUG SessionState: SessionState user: anonymous
20/10/21 11:41:16 DEBUG BlockReaderLocal: dfs.client.use.legacy.blockreader.local = false
20/10/21 11:41:16 DEBUG BlockReaderLocal: dfs.client.read.shortcircuit = false
20/10/21 11:41:16 DEBUG BlockReaderLocal: dfs.client.domain.socket.data.traffic = false
20/10/21 11:41:16 DEBUG BlockReaderLocal: dfs.domain.socket.path = /var/run/hdfs-sockets/dn
20/10/21 11:41:16 DEBUG HAUtil: No HA service delegation token found for logical URI hdfs://newbig
20/10/21 11:41:16 DEBUG BlockReaderLocal: dfs.client.use.legacy.blockreader.local = false
20/10/21 11:41:16 DEBUG BlockReaderLocal: dfs.client.read.shortcircuit = false
20/10/21 11:41:16 DEBUG BlockReaderLocal: dfs.client.domain.socket.data.traffic = false
20/10/21 11:41:16 DEBUG BlockReaderLocal: dfs.domain.socket.path = /var/run/hdfs-sockets/dn
20/10/21 11:41:16 DEBUG RetryUtils: multipleLinearRandomRetry = null
20/10/21 11:41:16 DEBUG Client: getting client out of cache: org.apache.hadoop.ipc.Client@134c370e
20/10/21 11:41:17 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop sending #9565
20/10/21 11:41:17 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop got value #9565
20/10/21 11:41:17 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms
20/10/21 11:41:18 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop sending #9566
20/10/21 11:41:18 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop got value #9566
20/10/21 11:41:18 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms
20/10/21 11:41:19 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop sending #9567
20/10/21 11:41:19 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop got value #9567
20/10/21 11:41:19 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms
20/10/21 11:41:20 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop sending #9568
20/10/21 11:41:20 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop got value #9568
20/10/21 11:41:20 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 2ms
20/10/21 11:41:21 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop sending #9569
20/10/21 11:41:21 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop got value #9569
20/10/21 11:41:21 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms
20/10/21 11:41:22 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop sending #9570
20/10/21 11:41:22 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop got value #9570
20/10/21 11:41:22 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms
20/10/21 11:41:23 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop sending #9571
20/10/21 11:41:23 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop got value #9571
20/10/21 11:41:23 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 2ms
20/10/21 11:41:24 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop sending #9572
20/10/21 11:41:24 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop got value #9572
20/10/21 11:41:24 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms
20/10/21 11:41:25 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop sending #9573
20/10/21 11:41:25 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop got value #9573
20/10/21 11:41:25 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms
20/10/21 11:41:26 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop sending #9574
20/10/21 11:41:26 DEBUG Client: IPC Client (1607020784) connection to cdhnode01.localdomain/172.27.10.70:8032 from hadoop got value #9574
20/10/21 11:41:26 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms
20/10/21 11:41:27 DEBUG DataTransferSaslUtil: DataTransferProtocol not using SaslPropertiesResolver, no QOP found in configuration for dfs.data.transfer.protection
20/10/21 11:41:27 DEBUG Client: The ping interval is 60000 ms.
20/10/21 11:41:27 DEBUG Client: Connecting to newbigma02.localdo
主要排查思路
1.校验disk io,网络带宽
IO正常
网络带宽万兆
2.检查所有主机日志
主机日志为发现明显相关延迟错误
3.关注spark-thrift的实现方式
解读部分源码后,发现主要延迟是网络解析产生的
4.主要关注网络配置
发现根因使用NetManager管理网络后DNS被优先解析而不是hosts文件
主要参考链接:https://bugzilla.redhat.com/show_bug.cgi?id=1093777
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-starting_networkmanager
5.解决Linux中/etc/resolv.conf文件总是自动改回的问题
修改/etc/sysconfig/network-scripts/ifcfig-team0文件,添加一句PEERDNS=yes即可