ES集群压测工具Rally离线压测方法以及优化建议

本文探讨ES集群官方压测工具rally,介绍离线安装及数据准备方法,阐述集群压测流程,包括工具使用、数据压测与解析。还给出ES优化建议,如内存分配、禁止swap、分片设置,以及容量模型建议,涵盖公有云最佳实践、社区建议和监控指标。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

探讨es集群的官方压测工具rally以及使用方法,压测方案来自es官网 https://github.com/elastic/rally-tracks?spm=a2c6h.12873639.article-detail.11.2dce10755um1B4。

本文主要讨论,由于在内网环境下,安装rally工具比较困难, 并且需要解决复杂的依赖问题。因此本问题主要探讨容器化rally进行压测。

1. 离线rally工具安装和数据准备

为了完成离线安装rally的目标,需要下载如下内容

1,下载rally镜像,保存为rally2.2.1.tar.bz2
2,下载rally压测用例集,可以打包下载,也可以通过git命令下载

git pull git@github.com:elastic/rally-tracks.git

保存为rally-tracks-7.12

3,下载压测用例对应的数据集,由于数据集合很大,因此可以根据实际情况下载,本文只下载了http_logs的压测数据集,保存在rally/benchmarks/data,按照目录结构保存
在这里插入图片描述

下载压测数据集的方法如下,可以通过如下方法下载其他类型的压测数据集

我们在elastic官网github上可以看到esrally的数据集相关描述 https://github.com/elastic/rally-tracks
通过查看 rally-tracks/download.sh,我们可以看到,所有数据源都是从一台亚马逊主机上下载的。
在这里插入图片描述

通过阅读代码,我们知道:数据集的主路径在http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora

通过查看 rally-tracks/geonames/files.txt文件,我们知道了geonames的数据集名称叫 documents-2.json.bz2 和 documents-2-1k.json.bz2 。
在这里插入图片描述

因此,我们尝试组合上述地址,使用浏览器访问:
http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geonames/documents-2.json.bz2
在这里插入图片描述
我们发现,数据集可以下载了。同理,其他数据集,也可以通过这种方式进行手工下载。

4, 配置es.sh压测

WORKSPACE=$(pwd)

echo $WORKSPACE

ESADDRESS=$1
#########判断docker是否存在
DOCKER=$(docker images | grep elastic/rally | awk '{print $3}')
if [ -z "${DOCKER}" ]; then
echo "请先加载docker镜像,参考命令 docker load -i 镜像文件"
exit 0
fi


if [ -z "${ESADDRESS}" ]; then
echo "请输入ES集群地址"
exit 0
fi

TESTTYPE=$2
if [ -z "${TESTTYPE}" ]; then
echo "请输入测试类型,目前支持的类型有geonames http_logs"
exit 0
fi


REPORTFILE=$3
if [ -z "${REPORTFILE}" ]; then
REPORTFILE=result.csv
echo "没有输入测试名称,使用默认测试名称result.csv,地址为${WORKSPACE}/rally/benchmarks/result.csv"
fi
echo "测试结果地址为${WORKSPACE}/rally/benchmarks/${REPORTFILE}"
docker="docker"

EXECCMD="${docker} run --network host  -v ${WORKSPACE}/rally:/rally/.rally elastic/rally:2.2.1  race --pipeline=benchmark-only --target-hosts=${ESADDRESS} --track-path=/rally/.rally/benchmarks/data/${TESTTYPE} --offline --report-format=csv --report-file=/rally/.rally/benchmarks/${REPORTFILE}"
echo $EXECCMD
${docker} run --network host  -v ${WORKSPACE}/rally:/rally/.rally elastic/rally:2.2.1  race --pipeline=benchmark-only --target-hosts=${ESADDRESS} --track-path=/rally/.rally/benchmarks/data/${TESTTYPE} --track-params="bulk_indexing_clients:20" --offline --report-format=csv --report-file=/rally/.rally/benchmarks/${REPORTFILE}


docker rm -f $(docker ps -a | grep elastic/rally | awk '{print $1}')

echo "程序执行完毕"

rally2.2.1.tar.bz2 为docker镜像文件
rally 为docker的映射路径,压测数据存储在./rally/benchmarks/data,压测结果存储在./rally/benchmarks/*.csv

完成4个步骤后,组合成完整的数据结构如下:
在这里插入图片描述

2. 进行集群压测

2.1 压测工具使用方法说明

运行es.sh 需要参数为 es集群地址测试的数据类型,例如:

./es.sh xx.xx.xx.xx:9200,xx.xx.xx.xx:9200,xx.xx.xx.xx:9200 http_logs

xx.xx.xx.xx:9200,xx.xx.xx.xx:9200,xx.xx.xx.xx:9200为es集群地址
http_logs 为测试的数据集

2.2 进行es集群数据压测

1, 不同集群的压测数据,跟底层的硬件设备差异很大,因此应该进行磁盘的io也同步进行压测。

先安装fio,如果已经安装可以忽略

yum install -y fio

磁盘io写性能压测

fio -filename=/data1/fio.txt -ioengine=libaio -direct=1 -iodepth 1 -thread -rw=randwrite -bs=4k -size=100G -numjobs=48 -runtime=300 -group_reporting -name=mytest

磁盘io读性能压测

fio -filename=/data1/fio.txt -ioengine=libaio -direct=1 -iodepth 1 -thread -rw=randread -bs=4k -size=100G -numjobs=48 -runtime=300 -group_reporting -name=mytest

2, 选择一台服务器安装docker(yum install -y docker)
3,上传工具包,并解压,导入镜像 使用docker load -i rally2.2.1.tar.bz2 导入docker 镜像
4,执行es.sh 命令

./es.sh xx.xx.xx.xx:9200,xx.xx.xx.xx:9200,xx.xx.xx.xx:9200 http_logs

不同类型类型的数据集数据结构和数据特性相差很大,因此需要根据不同类型的业务进行选择和测试比较合适。本次选用http_logs

根据官方的介绍,目前涉及的数据集,可供选择,主要包括:
1、Geonames: for evaluating the performance of structured data.
2、Geopoint: for evaluating the performance of geo queries.
3、Percolator: for evaluating the performance of percolation queries.
4、PMC: for evaluating the performance of full text search.
5、NYC taxis: for evaluating the performance for highly structured data.
6、Nested: for evaluating the performance for nested documents.
7、http_logs: for evaluating the performance of (Web) server logs.
8、noaa: for evaluating the performance of range fields.
在这里插入图片描述

2.3 es压测数据解析

压测完成之后又很多性能指标,一般要关注的数据有:

  • throughput 每个操作的吞吐量,比如 index、search等
  • latency 每个操作的响应时长数据
  • Heap used for x 记录堆栈的使用情况
Metric中文翻译TaskValueUnit备注
Cumulative indexing time of primary shards主分片累计索引时间13.36938333min越小越好
Min cumulative indexing time across primary shards跨分片累计索引最小时间0min越小越好
Median cumulative indexing time across primary shards跨分片累计索引中位时间min越小越好
Max cumulative indexing time across primary shards跨分片累计索引最大时间min越小越好
Cumulative indexing throttle time of primary shards主分片累计节流索引时间0min越小越好
Min cumulative indexing throttle time across primary shards跨分片累计节流最小索引时间0min越小越好
Median cumulative indexing throttle time across primary shards跨分片累计节流中位索引时间0min越小越好
Max cumulative indexing throttle time across primary shards跨分片累计节流最大索引时间0min越小越好
Cumulative merge time of primary shards主分片累积合并时间4.2677min越小越好
Cumulative merge count of primary shards主分片累积合并次数57越小越好
Min cumulative merge time across primary shards跨主分片累积最小合并时间0min越小越好
Median cumulative merge time across primary shards跨主分片累积中位合并时间1.348033333min越小越好
Max cumulative merge time across primary shards跨主分片累积最大合并时间1.464033333min越小越好
Cumulative merge throttle time of primary shards主分片累计节流合并时间1.065866667min越小越好
Min cumulative merge throttle time across primary shards主分片累计节流最小合并时间0min越小越好
Median cumulative merge throttle time across primary shards主分片累计节流中位合并时间0.328816667min越小越好
Max cumulative merge throttle time across primary shards主分片累计节流最大合并时间0.3759min越小越好
Cumulative refresh time of primary shards主分片累积refresh时间0.798716667min越小越好
Cumulative refresh count of primary shards主分片累积refresh次数320越小越好
Min cumulative refresh time across primary shards主分片累积最小refresh时间1.67E-05min越小越好
Median cumulative refresh time across primary shards主分片累积中位refresh时间0.250933333min越小越好
Max cumulative refresh time across primary shards主分片累积最大refresh时间0.266416667min越小越好
Cumulative flush time of primary shards主分片累积flush时间0.584383333min越小越好
Cumulative flush count of primary shards主分片累积flush次数46越小越好
Min cumulative flush time across primary shards主分片累积最小flush时间0min越小越好
Median cumulative flush time across primary shards主分片累积中位flush时间0.159566667min越小越好
Max cumulative flush time across primary shards主分片累积最大flush时间0.1632min越小越好
Total Young Gen GC timeYoung GC总时间4.179s越小越好
Total Young Gen GC countYoung GC总次数961越小越好
Total Old Gen GC timeOld GC总时间0.221s越小越好
Total Old Gen GC countOld GC总次数4越小越好
Store size存储大小3.018052787GB越小越好
Translog sizeTranslog大小4.10E-07GB越小越好
Heap used for segmentssegments使用的堆内内存0.461437225MB越小越好
Heap used for doc valuesdoc values使用的堆内内存0.021503448MB越小越好
Heap used for termsterms使用的堆内内存0.356811523MB越小越好
Heap used for normsnorms使用的堆内内存0.048034668MB越小越好
Heap used for pointspoints使用的堆内内存0MB越小越好
Heap used for stored fieldsstored fields使用的堆内内存0.035087585MB越小越好
Segment countSegment数量71越小越好
error rateindex-append0%
Min Throughput最小吞吐率index-stats90.02ops/s越大越好
Mean Throughput平均吞吐率index-stats90.03ops/s越大越好
Median Throughput中位吞吐率index-stats90.03ops/s越大越好
Max Throughput最大吞吐率index-stats90.06ops/s越大越好
50th percentile latency提交请求和收到完整回复之间的时间段(50%的请求该时间以内)index-stats2.688714827ms越小越好
90th percentile latencyindex-stats3.594806814ms
99th percentile latencyindex-stats6.877146151ms
99.9th percentile latencyindex-stats12.57476813ms
100th percentile latencyindex-stats19.47905542ms
50th percentile service timeindex-stats1.454657991ms
90th percentile service timeindex-stats1.97627194ms
99th percentile service timeindex-stats5.543909213ms
99.9th percentile service timeindex-stats10.26782569ms
100th percentile service timeindex-stats18.59820995ms
error rateindex-stats0%
Min Throughputnode-stats90.02ops/s
Mean Throughputnode-stats90.05ops/s
Median Throughputnode-stats90.04ops/s
Max Throughputnode-stats90.14ops/s
50th percentile latencynode-stats2.815647516ms
90th percentile latencynode-stats4.044909403ms
99th percentile latencynode-stats5.212370545ms
99.9th percentile latencynode-stats6.852936187ms
100th percentile latencynode-stats6.934299599ms
50th percentile service timenode-stats1.92963396ms
90th percentile service timenode-stats2.280614187ms
99th percentile service timenode-stats4.373069127ms
99.9th percentile service timenode-stats5.121724201ms
100th percentile service timenode-stats5.12892101ms
error ratenode-stats0%
Min Throughputdefault50.02ops/s
Mean Throughputdefault50.04ops/s
Median Throughputdefault50.04ops/s
Max Throughputdefault50.07ops/s
50th percentile latencydefault3.442207992ms
90th percentile latencydefault4.541033355ms
99th percentile latencydefault5.171663366ms
99.9th percentile latencydefault9.028199148ms
100th percentile latencydefault9.637624957ms
50th percentile service timedefault2.594712481ms
90th percentile service timedefault3.050701669ms
99th percentile service timedefault3.448219185ms
99.9th percentile service timedefault8.483097347ms
100th percentile service timedefault9.405504912ms
error ratedefault0%
Min Throughputterm100.01ops/s
Mean Throughputterm100.02ops/s
Median Throughputterm100.02ops/s
Max Throughputterm100.04ops/s
50th percentile latencyterm3.199955565ms
90th percentile latencyterm4.159100866ms
99th percentile latencyterm9.006197074ms
99.9th percentile latencyterm20.99158259ms
100th percentile latencyterm21.52055805ms
50th percentile service timeterm2.48551101ms
90th percentile service timeterm3.239720117ms
99th percentile service timeterm7.17226712ms
99.9th percentile service timeterm15.9544915ms
100th percentile service timeterm19.73530301ms
error rateterm0%
Min Throughputphrase109.99ops/s
Mean Throughputphrase110ops/s
Median Throughputphrase110ops/s
Max Throughputphrase110.01ops/s
50th percentile latencyphrase3.169040603ms
90th percentile latencyphrase3.634604893ms
99th percentile latencyphrase4.35058805ms
99.9th percentile latencyphrase16.27933249ms
100th percentile latencyphrase17.08333869ms
50th percentile service timephrase2.451517503ms
90th percentile service timephrase2.724279161ms
99th percentile service timephrase3.216251438ms
99.9th percentile service timephrase9.749228635ms
100th percentile service timephrase15.46012098ms
error ratephrase0%
Min Throughputcountry_agg_uncached3ops/s
Mean Throughputcountry_agg_uncached3ops/s
Median Throughputcountry_agg_uncached3ops/s
Max Throughputcountry_agg_uncached3ops/s
50th percentile latencycountry_agg_uncached265.1378055ms
90th percentile latencycountry_agg_uncached268.3491967ms
99th percentile latencycountry_agg_uncached282.9874858ms
100th percentile latencycountry_agg_uncached299.8582891ms
50th percentile service timecountry_agg_uncached264.1177385ms
90th percentile service timecountry_agg_uncached267.2917891ms
99th percentile service timecountry_agg_uncached282.0132841ms
100th percentile service timecountry_agg_uncached298.699945ms
error ratecountry_agg_uncached0%
Min Throughputcountry_agg_cached97.64ops/s
Mean Throughputcountry_agg_cached98.26ops/s
Median Throughputcountry_agg_cached98.32ops/s
Max Throughputcountry_agg_cached98.7ops/s
50th percentile latencycountry_agg_cached2.175618487ms
90th percentile latencycountry_agg_cached3.358712979ms
99th percentile latencycountry_agg_cached3.663528312ms
99.9th percentile latencycountry_agg_cached4.533531366ms
100th percentile latencycountry_agg_cached9.735687054ms
50th percentile service timecountry_agg_cached1.210322545ms
90th percentile service timecountry_agg_cached1.381615282ms
99th percentile service timecountry_agg_cached1.652208896ms
99.9th percentile service timecountry_agg_cached3.39570541ms
100th percentile service timecountry_agg_cached9.514000965ms
error ratecountry_agg_cached0%
Min Throughputscroll20.05pages/s
Mean Throughputscroll20.06pages/s
Median Throughputscroll20.06pages/s
Max Throughputscroll20.08pages/s
50th percentile latencyscroll273.2520165ms
90th percentile latencyscroll301.6026772ms
99th percentile latencyscroll347.1331405ms
100th percentile latencyscroll349.3009ms
50th percentile service timescroll271.233834ms
90th percentile service timescroll298.9778046ms
99th percentile service timescroll345.1081409ms
100th percentile service timescroll346.241483ms
error ratescroll0%
Min Throughputexpression1.5ops/s
Mean Throughputexpression1.5ops/s
Median Throughputexpression1.5ops/s
Max Throughputexpression1.5ops/s
50th percentile latencyexpression464.535454ms
90th percentile latencyexpression470.8226439ms
99th percentile latencyexpression485.6872773ms
100th percentile latencyexpression487.582457ms
50th percentile service timeexpression463.644907ms
90th percentile service timeexpression469.5449809ms
99th percentile service timeexpression484.4586398ms
100th percentile service timeexpression486.768786ms
error rateexpression0%
Min Throughputpainless_static1.4ops/s
Mean Throughputpainless_static1.4ops/s
Median Throughputpainless_static1.4ops/s
Max Throughputpainless_static1.4ops/s
50th percentile latencypainless_static581.6272671ms
90th percentile latencypainless_static588.2054265ms
99th percentile latencypainless_static597.229797ms
100th percentile latencypainless_static601.7254018ms
50th percentile service timepainless_static580.774506ms
90th percentile service timepainless_static587.0630695ms
99th percentile service timepainless_static595.7945851ms
100th percentile service timepainless_static600.6218339ms
error ratepainless_static0%
Min Throughputpainless_dynamic1.4ops/s
Mean Throughputpainless_dynamic1.4ops/s
Median Throughputpainless_dynamic1.4ops/s
Max Throughputpainless_dynamic1.4ops/s
50th percentile latencypainless_dynamic598.3268638ms
90th percentile latencypainless_dynamic604.6501834ms
99th percentile latencypainless_dynamic618.8403735ms
100th percentile latencypainless_dynamic619.2588332ms
50th percentile service timepainless_dynamic597.337956ms
90th percentile service timepainless_dynamic603.6431402ms
99th percentile service timepainless_dynamic617.5273529ms
100th percentile service timepainless_dynamic618.3759769ms
error ratepainless_dynamic0%
Min Throughputdecay_geo_gauss_function_score1ops/s
Mean Throughputdecay_geo_gauss_function_score1ops/s
Median Throughputdecay_geo_gauss_function_score1ops/s
Max Throughputdecay_geo_gauss_function_score1ops/s
50th percentile latencydecay_geo_gauss_function_score558.662883ms
90th percentile latencydecay_geo_gauss_function_score566.1635245ms
99th percentile latencydecay_geo_gauss_function_score576.7578347ms
100th percentile latencydecay_geo_gauss_function_score577.7786931ms
50th percentile service timedecay_geo_gauss_function_score557.0170344ms
90th percentile service timedecay_geo_gauss_function_score565.1927938ms
99th percentile service timedecay_geo_gauss_function_score575.6546767ms
100th percentile service timedecay_geo_gauss_function_score576.90977ms
error ratedecay_geo_gauss_function_score0%
Min Throughputdecay_geo_gauss_script_score1ops/s
Mean Throughputdecay_geo_gauss_script_score1ops/s
Median Throughputdecay_geo_gauss_script_score1ops/s
Max Throughputdecay_geo_gauss_script_score1ops/s
50th percentile latencydecay_geo_gauss_script_score575.896866ms
90th percentile latencydecay_geo_gauss_script_score584.6959502ms
99th percentile latencydecay_geo_gauss_script_score595.1810607ms
100th percentile latencydecay_geo_gauss_script_score610.31794ms
50th percentile service timedecay_geo_gauss_script_score574.895048ms
90th percentile service timedecay_geo_gauss_script_score583.542251ms
99th percentile service timedecay_geo_gauss_script_score594.0682872ms
100th percentile service timedecay_geo_gauss_script_score608.403309ms
error ratedecay_geo_gauss_script_score0%
Min Throughputfield_value_function_score1.5ops/s
Mean Throughputfield_value_function_score1.5ops/s
Median Throughputfield_value_function_score1.5ops/s
Max Throughputfield_value_function_score1.5ops/s
50th percentile latencyfield_value_function_score217.4870086ms
90th percentile latencyfield_value_function_score221.4966101ms
99th percentile latencyfield_value_function_score256.0486869ms
100th percentile latencyfield_value_function_score263.0984769ms
50th percentile service timefield_value_function_score216.1670045ms
90th percentile service timefield_value_function_score220.499306ms
99th percentile service timefield_value_function_score254.5428219ms
100th percentile service timefield_value_function_score261.8149639ms
error ratefield_value_function_score0%
Min Throughputfield_value_script_score1.5ops/s
Mean Throughputfield_value_script_score1.5ops/s
Median Throughputfield_value_script_score1.5ops/s
Max Throughputfield_value_script_score1.5ops/s
50th percentile latencyfield_value_script_score287.0456218ms
90th percentile latencyfield_value_script_score290.0809773ms
99th percentile latencyfield_value_script_score298.1395952ms
100th percentile latencyfield_value_script_score312.1123726ms
50th percentile service timefield_value_script_score285.789164ms
90th percentile service timefield_value_script_score288.8581588ms
99th percentile service timefield_value_script_score296.5342737ms
100th percentile service timefield_value_script_score311.1719809ms
error ratefield_value_script_score0%
Min Throughputlarge_terms1.1ops/s
Mean Throughputlarge_terms1.1ops/s
Median Throughputlarge_terms1.1ops/s
Max Throughputlarge_terms1.1ops/s
50th percentile latencylarge_terms572.2508298ms
90th percentile latencylarge_terms580.3001306ms
99th percentile latencylarge_terms620.8813236ms
100th percentile latencylarge_terms626.353689ms
50th percentile service timelarge_terms563.7678955ms
90th percentile service timelarge_terms572.1782421ms
99th percentile service timelarge_terms613.3370135ms
100th percentile service timelarge_terms617.420621ms
error ratelarge_terms0%
Min Throughputlarge_filtered_terms1.1ops/s
Mean Throughputlarge_filtered_terms1.1ops/s
Median Throughputlarge_filtered_terms1.1ops/s
Max Throughputlarge_filtered_terms1.1ops/s
50th percentile latencylarge_filtered_terms589.2866509ms
90th percentile latencylarge_filtered_terms593.4173963ms
99th percentile latencylarge_filtered_terms598.5252649ms
100th percentile latencylarge_filtered_terms602.3230727ms
50th percentile service timelarge_filtered_terms581.2035115ms
90th percentile service timelarge_filtered_terms585.5575252ms
99th percentile service timelarge_filtered_terms590.5933169ms
100th percentile service timelarge_filtered_terms594.4011461ms
error ratelarge_filtered_terms0%
Min Throughputlarge_prohibited_terms1.1ops/s
Mean Throughputlarge_prohibited_terms1.1ops/s
Median Throughputlarge_prohibited_terms1.1ops/s
Max Throughputlarge_prohibited_terms1.1ops/s
50th percentile latencylarge_prohibited_terms589.4530075ms
90th percentile latencylarge_prohibited_terms596.0567744ms
99th percentile latencylarge_prohibited_terms624.6372295ms
100th percentile latencylarge_prohibited_terms636.1257123ms
50th percentile service timelarge_prohibited_terms581.6967285ms
90th percentile service timelarge_prohibited_terms587.9331864ms
99th percentile service timelarge_prohibited_terms616.5220673ms
100th percentile service timelarge_prohibited_terms628.309642ms
error ratelarge_prohibited_terms0%
Min Throughputdesc_sort_population1.5ops/s
Mean Throughputdesc_sort_population1.51ops/s
Median Throughputdesc_sort_population1.51ops/s
Max Throughputdesc_sort_population1.51ops/s
50th percentile latencydesc_sort_population103.1405666ms
90th percentile latencydesc_sort_population105.2754088ms
99th percentile latencydesc_sort_population131.8258836ms
100th percentile latencydesc_sort_population152.3099904ms
50th percentile service timedesc_sort_population101.670836ms
90th percentile service timedesc_sort_population104.0073033ms
99th percentile service timedesc_sort_population130.6022178ms
100th percentile service timedesc_sort_population150.8698669ms
error ratedesc_sort_population0%
Min Throughputasc_sort_population1.5ops/s
Mean Throughputasc_sort_population1.51ops/s
Median Throughputasc_sort_population1.51ops/s
Max Throughputasc_sort_population1.51ops/s
50th percentile latencyasc_sort_population107.5372407ms
90th percentile latencyasc_sort_population110.8386073ms
99th percentile latencyasc_sort_population116.6895737ms
100th percentile latencyasc_sort_population119.4045231ms
50th percentile service timeasc_sort_population106.1783125ms
90th percentile service timeasc_sort_population109.3649962ms
99th percentile service timeasc_sort_population115.3436784ms
100th percentile service timeasc_sort_population118.1872ms
error rateasc_sort_population0%
Min Throughputasc_sort_with_after_population1.5ops/s
Mean Throughputasc_sort_with_after_population1.5ops/s
Median Throughputasc_sort_with_after_population1.5ops/s
Max Throughputasc_sort_with_after_population1.51ops/s
50th percentile latencyasc_sort_with_after_population129.1767997ms
90th percentile latencyasc_sort_with_after_population133.4439944ms
99th percentile latencyasc_sort_with_after_population140.8711791ms
100th percentile latencyasc_sort_with_after_population144.9907233ms
50th percentile service timeasc_sort_with_after_population127.6733635ms
90th percentile service timeasc_sort_with_after_population131.2300396ms
99th percentile service timeasc_sort_with_after_population140.3493805ms
100th percentile service timeasc_sort_with_after_population143.983128ms
error rateasc_sort_with_after_population0%
Min Throughputdesc_sort_geonameid6.01ops/s
Mean Throughputdesc_sort_geonameid6.01ops/s
Median Throughputdesc_sort_geonameid6.01ops/s
Max Throughputdesc_sort_geonameid6.02ops/s
50th percentile latencydesc_sort_geonameid6.548634556ms
90th percentile latencydesc_sort_geonameid7.124439673ms
99th percentile latencydesc_sort_geonameid8.067587848ms
100th percentile latencydesc_sort_geonameid8.096768637ms
50th percentile service timedesc_sort_geonameid5.541916529ms
90th percentile service timedesc_sort_geonameid5.901245272ms
99th percentile service timedesc_sort_geonameid6.820803307ms
100th percentile service timedesc_sort_geonameid6.879838067ms
error ratedesc_sort_geonameid0%
Min Throughputdesc_sort_with_after_geonameid5.99ops/s
Mean Throughputdesc_sort_with_after_geonameid6ops/s
Median Throughputdesc_sort_with_after_geonameid6ops/s
Max Throughputdesc_sort_with_after_geonameid6ops/s
50th percentile latencydesc_sort_with_after_geonameid142.7790278ms
90th percentile latencydesc_sort_with_after_geonameid151.9306856ms
99th percentile latencydesc_sort_with_after_geonameid208.632983ms
100th percentile latencydesc_sort_with_after_geonameid211.4377066ms
50th percentile service timedesc_sort_with_after_geonameid141.9006125ms
90th percentile service timedesc_sort_with_after_geonameid149.5498388ms
99th percentile service timedesc_sort_with_after_geonameid178.1799831ms
100th percentile service timedesc_sort_with_after_geonameid210.2229249ms
error ratedesc_sort_with_after_geonameid0%
Min Throughputasc_sort_geonameid6.02ops/s
Mean Throughputasc_sort_geonameid6.02ops/s
Median Throughputasc_sort_geonameid6.02ops/s
Max Throughputasc_sort_geonameid6.02ops/s
50th percentile latencyasc_sort_geonameid6.162967999ms
90th percentile latencyasc_sort_geonameid6.680636853ms
99th percentile latencyasc_sort_geonameid7.167303486ms
100th percentile latencyasc_sort_geonameid7.649931009ms
50th percentile service timeasc_sort_geonameid5.219853483ms
90th percentile service timeasc_sort_geonameid5.514943344ms
99th percentile service timeasc_sort_geonameid5.816583ms
100th percentile service timeasc_sort_geonameid6.203371915ms
error rateasc_sort_geonameid0%
Min Throughputasc_sort_with_after_geonameid6ops/s
Mean Throughputasc_sort_with_after_geonameid6ops/s
Median Throughputasc_sort_with_after_geonameid6ops/s
Max Throughputasc_sort_with_after_geonameid6.01ops/s
50th percentile latencyasc_sort_with_after_geonameid130.5534603ms
90th percentile latencyasc_sort_with_after_geonameid131.7300497ms
99th percentile latencyasc_sort_with_after_geonameid135.3648191ms
100th percentile latencyasc_sort_with_after_geonameid139.0438636ms
50th percentile service timeasc_sort_with_after_geonameid129.4173571ms
90th percentile service timeasc_sort_with_after_geonameid130.443844ms
99th percentile service timeasc_sort_with_after_geonameid133.3877408ms
100th percentile service timeasc_sort_with_after_geonameid137.657303ms
error rateasc_sort_with_after_geonameid0%

3. ES优化建议

3.1 es内存分配

当机器内存小于 64G 时,遵循通用的原则,50% 给 ES,50% 留给 lucene。
当机器内存大于 64G 时,遵循以下原则:
如果主要的使用场景是全文检索,那么建议给 ES Heap 分配 4~32G 的内存即可;其它内存留给操作系统,供 lucene 使用(segments cache),以提供更快的查询性能。
如果主要的使用场景是聚合或排序,并且大多数是 numerics,dates,geo_points 以及 not_analyzed 的字符类型,建议分配给 ES Heap 分配 4~32G 的内存即可,其它内存留给操作系统,供 lucene 使用,提供快速的基于文档的聚类、排序性能。

如果使用场景是聚合或排序,并且都是基于 analyzed 字符数据,这时需要更多的 heap size,建议机器上运行多 ES 实例,每个实例保持不超过 50% 的 ES heap 设置(但不超过 32 G,堆内存设置 32 G 以下时,JVM 使用对象指标压缩技巧节省空间),50% 以上留给 lucene。

3.2 禁止 swap

禁止 swap,一旦允许内存与磁盘的交换,会引起致命的性能问题。可以通过在 elasticsearch.yml 中 bootstrap.memory_lock: true,以保持 JVM 锁定内存,保证 ES 的性能。

3.3 es 分片建议

shard数过小不一定好,如果数据量很大,导致每个 shard 体积过大,会影响查询性能。
shard数过大也不一定好,因为 es 的每次查询是要分发给所有的 shard 来查询,然后再对结果做聚合处理,如果 shard 数过多也会影响查询性能。因此 shard 的数量需要根据自己的情况测出来。官方建议单个 shard 大小不要超过 50GB

官方文档有一节关于容量规划的章节,其给出的步骤如下:
使用生产环境的硬件配置创建单节点集群
创建一个只有一个主分片无副本的索引,设置相关的mapping信息
将真实的文档导入到步骤 2 的索引中
测试实际会用到的查询语句
测试的过程中,关注相关指标数据,比如索引性能、查询性能,如果在某一个点相关性能数据超出了你的预期值,那么此时的 shard size大小便是符合你预期的单个 shard size的大小。

4. ES容量模型建议

1.【公有云 ES 最佳实践】
1.1 集群总分片数 < 30000,单个索引分片大小建议不超过 50g,单节点分片数量 < 4000
1.2 节点数超过 20 节点添加专有 master 节点,即 data:master ≤ 20:1
1.3 cpu/mem/disk 建议比例
搜索场景:比例 1:2:32
日志场景:比例 1:4:192 ~ 1:4:384
1.4 单节点性能规格参考
写入性能: 16c/64g、 jvm 32G 单节点可支持 2w docs/s 写入
存储容量 = 源数据 * (1 + 副本数量) * 1.45 * (1 + 0.5)≈ 源数据 * (1 + 副本数量)

  1. 【社区建议】
    2.1
    数据加速、查询聚合等场景:单节点磁盘最大容量 = 单节点内存大小(GB)* 10。
    日志写入、离线分析等场景:单节点磁盘最大容量 = 单节点内存大小(GB)* 50。
    通常情况:单节点磁盘最大容量 = 单节点内存大小(GB)* 30。
    2.2
    单个数据节点的shard数量 = 当前节点的内存大小 * 30(小规格实例参考)
    单个数据节点的shard数量 = 当前节点的内存大小 * 50(大规格实例参考)

  2. 其他指标 【建议监控指标】
    cpu < 60%
    jvm内存 < 80%
    磁盘util < 60%
    磁盘使用率 < 70%
    集群所有index,必须至少1主 + 1从副本
    集群读写拒绝率 < 0.1%
    集群无节点 old gc
    单节点承载最大数据量 < 1T
    ES版本 >= 6.8

4. 参考文档

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值