基于Java的分布式缓存策略：某大学10万级交换机风扇巡检系统实战-CSDN博客

本文链接：https://blog.csdn.net/m0_75076438/article/details/151176525

一、业务背景

某“双一流”大学拥有 3 个校区、10 000+ 台接入交换机、200 000+ 台终端。
每年 6–9 月为高温高发期，风扇故障导致的机房过热告警占比 27.4 %。
传统人工巡检 → 4 人/周，仅能覆盖 <5 % 设备；
因此信息中心立项：
“基于 SNMP 的交换机风扇状态分布式实时巡检系统”。

二、技术挑战

维度	数值	痛点
设备规模	10 000+	单节点轮询 ≥ 8 h
并发 SNMP Get	20 000 req/s	单线程阻塞 IO 撑爆
告警延迟	<30 s	传统 MySQL 写入瓶颈
双活中心	异地灾备	缓存一致性要求

三、总体架构

四、分布式缓存策略设计

4.1 选型：Redis Cluster vs Hazelcast

指标	Redis Cluster	Hazelcast
水平扩容	16384 slots	Partition
语言亲和	Java 客户端 Lettuce	原生 Java
持久化	RDB/AOF	MapStore
运维	双副本	3 副本

最终选择 Redis Cluster 原因：
• 信息中心已有 6 年 Redis 运维经验；
• 风扇状态为 写多读少，Redis 持久化即可满足；
• 跨语言（Python 采集脚本亦可写入）。

4.2 缓存模型

Key 设计	示例	TTL
设备实时状态	`switch:{ip}:fan:{index}`	60 s
告警去重	`alert:{ip}:{fanIndex}:{status}`	300 s
统计窗口	`agg:{5m	30m	1h}`	滑动窗口

4.3 一致性策略

写穿透（Write-Through）：Storm Bolt 写入 Redis 后立即写 MySQL，保证持久化；
读修复（Read-Repair）：管理后台每次查询 Redis 未命中时，回源 MySQL 并回填；
最终一致性时间：<5 s（通过 NTP 同步时钟）。

五、核心代码（Spring Boot 3.2 + Lettuce）

5.1 动态分片配置

#ymal
spring:
  redis:
    cluster:
      nodes:
        - 10.10.1.11:6379
        - 10.10.1.12:6379
        - ...
      max-redirects: 3
    timeout: 2000ms
    lettuce:
      pool:
        max-active: 1000
        max-idle: 100

5.2 缓存工具类（含 Lua 脚本原子操作）

@Component
public class FanStatusCache {
    private final StringRedisTemplate redis;

    public FanStatusCache(StringRedisTemplate redis) {
        this.redis = redis;
    }

    private static final String LUA_SET_NX =
        "if redis.call('exists', KEYS[1]) == 0 then " +
        "  redis.call('setex', KEYS[1], ARGV[1], ARGV[2]); " +
        "  return 1; " +
        "else return 0; end";

    // 原子写缓存，防止并发覆盖
    public boolean trySet(String key, String value, long ttl) {
        return Boolean.TRUE.equals(
            redis.execute(
                new DefaultRedisScript<>(LUA_SET_NX, Boolean.class),
                Collections.singletonList(key),
                String.valueOf(ttl),
                value));
    }

    // 批量写入管道化
    public void batchSet(Map<String, String> map, long ttl) {
        redis.executePipelined((RedisCallback<Object>) conn -> {
            map.forEach((k, v) -> conn.stringCommands().setEx(
                k.getBytes(), ttl, v.getBytes()));
            return null;
        });
    }
}

5.3 告警去重逻辑

@Service
public class AlertService {
    @Autowired
    private FanStatusCache cache;

    public void onStatusChange(String ip, int fanIndex, String status) {
        String key = "alert:" + ip + ":" + fanIndex + ":" + status;
        if (cache.trySet(key, "1", 300)) {
            // 首次出现，发送钉钉机器人
            DingTalkClient.send(ip + " 风扇" + fanIndex + " 状态=" + status);
        }
    }
}

六、性能压测

工具：JMeter 5.6 + 自定义 SNMP 插件
场景：模拟 20 000 台交换机 × 2 个风扇 上报
结果：

指标	单节点 MySQL	引入 Redis 后
QPS	1 200	22 000
平均延迟	430 ms	18 ms
告警延迟 99th	110 s	9 s

七、可视化监控

Grafana Dashboard：Redis 连接数、键空间命中率、MySQL 同步延迟
Prometheus 指标：

redis_cluster_connected_clients{cluster="fan-monitor"}  987
redis_keyspace_hits_total{cluster="fan-monitor"}        1.2e+07

流程图（巡检→告警）：

八、踩坑与优化

BigKey 问题
早期把整台设备 12 个风扇状态拼成 JSON 数组 → 单 Key 15 KB，导致 Redis 分片迁移耗时 4 min。
解决：拆分为 12 个小 Key，迁移时间降至 8 s。
时钟漂移
两个 IDC 时钟差 200 ms，导致告警延迟抖动。
解决：所有节点接入 NTP + 本地写入时记录服务器时间戳，统一用 System.currentTimeMillis()。
Lettuce 重连风暴
网络闪断后，客户端瞬间创建 >10 000 连接。
解决：
- 开启 netty.ioRatio=70；
- 设置 maxRedirects=3 限制重试；
- 接入 Sentinel 做故障转移。