维特比算法(命名实体识别)
维特比算法(Viterbi Algorithm)是求解隐马尔可夫模型(HMM)中最可能的隐藏状态序列的动态规划算法,广泛用于序列标注任务,如词性标注、命名实体识别等。
一、什么是维特比算法
假设我们有:
- 一个观测序列 O=(o1,o2,...,oT)\mathbf{O} = (o_1, o_2, ..., o_T)O=(o1,o2,...,oT)
- 一个状态集合 S={s1,s2,...,sN}\mathbf{S} = \{s_1, s_2, ..., s_N\}S={s1,s2,...,sN}
- 转移概率 Pij=P(sj∣si)P_{ij} = P(s_j | s_i)Pij=P(sj∣si):从状态 sis_isi 转移到状态 sjs_jsj 的概率
- 发射概率 Oj(ot)=P(ot∣sj)O_j(o_t) = P(o_t | s_j)Oj(ot)=P(ot∣sj):在状态 sjs_jsj 生成观测 oto_tot 的概率
- 初始状态概率 πj=P(sj at t=1)\pi_j = P(s_j \text{ at } t=1)πj=P(sj at t=1)
目标:
找出在给定观测序列 o1,o2,...,oTo_1, o_2, ..., o_To1,o2,...,oT 时,最可能的隐藏状态序列 s1,s2,...,sTs_1, s_2, ..., s_Ts1,s2,...,sT。
二、BIO序列标注
BIO是一种常用的序列标注方案,含义如下:
- B-:一个实体的开头(Begin)
- I-:实体的内部(Inside)
- O:不是任何实体的一部分(Outside)
例如标注句子:“我 爱 北京 天安门”,实体是“北京天安门”这个地点,标注为:
我 O
爱 O
北京 B-LOC
天安门 I-LOC
三、维特比算法的实现步骤
定义:
- Vt(j)V_t(j)Vt(j):表示前 ttt 个观测中,以状态 sjs_jsj 结尾的最大概率路径的概率
- patht(j)\text{path}_t(j)patht(j):记录路径
初始化(t=1):
V1(j)=πj⋅Oj(o1) V_1(j) = \pi_j \cdot O_j(o_1) V1(j)=πj⋅Oj(o1)
path1(j)=[j] \text{path}_1(j) = [j] path1(j)=[j]
递推(t=2 到 T):
Vt(j)=maxi(Vt−1(i)⋅Pij⋅Oj(ot)) V_t(j) = \max_i \left( V_{t-1}(i) \cdot P_{ij} \cdot O_j(o_t) \right) Vt(j)=imax(Vt−1(i)⋅Pij⋅Oj(ot))
patht(j)=patht−1(i∗)+[j],i∗=argmaxi(Vt−1(i)⋅Pij) \text{path}_t(j) = \text{path}_{t-1}(i^*) + [j], \quad i^* = \arg\max_i \left( V_{t-1}(i) \cdot P_{ij} \right) patht(j)=patht−1(i∗)+[j],i∗=argimax(Vt−1(i)⋅Pij)
终止:
最终路径=argmaxjVT(j) \text{最终路径} = \arg\max_j V_T(j) 最终路径=argjmaxVT(j)
四、具体例子
假设你给出了如下:
- 状态集合(BIO标签):B, I, O
- 观测序列(长度为3):[“李白”, “是”, “诗人”]
- 转移概率 PijP_{ij}Pij:如下(从行到列):
From\To | B | I | O |
---|---|---|---|
B | 0.1 | 0.6 | 0.3 |
I | 0.0 | 0.7 | 0.3 |
O | 0.5 | 0.2 | 0.3 |
- 发射概率 Oj(ot)O_j(o_t)Oj(ot):如下:
Token | B | I | O |
---|---|---|---|
李白 | 0.8 | 0.1 | 0.1 |
是 | 0.1 | 0.1 | 0.8 |
诗人 | 0.4 | 0.5 | 0.1 |
- 初始概率 π=[0.5,0.0,0.5]\pi = [0.5, 0.0, 0.5]π=[0.5,0.0,0.5]
五、用维特比算法计算BIO序列
我们用动态规划表 Vt(j)V_t(j)Vt(j)(即每一步每个状态的最大概率),并记录路径。
Step 1:t = 1,观测 “李白”
V1(B)=0.5⋅0.8=0.4V1(I)=0.0⋅0.1=0V1(O)=0.5⋅0.1=0.05 V_1(B) = 0.5 \cdot 0.8 = 0.4\\ V_1(I) = 0.0 \cdot 0.1 = 0\\ V_1(O) = 0.5 \cdot 0.1 = 0.05 V1(B)=0.5⋅0.8=0.4V1(I)=0.0⋅0.1=0V1(O)=0.5⋅0.1=0.05
Step 2:t = 2,观测 “是”
V2(B)=max(0.4⋅0.1,0⋅0,0.05⋅0.5)⋅0.1=max(0.04,0,0.025)⋅0.1=0.004 V_2(B) = \max \left( 0.4 \cdot 0.1, 0 \cdot 0, 0.05 \cdot 0.5 \right) \cdot 0.1 = \max(0.04, 0, 0.025) \cdot 0.1 = 0.004 V2(B)=max(0.4⋅0.1,0⋅0,0.05⋅0.5)⋅0.1=max(0.04,0,0.025)⋅0.1=0.004
V2(I)=max(0.4⋅0.6,0⋅0.7,0.05⋅0.2)⋅0.1=max(0.24,0,0.01)⋅0.1=0.024 V_2(I) = \max \left( 0.4 \cdot 0.6, 0 \cdot 0.7, 0.05 \cdot 0.2 \right) \cdot 0.1 = \max(0.24, 0, 0.01) \cdot 0.1 = 0.024 V2(I)=max(0.4⋅0.6,0⋅0.7,0.05⋅0.2)⋅0.1=max(0.24,0,0.01)⋅0.1=0.024
V2(O)=max(0.4⋅0.3,0⋅0.3,0.05⋅0.3)⋅0.8=max(0.12,0,0.015)⋅0.8=0.096 V_2(O) = \max \left( 0.4 \cdot 0.3, 0 \cdot 0.3, 0.05 \cdot 0.3 \right) \cdot 0.8 = \max(0.12, 0, 0.015) \cdot 0.8 = 0.096 V2(O)=max(0.4⋅0.3,0⋅0.3,0.05⋅0.3)⋅0.8=max(0.12,0,0.015)⋅0.8=0.096
Step 3:t = 3,观测 “诗人”
V3(B)=max(0.004⋅0.1,0.024⋅0.0,0.096⋅0.5)⋅0.4=max(0.0004,0,0.048)⋅0.4=0.0192 V_3(B) = \max(0.004 \cdot 0.1, 0.024 \cdot 0.0, 0.096 \cdot 0.5) \cdot 0.4 = \max(0.0004, 0, 0.048) \cdot 0.4 = 0.0192 V3(B)=max(0.004⋅0.1,0.024⋅0.0,0.096⋅0.5)⋅0.4=max(0.0004,0,0.048)⋅0.4=0.0192
V3(I)=max(0.004⋅0.6,0.024⋅0.7,0.096⋅0.2)⋅0.5=max(0.0024,0.0168,0.0192)⋅0.5=0.0096 V_3(I) = \max(0.004 \cdot 0.6, 0.024 \cdot 0.7, 0.096 \cdot 0.2) \cdot 0.5 = \max(0.0024, 0.0168, 0.0192) \cdot 0.5 = 0.0096 V3(I)=max(0.004⋅0.6,0.024⋅0.7,0.096⋅0.2)⋅0.5=max(0.0024,0.0168,0.0192)⋅0.5=0.0096
V3(O)=max(0.004⋅0.3,0.024⋅0.3,0.096⋅0.3)⋅0.1=max(0.0012,0.0072,0.0288)⋅0.1=0.00288 V_3(O) = \max(0.004 \cdot 0.3, 0.024 \cdot 0.3, 0.096 \cdot 0.3) \cdot 0.1 = \max(0.0012, 0.0072, 0.0288) \cdot 0.1 = 0.00288 V3(O)=max(0.004⋅0.3,0.024⋅0.3,0.096⋅0.3)⋅0.1=max(0.0012,0.0072,0.0288)⋅0.1=0.00288
六、最终结果
max(V3)=max(0.0192,0.0096,0.00288)=0.0192 \max(V_3) = \max(0.0192, 0.0096, 0.00288) = 0.0192 max(V3)=max(0.0192,0.0096,0.00288)=0.0192
最终状态是 B,对应路径:
- t=1:B(0.4)
- t=2:O(选0.05 → O → B)
- t=3:B
所以最优路径为:
[B, O, B]
七、小结
- 维特比算法通过动态规划高效求解最优状态路径
- BIO标注是实体识别中的序列标注方式
- 利用**转移概率 Pij 和发射概率 O(bj)**可以一步步填表
- 最终输出的是概率最高的状态序列