搜索补全（二）：Trie树（前缀树）经典应用

原创已于 2025-01-10 23:30:01 修改 · 629 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#Tire树 #搜索补全 #检索 #搜索

于 2025-01-09 01:15:56 首次发布

数据结构专栏收录该内容

39 篇文章

订阅专栏

`Trie树：高效字符串检索的幕后英雄`

搜索系列相关文章（置顶）

1.原始信息再加工：一文读懂倒排索引
 2.慧眼识词：解析TF-IDF工作原理
 3.超越TF-IDF：信息检索之BM25
4.深入浅出 Beam Search：自然语言处理中的高效搜索利器
 5.搜索补全（一）：倒排索引与Trie的魔法
 6.搜索补全（二）：Trie树经典应用

引言

在信息检索、自动补全和拼写检查等领域，Trie树（发音为“try”，源自retrieval）作为一种高效的前缀树数据结构，发挥着至关重要的作用。本文将深入探讨Trie树的基本概念、构建与使用方法、应用场景及其优化策略。

在这里插入图片描述

一、什么是Trie树？

Trie树，又称前缀树，单词查找树，是一种哈希树的变种，是一种特殊的树形数据结构，专门用于存储字符串集合，并支持快速查找、插入和删除操作。每个节点代表一个字符，从根节点到叶子节点的路径构成完整的字符串。Trie树的主要特点包括：

多叉树结构：每个节点可以有多个子节点，通常对应于字符集中的不同字符。
路径表示字符串：从根节点到任意节点的路径上的字符序列构成一个字符串。
高效检索：通过逐字符匹配的方式进行查找，时间复杂度为O(m)，其中m是查询字符串的长度。

二、Trie树的构建

1. 初始化

首先，创建一个空的Trie树，即初始化一个根节点。该根节点不包含任何字符，但作为所有字符串的起点。

2. 插入字符串

为了向Trie中插入一个新的字符串，我们按照以下步骤操作：

从根节点开始：遍历字符串中的每个字符。
检查子节点：对于当前字符，检查当前节点是否已有对应的子节点。
创建新节点：如果没有找到对应子节点，则创建一个新的子节点。
移动到下一个节点：将当前节点更新为新创建或已存在的子节点。
标记结束：当所有字符都处理完毕后，在最后一个节点上设置一个标志位，表示这是一个完整单词的结尾。

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word: str) -> None:
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end_of_word = True

三、使用方法

1. 搜索字符串

要查找一个字符串是否存在于Trie中，可以通过逐字符匹配的方式进行：

从根节点开始：遍历字符串中的每个字符。
检查子节点：对于当前字符，检查当前节点是否有对应的子节点。
移动到下一个节点：如果存在对应子节点，则继续向下遍历；否则返回未找到的结果。
检查结束标志：遍历结束后，检查最后到达的节点是否标记为单词结尾。

def search(self, word: str) -> bool:
    node = self.root
    for char in word:
        if char not in node.children:
            return False
        node = node.children[char]
    return node.is_end_of_word

2. 前缀匹配

除了精确匹配外，Trie还非常适合用于前缀匹配。例如，在搜索补全场景中，用户输入部分字符时，系统可以根据这些字符找到所有以该前缀开头的完整字符串。

def starts_with(self, prefix: str) -> bool:
    node = self.root
    for char in prefix:
        if char not in node.children:
            return False
        node = node.children[char]
    return True

3. 删除字符串

从Trie中删除一个字符串需要特别小心，因为简单的移除操作可能会破坏其他字符串的完整性。一种常见的做法是从叶子节点向上回溯，逐步清理不再使用的节点。

def delete(self, word: str) -> None:
    def _delete(node, word, index):
        if index == len(word):
            if node.is_end_of_word:
                node.is_end_of_word = False
            return len(node.children) == 0
        char = word[index]
        if char not in node.children or not _delete(node.children[char], word, index + 1):
            return False
        del node.children[char]
        return len(node.children) == 0
    
    _delete(self.root, word, 0)

四、举个栗子🌰–>单分支

当然，为了帮助您更好地理解Trie树的工作原理和应用，我们可以通过一个具体的例子来展示如何构建和使用Trie树进行字符串操作。我们将通过插入、搜索、前缀匹配和删除四个主要操作来演示。

示例：构建和使用Trie树

1. 初始化Trie树

首先，创建一个空的Trie树：

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

2. 插入字符串

假设我们要插入以下单词到Trie中：

“apple”
“app”
“application”

插入过程：

插入 “apple”：
- 从根节点开始，依次添加字符 ‘a’, ‘p’, ‘p’, ‘l’, ‘e’。
- 在最后一个节点上设置 is_end_of_word = True，表示这是一个完整的单词。
插入 “app”：
- 从根节点开始，找到已存在的路径 ‘a’ -> ‘p’ -> ‘p’。
- 在最后一个节点上设置 is_end_of_word = True，表示 “app” 也是一个完整的单词。
插入 “application”：
- 从 “app” 的最后一个节点继续添加字符 ‘l’, ‘i’, ‘c’, ‘a’, ‘t’, ‘i’, ‘o’, ‘n’。
- 在最后一个节点上设置 is_end_of_word = True，表示 “application” 是一个完整的单词。

def insert(self, word: str) -> None:
    node = self.root
    for char in word:
        if char not in node.children:
            node.children[char] = TrieNode()
        node = node.children[char]
    node.is_end_of_word = True

3. 搜索字符串

现在我们想检查某些单词是否存在于Trie中：

搜索 “apple”：应该返回 True。
搜索 “app”：应该返回 True。
搜索 “appl”：应该返回 False（因为 “appl” 不是一个完整的单词）。

def search(self, word: str) -> bool:
    node = self.root
    for char in word:
        if char not in node.children:
            return False
        node = node.children[char]
    return node.is_end_of_word

4. 前缀匹配

我们可以利用Trie进行前缀匹配，找到所有以某个前缀开头的单词：

前缀匹配 “app”：应该返回 True，因为存在 “app” 和 “application”。
前缀匹配 “appl”：应该返回 True，因为存在 “apple” 和 “application”。

def starts_with(self, prefix: str) -> bool:
    node = self.root
    for char in prefix:
        if char not in node.children:
            return False
        node = node.children[char]
    return True

5. 删除字符串

要从Trie中删除一个单词，我们需要确保不会破坏其他单词的完整性：

删除 “app”：
- 遍历 “app” 的路径，并在最后一个节点移除 is_end_of_word 标志。
- 如果该节点没有其他子节点，则可以安全地删除它。

def delete(self, word: str) -> None:
    def _delete(node, word, index):
        if index == len(word):
            if node.is_end_of_word:
                node.is_end_of_word = False
            return len(node.children) == 0
        char = word[index]
        if char not in node.children or not _delete(node.children[char], word, index + 1):
            return False
        del node.children[char]
        return len(node.children) == 0
    
    _delete(self.root, word, 0)

具体示例代码

以下是上述操作的具体Python代码实现：

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word: str) -> None:
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end_of_word = True

    def search(self, word: str) -> bool:
        node = self.root
        for char in word:
            if char not in node.children:
                return False
            node = node.children[char]
        return node.is_end_of_word

    def starts_with(self, prefix: str) -> bool:
        node = self.root
        for char in prefix:
            if char not in node.children:
                return False
            node = node.children[char]
        return True

    def delete(self, word: str) -> None:
        def _delete(node, word, index):
            if index == len(word):
                if node.is_end_of_word:
                    node.is_end_of_word = False
                return len(node.children) == 0
            char = word[index]
            if char not in node.children or not _delete(node.children[char], word, index + 1):
                return False
            del node.children[char]
            return len(node.children) == 0
        
        _delete(self.root, word, 0)

# 使用示例
trie = Trie()
words = ["apple", "app", "application"]

# 插入单词
for word in words:
    trie.insert(word)

# 搜索单词
print(trie.search("apple"))       # 输出: True
print(trie.search("app"))         # 输出: True
print(trie.search("appl"))        # 输出: False

# 前缀匹配
print(trie.starts_with("app"))    # 输出: True
print(trie.starts_with("appl"))   # 输出: True

# 删除单词
trie.delete("app")
print(trie.search("app"))         # 输出: False
print(trie.search("apple"))       # 输出: True

五、举个栗子🌰–>多分支

为了使例子更加复杂和多样化，我们可以插入更多不同前缀的单词，并展示Trie树在处理多分支结构时的行为。这样可以更全面地展示Trie树的功能和灵活性。

示例：构建和使用复杂的Trie树

假设我们正在为一个拼写检查应用构建一个词典，该词典包含以下单词：

“apple”
“app”
“application”
“banana”
“band”
“bank”
“orange”
“oranges”

1. 初始化Trie树

首先，创建一个空的Trie树，即初始化一个根节点。这个根节点不包含任何字符，但作为所有字符串的起点。

       root

2. 插入单词

插入 “apple” 和 “app”

从根节点开始，依次添加字符 ‘a’, ‘p’, ‘p’, ‘l’, ‘e’。在最后一个节点上设置 is_end_of_word = True。同时，在第二个 ‘p’ 节点上也设置 is_end_of_word = True。

       root
        |
        a
        |
        p
        |
        p (is_end_of_word=True)
        |
        l
        |
        e (is_end_of_word=True)

插入 “application”

从 “app” 的最后一个节点继续添加字符 ‘l’, ‘i’, ‘c’, ‘a’, ‘t’, ‘i’, ‘o’, ‘n’。在最后一个节点上设置 is_end_of_word = True。

       root
        |
        a
        |
        p
        |
        p (is_end_of_word=True)
        |
        l
        |
        e (is_end_of_word=True)
        |
        i
        |
        c
        |
        a
        |
        t
        |
        i
        |
        o
        |
        n (is_end_of_word=True)

插入 “banana”, “band”, 和 “bank”

从根节点开始，插入这些以 ‘b’ 开头的单词。

       root
        / \
       a   b
      /     \
     p       a
    /         \
   p (is_end_of_word=True) n
  / \           | \
 l    i          a  d
 |    |          |  |
 e    c          n  k
 |    |          |  |
 (is_end_of_word=True) a
                      |
                      n
                      |
                      k
                      |
                      s (is_end_of_word=True)

插入 “banana”：
- 沿路径 ‘b’ -> ‘a’ -> ‘n’ -> ‘a’ -> ‘n’ -> ‘a’。
- 在最后一个节点上设置 is_end_of_word = True。
插入 “band”：
- 沿路径 ‘b’ -> ‘a’ -> ‘n’ -> ‘d’。
- 在最后一个节点上设置 is_end_of_word = True。
插入 “bank”：
- 沿路径 ‘b’ -> ‘a’ -> ‘n’ -> ‘k’。
- 在最后一个节点上设置 is_end_of_word = True。

插入 “orange” 和 “oranges”

从根节点开始，插入这些以 ‘o’ 开头的单词。

       root
        / | \
       a  b  o
      /     \ \
     p       a r
    / \       | a
   p   i      n n
  / \   |     | g
 l    c  a    | e
 |    |  n    | s
 e    a  k    | 
 |    |  |    | (is_end_of_word=True)
 (is_end_of_word=True) t
                     |
                     i
                     |
                     o
                     |
                     n
                     |
                     s (is_end_of_word=True)

插入 “orange”：
- 沿路径 ‘o’ -> ‘r’ -> ‘a’ -> ‘n’ -> ‘g’ -> ‘e’。
- 在最后一个节点上设置 is_end_of_word = True。
插入 “oranges”：
- 沿路径 ‘o’ -> ‘r’ -> ‘a’ -> ‘n’ -> ‘g’ -> ‘e’ -> ‘s’。
- 在最后一个节点上设置 is_end_of_word = True。

3. 搜索单词

现在我们想检查某些单词是否存在于Trie中：

搜索 “apple”：返回 True
搜索 “app”：返回 True
搜索 “appl”：返回 False
搜索 “banana”：返回 True
搜索 “band”：返回 True
搜索 “bank”：返回 True
搜索 “orange”：返回 True
搜索 “oranges”：返回 True

4. 前缀匹配

我们可以利用Trie进行前缀匹配，找到所有以某个前缀开头的单词：

前缀匹配 “app”：返回 True（存在 “app”, “apple”, “application”）
前缀匹配 “ban”：返回 True（存在 “banana”, “band”, “bank”）
前缀匹配 “ora”：返回 True（存在 “orange”, “oranges”）

5. 删除单词

要从Trie中删除一个单词，我们需要确保不会破坏其他单词的完整性：

删除 “app”：
- 移除 ‘p’ 节点上的 is_end_of_word=True 标志，因为还有其他分支（如 “apple” 和 “application”），所以不能删除节点本身。

       root
        / | \
       a  b  o
      /     \ \
     p       a r
    / \       | a
   p   i      n n
  / \   |     | g
 l    c  a    | e
 |    |  n    | s
 e    a  k    | 
 |    |  |    | (is_end_of_word=True)
 (is_end_of_word=True) t
                     |
                     i
                     |
                     o
                     |
                     n
                     |
                     s (is_end_of_word=True)

图形表示法

为了更直观地理解，下面是每个步骤的图形表示：

初始状态：

       root

插入 “apple” 和 “app”：

       root
        |
        a
        |
        p
        |
        p (is_end_of_word=True)
        |
        l
        |
        e (is_end_of_word=True)

插入 “application”：

       root
        |
        a
        |
        p
        |
        p (is_end_of_word=True)
        |
        l
        |
        e (is_end_of_word=True)
        |
        i
        |
        c
        |
        a
        |
        t
        |
        i
        |
        o
        |
        n (is_end_of_word=True)

插入 “banana”, “band”, 和 “bank”：

       root
        / \
       a   b
      /     \
     p       a
    /         \
   p (is_end_of_word=True) n
  / \           | \
 l    i          a  d
 |    |          |  |
 e    c          n  k
 |    |          |  |
 (is_end_of_word=True) a
                      |
                      n
                      |
                      k
                      |
                      s (is_end_of_word=True)

插入 “orange” 和 “oranges”：

       root
        / | \
       a  b  o
      /     \ \
     p       a r
    / \       | a
   p   i      n n
  / \   |     | g
 l    c  a    | e
 |    |  n    | s
 e    a  k    | 
 |    |  |    | (is_end_of_word=True)
 (is_end_of_word=True) t
                     |
                     i
                     |
                     o
                     |
                     n
                     |
                     s (is_end_of_word=True)

删除 “app”：

       root
        / | \
       a  b  o
      /     \ \
     p       a r
    / \       | a
   p   i      n n
  / \   |     | g
 l    c  a    | e
 |    |  n    | s
 e    a  k    | 
 |    |  |    | (is_end_of_word=True)
 (is_end_of_word=True) t
                     |
                     i
                     |
                     o
                     |
                     n
                     |
                     s (is_end_of_word=True)