LlamaIndex中的数据处理转换组件详解-CSDN博客

本文链接：https://blog.csdn.net/gitblog_00532/article/details/148325482

LlamaIndex中的数据处理转换组件详解

llama_index LlamaIndex（前身为GPT Index）是一个用于LLM应用程序的数据框架项目地址: https://gitcode.com/gh_mirrors/ll/llama_index

什么是转换组件(Transformations)

在LlamaIndex项目中，转换组件是指能够接收节点(Node)列表作为输入，并返回处理后节点列表的功能模块。这些组件都实现了基础的Transformation类，提供了同步(__call__)和异步(acall)两种调用方式。

转换组件在数据处理流程中扮演着重要角色，它们可以对文本数据进行各种预处理和特征提取操作，为后续的索引构建和查询处理做好准备。

内置转换组件类型

LlamaIndex提供了多种开箱即用的转换组件：

文本分割器(TextSplitter)：将大段文本分割成适合处理的小块
节点解析器(NodeParser)：将文档解析为结构化节点
元数据提取器(MetadataExtractor)：从文档中提取关键元数据信息
嵌入模型(Embeddings)：将文本转换为向量表示

基本使用模式

转换组件既可以单独使用，也可以组合在数据处理管道中。

单独使用示例

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor

# 初始化组件
node_parser = SentenceSplitter(chunk_size=512)  # 按句子分割，每块512字符
extractor = TitleExtractor()  # 标题提取器

# 同步调用
nodes = node_parser(documents)

# 异步调用
nodes = await extractor.acall(nodes)

与索引结合使用

转换组件可以全局设置，也可以针对特定索引设置：

from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.extractors import TitleExtractor, QuestionsAnsweredExtractor
from llama_index.core.node_parser import TokenTextSplitter

# 定义转换组件列表
transformations = [
    TokenTextSplitter(chunk_size=512, chunk_overlap=128),  # 基于token的分割
    TitleExtractor(nodes=5),  # 提取标题
    QuestionsAnsweredExtractor(questions=3),  # 生成可能的问题
]

# 全局设置(影响所有索引)
Settings.transformations = transformations

# 单个索引设置
index = VectorStoreIndex.from_documents(
    documents, 
    transformations=transformations
)

自定义转换组件

LlamaIndex允许开发者创建自己的转换组件，只需继承TransformComponent基类并实现相应方法即可。

实现示例：文本清洁器

下面是一个去除文本中特殊字符和标点的自定义转换组件：

import re
from llama_index.core.schema import TransformComponent

class TextCleaner(TransformComponent):
    def __call__(self, nodes, **kwargs):
        for node in nodes:
            node.text = re.sub(r"[^0-9A-Za-z ]", "", node.text)  # 只保留字母数字和空格
        return nodes

在管道中使用自定义组件

自定义组件可以无缝集成到数据处理管道中：

from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TextCleaner(),  # 使用自定义清洁器
        OpenAIEmbedding(),  # 文本向量化
    ],
)

# 运行完整处理流程
nodes = pipeline.run(documents=[Document.example()])