元数据提取

介绍

在许多情况下，尤其是长文档中，一个片段 chunk 文本可能缺少必要的上下文，将该片段 chunk 文本与其他类似的文本片段 chunk 区分开来。

为了解决这个问题，我们使用 LLM 来提取与文档相关的某些上下文信息，以便更好地提升检索效果，帮助语言模型区分哪些看起来相似的段落 chunk。

我们在示例笔记本中展示了这一点，并证明了其在处理长文档方面的效果。

用法

首先，我们定义一个元数据提取器，这个提取器接收一系列的小的特征元数据提取器，通过这些特征处理器顺序处理传入的节点。

然后我们将其提供给节点解析器，它将按照顺序地向每个节点添加额外的元数据。

Python

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
)

from llama_index.extractors.entity import EntityExtractor

# 第一个变换是句子拆分
# 从第二个开始都是元数据提取器
transformations = [
    SentenceSplitter(),
    TitleExtractor(nodes=5),
    QuestionsAnsweredExtractor(questions=3),
    SummaryExtractor(summaries=["prev", "self"]),
    KeywordExtractor(keywords=10),
    EntityExtractor(prediction_threshold=0.5),
]

然后，我们可以在输入文档或节点上运行元数据提取操作：

Python

from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=transformations)

nodes = pipeline.run(documents=documents)

以下是提取的元数据的示例：

Python

{'page_label': '2',
 'file_name': '10k-132.pdf',
 'document_title': 'Uber Technologies, Inc. 2019 Annual Report: Revolutionizing Mobility and Logistics Across 69 Countries and 111 Million MAPCs with $65 Billion in Gross Bookings',
 'questions_this_excerpt_can_answer': '\n\n1. How many countries does Uber Technologies, Inc. operate in?\n2. What is the total number of MAPCs served by Uber Technologies, Inc.?\n3. How much gross bookings did Uber Technologies, Inc. generate in 2019?',
 'prev_section_summary': "\n\nThe 2019 Annual Report provides an overview of the key topics and entities that have been important to the organization over the past year. These include financial performance, operational highlights, customer satisfaction, employee engagement, and sustainability initiatives. It also provides an overview of the organization's strategic objectives and goals for the upcoming year.",
 'section_summary': '\nThis section discusses a global tech platform that serves multiple multi-trillion dollar markets with products leveraging core technology and infrastructure. It enables consumers and drivers to tap a button and get a ride or work. The platform has revolutionized personal mobility with ridesharing and is now leveraging its platform to redefine the massive meal delivery and logistics industries. The foundation of the platform is its massive network, leading technology, operational excellence, and product expertise.',
 'excerpt_keywords': '\nRidesharing, Mobility, Meal Delivery, Logistics, Network, Technology, Operational Excellence, Product Expertise, Point A, Point B'}

自定义提取器

如果提供的提取器不能满足您的需求，您还可以定义自定义提取器，如下所示：

Python

from llama_index.core.extractors import BaseExtractor


class CustomExtractor(BaseExtractor):
    async def aextract(self, nodes) -> List[Dict]:
        metadata_list = [
            {
                "custom": node.metadata["document_title"]
                + "\n"
                + node.metadata["excerpt_keywords"]
            }
            for node in nodes
        ]
        return metadata_list

extractor.extract() 将自动 aextract() 在后台调用，以提供同步和异步入口点。

在更高级的示例中，它还可以利用llm从节点内容和现有元数据中提取特征。有关更多详细信息，请参阅提供的元数据提取器的源代码。

模块

下面您将找到各种元数据提取器的指南和教程。

SEC 文件元数据提取
LLM 调查提取
实体提取
Marvin 元数据提取
Pydantic 元数据提取