LlamaIndex 简介 #

什么是 RAG？ #

在了解 LlamaIndex 之前，我们需要先理解 RAG（Retrieval-Augmented Generation，检索增强生成）的概念。RAG 是一种将信息检索与文本生成相结合的技术，它让大语言模型能够利用外部知识库来生成更准确、更有依据的回答。

text

┌─────────────────────────────────────────────────────────────┐
│                      RAG 工作原理                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   用户问题："公司去年的销售额是多少？"                        │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐   │
│   │  1. 检索阶段                                         │   │
│   │     在知识库中搜索相关文档                           │   │
│   │     找到：年度报告.pdf、财务数据.xlsx                │   │
│   └─────────────────────────────────────────────────────┘   │
│                           │                                  │
│                           ▼                                  │
│   ┌─────────────────────────────────────────────────────┐   │
│   │  2. 增强阶段                                         │   │
│   │     将检索到的内容作为上下文                         │   │
│   │     构建提示词：问题 + 相关文档                      │   │
│   └─────────────────────────────────────────────────────┘   │
│                           │                                  │
│                           ▼                                  │
│   ┌─────────────────────────────────────────────────────┐   │
│   │  3. 生成阶段                                         │   │
│   │     LLM 基于上下文生成回答                           │   │
│   │     "根据年度报告，去年销售额为 1.2 亿元"            │   │
│   └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

为什么需要 RAG？ #

text

┌─────────────────────────────────────────────────────────────┐
│                  LLM 的局限性                                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ❌ 知识截止                                                │
│     训练数据有截止日期，无法获取最新信息                     │
│                                                             │
│  ❌ 私有数据缺失                                            │
│     无法访问企业内部文档、数据库等私有数据                   │
│                                                             │
│  ❌ 幻觉问题                                                │
│     可能生成看似合理但实际错误的信息                        │
│                                                             │
│  ❌ 缺乏引用                                                │
│     无法提供信息来源，难以验证                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                  RAG 的解决方案                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ✅ 实时数据                                                │
│     可以检索最新的文档和数据                                │
│                                                             │
│  ✅ 私有知识                                                │
│     连接企业内部知识库、数据库                              │
│                                                             │
│  ✅ 减少幻觉                                                │
│     基于真实文档生成回答，有据可查                          │
│                                                             │
│  ✅ 可追溯                                                  │
│     提供引用来源，便于验证                                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

什么是 LlamaIndex？ #

LlamaIndex 是一个专门为构建 LLM 应用设计的数据框架，它让开发者能够轻松地将私有数据与大语言模型连接起来。无论是构建文档问答系统、聊天机器人，还是智能代理，LlamaIndex 都提供了完整的工具链。

核心定位 #

text

┌─────────────────────────────────────────────────────────────┐
│                      LlamaIndex                              │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ 数据连接器   │  │ 索引构建    │  │ 查询引擎    │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ 智能代理     │  │ 评估工具    │  │ 可观测性    │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
└─────────────────────────────────────────────────────────────┘

LlamaIndex 的历史 #

发展历程 #

text

2022年 ─── LlamaIndex 项目启动
    │
    │      Jerry Liu 创立
    │      专注于 LLM 数据连接
    │      开源发布
    │
2023年 ─── 快速发展期
    │
    │      社区快速增长
    │      支持更多数据源
    │      引入代理功能
    │
2024年 ─── 成熟期
    │
    │      LlamaCloud 企业版
    │      LlamaParse 文档解析
    │      更多高级特性
    │
至今   ─── 持续创新
    │
    │      GitHub 35k+ Stars
    │      活跃的社区
    │      企业级应用

版本演进 #

版本	时间	重要特性
0.1	2022.11	基础索引和查询功能
0.5	2023.03	数据连接器扩展
0.7	2023.06	代理框架引入
0.8	2023.08	核心架构重构
0.9	2023.10	LlamaParse 预览
0.10	2024.01	模块化架构
0.11	2024.06	多模态支持增强

LlamaIndex 的核心特点 #

1. 丰富的数据连接器 #

支持多种数据源，开箱即用：

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.readers.file import PDFReader

documents = SimpleDirectoryReader("./data").load_data()

from llama_index.readers.web import SimpleWebPageReader
documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["https://example.com/doc"]
)

from llama_index.readers.database import DatabaseReader
reader = DatabaseReader(sql_database="sqlite:///mydb.db")
documents = reader.load_data(query="SELECT * FROM articles")

2. 灵活的索引类型 #

根据场景选择合适的索引：

text

┌─────────────────────────────────────────────────────────────┐
│                    LlamaIndex 索引类型                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  VectorStoreIndex（向量索引）                                │
│  ├── 最常用的索引类型                                        │
│  ├── 语义相似度搜索                                          │
│  └── 适合：问答、语义搜索                                    │
│                                                             │
│  SummaryIndex（摘要索引）                                    │
│  ├── 遍历所有节点                                            │
│  ├── 生成综合摘要                                            │
│  └── 适合：文档总结、全面分析                                │
│                                                             │
│  TreeIndex（树索引）                                         │
│  ├── 层级结构                                                │
│  ├── 从根到叶查询                                            │
│  └── 适合：大型文档集                                        │
│                                                             │
│  KeywordTableIndex（关键词索引）                             │
│  ├── 关键词提取                                              │
│  ├── 精确匹配                                                │
│  └── 适合：关键词搜索                                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

3. 强大的查询引擎 #

python

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

response = query_engine.query("什么是 LlamaIndex？")
print(response)

4. 智能代理支持 #

python

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool

query_engine_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="knowledge_base",
    description="用于查询知识库"
)

agent = ReActAgent.from_tools([query_engine_tool], llm=llm)
response = agent.chat("帮我查找关于 Python 的资料并总结")

5. 易于扩展 #

python

from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.query_engine import CustomQueryEngine

class MyCustomRetriever(BaseRetriever):
    def _retrieve(self, query_bundle):
        pass

class MyCustomQueryEngine(CustomQueryEngine):
    def custom_query(self, query_str):
        pass

LlamaIndex 的架构 #

核心架构图 #

text

┌─────────────────────────────────────────────────────────────┐
│                    LlamaIndex 架构                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                    应用层                             │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐    │  │
│  │  │问答系统 │ │聊天机器人│ │智能代理 │ │数据分析 │    │  │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘    │  │
│  └──────────────────────────────────────────────────────┘  │
│                           │                                  │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                    查询层                             │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐    │  │
│  │  │查询引擎 │ │检索器   │ │响应合成 │ │代理     │    │  │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘    │  │
│  └──────────────────────────────────────────────────────┘  │
│                           │                                  │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                    索引层                             │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐    │  │
│  │  │向量索引 │ │树索引   │ │关键词索引│ │摘要索引 │    │  │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘    │  │
│  └──────────────────────────────────────────────────────┘  │
│                           │                                  │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                    数据层                             │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐    │  │
│  │  │文档     │ │节点     │ │嵌入向量 │ │元数据   │    │  │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘    │  │
│  └──────────────────────────────────────────────────────┘  │
│                           │                                  │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                    存储层                             │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐    │  │
│  │  │向量存储 │ │文档存储 │ │索引存储 │ │缓存     │    │  │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘    │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

数据流转过程 #

text

┌─────────────────────────────────────────────────────────────┐
│                    数据处理流程                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. 数据加载                                                 │
│     ┌─────────┐                                             │
│     │ 数据源  │ ──→ SimpleDirectoryReader ──→ Documents     │
│     │ (PDF等) │                                             │
│     └─────────┘                                             │
│                                                             │
│  2. 文档分割                                                 │
│     Documents ──→ NodeParser ──→ Nodes                      │
│                                                             │
│  3. 向量化                                                   │
│     Nodes ──→ Embedding Model ──→ Embeddings                │
│                                                             │
│  4. 索引构建                                                 │
│     Nodes + Embeddings ──→ VectorStoreIndex                 │
│                                                             │
│  5. 查询处理                                                 │
│     Query ──→ Retriever ──→ Relevant Nodes                  │
│            ──→ Response Synthesizer ──→ Answer               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

LlamaIndex 的应用场景 #

1. 文档问答系统 #

text

┌─────────────────────────────────────────────────────────────┐
│                    文档问答系统                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   用户问题                                                   │
│      │                                                       │
│      ▼                                                       │
│   ┌─────────────┐                                           │
│   │  向量检索   │ ←── 知识库（PDF、Word、网页等）            │
│   └─────────────┘                                           │
│      │                                                       │
│      ▼                                                       │
│   ┌─────────────┐                                           │
│   │  LLM 生成   │                                           │
│   └─────────────┘                                           │
│      │                                                       │
│      ▼                                                       │
│   回答 + 引用来源                                            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

2. 智能客服 #

text

应用场景：
- 产品咨询
- 售后支持
- 常见问题解答

优势：
✅ 24/7 在线
✅ 统一的回答质量
✅ 可追溯的对话记录
✅ 持续学习优化

3. 知识管理 #

text

┌─────────────────────────────────────────────────────────────┐
│                    企业知识管理                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  数据源：                                                    │
│  - 内部文档（Wiki、Confluence）                             │
│  - 代码仓库（GitHub、GitLab）                               │
│  - 数据库（MySQL、PostgreSQL）                              │
│  - API 文档                                                 │
│                                                             │
│  功能：                                                     │
│  - 知识检索                                                 │
│  - 智能推荐                                                 │
│  - 自动摘要                                                 │
│  - 知识图谱                                                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

4. 代码助手 #

python

from llama_index.core import VectorStoreIndex
from llama_index.readers.github import GitHubRepositoryReader

reader = GitHubRepositoryReader(
    owner="run-llama",
    repo="llama_index",
)

documents = reader.load_data(branch="main")
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("VectorStoreIndex 是如何实现的？")

LlamaIndex vs 其他框架 #

对比分析 #

特性	LlamaIndex	LangChain	Haystack
定位	数据框架	应用框架	NLP 框架
RAG 支持	✅ 核心功能	✅ 支持	✅ 支持
数据连接器	丰富	丰富	中等
索引类型	多种	单一	单一
代理支持	✅	✅ 核心功能	❌
学习曲线	中等	较陡	平缓
生产就绪	✅	✅	✅

选择建议 #

text

┌─────────────────────────────────────────────────────────────┐
│                    框架选择指南                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  选择 LlamaIndex 的场景：                                    │
│  ✅ 以 RAG 为核心的应用                                     │
│  ✅ 需要多种索引类型                                        │
│  ✅ 复杂的数据处理流程                                      │
│  ✅ 企业知识库建设                                          │
│                                                             │
│  选择 LangChain 的场景：                                     │
│  ✅ 复杂的代理工作流                                        │
│  ✅ 多步骤任务编排                                          │
│  ✅ 需要丰富的工具集成                                      │
│                                                             │
│  选择 Haystack 的场景：                                      │
│  ✅ NLP 管道构建                                            │
│  ✅ 生产级搜索系统                                          │
│  ✅ 需要预训练模型                                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

LlamaIndex 的核心概念 #

Document（文档） #

python

from llama_index.core import Document

doc = Document(
    text="这是一段文本内容...",
    metadata={"source": "example.pdf", "page": 1}
)

Node（节点） #

python

from llama_index.core import TextNode

node = TextNode(
    text="这是一个文本节点",
    metadata={"source": "example.pdf"}
)

Index（索引） #

python

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

Query Engine（查询引擎） #

python

query_engine = index.as_query_engine()
response = query_engine.query("你的问题")

Retriever（检索器） #

python

retriever = index.as_retriever()
nodes = retriever.retrieve("你的查询")

学习路径 #

text

入门阶段
├── LlamaIndex 简介（本文）
├── 安装与配置
├── 快速开始
└── 核心概念

进阶阶段
├── 数据连接器
├── 文档与节点
├── 索引类型
└── 查询引擎

高级阶段
├── RAG 高级技术
├── 智能代理
├── 评估与优化
└── 存储与持久化

实战阶段
├── 文档问答系统
├── 聊天机器人
└── 多模态 RAG

下一步 #

现在你已经了解了 LlamaIndex 的基本概念，接下来学习安装与配置开始实际使用 LlamaIndex！