预计时长:4-6周
目标:掌握模型部署优化技术,完成端到端项目,建立持续学习能力
量化:将模型权重从高精度转换为低精度
FP32 (32位) → FP16 (16位) → INT8 (8位) → INT4 (4位)
优势:
- 减少模型大小
- 加速推理
- 降低显存占用
代价:
- 可能损失精度
| 方法 | 精度 | 特点 | 适用场景 |
|---|---|---|---|
| FP16 | 16-bit | 几乎无损 | 训练和推理 |
| INT8 | 8-bit | 轻微损失 | 部署推理 |
| GPTQ | 4-bit | 后训练量化 | 大模型推理 |
| AWQ | 4-bit | 激活感知 | 大模型推理 |
| GGUF | 多种 | llama.cpp格式 | CPU推理 |
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
# GPTQ配置
gptq_config = GPTQConfig(
bits=4,
dataset="c4",
tokenizer=tokenizer,
)
# 加载并量化
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2-7B",
quantization_config=gptq_config,
device_map="auto"
)
# 保存量化模型
model.save_pretrained("Qwen2-7B-GPTQ-4bit")# vLLM: 高性能LLM推理引擎
from vllm import LLM, SamplingParams
# 加载模型
llm = LLM(
model="Qwen/Qwen2-7B",
tensor_parallel_size=2, # 张量并行
gpu_memory_utilization=0.9,
)
# 采样参数
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
# 批量推理
prompts = ["问题1", "问题2", "问题3"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)1. PagedAttention
- 类似操作系统的分页内存管理
- 减少显存碎片
- 提高批处理效率
2. Continuous Batching
- 动态批处理
- 请求完成立即处理新请求
- 提高吞吐量
3. Tensor Parallelism
- 多卡并行推理
- 大模型分布式推理
# TensorRT-LLM: NVIDIA官方加速方案
# 1. 转换模型
# python convert_checkpoint.py --model_dir ./model --output_dir ./trt_ckpt
# 2. 构建引擎
# trtllm-build --checkpoint_dir ./trt_ckpt --output_dir ./trt_engine
# 3. 推理
from tensorrt_llm import LLM
llm = LLM(model="./trt_engine")
output = llm.generate("你好")from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()
# 加载模型
model_name = "Qwen/Qwen2-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
class ChatRequest(BaseModel):
prompt: str
max_tokens: int = 512
temperature: float = 0.7
class ChatResponse(BaseModel):
response: str
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
try:
inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=request.max_tokens,
temperature=request.temperature,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return ChatResponse(response=response)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# 运行: uvicorn server:app --host 0.0.0.0 --port 8000from fastapi.responses import StreamingResponse
import asyncio
@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
async def generate():
inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
# 使用 TextIteratorStreamer
from transformers import TextIteratorStreamer
from threading import Thread
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
**inputs,
max_new_tokens=request.max_tokens,
streamer=streamer,
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for text in streamer:
yield f"data: {text}\n\n"
await asyncio.sleep(0)
return StreamingResponse(generate(), media_type="text/event-stream")# 使用vLLM启动OpenAI兼容服务
# python -m vllm.entrypoints.openai.api_server \
# --model Qwen/Qwen2-7B \
# --host 0.0.0.0 \
# --port 8000
# 客户端调用
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="Qwen/Qwen2-7B",
messages=[{"role": "user", "content": "你好"}]
)
print(response.choices[0].message.content)# Dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
WORKDIR /app
# 安装Python
RUN apt-get update && apt-get install -y python3 python3-pip
# 安装依赖
COPY requirements.txt .
RUN pip3 install -r requirements.txt
# 复制代码
COPY . .
# 下载模型(或挂载)
# RUN python3 download_model.py
EXPOSE 8000
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "/models/Qwen2-7B", \
"--host", "0.0.0.0", \
"--port", "8000"]# docker-compose.yml
version: '3.8'
services:
llm-server:
build: .
ports:
- "8000:8000"
volumes:
- ./models:/models
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]┌─────────────────────────────────────────────────────────────┐
│ 前端 (Streamlit) │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ FastAPI 后端 │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 文档处理 │ │ RAG 检索 │ │ LLM 生成 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ 数据层 │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Milvus │ │ PostgreSQL │ │
│ │ (向量库) │ │ (元数据) │ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
# rag_system.py
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Milvus
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
class RAGSystem:
def __init__(self):
self.embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-base-zh-v1.5"
)
self.llm = ChatOpenAI(model="gpt-4", temperature=0)
self.vectorstore = None
self.chain = None
self.memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True
)
def load_documents(self, file_paths: list):
documents = []
for path in file_paths:
if path.endswith('.pdf'):
loader = PyPDFLoader(path)
elif path.endswith('.docx'):
loader = Docx2txtLoader(path)
else:
continue
documents.extend(loader.load())
# 切分
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
# 存储
self.vectorstore = Milvus.from_documents(
chunks,
self.embeddings,
connection_args={"host": "localhost", "port": "19530"}
)
# 创建链
self.chain = ConversationalRetrievalChain.from_llm(
llm=self.llm,
retriever=self.vectorstore.as_retriever(search_kwargs={"k": 3}),
memory=self.memory,
return_source_documents=True
)
def query(self, question: str):
if not self.chain:
return {"answer": "请先上传文档", "sources": []}
result = self.chain({"question": question})
return {
"answer": result["answer"],
"sources": [doc.page_content[:200] for doc in result["source_documents"]]
}智能客服核心功能:
1. 意图识别
2. 多轮对话管理
3. 知识库问答
4. 工单创建
5. 人工转接
# chatbot.py
from enum import Enum
from typing import Optional
from pydantic import BaseModel
import openai
class Intent(Enum):
QUERY = "query" # 查询问题
COMPLAINT = "complaint" # 投诉
ORDER = "order" # 订单相关
TRANSFER = "transfer" # 转人工
OTHER = "other"
class ChatBot:
def __init__(self):
self.client = openai.OpenAI()
self.conversation_history = []
self.rag_system = RAGSystem()
def classify_intent(self, message: str) -> Intent:
"""意图识别"""
response = self.client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": """
分类用户意图,返回以下类别之一:
- query: 咨询问题
- complaint: 投诉
- order: 订单相关
- transfer: 要求转人工
- other: 其他
只返回类别名称。
"""},
{"role": "user", "content": message}
],
temperature=0
)
intent_str = response.choices[0].message.content.strip().lower()
return Intent(intent_str) if intent_str in [i.value for i in Intent] else Intent.OTHER
def handle_query(self, message: str) -> str:
"""处理咨询"""
result = self.rag_system.query(message)
return result["answer"]
def handle_complaint(self, message: str) -> str:
"""处理投诉"""
# 创建工单
ticket_id = self.create_ticket(message, "complaint")
return f"非常抱歉给您带来不便。已为您创建投诉工单(编号:{ticket_id}),我们会在24小时内联系您处理。"
def handle_order(self, message: str) -> str:
"""处理订单查询"""
# 调用订单API
return "请提供您的订单号,我帮您查询订单状态。"
def chat(self, message: str) -> str:
"""主对话入口"""
self.conversation_history.append({"role": "user", "content": message})
# 意图识别
intent = self.classify_intent(message)
# 根据意图路由
if intent == Intent.TRANSFER:
response = "好的,正在为您转接人工客服,请稍候..."
elif intent == Intent.QUERY:
response = self.handle_query(message)
elif intent == Intent.COMPLAINT:
response = self.handle_complaint(message)
elif intent == Intent.ORDER:
response = self.handle_order(message)
else:
response = self.handle_query(message) # 默认走问答
self.conversation_history.append({"role": "assistant", "content": response})
return response
def create_ticket(self, content: str, ticket_type: str) -> str:
"""创建工单(示例)"""
import uuid
return str(uuid.uuid4())[:8].upper()# code_assistant.py
from openai import OpenAI
class CodeAssistant:
def __init__(self):
self.client = OpenAI()
self.system_prompt = """
你是一个专业的编程助手,精通多种编程语言。
你的能力:
1. 代码生成:根据需求生成代码
2. 代码解释:解释代码的功能和逻辑
3. 代码审查:发现潜在问题并给出建议
4. Bug修复:分析和修复代码错误
5. 代码优化:提供性能优化建议
输出要求:
- 代码用markdown代码块包裹
- 给出必要的注释
- 解释关键逻辑
"""
def generate_code(self, requirement: str, language: str = "python") -> str:
"""生成代码"""
response = self.client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": f"用{language}实现:{requirement}"}
]
)
return response.choices[0].message.content
def explain_code(self, code: str) -> str:
"""解释代码"""
response = self.client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": f"请解释以下代码:\n```\n{code}\n```"}
]
)
return response.choices[0].message.content
def review_code(self, code: str) -> str:
"""代码审查"""
response = self.client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": f"""
请审查以下代码,从以下方面给出建议:
1. 代码风格
2. 潜在bug
3. 性能问题
4. 安全隐患
5. 改进建议
代码:
```
{code}
```
"""}
]
)
return response.choices[0].message.content| 论文 | 主题 | 重要程度 |
|---|---|---|
| Attention Is All You Need | Transformer原理 | ⭐⭐⭐⭐⭐ |
| BERT | 双向预训练 | ⭐⭐⭐⭐⭐ |
| GPT-3 (Language Models are Few-Shot Learners) | In-Context Learning | ⭐⭐⭐⭐⭐ |
| InstructGPT | RLHF | ⭐⭐⭐⭐⭐ |
| LoRA | 参数高效微调 | ⭐⭐⭐⭐⭐ |
| RAG | 检索增强生成 | ⭐⭐⭐⭐ |
| Chain-of-Thought | 思维链推理 | ⭐⭐⭐⭐ |
| LLaMA | 开源LLM | ⭐⭐⭐⭐ |
| DPO | 直接偏好优化 | ⭐⭐⭐⭐ |
- arXiv: https://arxiv.org/list/cs.CL/recent
- Papers With Code: https://paperswithcode.com/
- 顶会:ACL、EMNLP、NAACL、NeurIPS、ICML、ICLR
| 项目 | 描述 | 链接 |
|---|---|---|
| Transformers | HuggingFace模型库 | github.com/huggingface/transformers |
| LangChain | LLM应用开发框架 | github.com/langchain-ai/langchain |
| LlamaIndex | 数据框架 | github.com/run-llama/llama_index |
| vLLM | 高性能推理 | github.com/vllm-project/vllm |
| OpenAI Cookbook | 最佳实践 | github.com/openai/openai-cookbook |
| FastChat | 对话模型训练 | github.com/lm-sys/FastChat |
| text-generation-webui | 模型部署UI | github.com/oobabooga/text-generation-webui |
-
技术博客
- 学习笔记
- 项目总结
- 源码分析
-
GitHub
- 开源自己的项目
- 贡献开源社区
- 维护学习笔记仓库
-
技术分享
- 团队内部分享
- 技术社区演讲
- 视频教程录制
完成以下任务后,恭喜你完成了AI算法学习路径!
-
模型部署
- 掌握模型量化技术
- 使用vLLM进行推理加速
- 部署OpenAI兼容API服务
-
项目实战
- 完成RAG文档问答系统
- 实现智能客服或代码助手
- 项目上线并可访问
-
持续学习
- 建立论文阅读习惯
- 关注开源项目动态
- 开始技术博客输出
恭喜!你已经完成了从零基础到LLM的完整学习路径。
- 理论基础:扎实的数学、机器学习、深度学习基础
- 技术栈:Python、PyTorch、Transformers、LangChain
- LLM技能:Prompt Engineering、微调、RAG、Agent
- 工程能力:模型部署、服务化、项目实战
- 深耕方向:选择一个细分领域深入研究
- 参与开源:贡献开源社区,提升影响力
- 持续学习:AI领域发展迅速,保持学习
- 实践项目:用技术解决实际问题
- 定期回顾和更新知识体系
- 关注行业动态和前沿进展
- 与社区保持交流和分享
祝你在AI领域取得更大的成就!🎉