LLM微调文档

目录

环境准备
微调
DPO
ORPO
Merge LoRA
量化
推理
Web-UI
推送模型

环境准备

GPU设备: A10, 3090, V100, A100均可.

# 设置pip全局镜像 (加速下载)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
# 安装ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'

# 如果你想要使用deepspeed.
pip install deepspeed -U

# 如果你想要使用基于auto_gptq的qlora训练. (推荐, 效果优于bnb)
# 支持auto_gptq的模型: `https://github.com/modelscope/swift/blob/main/docs/source/Instruction/支持的模型和数据集.md#模型`
# auto_gptq和cuda版本有对应关系，请按照`https://github.com/PanQiWei/AutoGPTQ#quick-installation`选择版本
pip install auto_gptq -U

# 如果你想要使用基于bnb的qlora训练.
pip install bitsandbytes -U

# 环境对齐 (通常不需要运行. 如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
pip install -r requirements/framework.txt  -U
pip install -r requirements/llm.txt  -U

微调

如果你要使用界面的方式进行微调与推理, 可以查看界面训练与推理文档.

使用python

# Experimental environment: A10, 3090, V100, ...
# 20GB GPU memory
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

import torch

from swift.llm import (
    DatasetName, InferArguments, ModelType, SftArguments,
    infer_main, sft_main, app_ui_main
)

model_type = ModelType.qwen_7b_chat
sft_args = SftArguments(
    model_type=model_type,
    dataset=[f'{DatasetName.blossom_math_zh}#2000'],
    output_dir='output')
result = sft_main(sft_args)
last_model_checkpoint = result['last_model_checkpoint']
print(f'last_model_checkpoint: {last_model_checkpoint}')
torch.cuda.empty_cache()

infer_args = InferArguments(
    ckpt_dir=last_model_checkpoint,
    load_dataset_config=True)
# merge_lora(infer_args, device_map='cpu')
result = infer_main(infer_args)
torch.cuda.empty_cache()

app_ui_main(infer_args)

使用CLI

# Experimental environment: A10, 3090, V100, ...
# 20GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
    --model_id_or_path qwen/Qwen-7B-Chat \
    --dataset AI-ModelScope/blossom-math-v2 \
    --output_dir output \

# 使用自己的数据集
# 自定义数据集格式查看: https://github.com/modelscope/swift/blob/main/docs/source/Instruction/%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86
CUDA_VISIBLE_DEVICES=0 swift sft \
    --model_id_or_path qwen/Qwen-7B-Chat \
    --dataset chatml.jsonl \
    --output_dir output \

# 使用DDP
# Experimental environment: 2 * 3090
# 2 * 23GB GPU memory
CUDA_VISIBLE_DEVICES=0,1 \
NPROC_PER_NODE=2 \
swift sft \
    --model_id_or_path qwen/Qwen-7B-Chat \
    --dataset AI-ModelScope/blossom-math-v2 \
    --output_dir output \

# 多机多卡
# 如果非共用磁盘请在各机器sh中额外指定`--save_on_each_node true`.
# node0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NNODES=2 \
NODE_RANK=0 \
MASTER_ADDR=127.0.0.1 \
NPROC_PER_NODE=4 \
swift sft \
    --model_id_or_path qwen/Qwen-7B-Chat \
    --dataset AI-ModelScope/blossom-math-v2 \
    --output_dir output \
# node1
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NNODES=2 \
NODE_RANK=1 \
MASTER_ADDR=xxx.xxx.xxx.xxx \
NPROC_PER_NODE=4 \
swift sft \
    --model_id_or_path qwen/Qwen-7B-Chat \
    --dataset AI-ModelScope/blossom-math-v2 \
    --output_dir output \

更多sh脚本

更多sh脚本可以查看这里

# 脚本需要在此目录下执行
cd examples/pytorch/llm

提示:

我们默认在训练时设置--gradient_checkpointing true来节约显存, 这会略微降低训练速度.
如果你想要使用量化参数--quantization_bit 4, 你需要先安装bnb: pip install bitsandbytes -U. 这会减少显存消耗, 但通常会降低训练速度.
如果你想要使用基于auto_gptq的量化, 你需要先安装对应cuda版本的auto_gptq: pip install auto_gptq -U.

使用auto_gptq的模型可以查看LLM支持的模型. 建议使用auto_gptq, 而不是bnb.
如果你想要使用deepspeed, 你需要pip install deepspeed -U. 使用deepspeed可以节约显存, 但可能会略微降低训练速度.
如果你的训练涉及到知识编辑的内容, 例如: 自我认知微调, 你需要在MLP上也加上LoRA, 否则可能会效果不佳. 你可以简单传入参数--lora_target_modules ALL来对所有的linear(qkvo, mlp)加上lora, 这通常是效果最好的.
如果你使用的是V100等较老的GPU, 你需要设置--dtype AUTO或者--dtype fp16, 因为其不支持bf16.
如果你的机器是A100等高性能显卡, 且模型支持flash-attn, 推荐你安装flash-attn, 这将会加快训练和推理的速度以及显存占用(A10, 3090, V100等显卡不支持flash-attn进行训练). 支持flash-attn的模型可以查看LLM支持的模型
如果你要进行二次预训练, 多轮对话, 你可以参考自定义与拓展
如果你需要断网进行训练, 请使用--model_id_or_path <model_dir>和设置--check_model_is_latest false. 具体参数含义请查看命令行参数.
如果你想在训练时, 将权重push到ModelScope Hub中, 你需要设置--push_to_hub true.
如果你想要在推理时, 合并LoRA权重并保存，你需要设置--merge_lora true. 不推荐对qlora训练的模型进行merge, 这会存在精度损失. 因此不建议使用qlora进行微调, 部署生态不好.

注意:

由于曾用名问题, 以xxx_ds结尾的脚本的含义是: 使用deepspeed zero2进行训练. (e.g. full_ddp_ds).
除了以下列出的脚本, 其他脚本不一定进行维护.

如果你想要自定义脚本, 可以参考以下脚本进行修改: (以下脚本会定期维护)

full: qwen1half-7b-chat (A100), qwen-7b-chat (2*A100)
full+ddp+zero2: qwen-7b-chat (4*A100)
full+ddp+zero3: qwen-14b-chat (4*A100)
lora: chatglm3-6b (3090), baichuan2-13b-chat (2*3090), yi-34b-chat (A100), qwen-72b-chat (2*A100)
lora+ddp: chatglm3-6b (2*3090)
lora+ddp+zero3: qwen-14b-chat (4*3090), qwen-72b-chat (4*A100)
qlora(gptq-int4): qwen-14b-chat-int4 (3090), qwen1half-72b-chat-int4 (A100)
qlora(gptq-int8): qwen-14b-chat-int8 (3090)
qlora(bnb-int4): qwen-14b-chat (3090), llama2-70b-chat (2 * 3090)

DPO

如果你要使用DPO进行人类对齐, 你可以查看DPO训练文档.

ORPO

如果你要使用ORPO进行人类对齐, 你可以查看ORPO最佳实践.

Merge LoRA

提示: 暂时不支持bnb和auto_gptq量化模型的merge lora, 这会产生较大的精度损失.

# 如果你需要量化, 可以指定`--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true

量化

对微调后模型进行量化可以查看LLM量化与导出文档

推理

如果你要使用VLLM进行推理加速, 可以查看VLLM推理加速与部署

原始模型

单样本推理可以查看LLM推理文档

使用数据集评估:

CUDA_VISIBLE_DEVICES=0 swift infer --model_id_or_path qwen/Qwen-7B-Chat --dataset AI-ModelScope/blossom-math-v2

微调后模型

单样本推理:

使用LoRA增量权重进行推理:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type
)
from swift.tuners import Swift

ckpt_dir = 'vx-xxx/checkpoint-100'
model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)
model_id_or_path = None
model, tokenizer = get_model_tokenizer(model_type, model_id_or_path=model_id_or_path, model_kwargs={'device_map': 'auto'})

model = Swift.from_pretrained(model, ckpt_dir, inference_mode=True)
template = get_template(template_type, tokenizer)
query = 'xxxxxx'
response, history = inference(model, template, query)
print(f'response: {response}')
print(f'history: {history}')

使用LoRA merged的权重进行推理:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type
)

ckpt_dir = 'vx-xxx/checkpoint-100-merged'
model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'},
                                       model_id_or_path=ckpt_dir)

template = get_template(template_type, tokenizer)
query = 'xxxxxx'
response, history = inference(model, template, query)
print(f'response: {response}')
print(f'history: {history}')

使用数据集评估:

# 直接推理
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' \
    --load_dataset_config true \

# 如果需要更换val_dataset
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --val_dataset <your-val-dataset>

# Merge LoRA增量权重并推理
# 如果你需要量化, 可以指定`--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true

CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' --load_dataset_config true

人工评估:

# 直接推理
CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx'

# Merge LoRA增量权重并推理
# 如果你需要量化, 可以指定`--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true

CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged'

Web-UI

如果你要使用VLLM进行部署并提供API接口, 可以查看VLLM推理加速与部署

原始模型

使用原始模型的web-ui可以查看LLM推理文档

微调后模型

# 直接使用app-ui
CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx'

# merge LoRA增量权重并使用app-ui
# 如果你需要量化, 可以指定`--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true

CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged'

推送模型

如果你想推送模型到ModelScope，可以参考模型推送文档