diff --git a/01.LLM_Theory_Course/01.Industry_Model_Introduction/01.Classic_Model_Technical_Analysis/01.BERT/bert_emotect_finetune.ipynb b/01.LLM_Theory_Course/01.Industry_Model_Introduction/01.Classic_Model_Technical_Analysis/01.BERT/bert_emotect_finetune.ipynb index 6ceed46..e271533 100644 --- a/01.LLM_Theory_Course/01.Industry_Model_Introduction/01.Classic_Model_Technical_Analysis/01.BERT/bert_emotect_finetune.ipynb +++ b/01.LLM_Theory_Course/01.Industry_Model_Introduction/01.Classic_Model_Technical_Analysis/01.BERT/bert_emotect_finetune.ipynb @@ -4,159 +4,61 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## 环境配置" + "# 基于 MindSpore 的 BERT 模型实现对话情绪识别\n", + "\n", + "## 案例介绍\n", + "\n", + "该案例以一个文本情感分类任务为例子来说明BERT模型的整个应用过程。\n", + "\n", + "## 模型简介\n", + "\n", + "BERT全称是来自变换器的双向编码器表征量(Bidirectional Encoder Representations from Transformers),它是Google于2018年末开发并发布的一种新型语言模型。与BERT模型相似的预训练语言模型例如问答、命名实体识别、自然语言推理、文本分类等在许多自然语言处理任务中发挥着重要作用。模型是基于Transformer中的Encoder并加上双向的结构,因此一定要熟练掌握Transformer的Encoder的结构。\n", + "\n", + "BERT模型的主要创新点都在pre-train方法上,即用了Masked Language Model和Next Sentence Prediction两种方法分别捕捉词语和句子级别的representation。\n", + "\n", + "在用Masked Language Model方法训练BERT的时候,随机把语料库中15%的单词做Mask操作。对于这15%的单词做Mask操作分为三种情况:80%的单词直接用[Mask]替换、10%的单词直接替换成另一个新的单词、10%的单词保持不变。\n", + "\n", + "因为涉及到Question Answering (QA) 和 Natural Language Inference (NLI)之类的任务,增加了Next Sentence Prediction预训练任务,目的是让模型理解两个句子之间的联系。与Masked Language Model任务相比,Next Sentence Prediction更简单些,训练的输入是句子A和B,B有一半的几率是A的下一句,输入这两个句子,BERT模型预测B是不是A的下一句。\n", + "\n", + "BERT预训练之后,会保存它的Embedding table和12层Transformer权重(BERT-BASE)或24层Transformer权重(BERT-LARGE)。使用预训练好的BERT模型可以对下游任务进行Fine-tuning,比如:文本分类、相似度判断、阅读理解等。\n", + "\n", + "对话情绪识别(Emotion Detection,简称EmoTect),专注于识别智能对话场景中用户的情绪,针对智能对话场景中的用户文本,自动判断该文本的情绪类别并给出相应的置信度,情绪类型分为积极、消极、中性。 对话情绪识别适用于聊天、客服等多个场景,能够帮助企业更好地把握对话质量、改善产品的用户交互体验,也能分析客服服务质量、降低人工质检成本。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "安装MindSpore框架和MindNLP套件" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple\n", - "Collecting mindspore==2.5.0\n", - " Using cached https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.5.0/MindSpore/unified/aarch64/mindspore-2.5.0-cp39-cp39-linux_aarch64.whl (345.0 MB)\n", - "Requirement already satisfied: pip in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (25.1)\n", - "\u001b[31mERROR: Could not find a version that satisfies the requirement install (from versions: none)\u001b[0m\u001b[31m\n", - "\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n", - "\u001b[31mERROR: No matching distribution found for install\u001b[0m\u001b[31m\n", - "\u001b[0m" - ] - } - ], - "source": [ - "!pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.5.0/MindSpore/unified/aarch64/mindspore-2.5.0-cp39-cp39-linux_aarch64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple" + "## 环境配置\n", + "\n", + "本案例的运行环境为:\n", + "\n", + "| Python | MindSpore | MindSpore NLP |\n", + "| :----- | :-------- | :------------ |\n", + "| 3.10 | 2.7.0 | 0.5.1 |\n", + "\n", + "如果你在如[昇思大模型平台](https://xihe.mindspore.cn/training-projects)、[华为云ModelArts](https://www.huaweicloud.com/product/modelarts.html)、[启智社区](https://openi.pcl.ac.cn/)等算力平台的Jupyter在线编程环境中运行本案例,可取消如下代码的注释,进行依赖库安装:" ] }, { "cell_type": "code", - "execution_count": 2, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Looking in indexes: https://repo.huaweicloud.com/repository/pypi/simple/\n", - "Requirement already satisfied: mindnlp==0.4.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (0.4.0)\n", - "Requirement already satisfied: mindspore>=2.2.14 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.5.0)\n", - "Requirement already satisfied: tqdm in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (4.67.1)\n", - "Requirement already satisfied: requests in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.32.3)\n", - "Requirement already satisfied: datasets in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (3.6.0)\n", - "Requirement already satisfied: evaluate in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.4.3)\n", - "Requirement already satisfied: tokenizers==0.19.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.19.1)\n", - "Requirement already satisfied: safetensors in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.3)\n", - "Requirement already satisfied: sentencepiece in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.2.0)\n", - "Requirement already satisfied: regex in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2024.11.6)\n", - "Requirement already satisfied: addict in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.4.0)\n", - "Requirement already satisfied: ml-dtypes in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.1)\n", - "Requirement already satisfied: pyctcdecode in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.0)\n", - "Requirement already satisfied: jieba in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.42.1)\n", - "Requirement already satisfied: pytest==7.2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (7.2.0)\n", - "Requirement already satisfied: pillow>=10.0.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (11.2.1)\n", - "Requirement already satisfied: attrs>=19.2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (25.3.0)\n", - "Requirement already satisfied: iniconfig in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (2.1.0)\n", - "Requirement already satisfied: packaging in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (24.2)\n", - "Requirement already satisfied: pluggy<2.0,>=0.12 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (1.5.0)\n", - "Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (1.2.0)\n", - "Requirement already satisfied: tomli>=1.0.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (2.2.1)\n", - "Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from tokenizers==0.19.1->mindnlp==0.4.0) (0.32.3)\n", - "Requirement already satisfied: filelock in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (3.18.0)\n", - "Requirement already satisfied: fsspec>=2023.5.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (2025.3.0)\n", - "Requirement already satisfied: pyyaml>=5.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (6.0.2)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (4.12.2)\n", - "Requirement already satisfied: hf-xet<2.0.0,>=1.1.2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (1.1.2)\n", - "Requirement already satisfied: numpy<2.0.0,>=1.20.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.26.4)\n", - "Requirement already satisfied: protobuf>=3.13.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (6.30.2)\n", - "Requirement already satisfied: asttokens>=2.0.4 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (3.0.0)\n", - "Requirement already satisfied: scipy>=1.5.4 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.13.1)\n", - "Requirement already satisfied: psutil>=5.6.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (5.9.0)\n", - "Requirement already satisfied: astunparse>=1.6.3 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.6.3)\n", - "Requirement already satisfied: dill>=0.3.7 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (0.3.8)\n", - "Requirement already satisfied: wheel<1.0,>=0.23.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from astunparse>=1.6.3->mindspore>=2.2.14->mindnlp==0.4.0) (0.45.1)\n", - "Requirement already satisfied: six<2.0,>=1.6.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from astunparse>=1.6.3->mindspore>=2.2.14->mindnlp==0.4.0) (1.17.0)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (20.0.0)\n", - "Requirement already satisfied: pandas in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (2.2.3)\n", - "Requirement already satisfied: xxhash in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (3.5.0)\n", - "Requirement already satisfied: multiprocess<0.70.17 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (0.70.16)\n", - "Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (3.12.7)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (2.6.1)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (1.3.2)\n", - "Requirement already satisfied: async-timeout<6.0,>=4.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (5.0.1)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (1.6.0)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (6.4.4)\n", - "Requirement already satisfied: propcache>=0.2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (0.3.1)\n", - "Requirement already satisfied: yarl<2.0,>=1.17.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (1.20.0)\n", - "Requirement already satisfied: idna>=2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from yarl<2.0,>=1.17.0->aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (3.10)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (3.4.1)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (2.4.0)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (2025.4.26)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2.9.0.post0)\n", - "Requirement already satisfied: pytz>=2020.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2025.2)\n", - "Requirement already satisfied: tzdata>=2022.7 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2025.2)\n", - "Requirement already satisfied: pygtrie<3.0,>=2.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pyctcdecode->mindnlp==0.4.0) (2.5.0)\n", - "Requirement already satisfied: hypothesis<7,>=6.14 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pyctcdecode->mindnlp==0.4.0) (6.133.2)\n", - "Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from hypothesis<7,>=6.14->pyctcdecode->mindnlp==0.4.0) (2.4.0)\n", - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n" - ] - } - ], - "source": [ - "!pip install mindnlp==0.4.0" - ] - }, - { - "cell_type": "markdown", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "#### 注:MindNLP whl包下载链接为:[MindNLP](https://repo.mindspore.cn/mindspore-lab/mindnlp/newest/any/)" + "# !pip install mindspore==2.7.0 mindnlp==0.5.1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# 基于 MindSpore 实现 BERT 对话情绪识别\n", - "\n", - "\n", - "## 模型简介\n", - "\n", - "BERT全称是来自变换器的双向编码器表征量(Bidirectional Encoder Representations from Transformers),它是Google于2018年末开发并发布的一种新型语言模型。与BERT模型相似的预训练语言模型例如问答、命名实体识别、自然语言推理、文本分类等在许多自然语言处理任务中发挥着重要作用。模型是基于Transformer中的Encoder并加上双向的结构,因此一定要熟练掌握Transformer的Encoder的结构。\n", - "\n", - "BERT模型的主要创新点都在pre-train方法上,即用了Masked Language Model和Next Sentence Prediction两种方法分别捕捉词语和句子级别的representation。\n", - "\n", - "在用Masked Language Model方法训练BERT的时候,随机把语料库中15%的单词做Mask操作。对于这15%的单词做Mask操作分为三种情况:80%的单词直接用[Mask]替换、10%的单词直接替换成另一个新的单词、10%的单词保持不变。\n", - "\n", - "因为涉及到Question Answering (QA) 和 Natural Language Inference (NLI)之类的任务,增加了Next Sentence Prediction预训练任务,目的是让模型理解两个句子之间的联系。与Masked Language Model任务相比,Next Sentence Prediction更简单些,训练的输入是句子A和B,B有一半的几率是A的下一句,输入这两个句子,BERT模型预测B是不是A的下一句。\n", - "\n", - "BERT预训练之后,会保存它的Embedding table和12层Transformer权重(BERT-BASE)或24层Transformer权重(BERT-LARGE)。使用预训练好的BERT模型可以对下游任务进行Fine-tuning,比如:文本分类、相似度判断、阅读理解等。\n", - "\n", - "对话情绪识别(Emotion Detection,简称EmoTect),专注于识别智能对话场景中用户的情绪,针对智能对话场景中的用户文本,自动判断该文本的情绪类别并给出相应的置信度,情绪类型分为积极、消极、中性。 对话情绪识别适用于聊天、客服等多个场景,能够帮助企业更好地把握对话质量、改善产品的用户交互体验,也能分析客服服务质量、降低人工质检成本。\n", - "\n", - "下面以一个文本情感分类任务为例子来说明BERT模型的整个应用过程。" + "其他场景可参考[MindSpore安装指南](https://www.mindspore.cn/install)与[MindSpore NLP安装指南](https://github.com/mindspore-lab/mindnlp?tab=readme-ov-file#installation)进行环境搭建。" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": { "tags": [] }, @@ -165,62 +67,29 @@ "name": "stderr", "output_type": "stream", "text": [ - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", " setattr(self, word, getattr(machar, word).flat[0])\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", " return self._float_to_str(self.smallest_subnormal)\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", " setattr(self, word, getattr(machar, word).flat[0])\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", " return self._float_to_str(self.smallest_subnormal)\n", - "Building prefix dict from the default dictionary ...\n", - "Loading model from cache /tmp/jieba.cache\n", - "Loading model cost 0.908 seconds.\n", - "Prefix dict has been built successfully.\n" + "/usr/local/python3.10.14/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n", + "Modular Diffusers is currently an experimental feature under active development. The API is subject to breaking changes in future releases.\n", + "[WARNING] ME(3859:281473418565184,MainProcess):2025-11-25-13:56:29.950.000 [mindspore/context.py:1412] For 'context.set_context', the parameter 'pynative_synchronize' will be deprecated and removed in a future version. Please use the api mindspore.runtime.launch_blocking() instead.\n" ] } ], "source": [ "import os\n", - "\n", + "import mindnlp\n", "import mindspore\n", - "from mindspore.dataset import text, GeneratorDataset, transforms\n", + "from datasets import Dataset\n", "from mindspore import nn, context\n", "\n", - "from mindnlp.engine.trainer import Trainer" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# prepare dataset\n", - "class SentimentDataset:\n", - " \"\"\"Sentiment Dataset\"\"\"\n", - "\n", - " def __init__(self, path):\n", - " self.path = path\n", - " self._labels, self._text_a = [], []\n", - " self._load()\n", - "\n", - " def _load(self):\n", - " with open(self.path, \"r\", encoding=\"utf-8\") as f:\n", - " dataset = f.read()\n", - " lines = dataset.split(\"\\n\")\n", - " for line in lines[1:-1]:\n", - " label, text_a = line.split(\"\\t\")\n", - " self._labels.append(int(label))\n", - " self._text_a.append(text_a)\n", - "\n", - " def __getitem__(self, index):\n", - " return self._labels[index], self._text_a[index]\n", - "\n", - " def __len__(self):\n", - " return len(self._labels)" + "mindspore.set_context(pynative_synchronize=True)" ] }, { @@ -244,7 +113,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": { "tags": [] }, @@ -253,16 +122,16 @@ "name": "stdout", "output_type": "stream", "text": [ - "--2025-06-03 16:26:40-- https://baidu-nlp.bj.bcebos.com/emotion_detection-dataset-1.0.0.tar.gz\n", - "正在解析主机 baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)... 36.110.192.178, 2409:8c04:1001:1203:0:ff:b0bb:4f27\n", - "正在连接 baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)|36.110.192.178|:443... 已连接。\n", - "已发出 HTTP 请求,正在等待回应... 200 OK\n", - "长度:1710581 (1.6M) [application/x-gzip]\n", - "正在保存至: “emotion_detection.tar.gz”\n", + "--2025-11-25 13:56:41-- https://baidu-nlp.bj.bcebos.com/emotion_detection-dataset-1.0.0.tar.gz\n", + "Resolving proxy-notebook.modelarts.com (proxy-notebook.modelarts.com)... 192.168.0.33\n", + "Connecting to proxy-notebook.modelarts.com (proxy-notebook.modelarts.com)|192.168.0.33|:8083... connected.\n", + "Proxy request sent, awaiting response... 200 OK\n", + "Length: 1710581 (1.6M) [application/x-gzip]\n", + "Saving to: ‘emotion_detection.tar.gz’\n", "\n", - "emotion_detection.t 100%[===================>] 1.63M 7.02MB/s 用时 0.2s \n", + "emotion_detection.t 100%[===================>] 1.63M 986KB/s in 1.7s \n", "\n", - "2025-06-03 16:26:41 (7.02 MB/s) - 已保存 “emotion_detection.tar.gz” [1710581/1710581])\n", + "2025-11-25 13:56:43 (986 KB/s) - ‘emotion_detection.tar.gz’ saved [1710581/1710581]\n", "\n", "data/\n", "data/test.tsv\n", @@ -285,168 +154,77 @@ "source": [ "### 数据加载和数据预处理\n", "\n", - "新建 process_dataset 函数用于数据加载和数据预处理,具体内容可见下面代码注释。" + "具体内容可见下面代码注释。" ] }, { "cell_type": "code", - "execution_count": 6, - "metadata": { - "tags": [] - }, + "execution_count": null, + "metadata": {}, "outputs": [], "source": [ - "import numpy as np\n", - "\n", - "def process_dataset(source, tokenizer, max_seq_len=64, batch_size=32, shuffle=True):\n", - " is_ascend = mindspore.get_context('device_target') == 'Ascend'\n", - "\n", - " column_names = [\"label\", \"text_a\"]\n", - " \n", - " dataset = GeneratorDataset(source, column_names=column_names, shuffle=shuffle)\n", - " # transforms\n", - " type_cast_op = transforms.TypeCast(mindspore.int32)\n", - " def tokenize_and_pad(text):\n", - " if is_ascend:\n", - " tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=max_seq_len)\n", - " else:\n", - " tokenized = tokenizer(text)\n", - " return tokenized['input_ids'], tokenized['attention_mask']\n", - " # map dataset\n", - " dataset = dataset.map(operations=tokenize_and_pad, input_columns=\"text_a\", output_columns=['input_ids', 'attention_mask'])\n", - " dataset = dataset.map(operations=[type_cast_op], input_columns=\"label\", output_columns='labels')\n", - " # batch dataset\n", - " if is_ascend:\n", - " dataset = dataset.batch(batch_size)\n", - " else:\n", - " dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id),\n", - " 'attention_mask': (None, 0)})\n", - "\n", - " return dataset" + "# 准备数据集(假设数据格式:标签\\t文本)\n", + "def load_dataset(path):\n", + " labels, texts = [], []\n", + " with open(path, 'r', encoding='utf-8') as f:\n", + " for line in f.readlines()[1:]: # 跳过标题行\n", + " if line.strip():\n", + " label, text = line.strip().split('\\t')\n", + " labels.append(int(label))\n", + " texts.append(text)\n", + " return {'label': labels, 'text': texts}\n" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "昇腾NPU环境下暂不支持动态Shape,数据预处理部分采用静态Shape处理:" + "# 加载数据\n", + "train_data = load_dataset(\"data/train.tsv\")\n", + "val_data = load_dataset(\"data/dev.tsv\")\n", + "test_data = load_dataset(\"data/test.tsv\")\n", + "\n", + "# 转换为Hugging Face数据集格式\n", + "train_dataset = Dataset.from_dict(train_data)\n", + "val_dataset = Dataset.from_dict(val_data)\n", + "test_dataset = Dataset.from_dict(test_data)" ] }, { "cell_type": "code", - "execution_count": 7, - "metadata": { - "tags": [] - }, + "execution_count": null, + "metadata": {}, "outputs": [ { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "70d7820ca2334d3ba52d2b57e7a23918", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - " 0%| | 0.00/49.0 [00:00\n", + "数据集列名: ['input_ids', 'attention_mask', 'labels']\n", + "数据集特征: {'input_ids': List(Value('int32')), 'attention_mask': List(Value('int8')), 'labels': Value('int32')}\n", + "数据集长度: 9655\n", + "\n", + "前3个样本:\n", + "样本 0: {'input_ids': [101, 872, 4638, 4495, 3189, 1762, 8043, 3299, 8043, 3189, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}\n", + "样本 1: {'input_ids': [101, 1420, 6432, 872, 3221, 671, 702, 3265, 4511, 8024, 3221, 1408, 8043, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}\n", + "样本 2: {'input_ids': [101, 872, 2823, 5756, 4638, 1416, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}\n" ] } ], "source": [ - "%env HF_ENDPOINT=https://hf-mirror.com" + "# 查看数据集信息\n", + "print(\"数据集信息:\")\n", + "print(f\"数据集类型: {type(tokenized_train)}\")\n", + "print(f\"数据集列名: {tokenized_train.column_names}\")\n", + "print(f\"数据集特征: {tokenized_train.features}\")\n", + "print(f\"数据集长度: {len(tokenized_train)}\")\n", + "\n", + "# 查看前3个样本\n", + "print(\"\\n前3个样本:\")\n", + "for i in range(min(3, len(tokenized_train))):\n", + " print(f\"样本 {i}: {tokenized_train[i]}\")" ] }, { @@ -530,38 +356,22 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "metadata": { "tags": [] }, "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "15fc8f5f71ab4c5ea26e9e6b9e5f0743", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - " 0%| | 0.00/392M [00:00\n", + " \n", + " \n", + " [151/151 02:51, Epoch 1/1]\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
EpochTraining LossValidation LossAccuracy
1No log0.2168330.917593

" + ], "text/plain": [ - " 0%| | 0/34 [00:00" ] }, "metadata": {}, "output_type": "display_data" }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 0.056950367987155914, 'eval_accuracy': 0.9861111111111112, 'eval_runtime': 1.5702, 'eval_samples_per_second': 21.653, 'eval_steps_per_second': 21.653, 'epoch': 4.0}\n", - "{'loss': 0.0568, 'learning_rate': 1.456953642384106e-06, 'epoch': 4.64}\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - " 0%| | 0/34 [00:00=2.2.14 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.4.1)\n", - "Requirement already satisfied: tqdm in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (4.67.1)\n", - "Requirement already satisfied: requests in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.32.3)\n", - "Requirement already satisfied: datasets in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (3.6.0)\n", - "Requirement already satisfied: evaluate in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.4.3)\n", - "Requirement already satisfied: tokenizers==0.19.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.19.1)\n", - "Requirement already satisfied: safetensors in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.3)\n", - "Requirement already satisfied: sentencepiece in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.2.0)\n", - "Requirement already satisfied: regex in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2024.11.6)\n", - "Requirement already satisfied: addict in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.4.0)\n", - "Requirement already satisfied: ml-dtypes in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.1)\n", - "Requirement already satisfied: pyctcdecode in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.0)\n", - "Requirement already satisfied: jieba in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.42.1)\n", - "Requirement already satisfied: pytest==7.2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (7.2.0)\n", - "Requirement already satisfied: pillow>=10.0.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (11.2.1)\n", - "Requirement already satisfied: attrs>=19.2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (25.3.0)\n", - "Requirement already satisfied: iniconfig in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (2.1.0)\n", - "Requirement already satisfied: packaging in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (24.2)\n", - "Requirement already satisfied: pluggy<2.0,>=0.12 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (1.5.0)\n", - "Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (1.2.0)\n", - "Requirement already satisfied: tomli>=1.0.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (2.2.1)\n", - "Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from tokenizers==0.19.1->mindnlp==0.4.0) (0.31.1)\n", - "Requirement already satisfied: filelock in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (3.18.0)\n", - "Requirement already satisfied: fsspec>=2023.5.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (2025.3.0)\n", - "Requirement already satisfied: pyyaml>=5.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (6.0.2)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (4.12.2)\n", - "Requirement already satisfied: hf-xet<2.0.0,>=1.1.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (1.1.0)\n", - "Requirement already satisfied: numpy<2.0.0,>=1.20.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.26.4)\n", - "Requirement already satisfied: protobuf>=3.13.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (6.30.2)\n", - "Requirement already satisfied: asttokens>=2.0.4 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (3.0.0)\n", - "Requirement already satisfied: scipy>=1.5.4 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.13.1)\n", - "Requirement already satisfied: psutil>=5.6.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (5.9.0)\n", - "Requirement already satisfied: astunparse>=1.6.3 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.6.3)\n", - "Requirement already satisfied: wheel<1.0,>=0.23.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from astunparse>=1.6.3->mindspore>=2.2.14->mindnlp==0.4.0) (0.45.1)\n", - "Requirement already satisfied: six<2.0,>=1.6.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from astunparse>=1.6.3->mindspore>=2.2.14->mindnlp==0.4.0) (1.17.0)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (20.0.0)\n", - "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (0.3.8)\n", - "Requirement already satisfied: pandas in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (2.2.3)\n", - "Requirement already satisfied: xxhash in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (3.5.0)\n", - "Requirement already satisfied: multiprocess<0.70.17 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (0.70.16)\n", - "Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (3.11.18)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (2.6.1)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (1.3.2)\n", - "Requirement already satisfied: async-timeout<6.0,>=4.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (5.0.1)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (1.6.0)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (6.4.3)\n", - "Requirement already satisfied: propcache>=0.2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (0.3.1)\n", - "Requirement already satisfied: yarl<2.0,>=1.17.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (1.20.0)\n", - "Requirement already satisfied: idna>=2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from yarl<2.0,>=1.17.0->aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (3.10)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (3.4.1)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (2.4.0)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (2025.4.26)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2.9.0.post0)\n", - "Requirement already satisfied: pytz>=2020.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2025.2)\n", - "Requirement already satisfied: tzdata>=2022.7 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2025.2)\n", - "Requirement already satisfied: pygtrie<3.0,>=2.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pyctcdecode->mindnlp==0.4.0) (2.5.0)\n", - "Requirement already satisfied: hypothesis<7,>=6.14 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pyctcdecode->mindnlp==0.4.0) (6.131.15)\n", - "Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from hypothesis<7,>=6.14->pyctcdecode->mindnlp==0.4.0) (2.4.0)\n", - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n", - "Looking in indexes: https://repo.huaweicloud.com/repository/pypi/simple/\n", - "Requirement already satisfied: pytesseract in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (0.3.13)\n", - "Requirement already satisfied: packaging>=21.3 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytesseract) (24.2)\n", - "Requirement already satisfied: Pillow>=8.0.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytesseract) (11.2.1)\n", - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n" + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", + " setattr(self, word, getattr(machar, word).flat[0])\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", + " return self._float_to_str(self.smallest_subnormal)\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", + " setattr(self, word, getattr(machar, word).flat[0])\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", + " return self._float_to_str(self.smallest_subnormal)\n", + "[WARNING] ME(64329:281473217238592,MainProcess):2025-11-25-16:57:21.881.000 [mindspore/context.py:1412] For 'context.set_context', the parameter 'pynative_synchronize' will be deprecated and removed in a future version. Please use the api mindspore.runtime.launch_blocking() instead.\n" ] } ], "source": [ - "# install mindnlp\n", - "!pip install mindnlp==0.4.0\n", - "!pip install pytesseract" + "import mindspore\n", + "mindspore.set_context(pynative_synchronize=True) #开启同步设置,方便后续问题定位" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "0ad0c6f0", "metadata": {}, "outputs": [ @@ -295,69 +267,11 @@ "name": "stderr", "output_type": "stream", "text": [ - "[WARNING] GE_ADPT(6108,ffffab724020,python):2025-05-13-12:28:08.673.160 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclmdlBundleGetModelId failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclmdlBundleGetModelId\n", - "[WARNING] GE_ADPT(6108,ffffab724020,python):2025-05-13-12:28:08.673.218 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclmdlBundleLoadFromMem failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclmdlBundleLoadFromMem\n", - "[WARNING] GE_ADPT(6108,ffffab724020,python):2025-05-13-12:28:08.673.244 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclmdlBundleUnload failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclmdlBundleUnload\n", - "[WARNING] GE_ADPT(6108,ffffab724020,python):2025-05-13-12:28:08.673.383 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclrtGetMemUceInfo failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclrtGetMemUceInfo\n", - "[WARNING] GE_ADPT(6108,ffffab724020,python):2025-05-13-12:28:08.673.408 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclrtDeviceTaskAbort failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclrtDeviceTaskAbort\n", - "[WARNING] GE_ADPT(6108,ffffab724020,python):2025-05-13-12:28:08.673.431 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclrtMemUceRepair failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclrtMemUceRepair\n", - "[WARNING] GE_ADPT(6108,ffffab724020,python):2025-05-13-12:28:08.674.937 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol acltdtCleanChannel failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libacl_tdt_channel.so: undefined symbol: acltdtCleanChannel\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", - " setattr(self, word, getattr(machar, word).flat[0])\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", - " return self._float_to_str(self.smallest_subnormal)\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", - " setattr(self, word, getattr(machar, word).flat[0])\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", - " return self._float_to_str(self.smallest_subnormal)\n", - "Building prefix dict from the default dictionary ...\n", - "Loading model from cache /tmp/jieba.cache\n", - "Loading model cost 1.042 seconds.\n", - "Prefix dict has been built successfully.\n" + "/usr/local/python3.10.14/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n", + "Modular Diffusers is currently an experimental feature under active development. The API is subject to breaking changes in future releases.\n" ] }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "6421e22f124345fe924aa445c22403a2", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "0.00B [00:00, ?B/s]" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "30735811eedd4f2ea525d38c702c177f", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "0.00B [00:00, ?B/s]" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "2e931b32674045c599aeb89a2088f180", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - " 0%| | 0.00/334 [00:00 \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n", - "\u001b[31mERROR: No matching distribution found for install\u001b[0m\u001b[31m\n", - "\u001b[0m" - ] - } - ], + "cell_type": "markdown", + "metadata": {}, "source": [ - "!pip install pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.5.0/MindSpore/unified/aarch64/mindspore-2.5.0-cp39-cp39-linux_aarch64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple" + "## 环境配置\n", + "\n", + "本案例的运行环境为:\n", + "\n", + "| Python | MindSpore | MindSpore NLP |\n", + "| :----- | :-------- | :------------ |\n", + "| 3.10 | 2.7.0 | 0.5.1 |\n", + "\n", + "如果你在如[昇思大模型平台](https://xihe.mindspore.cn/training-projects)、[华为云ModelArts](https://www.huaweicloud.com/product/modelarts.html)、[启智社区](https://openi.pcl.ac.cn/)等算力平台的Jupyter在线编程环境中运行本案例,可取消如下代码的注释,进行依赖库安装:" ] }, { "cell_type": "code", - "execution_count": 2, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Looking in indexes: https://repo.huaweicloud.com/repository/pypi/simple/\n", - "Requirement already satisfied: mindnlp==0.4.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (0.4.0)\n", - "Requirement already satisfied: mindspore>=2.2.14 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.4.1)\n", - "Requirement already satisfied: tqdm in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (4.67.1)\n", - "Requirement already satisfied: requests in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.32.3)\n", - "Requirement already satisfied: datasets in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (3.6.0)\n", - "Requirement already satisfied: evaluate in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.4.3)\n", - "Requirement already satisfied: tokenizers==0.19.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.19.1)\n", - "Requirement already satisfied: safetensors in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.3)\n", - "Requirement already satisfied: sentencepiece in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.2.0)\n", - "Requirement already satisfied: regex in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2024.11.6)\n", - "Requirement already satisfied: addict in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.4.0)\n", - "Requirement already satisfied: ml-dtypes in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.1)\n", - "Requirement already satisfied: pyctcdecode in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.0)\n", - "Requirement already satisfied: jieba in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.42.1)\n", - "Requirement already satisfied: pytest==7.2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (7.2.0)\n", - "Requirement already satisfied: pillow>=10.0.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindnlp==0.4.0) (11.1.0)\n", - "Requirement already satisfied: attrs>=19.2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (25.3.0)\n", - "Requirement already satisfied: iniconfig in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (2.1.0)\n", - "Requirement already satisfied: packaging in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (24.2)\n", - "Requirement already satisfied: pluggy<2.0,>=0.12 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (1.5.0)\n", - "Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (1.2.0)\n", - "Requirement already satisfied: tomli>=1.0.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (2.2.1)\n", - "Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from tokenizers==0.19.1->mindnlp==0.4.0) (0.32.3)\n", - "Requirement already satisfied: numpy<2.0.0,>=1.20.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.26.4)\n", - "Requirement already satisfied: protobuf>=3.13.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (6.30.2)\n", - "Requirement already satisfied: asttokens>=2.0.4 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (2.0.5)\n", - "Requirement already satisfied: scipy>=1.5.4 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.13.1)\n", - "Requirement already satisfied: psutil>=5.6.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (5.9.0)\n", - "Requirement already satisfied: astunparse>=1.6.3 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.6.3)\n", - "Requirement already satisfied: filelock in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (3.18.0)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (20.0.0)\n", - "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (0.3.8)\n", - "Requirement already satisfied: pandas in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (2.2.3)\n", - "Requirement already satisfied: xxhash in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (3.5.0)\n", - "Requirement already satisfied: multiprocess<0.70.17 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (0.70.16)\n", - "Requirement already satisfied: fsspec<=2025.3.0,>=2023.1.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (2025.3.0)\n", - "Requirement already satisfied: pyyaml>=5.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (6.0.2)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (3.4.1)\n", - "Requirement already satisfied: idna<4,>=2.5 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (3.10)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (2.3.0)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (2025.1.31)\n", - "Requirement already satisfied: pygtrie<3.0,>=2.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pyctcdecode->mindnlp==0.4.0) (2.5.0)\n", - "Requirement already satisfied: hypothesis<7,>=6.14 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pyctcdecode->mindnlp==0.4.0) (6.133.2)\n", - "Requirement already satisfied: six in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from asttokens>=2.0.4->mindspore>=2.2.14->mindnlp==0.4.0) (1.16.0)\n", - "Requirement already satisfied: wheel<1.0,>=0.23.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from astunparse>=1.6.3->mindspore>=2.2.14->mindnlp==0.4.0) (0.45.1)\n", - "Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (3.12.7)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (4.12.2)\n", - "Requirement already satisfied: hf-xet<2.0.0,>=1.1.2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (1.1.2)\n", - "Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from hypothesis<7,>=6.14->pyctcdecode->mindnlp==0.4.0) (2.4.0)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2.9.0.post0)\n", - "Requirement already satisfied: pytz>=2020.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2025.2)\n", - "Requirement already satisfied: tzdata>=2022.7 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2025.2)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (2.6.1)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (1.3.2)\n", - "Requirement already satisfied: async-timeout<6.0,>=4.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (5.0.1)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (1.6.0)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (6.4.4)\n", - "Requirement already satisfied: propcache>=0.2.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (0.3.1)\n", - "Requirement already satisfied: yarl<2.0,>=1.17.0 in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets->mindnlp==0.4.0) (1.20.0)\n", - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ - "!pip install mindnlp==0.4.0" + "# !pip install mindspore==2.7.0 mindnlp==0.5.1" ] }, { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Looking in indexes: https://repo.huaweicloud.com/repository/pypi/simple/\n", - "Requirement already satisfied: jieba in /home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages (0.42.1)\n", - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n" - ] - } - ], + "cell_type": "markdown", + "metadata": {}, "source": [ - "!pip install jieba" + "其他场景可参考[MindSpore安装指南](https://www.mindspore.cn/install)与[MindSpore NLP安装指南](https://github.com/mindspore-lab/mindnlp?tab=readme-ov-file#installation)进行环境搭建。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 安装依赖" ] }, { "cell_type": "code", - "execution_count": 4, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "env: HF_ENDPOINT=https://hf-mirror.com\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ - "%env HF_ENDPOINT=https://hf-mirror.com" + "!pip install jieba" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": { "tags": [] }, @@ -170,128 +79,165 @@ "name": "stderr", "output_type": "stream", "text": [ - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:02.546.745 [mindspore/run_check/_check_version.py:329] MindSpore version 2.4.1 and Ascend AI software package (Ascend Data Center Solution)version 7.6 does not match, the version of software package expect one of ['7.3', '7.5']. Please refer to the match info on: https://www.mindspore.cn/install\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", " setattr(self, word, getattr(machar, word).flat[0])\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", " return self._float_to_str(self.smallest_subnormal)\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", " setattr(self, word, getattr(machar, word).flat[0])\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", " return self._float_to_str(self.smallest_subnormal)\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:04.810.368 [mindspore/run_check/_check_version.py:347] MindSpore version 2.4.1 and \"te\" wheel package version 7.6 does not match. For details, refer to the installation guidelines: https://www.mindspore.cn/install\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:04.812.087 [mindspore/run_check/_check_version.py:354] MindSpore version 2.4.1 and \"hccl\" wheel package version 7.6 does not match. For details, refer to the installation guidelines: https://www.mindspore.cn/install\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:04.812.731 [mindspore/run_check/_check_version.py:368] Please pay attention to the above warning, countdown: 3\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:05.814.363 [mindspore/run_check/_check_version.py:368] Please pay attention to the above warning, countdown: 2\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:06.816.061 [mindspore/run_check/_check_version.py:368] Please pay attention to the above warning, countdown: 1\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:09.473.235 [mindspore/run_check/_check_version.py:329] MindSpore version 2.4.1 and Ascend AI software package (Ascend Data Center Solution)version 7.6 does not match, the version of software package expect one of ['7.3', '7.5']. Please refer to the match info on: https://www.mindspore.cn/install\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:09.474.964 [mindspore/run_check/_check_version.py:347] MindSpore version 2.4.1 and \"te\" wheel package version 7.6 does not match. For details, refer to the installation guidelines: https://www.mindspore.cn/install\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:09.475.567 [mindspore/run_check/_check_version.py:354] MindSpore version 2.4.1 and \"hccl\" wheel package version 7.6 does not match. For details, refer to the installation guidelines: https://www.mindspore.cn/install\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:09.476.234 [mindspore/run_check/_check_version.py:368] Please pay attention to the above warning, countdown: 3\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:10.477.850 [mindspore/run_check/_check_version.py:368] Please pay attention to the above warning, countdown: 2\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:11.478.869 [mindspore/run_check/_check_version.py:368] Please pay attention to the above warning, countdown: 1\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:12.480.761 [mindspore/run_check/_check_version.py:329] MindSpore version 2.4.1 and Ascend AI software package (Ascend Data Center Solution)version 7.6 does not match, the version of software package expect one of ['7.3', '7.5']. Please refer to the match info on: https://www.mindspore.cn/install\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:12.482.175 [mindspore/run_check/_check_version.py:347] MindSpore version 2.4.1 and \"te\" wheel package version 7.6 does not match. For details, refer to the installation guidelines: https://www.mindspore.cn/install\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:12.482.802 [mindspore/run_check/_check_version.py:354] MindSpore version 2.4.1 and \"hccl\" wheel package version 7.6 does not match. For details, refer to the installation guidelines: https://www.mindspore.cn/install\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:12.483.400 [mindspore/run_check/_check_version.py:368] Please pay attention to the above warning, countdown: 3\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:13.485.045 [mindspore/run_check/_check_version.py:368] Please pay attention to the above warning, countdown: 2\n", - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:14.486.640 [mindspore/run_check/_check_version.py:368] Please pay attention to the above warning, countdown: 1\n", - "Building prefix dict from the default dictionary ...\n", - "Loading model from cache /tmp/jieba.cache\n", - "Loading model cost 1.075 seconds.\n", - "Prefix dict has been built successfully.\n" + "/usr/local/python3.10.14/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n", + "Modular Diffusers is currently an experimental feature under active development. The API is subject to breaking changes in future releases.\n", + "[WARNING] ME(13257:281473667298880,MainProcess):2025-11-25-14:35:28.604.000 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead.\n", + "[WARNING] ME(13257:281473667298880,MainProcess):2025-11-25-14:35:28.606.000 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_id' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead.\n" ] } ], "source": [ "import os\n", - "\n", + "import mindnlp\n", "import mindspore\n", "from mindspore.dataset import text, GeneratorDataset, transforms\n", "from mindspore import nn\n", - "mindspore.set_context(device_target='Ascend', device_id=0)\n", "\n", - "from mindnlp.dataset import load_dataset\n", + "from datasets import load_dataset\n", "\n", - "from mindnlp.engine.trainer import Trainer" + "from transformers import Trainer" ] }, { "cell_type": "code", - "execution_count": 6, - "metadata": { - "tags": [] - }, - "outputs": [], + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[WARNING] ME(13257:281473667298880,MainProcess):2025-11-25-14:35:30.889.000 [mindspore/context.py:1412] For 'context.set_context', the parameter 'pynative_synchronize' will be deprecated and removed in a future version. Please use the api mindspore.runtime.launch_blocking() instead.\n" + ] + } + ], + "source": [ + "mindspore.set_context(pynative_synchronize=True) #开启同步设置,方便后续定位" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, "source": [ - "imdb_ds = load_dataset('imdb', split=['train', 'test'])\n", - "imdb_train = imdb_ds['train']\n", - "imdb_test = imdb_ds['test']\n", - "# 为加快运行速度只选取一部分训练\n", - "imdb_train, _ = imdb_train.split([0.1, 0.9])\n", - "imdb_test, _ = imdb_test.split([0.1, 0.9])" + "## 数据加载与预处理\n", + "\n", + "### 数据集加载" ] }, { "cell_type": "code", - "execution_count": 7, - "metadata": { - "tags": [] - }, + "execution_count": 5, + "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "2500" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 141645.48 examples/s]\n", + "Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 143740.58 examples/s]\n", + "Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 153439.56 examples/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "采样后训练集大小: 2500\n", + "采样后测试集大小: 2500\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] } ], "source": [ - "imdb_train.get_dataset_size()" + "# 在加载时直接采样10%的数据\n", + "imdb_ds = load_dataset('imdb', split=['train[:10%]', 'test[:10%]'])\n", + "imdb_train = imdb_ds[0] # 10%的训练数据\n", + "imdb_test = imdb_ds[1] # 10%的测试数据\n", + "\n", + "print(f\"采样后训练集大小: {len(imdb_train)}\")\n", + "print(f\"采样后测试集大小: {len(imdb_test)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据预处理" ] }, { "cell_type": "code", - "execution_count": 8, - "metadata": { - "tags": [] - }, + "execution_count": 6, + "metadata": {}, "outputs": [], "source": [ - "import numpy as np\n", + "from datasets import Dataset\n", + "from torch.utils.data import DataLoader\n", + "from transformers import AutoTokenizer, DataCollatorWithPadding\n", "\n", "def process_dataset(dataset, tokenizer, max_seq_len=512, batch_size=4, shuffle=False):\n", - " is_ascend = mindspore.get_context('device_target') == 'Ascend'\n", - " def tokenize(text):\n", - " if is_ascend:\n", - " tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=max_seq_len)\n", - " else:\n", - " tokenized = tokenizer(text, truncation=True, max_length=max_seq_len)\n", - " return tokenized['input_ids'], tokenized['attention_mask']\n", - "\n", + " \n", + " def tokenize_function(examples):\n", + " return tokenizer(\n", + " examples['text'],\n", + " truncation=True,\n", + " max_length=max_seq_len,\n", + " return_tensors=None\n", + " )\n", + " \n", + " # 如果 shuffle 为 True,先打乱数据集\n", " if shuffle:\n", - " dataset = dataset.shuffle(batch_size)\n", - "\n", - " # map dataset\n", - " dataset = dataset.map(operations=[tokenize], input_columns=\"text\", output_columns=['input_ids', 'attention_mask'])\n", - " dataset = dataset.map(operations=transforms.TypeCast(mindspore.int32), input_columns=\"label\", output_columns=\"labels\")\n", - " # batch dataset\n", - " if is_ascend:\n", - " dataset = dataset.batch(batch_size)\n", - " else:\n", - " dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id),\n", - " 'attention_mask': (None, 0)})\n", - "\n", - " return dataset" + " dataset = dataset.shuffle(seed=42)\n", + " \n", + " # 应用 tokenize 函数\n", + " dataset = dataset.map(\n", + " tokenize_function,\n", + " batched=True,\n", + " remove_columns=['text']\n", + " )\n", + " \n", + " # 重命名标签列\n", + " if 'label' in dataset.column_names:\n", + " dataset = dataset.rename_column('label', 'labels')\n", + " \n", + " # 设置格式\n", + " dataset.set_format(type='torch')\n", + " \n", + " # 使用 DataCollatorWithPadding 处理所有 padding\n", + " data_collator = DataCollatorWithPadding(\n", + " tokenizer=tokenizer,\n", + " padding='longest' # 动态 padding\n", + " )\n", + " \n", + " dataloader = DataLoader(\n", + " dataset,\n", + " batch_size=batch_size,\n", + " shuffle=shuffle,\n", + " collate_fn=data_collator,\n", + " drop_last=False\n", + " )\n", + " \n", + " return dataloader" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 7, "metadata": { "tags": [] }, @@ -300,14 +246,12 @@ "name": "stderr", "output_type": "stream", "text": [ - "ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindnlp/transformers/tokenization_utils_base.py:1526: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted, and will be then set to `False` by default. \n", - " warnings.warn(\n" + "ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.\n" ] } ], "source": [ - "from mindnlp.transformers import OpenAIGPTTokenizer\n", + "from transformers import OpenAIGPTTokenizer\n", "# tokenizer\n", "gpt_tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n", "\n", @@ -322,67 +266,147 @@ }, { "cell_type": "code", - "execution_count": 10, - "metadata": { - "tags": [] - }, + "execution_count": 8, + "metadata": {}, "outputs": [ { - "name": "stderr", + "name": "stdout", "output_type": "stream", "text": [ - "[WARNING] ME(250:281472944873504,MainProcess):2025-06-03-14:44:51.758.363 [mindspore/dataset/engine/datasets.py:2534] Dataset is shuffled before split.\n" + "训练集大小: 1750\n", + "验证集大小: 750\n" ] } ], "source": [ "# split train dataset into train and valid datasets\n", - "imdb_train, imdb_val = imdb_train.split([0.7, 0.3])" + "dataset_split = imdb_train.train_test_split(test_size=0.3, seed=42)\n", + "imdb_train = dataset_split['train']\n", + "imdb_val = dataset_split['test']\n", + "\n", + "print(f\"训练集大小: {len(imdb_train)}\")\n", + "print(f\"验证集大小: {len(imdb_val)}\")" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 9, "metadata": { "tags": [] }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Map: 100%|██████████| 1750/1750 [00:15<00:00, 111.91 examples/s]\n", + "Map: 100%|██████████| 750/750 [00:07<00:00, 104.21 examples/s]\n", + "Map: 100%|██████████| 2500/2500 [00:21<00:00, 113.66 examples/s]\n" + ] + } + ], "source": [ "dataset_train = process_dataset(imdb_train, gpt_tokenizer, shuffle=True)\n", "dataset_val = process_dataset(imdb_val, gpt_tokenizer)\n", "dataset_test = process_dataset(imdb_test, gpt_tokenizer)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 查看数据信息" + ] + }, { "cell_type": "code", - "execution_count": 12, - "metadata": { - "tags": [] - }, + "execution_count": 10, + "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "[Tensor(shape=[4, 512], dtype=Int64, value=\n", - " [[ 500, 246, 1322 ... 40480, 40480, 40480],\n", - " [ 1473, 980, 246 ... 40480, 40480, 40480],\n", - " [39516, 498, 481 ... 40480, 40480, 40480],\n", - " [ 616, 544, 808 ... 40480, 40480, 40480]]),\n", - " Tensor(shape=[4, 512], dtype=Int64, value=\n", - " [[1, 1, 1 ... 0, 0, 0],\n", - " [1, 1, 1 ... 0, 0, 0],\n", - " [1, 1, 1 ... 0, 0, 0],\n", - " [1, 1, 1 ... 0, 0, 0]]),\n", - " Tensor(shape=[4], dtype=Int32, value= [1, 1, 0, 0])]" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "检查训练数据:\n", + "[MS_ALLOC_CONF]Runtime config: enable_vmm:True vmm_align_size:2MB\n", + "Batch 0:\n", + " Input IDs shape: mindtorch.Size([4, 268])\n", + " Attention mask shape: mindtorch.Size([4, 268])\n", + " Labels: [0 0 0 0]\n", + "Batch 1:\n", + " Input IDs shape: mindtorch.Size([4, 512])\n", + " Attention mask shape: mindtorch.Size([4, 512])\n", + " Labels: [0 0 0 0]\n", + "Batch 2:\n", + " Input IDs shape: mindtorch.Size([4, 512])\n", + " Attention mask shape: mindtorch.Size([4, 512])\n", + " Labels: [0 0 0 0]\n" + ] } ], "source": [ - "next(dataset_train.create_tuple_iterator())" + "# 检查数据\n", + "print(\"检查训练数据:\")\n", + "for i, batch in enumerate(dataset_train):\n", + " print(f\"Batch {i}:\")\n", + " print(f\" Input IDs shape: {batch['input_ids'].shape}\")\n", + " print(f\" Attention mask shape: {batch['attention_mask'].shape}\")\n", + " print(f\" Labels: {batch['labels']}\")\n", + " if i >= 2: # 只看前3个batch\n", + " break" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 现有的 process_dataset 函数返回 DataLoader,可以这样提取 Dataset\n", + "def extract_dataset_from_dataloader(dataloader):\n", + " \"\"\"从 DataLoader 中提取原始的 Dataset\"\"\"\n", + " return dataloader.dataset\n", + "\n", + "# 使用示例\n", + "dataset_train = extract_dataset_from_dataloader(dataset_train)\n", + "dataset_val = extract_dataset_from_dataloader(dataset_val)\n", + "dataset_test = extract_dataset_from_dataloader(dataset_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 加载评估指标" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Looking in indexes: http://pip.modelarts.private.com:8888/repository/pypi/simple\n", + "Collecting scikit-learn\n", + " Downloading http://pip.modelarts.private.com:8888/repository/pypi/packages/scikit-learn/1.7.2/scikit_learn-1.7.2-cp310-cp310-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl (9.5 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m9.5/9.5 MB\u001b[0m \u001b[31m133.8 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: numpy>=1.22.0 in /usr/local/python3.10.14/lib/python3.10/site-packages (from scikit-learn) (1.26.4)\n", + "Requirement already satisfied: scipy>=1.8.0 in /usr/local/python3.10.14/lib/python3.10/site-packages (from scikit-learn) (1.10.1)\n", + "Collecting joblib>=1.2.0 (from scikit-learn)\n", + " Downloading http://pip.modelarts.private.com:8888/repository/pypi/packages/joblib/1.5.2/joblib-1.5.2-py3-none-any.whl (308 kB)\n", + "Collecting threadpoolctl>=3.1.0 (from scikit-learn)\n", + " Downloading http://pip.modelarts.private.com:8888/repository/pypi/packages/threadpoolctl/3.6.0/threadpoolctl-3.6.0-py3-none-any.whl (18 kB)\n", + "Installing collected packages: threadpoolctl, joblib, scikit-learn\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3/3\u001b[0m [scikit-learn][0m [scikit-learn]\n", + "\u001b[1A\u001b[2KSuccessfully installed joblib-1.5.2 scikit-learn-1.7.2 threadpoolctl-3.6.0\n" + ] + } + ], + "source": [ + "!pip install scikit-learn # 安装依赖" ] }, { @@ -395,7 +419,7 @@ "source": [ "import evaluate\n", "import numpy as np\n", - "from mindnlp.engine.utils import EvalPrediction\n", + "from transformers import EvalPrediction\n", "\n", "metric = evaluate.load(\"accuracy\")\n", "\n", @@ -405,59 +429,54 @@ " return metric.compute(predictions=predictions, references=labels)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 模型加载" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import OpenAIGPTForSequenceClassification, TrainingArguments\n", + "\n", + "\n", + "model = OpenAIGPTForSequenceClassification.from_pretrained('openai-gpt', num_labels=2)\n", + "model.config.pad_token_id = gpt_tokenizer.pad_token_id\n", + "model.resize_token_embeddings(model.config.vocab_size + 3)" + ] + }, { "cell_type": "code", - "execution_count": 14, + "execution_count": null, "metadata": { "tags": [] }, "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "94d1b9d8276a4040a27030d34c8d44e2", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - " 0%| | 0.00/457M [00:00 type is zero.\n", - " setattr(self, word, getattr(machar, word).flat[0])\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", - " return self._float_to_str(self.smallest_subnormal)\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", - " setattr(self, word, getattr(machar, word).flat[0])\n", - "/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", - " return self._float_to_str(self.smallest_subnormal)\n", - "Building prefix dict from the default dictionary ...\n", - "Dumping model to file cache /tmp/jieba.cache\n", - "Loading model cost 1.022 seconds.\n", - "Prefix dict has been built successfully.\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "ef6c147f60104856b73d7926b515b038", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - " 0%| | 0.00/6.01k [00:00 此为在线运行平台配置python3.9的指南,如在其他环境平台运行案例,请根据实际情况修改如下代码" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "第一步:设置python版本为3.9.0" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%capture captured_output\n", - "!/home/ma-user/anaconda3/bin/conda create -n python-3.9.0 python=3.9.0 -y --override-channels --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main\n", - "!/home/ma-user/anaconda3/envs/python-3.9.0/bin/pip install ipykernel" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "import os\n", - "\n", - "data = {\n", - " \"display_name\": \"python-3.9.0\",\n", - " \"env\": {\n", - " \"PATH\": \"/home/ma-user/anaconda3/envs/python-3.9.0/bin:/home/ma-user/anaconda3/envs/python-3.7.10/bin:/modelarts/authoring/notebook-conda/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/ma-user/modelarts/ma-cli/bin:/home/ma-user/modelarts/ma-cli/bin\"\n", - " },\n", - " \"language\": \"python\",\n", - " \"argv\": [\n", - " \"/home/ma-user/anaconda3/envs/python-3.9.0/bin/python\",\n", - " \"-m\",\n", - " \"ipykernel\",\n", - " \"-f\",\n", - " \"{connection_file}\"\n", - " ]\n", - "}\n", - "\n", - "if not os.path.exists(\"/home/ma-user/anaconda3/share/jupyter/kernels/python-3.9.0/\"):\n", - " os.mkdir(\"/home/ma-user/anaconda3/share/jupyter/kernels/python-3.9.0/\")\n", - "\n", - "with open('/home/ma-user/anaconda3/share/jupyter/kernels/python-3.9.0/kernel.json', 'w') as f:\n", - " json.dump(data, f, indent=4)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 注:以上代码运行完成后,需要重新设置kernel为python-3.9.0" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "

" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "第二步:安装MindSpore框架和MindNLP套件" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "mindspore官网提供了不同的mindspore版本,可以根据自己的操作系统和Python版本,安装不同版本的mindspore\n", - "\n", - "\n", - "https://www.mindspore.cn/install" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Looking in indexes: https://mirrors.aliyun.com/pypi/simple/\n", - "Collecting mindspore==2.5.0\n", - " Downloading https://mirrors.aliyun.com/pypi/packages/23/22/dff0f1bef6c0846a97271ae5d39ca187914f39562f9e3f6787041dea1a97/mindspore-2.5.0-cp39-cp39-manylinux1_x86_64.whl (958.4 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m958.4/958.4 MB\u001b[0m \u001b[31m9.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:03\u001b[0m\n", - "\u001b[?25hCollecting numpy<2.0.0,>=1.20.0 (from mindspore==2.5.0)\n", - " Downloading https://mirrors.aliyun.com/pypi/packages/54/30/c2a907b9443cf42b90c17ad10c1e8fa801975f01cb9764f3f8eb8aea638b/numpy-1.26.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m18.2/18.2 MB\u001b[0m \u001b[31m16.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", - "\u001b[?25hCollecting protobuf>=3.13.0 (from mindspore==2.5.0)\n", - " Downloading https://mirrors.aliyun.com/pypi/packages/28/50/1925de813499546bc8ab3ae857e3ec84efe7d2f19b34529d0c7c3d02d11d/protobuf-6.30.2-cp39-abi3-manylinux2014_x86_64.whl (316 kB)\n", - "Requirement already satisfied: asttokens>=2.0.4 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindspore==2.5.0) (3.0.0)\n", - "Collecting pillow>=6.2.0 (from mindspore==2.5.0)\n", - " Downloading https://mirrors.aliyun.com/pypi/packages/f6/46/0bd0ca03d9d1164a7fa33d285ef6d1c438e963d0c8770e4c5b3737ef5abe/pillow-11.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.4/4.4 MB\u001b[0m \u001b[31m14.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", - "\u001b[?25hCollecting scipy>=1.5.4 (from mindspore==2.5.0)\n", - " Downloading https://mirrors.aliyun.com/pypi/packages/35/f5/d0ad1a96f80962ba65e2ce1de6a1e59edecd1f0a7b55990ed208848012e0/scipy-1.13.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m38.6/38.6 MB\u001b[0m \u001b[31m16.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", - "\u001b[?25hRequirement already satisfied: packaging>=20.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindspore==2.5.0) (24.2)\n", - "Requirement already satisfied: psutil>=5.6.1 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindspore==2.5.0) (5.9.1)\n", - "Collecting astunparse>=1.6.3 (from mindspore==2.5.0)\n", - " Downloading https://mirrors.aliyun.com/pypi/packages/2b/03/13dde6512ad7b4557eb792fbcf0c653af6076b81e5941d36ec61f7ce6028/astunparse-1.6.3-py2.py3-none-any.whl (12 kB)\n", - "Collecting safetensors>=0.4.0 (from mindspore==2.5.0)\n", - " Downloading https://mirrors.aliyun.com/pypi/packages/a6/f8/dae3421624fcc87a89d42e1898a798bc7ff72c61f38973a65d60df8f124c/safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (471 kB)\n", - "Collecting dill>=0.3.7 (from mindspore==2.5.0)\n", - " Downloading https://mirrors.aliyun.com/pypi/packages/46/d1/e73b6ad76f0b1fb7f23c35c6d95dbc506a9c8804f43dda8cb5b0fa6331fd/dill-0.3.9-py3-none-any.whl (119 kB)\n", - "Requirement already satisfied: wheel<1.0,>=0.23.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from astunparse>=1.6.3->mindspore==2.5.0) (0.45.1)\n", - "Requirement already satisfied: six<2.0,>=1.6.1 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from astunparse>=1.6.3->mindspore==2.5.0) (1.17.0)\n", - "Installing collected packages: safetensors, protobuf, pillow, numpy, dill, astunparse, scipy, mindspore\n", - "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", - "auto-tune 0.1.0 requires te, which is not installed.\n", - "schedule-search 0.0.1 requires absl-py, which is not installed.\u001b[0m\u001b[31m\n", - "\u001b[0mSuccessfully installed astunparse-1.6.3 dill-0.3.9 mindspore-2.5.0 numpy-1.26.4 pillow-11.1.0 protobuf-6.30.2 safetensors-0.5.3 scipy-1.13.1\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.5.0/MindSpore/unified/x86_64/mindspore-2.5.0-cp39-cp39-linux_x86_64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "注意,这里如果通过pip安装mindnlp的话,需要参考链接对mindnlp/core/nn/modules/module.py进行Files changed所示的修改,确保loss正确下降,链接教程:https://github.com/mindspore-lab/mindnlp/pull/2007/files\n", - "可以通过git clone下载最新的mindnlp,然后将下载的最新的mindnlp版本替换到你环境中安装mindnlp的位置,一般是/home/user/miniconda3/envs/yourenv/lib/python3.9/site-packages" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Looking in indexes: https://mirrors.aliyun.com/pypi/simple\n", - "Collecting mindnlp==0.4.0\n", - " Downloading https://mirrors.aliyun.com/pypi/packages/0f/a8/5a072852d28a51417b5e330b32e6ae5f26b491ef01a15ba968e77f785e69/mindnlp-0.4.0-py3-none-any.whl (8.4 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.4/8.4 MB\u001b[0m \u001b[31m4.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m0m\n", - "\u001b[?25hRequirement already satisfied: mindspore>=2.2.14 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.5.0)\n", - "Requirement already satisfied: tqdm in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (4.67.1)\n", - "Requirement already satisfied: requests in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.32.3)\n", - "Requirement already satisfied: datasets in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (3.5.0)\n", - "Requirement already satisfied: evaluate in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.4.3)\n", - "Requirement already satisfied: tokenizers==0.19.1 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.19.1)\n", - "Requirement already satisfied: safetensors in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.3)\n", - "Requirement already satisfied: sentencepiece in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.2.0)\n", - "Requirement already satisfied: regex in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (2024.11.6)\n", - "Requirement already satisfied: addict in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (2.4.0)\n", - "Requirement already satisfied: ml-dtypes in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.1)\n", - "Requirement already satisfied: pyctcdecode in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (0.5.0)\n", - "Collecting jieba (from mindnlp==0.4.0)\n", - " Downloading https://mirrors.aliyun.com/pypi/packages/c6/cb/18eeb235f833b726522d7ebed54f2278ce28ba9438e3135ab0278d9792a2/jieba-0.42.1.tar.gz (19.2 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m19.2/19.2 MB\u001b[0m \u001b[31m15.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", - "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25ldone\n", - "\u001b[?25hRequirement already satisfied: pytest==7.2.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (7.2.0)\n", - "Requirement already satisfied: pillow>=10.0.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindnlp==0.4.0) (11.1.0)\n", - "Requirement already satisfied: attrs>=19.2.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (24.3.0)\n", - "Requirement already satisfied: iniconfig in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (2.1.0)\n", - "Requirement already satisfied: packaging in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (24.2)\n", - "Requirement already satisfied: pluggy<2.0,>=0.12 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (1.5.0)\n", - "Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (1.2.2)\n", - "Requirement already satisfied: tomli>=1.0.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pytest==7.2.0->mindnlp==0.4.0) (2.0.1)\n", - "Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from tokenizers==0.19.1->mindnlp==0.4.0) (0.30.2)\n", - "Requirement already satisfied: numpy<2.0.0,>=1.20.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.26.4)\n", - "Requirement already satisfied: protobuf>=3.13.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (6.30.2)\n", - "Requirement already satisfied: asttokens>=2.0.4 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (3.0.0)\n", - "Requirement already satisfied: scipy>=1.5.4 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.13.1)\n", - "Requirement already satisfied: psutil>=5.6.1 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (5.9.1)\n", - "Requirement already satisfied: astunparse>=1.6.3 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (1.6.3)\n", - "Requirement already satisfied: dill>=0.3.7 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from mindspore>=2.2.14->mindnlp==0.4.0) (0.3.8)\n", - "Requirement already satisfied: filelock in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (3.18.0)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (19.0.1)\n", - "Requirement already satisfied: pandas in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (2.2.3)\n", - "Requirement already satisfied: xxhash in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (3.5.0)\n", - "Requirement already satisfied: multiprocess<0.70.17 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (0.70.16)\n", - "Requirement already satisfied: fsspec<=2024.12.0,>=2023.1.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets->mindnlp==0.4.0) (2024.12.0)\n", - "Requirement already satisfied: aiohttp in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (3.11.16)\n", - "Requirement already satisfied: pyyaml>=5.1 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from datasets->mindnlp==0.4.0) (6.0.2)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (3.7)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (2.3.0)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from requests->mindnlp==0.4.0) (2025.1.31)\n", - "Requirement already satisfied: pygtrie<3.0,>=2.1 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pyctcdecode->mindnlp==0.4.0) (2.5.0)\n", - "Requirement already satisfied: hypothesis<7,>=6.14 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pyctcdecode->mindnlp==0.4.0) (6.130.13)\n", - "Requirement already satisfied: wheel<1.0,>=0.23.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from astunparse>=1.6.3->mindspore>=2.2.14->mindnlp==0.4.0) (0.45.1)\n", - "Requirement already satisfied: six<2.0,>=1.6.1 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from astunparse>=1.6.3->mindspore>=2.2.14->mindnlp==0.4.0) (1.17.0)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from aiohttp->datasets->mindnlp==0.4.0) (2.6.1)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from aiohttp->datasets->mindnlp==0.4.0) (1.3.2)\n", - "Requirement already satisfied: async-timeout<6.0,>=4.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from aiohttp->datasets->mindnlp==0.4.0) (5.0.1)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from aiohttp->datasets->mindnlp==0.4.0) (1.5.0)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from aiohttp->datasets->mindnlp==0.4.0) (6.4.2)\n", - "Requirement already satisfied: propcache>=0.2.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from aiohttp->datasets->mindnlp==0.4.0) (0.3.1)\n", - "Requirement already satisfied: yarl<2.0,>=1.17.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from aiohttp->datasets->mindnlp==0.4.0) (1.19.0)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers==0.19.1->mindnlp==0.4.0) (4.13.1)\n", - "Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from hypothesis<7,>=6.14->pyctcdecode->mindnlp==0.4.0) (2.4.0)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2.9.0.post0)\n", - "Requirement already satisfied: pytz>=2020.1 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2025.2)\n", - "Requirement already satisfied: tzdata>=2022.7 in /home/jiangna1/miniconda3/envs/llama39/lib/python3.9/site-packages (from pandas->datasets->mindnlp==0.4.0) (2025.2)\n", - "Building wheels for collected packages: jieba\n", - " Building wheel for jieba (setup.py) ... \u001b[?25ldone\n", - "\u001b[?25h Created wheel for jieba: filename=jieba-0.42.1-py3-none-any.whl size=19314508 sha256=30064bba508d12a9c2c545bdec7e271f61d5a83e9fdd53298a82e74659e1fd26\n", - " Stored in directory: /home/jiangna1/.cache/pip/wheels/95/ef/7c/d8b3108835edfa15487417c5bddff166482b195d8090117ac5\n", - "Successfully built jieba\n", - "Installing collected packages: jieba, mindnlp\n", - "Successfully installed jieba-0.42.1 mindnlp-0.4.0\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install mindnlp==0.4.0 -i https://mirrors.aliyun.com/pypi/simple\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## LLaMA微调推理全流程" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# LLaMA介绍" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "LLaMA(Large Language Model Meta AI)是由Meta(前Facebook)推出的一系列开源大规模语言模型,意味着大模型应用进入了“免费时代”,初创公司也能够以低廉的价格来创建类似ChatGPT这样的聊天机器人。 LLaMA 专注于自然语言处理任务,其结构基于Transformer架构,但是进行了改进,比如引入更高效的注意力机制和更紧凑的模型设计,例如使用SwiGLU激活函数 使用RoPE位置编码等。\n", - "\n", - "LLaMA的模型参数规模设计更为精细,以更少的参数实现了与更大模型相当的性能。例如,LLaMA-7B模型的性能可以与175B参数规模的GPT-3媲美。模型在推理过程中采用了多查询注意力(Multi-Query Attention, MQA)机制,改进了传统多头注意力的查询方式,将多个注意力头的查询统一为单个查询头,从而显著减少了推理时间和显存需求,提升了效率。适用于文本生成、问答、翻译等任务。例如,在文本生成任务中,LLaMA可以生成高质量的文章段落,其轻量化设计降低了硬件需求,使研究者和开发者更容易使用高性能的语言模型。" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "导入必要的包" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "import numpy as np\n", - "import mindspore as ms\n", - "import mindspore.dataset as ds" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[WARNING] ME(52617:140460371482432,MainProcess):2025-04-11-09:21:14.572.566 [mindspore/context.py:1335] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead.\n" - ] - } - ], - "source": [ - "#将模式设置为动态图模式(PYNATIVE_MODE),并指定设备目标为Ascend芯片\n", - "ms.set_context(mode=ms.PYNATIVE_MODE, device_target=\"Ascend\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "#指定模型路径\n", - "base_model_path = \"/home/jiangna1/.mindnlp/model/hfl/chinese-llama-2-1.3b\" #中文模型,这里我提前下载到了本地减少下载时间\n", - "# base_model_path = \"NousResearch/Hermes-3-Llama-3.2-3B\" #英文模型\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 数据集\n", - "\n", - "这里提供两份用于微调的数据集,分别用于中文模型和英文模型,其中中文模型为hfl/chinese-llama-2-1.3b,参数量为1.3B,英文模型为NousResearch/Hermes-3-Llama-3.2-3B,参数量为3.2B,用户可以根据自己的配置或期望训练时长自行选择。\n", - "\n", - "数据来源皆为hugging face公开用于微调的数据集,其中中文数据集来源为弱智吧,数据格式为:\n", - "\n", - " {\"instruction\": \"只剩一个心脏了还能活吗?\",\n", - "\n", - " \"output\": \"能,人本来就只有一个心脏。\"},\n", - "\n", - " {\"instruction\": \"爸爸再婚,我是不是就有了个新娘?\",\n", - "\n", - " \"output\": \"不是的,你有了一个继母。\\\"新娘\\\"是指新婚的女方,而你爸爸再婚,他的新婚妻子对你来说是继母。\"}\n", - "\n", - "\n", - "英文数据来源为Alpaca,数据格式为:\n", - "\n", - " {\"instruction\": \"Give three tips for staying healthy.\",\n", - "\n", - " \"input\": \"\",\n", - "\n", - " \"output\": \"1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \\n2. Exercise regularly to keep your body active and strong. \\n3. Get enough sleep and maintain a consistent sleep schedule.\"},\n", - "\n", - " {\"instruction\": \"What are the three primary colors?\",\n", - "\n", - " \"input\": \"\",\n", - "\n", - " \"output\": \"The three primary colors are red, blue, and yellow.\"}\n", - "\n", - "\n", - "以下教程同时包括包括中文和英文模型的微调教程为例,其中英文模型微调效果更好,但因为时间关系,本模型展示主要以小规模的中文为例,可以自行根据自己的需求修改数据来源和模型。\n", - " " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 数据加载和数据预处理" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "新建 tokenize_function 函数用于数据预处理,具体内容可见下面代码注释。" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "def tokenize_function(example, tokenizer):\n", - " instruction = example.get(\"instruction\", \"\")\n", - " input_text = example.get(\"input\", \"\")\n", - " output = example.get(\"output\", \"\")\n", - " # prompt\n", - " if input_text:\n", - " prompt = f\"User: {instruction} {input_text}\\nAssistant: {output}\"\n", - " else:\n", - " prompt = f\"User: {instruction}\\nAssistant: {output}\"\n", - " \n", - " # Tokenize\n", - " tokenized = tokenizer(prompt, padding=\"max_length\", truncation=True, max_length=512)\n", - " input_ids = np.array(tokenized[\"input_ids\"], dtype=np.int32)\n", - "\n", - " # Handle label\n", - " pad_token_id = tokenizer.pad_token_id\n", - " labels = np.array(\n", - " [-100 if token_id == pad_token_id else token_id for token_id in input_ids], dtype=np.int32\n", - " )\n", - " return input_ids, labels\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "数据来源如下,为了避免网络问题,建议先下载到本地\n", - "\n", - "https://huggingface.co/datasets/LooksJuicy/ruozhiba" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# data_path = \"/home/jiangna1/mindnlp_llama_all/alpaca_data.json\" #英文数据集\n", - "data_path = \"/home/jiangna1/mindnlp_llama_all/chinese_data.json\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "查看数据具体内容,该数据只包括instruction和output两列" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[{'instruction': '只剩一个心脏了还能活吗?', 'output': '能,人本来就只有一个心脏。'}, {'instruction': '爸爸再婚,我是不是就有了个新娘?', 'output': '不是的,你有了一个继母。\"新娘\"是指新婚的女方,而你爸爸再婚,他的新婚妻子对你来说是继母。'}, {'instruction': '樟脑丸是我吃过最难吃的硬糖有奇怪的味道怎么还有人买', 'output': '樟脑丸并不是硬糖,而是一种常见的驱虫药,不能食用。虽然它的味道可能不太好,但是由于其有效的驱虫效果,所以仍然有很多人会购买。'}]\n" - ] - } - ], - "source": [ - "with open(data_path, 'r', encoding='utf-8') as f:\n", - " data = json.load(f)\n", - "print(data[:3])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "从指定路径加载预训练的分词器,该分词器能将输入文本分割成模型可处理的词元。接着,将填充标记设置为结束标记,这样在处理不同长度的文本序列时,用结束标记来填充额外位置,避免引入额外特殊标记,减少模型学习负担。最后,设置填充方向为右侧,使文本在右侧添加填充标记达到统一长度,维持文本原始顺序。" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[1, 29871, 2056]" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from mindnlp.transformers import AutoTokenizer\n", - "\n", - "tokenizer = AutoTokenizer.from_pretrained(base_model_path)\n", - "tokenizer.pad_token = tokenizer.eos_token\n", - "tokenizer.pad_token_id = tokenizer.eos_token_id\n", - "tokenizer.padding_side = \"right\"\n", - "tokenizer.encode(' ;')\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "将数据分为训练集和验证集" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "def data_generator(dataset, tokenizer):\n", - " for item in dataset:\n", - " yield tokenize_function(item, tokenizer)\n", - " \n", - "split_ratio = 0.9\n", - "split_index = int(len(data) * split_ratio)\n", - "train_data, val_data = data[:split_index], data[split_index:]\n", - "\n", - "train_dataset = ds.GeneratorDataset(\n", - " source=lambda: data_generator(train_data, tokenizer), \n", - " column_names=[\"input_ids\", \"labels\"]\n", - ")\n", - "\n", - "eval_dataset = ds.GeneratorDataset(\n", - " source=lambda: data_generator(val_data, tokenizer), \n", - " column_names=[\"input_ids\", \"labels\"]\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "查看处理后的数据,tokenizer 将输入的文本(prompt)拆分为词片段(tokens),然后将每个词片段映射为对应的 token ID。" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[MS_ALLOC_CONF]Runtime config: enable_vmm:True vmm_align_size:2MB\n", - "Sample 0: Input IDs: [ 1 4911 29901 29871 47133 32002 37755 30743 33302 31704]\n", - "Sample 0: Labels: [ 1 4911 29901 29871 47133 32002 37755 30743 33302 31704]\n", - "\n", - "Sample 1: Input IDs: [ 1 4911 29901 29871 33594 31733 33364 30214 30672 32308]\n", - "Sample 1: Labels: [ 1 4911 29901 29871 33594 31733 33364 30214 30672 32308]\n", - "\n", - "Sample 2: Input IDs: [ 1 4911 29901 29871 47019 33027 31818 34030 39950 44345]\n", - "Sample 2: Labels: [ 1 4911 29901 29871 47019 33027 31818 34030 39950 44345]\n", - "\n", - "Sample 3: Input IDs: [ 1 4911 29901 29871 34214 30698 30429 36310 32658 30743]\n", - "Sample 3: Labels: [ 1 4911 29901 29871 34214 30698 30429 36310 32658 30743]\n", - "\n", - "Sample 4: Input IDs: [ 1 4911 29901 32581 34822 31639 2882 6530 30883 30210]\n", - "Sample 4: Labels: [ 1 4911 29901 32581 34822 31639 2882 6530 30883 30210]\n", - "\n" - ] - } - ], - "source": [ - "for i, sample in enumerate(train_dataset.create_dict_iterator()):\n", - " if i >= 5:\n", - " break\n", - " print(f\"Sample {i}: Input IDs: {sample['input_ids'][:10]}\") \n", - " print(f\"Sample {i}: Labels: {sample['labels'][:10]}\\n\") " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## lora指令微调" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "指定微调结果输出路径" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# 指定输出路径\n", - "peft_output_dir = \"/home/jiangna1/mindnlp_llama_all/pert_model_Chinese\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "加载基座模型\n" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "LlamaForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`.`PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.\n", - " - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).\n", - " - If you are not the owner of the model architecture class, please contact the model code owner to update it.\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/jiangna1/miniconda3/envs/ms39/lib/python3.9/site-packages/mindnlp/transformers/generation/configuration_utils.py:557: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.2` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.\n", - " warnings.warn(\n", - "/home/jiangna1/miniconda3/envs/ms39/lib/python3.9/site-packages/mindnlp/transformers/generation/configuration_utils.py:562: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.\n", - " warnings.warn(\n", - "/home/jiangna1/miniconda3/envs/ms39/lib/python3.9/site-packages/mindnlp/transformers/generation/configuration_utils.py:557: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.2` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n", - " warnings.warn(\n", - "/home/jiangna1/miniconda3/envs/ms39/lib/python3.9/site-packages/mindnlp/transformers/generation/configuration_utils.py:562: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.\n", - " warnings.warn(\n" - ] - } - ], - "source": [ - "from mindnlp.transformers import AutoModelForCausalLM, GenerationConfig\n", - "\n", - "ms_base_model = AutoModelForCausalLM.from_pretrained(base_model_path, ms_dtype=ms.float16)\n", - "ms_base_model.generation_config = GenerationConfig.from_pretrained(base_model_path)\n", - "ms_base_model.generation_config.pad_token_id = ms_base_model.generation_config.eos_token_id\n" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "#修改精度,会使训练变慢,但是训练loss下降效果会变好\n", - "for name, param in ms_base_model.parameters_and_names():\n", - " param.set_dtype(ms.float32) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "这部分代码的主要作用是创建一个 LoRA的配置对象 ms_config,大语言模型进行微调时,LoRA 是一种高效的参数微调方法,通过在预训练模型的基础上添加低秩矩阵来减少需要训练的参数数量,从而提高微调效率。" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from mindnlp.peft import LoraConfig, TaskType, get_peft_model, PeftModel\n", - "\n", - "ms_config = LoraConfig(\n", - " task_type=TaskType.CAUSAL_LM,#微调任务的类型\n", - " #指定需要应用 LoRA 调整的目标模块\n", - " # target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n", - " target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\"],\n", - " inference_mode=False, \n", - " r=8, \n", - " lora_alpha=32, \n", - " lora_dropout=0.1 \n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "基于给定的基础模型和 LoRA 配置创建一个可进行参数高效微调的模型。" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "ms_model = get_peft_model(ms_base_model, ms_config)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "训练参数的设置" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [], - "source": [ - "from mindnlp.engine import TrainingArguments, Trainer\n", - "\n", - "num_train_epochs = 70\n", - "fp16 = True\n", - "overwrite_output_dir = True\n", - "per_device_train_batch_size = 16\n", - "per_device_eval_batch_size = 32\n", - "gradient_accumulation_steps = 16\n", - "gradient_checkpointing = True\n", - "evaluation_strategy = \"steps\"\n", - "learning_rate = 1e-5\n", - "lr_scheduler_type = \"cosine\" \n", - "weight_decay = 0.01\n", - "warmup_ratio = 0.1\n", - "max_grad_norm = 0.3\n", - "group_by_length = False \n", - "auto_find_batch_size = False\n", - "save_steps = 50 \n", - "logging_strategy = \"steps\"\n", - "logging_steps = 150 \n", - "load_best_model_at_end = True \n", - "packing = False\n", - "save_total_limit = 3\n", - "neftune_noise_alpha = 5 \n", - "eval_steps = 10\n", - "\n", - "training_arguments = TrainingArguments(\n", - " output_dir=peft_output_dir,\n", - " overwrite_output_dir=overwrite_output_dir,\n", - " num_train_epochs=num_train_epochs,\n", - " load_best_model_at_end=load_best_model_at_end,\n", - " per_device_train_batch_size=per_device_train_batch_size,\n", - " per_device_eval_batch_size=per_device_eval_batch_size,\n", - " evaluation_strategy=evaluation_strategy,\n", - " eval_steps=eval_steps,\n", - " max_grad_norm=max_grad_norm,\n", - " auto_find_batch_size=auto_find_batch_size,\n", - " save_total_limit=save_total_limit,\n", - " gradient_accumulation_steps=gradient_accumulation_steps,\n", - " save_steps=save_steps,\n", - " logging_strategy=logging_strategy,\n", - " logging_steps=logging_steps,\n", - " learning_rate=learning_rate,\n", - " weight_decay=weight_decay,\n", - " fp16=fp16,\n", - " warmup_ratio=warmup_ratio,\n", - " group_by_length=group_by_length,\n", - " lr_scheduler_type=lr_scheduler_type\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "初始化训练器" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "trainer = Trainer(\n", - " model=ms_model,\n", - " train_dataset=train_dataset,\n", - " eval_dataset=eval_dataset,\n", - " args=training_arguments\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "开始训练" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 0%| | 1/350 [00:19<1:51:45, 19.21s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "." - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 3%|▎ | 10/350 [02:02<1:04:58, 11.47s/it]We detected that you are passing `past_key_values` as a tuple and this is deprecated. Please use an appropriate `Cache` class\n", - " \n", - " 3%|▎ | 10/350 [02:05<1:04:58, 11.47s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.853940486907959, 'eval_runtime': 2.8532, 'eval_samples_per_second': 1.752, 'eval_steps_per_second': 0.35, 'epoch': 1.88}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 6%|▌ | 20/350 [04:00<1:02:29, 11.36s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.840895891189575, 'eval_runtime': 2.3863, 'eval_samples_per_second': 2.095, 'eval_steps_per_second': 0.419, 'epoch': 3.76}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 9%|▊ | 30/350 [05:54<1:00:27, 11.33s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.8152430057525635, 'eval_runtime': 2.3786, 'eval_samples_per_second': 2.102, 'eval_steps_per_second': 0.42, 'epoch': 5.65}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 11%|█▏ | 40/350 [07:49<58:26, 11.31s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.772057294845581, 'eval_runtime': 2.3891, 'eval_samples_per_second': 2.093, 'eval_steps_per_second': 0.419, 'epoch': 7.53}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 14%|█▍ | 50/350 [09:44<56:29, 11.30s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.7203030586242676, 'eval_runtime': 2.3927, 'eval_samples_per_second': 2.09, 'eval_steps_per_second': 0.418, 'epoch': 9.41}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 17%|█▋ | 60/350 [11:48<54:53, 11.36s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.6663565635681152, 'eval_runtime': 2.3972, 'eval_samples_per_second': 2.086, 'eval_steps_per_second': 0.417, 'epoch': 11.29}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 20%|██ | 70/350 [13:43<52:18, 11.21s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.6159634590148926, 'eval_runtime': 2.3981, 'eval_samples_per_second': 2.085, 'eval_steps_per_second': 0.417, 'epoch': 13.18}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 23%|██▎ | 80/350 [15:38<50:19, 11.18s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.565094470977783, 'eval_runtime': 2.3789, 'eval_samples_per_second': 2.102, 'eval_steps_per_second': 0.42, 'epoch': 15.06}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 26%|██▌ | 90/350 [17:33<49:11, 11.35s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.5170516967773438, 'eval_runtime': 2.3956, 'eval_samples_per_second': 2.087, 'eval_steps_per_second': 0.417, 'epoch': 16.94}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 29%|██▊ | 100/350 [19:28<47:17, 11.35s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.4649040699005127, 'eval_runtime': 2.3707, 'eval_samples_per_second': 2.109, 'eval_steps_per_second': 0.422, 'epoch': 18.82}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 31%|███▏ | 110/350 [21:30<45:42, 11.43s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.407757520675659, 'eval_runtime': 2.3942, 'eval_samples_per_second': 2.088, 'eval_steps_per_second': 0.418, 'epoch': 20.71}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 34%|███▍ | 120/350 [23:25<43:28, 11.34s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.343446969985962, 'eval_runtime': 2.3932, 'eval_samples_per_second': 2.089, 'eval_steps_per_second': 0.418, 'epoch': 22.59}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 37%|███▋ | 130/350 [25:20<41:29, 11.31s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.267294406890869, 'eval_runtime': 2.3965, 'eval_samples_per_second': 2.086, 'eval_steps_per_second': 0.417, 'epoch': 24.47}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 40%|████ | 140/350 [27:15<39:26, 11.27s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.219864845275879, 'eval_runtime': 2.3872, 'eval_samples_per_second': 2.094, 'eval_steps_per_second': 0.419, 'epoch': 26.35}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 43%|████▎ | 150/350 [29:08<37:32, 11.26s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'loss': 3.5504, 'learning_rate': 7.056435515653059e-06, 'epoch': 28.24}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "." - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 43%|████▎ | 150/350 [29:10<37:32, 11.26s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.1902430057525635, 'eval_runtime': 2.5486, 'eval_samples_per_second': 1.962, 'eval_steps_per_second': 0.392, 'epoch': 28.24}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 46%|████▌ | 160/350 [31:12<35:42, 11.28s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.161935329437256, 'eval_runtime': 2.391, 'eval_samples_per_second': 2.091, 'eval_steps_per_second': 0.418, 'epoch': 30.12}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 49%|████▊ | 170/350 [33:07<33:32, 11.18s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.138967752456665, 'eval_runtime': 2.3927, 'eval_samples_per_second': 2.09, 'eval_steps_per_second': 0.418, 'epoch': 32.0}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 51%|█████▏ | 180/350 [35:03<32:11, 11.36s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.122406482696533, 'eval_runtime': 2.3994, 'eval_samples_per_second': 2.084, 'eval_steps_per_second': 0.417, 'epoch': 33.88}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 54%|█████▍ | 190/350 [36:58<30:17, 11.36s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.106480836868286, 'eval_runtime': 2.3919, 'eval_samples_per_second': 2.09, 'eval_steps_per_second': 0.418, 'epoch': 35.76}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 57%|█████▋ | 200/350 [38:53<28:21, 11.35s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.0940558910369873, 'eval_runtime': 2.397, 'eval_samples_per_second': 2.086, 'eval_steps_per_second': 0.417, 'epoch': 37.65}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 60%|██████ | 210/350 [40:56<26:40, 11.43s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.082423686981201, 'eval_runtime': 2.3797, 'eval_samples_per_second': 2.101, 'eval_steps_per_second': 0.42, 'epoch': 39.53}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 63%|██████▎ | 220/350 [42:51<24:30, 11.31s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.0746374130249023, 'eval_runtime': 2.405, 'eval_samples_per_second': 2.079, 'eval_steps_per_second': 0.416, 'epoch': 41.41}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 66%|██████▌ | 230/350 [44:46<22:31, 11.26s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.0679070949554443, 'eval_runtime': 2.3952, 'eval_samples_per_second': 2.088, 'eval_steps_per_second': 0.418, 'epoch': 43.29}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 69%|██████▊ | 240/350 [46:41<20:33, 11.21s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.060976266860962, 'eval_runtime': 2.3947, 'eval_samples_per_second': 2.088, 'eval_steps_per_second': 0.418, 'epoch': 45.18}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 71%|███████▏ | 250/350 [48:36<18:39, 11.19s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.0542078018188477, 'eval_runtime': 2.3973, 'eval_samples_per_second': 2.086, 'eval_steps_per_second': 0.417, 'epoch': 47.06}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 74%|███████▍ | 260/350 [50:39<17:09, 11.44s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.049464702606201, 'eval_runtime': 2.3926, 'eval_samples_per_second': 2.09, 'eval_steps_per_second': 0.418, 'epoch': 48.94}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 77%|███████▋ | 270/350 [52:34<15:09, 11.37s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.047043561935425, 'eval_runtime': 2.3877, 'eval_samples_per_second': 2.094, 'eval_steps_per_second': 0.419, 'epoch': 50.82}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 80%|████████ | 280/350 [54:29<13:14, 11.34s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.0444722175598145, 'eval_runtime': 2.3917, 'eval_samples_per_second': 2.091, 'eval_steps_per_second': 0.418, 'epoch': 52.71}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 83%|████████▎ | 290/350 [56:24<11:20, 11.34s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.0422093868255615, 'eval_runtime': 2.3938, 'eval_samples_per_second': 2.089, 'eval_steps_per_second': 0.418, 'epoch': 54.59}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 86%|████████▌ | 300/350 [58:17<09:25, 11.31s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'loss': 3.0383, 'learning_rate': 6.088921331488568e-07, 'epoch': 56.47}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 86%|████████▌ | 300/350 [58:19<09:25, 11.31s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.0407938957214355, 'eval_runtime': 2.3837, 'eval_samples_per_second': 2.098, 'eval_steps_per_second': 0.42, 'epoch': 56.47}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 89%|████████▊ | 310/350 [1:00:22<07:34, 11.37s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.0404062271118164, 'eval_runtime': 2.3948, 'eval_samples_per_second': 2.088, 'eval_steps_per_second': 0.418, 'epoch': 58.35}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 91%|█████████▏| 320/350 [1:02:17<05:38, 11.27s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.0399832725524902, 'eval_runtime': 2.3929, 'eval_samples_per_second': 2.089, 'eval_steps_per_second': 0.418, 'epoch': 60.24}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 94%|█████████▍| 330/350 [1:04:12<03:43, 11.20s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.039731740951538, 'eval_runtime': 2.3855, 'eval_samples_per_second': 2.096, 'eval_steps_per_second': 0.419, 'epoch': 62.12}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - " 97%|█████████▋| 340/350 [1:06:07<01:51, 11.19s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.0396060943603516, 'eval_runtime': 2.3904, 'eval_samples_per_second': 2.092, 'eval_steps_per_second': 0.418, 'epoch': 64.0}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \n", - "100%|██████████| 350/350 [1:08:03<00:00, 11.35s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'eval_loss': 3.039623260498047, 'eval_runtime': 2.3877, 'eval_samples_per_second': 2.094, 'eval_steps_per_second': 0.419, 'epoch': 65.88}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "The intermediate checkpoints of PEFT may not be saved correctly, consider using a custom callback to save adapter_model.bin in corresponding saving folders. Check some examples here: https://github.com/huggingface/peft/issues/96\n", - "100%|██████████| 350/350 [1:08:12<00:00, 11.69s/it]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'train_runtime': 4092.112, 'train_samples_per_second': 23.264, 'train_steps_per_second': 0.086, 'train_loss': 3.2515819876534597, 'epoch': 65.88}\n" - ] - } - ], - "source": [ - "trainer.train()\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for name, param in trainer.model.parameters_and_names():\n", - " param.set_dtype(ms.float16)\n", - "trainer.model.save_pretrained(peft_output_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "最后存储还是存储为 16 位浮点数,保存训练后的 LoRA 模型保存到指定的输出目录" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 推理" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "注意,lora微调后保存的并不是完整的参数,在推理时,需要将保存的 LoRA 参数加载到原预训练模型中,合并后得到完整的模型,然后进行推理。" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "使用PeftModel进行配置和参数合并,最后将模型设置为评估模式,进行推理任务。" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "model merge succeeded\n" - ] - }, - { - "data": { - "text/plain": [ - "LlamaForCausalLM(\n", - " (model): LlamaModel(\n", - " (embed_tokens): Embedding(55296, 4096, padding_idx=0)\n", - " (layers): ModuleList(\n", - " (0-3): 4 x LlamaDecoderLayer(\n", - " (self_attn): LlamaAttention(\n", - " (q_proj): lora.Linear(\n", - " (base_layer): Linear (4096 -> 4096)\n", - " (lora_dropout): ModuleDict(\n", - " (default): Dropout(p=0.1, inplace=False)\n", - " )\n", - " (lora_A): ModuleDict(\n", - " (default): Linear (4096 -> 8)\n", - " )\n", - " (lora_B): ModuleDict(\n", - " (default): Linear (8 -> 4096)\n", - " )\n", - " (lora_embedding_A): ParameterDict()\n", - " (lora_embedding_B): ParameterDict()\n", - " (lora_magnitude_vector): ModuleDict()\n", - " )\n", - " (k_proj): lora.Linear(\n", - " (base_layer): Linear (4096 -> 4096)\n", - " (lora_dropout): ModuleDict(\n", - " (default): Dropout(p=0.1, inplace=False)\n", - " )\n", - " (lora_A): ModuleDict(\n", - " (default): Linear (4096 -> 8)\n", - " )\n", - " (lora_B): ModuleDict(\n", - " (default): Linear (8 -> 4096)\n", - " )\n", - " (lora_embedding_A): ParameterDict()\n", - " (lora_embedding_B): ParameterDict()\n", - " (lora_magnitude_vector): ModuleDict()\n", - " )\n", - " (v_proj): lora.Linear(\n", - " (base_layer): Linear (4096 -> 4096)\n", - " (lora_dropout): ModuleDict(\n", - " (default): Dropout(p=0.1, inplace=False)\n", - " )\n", - " (lora_A): ModuleDict(\n", - " (default): Linear (4096 -> 8)\n", - " )\n", - " (lora_B): ModuleDict(\n", - " (default): Linear (8 -> 4096)\n", - " )\n", - " (lora_embedding_A): ParameterDict()\n", - " (lora_embedding_B): ParameterDict()\n", - " (lora_magnitude_vector): ModuleDict()\n", - " )\n", - " (o_proj): lora.Linear(\n", - " (base_layer): Linear (4096 -> 4096)\n", - " (lora_dropout): ModuleDict(\n", - " (default): Dropout(p=0.1, inplace=False)\n", - " )\n", - " (lora_A): ModuleDict(\n", - " (default): Linear (4096 -> 8)\n", - " )\n", - " (lora_B): ModuleDict(\n", - " (default): Linear (8 -> 4096)\n", - " )\n", - " (lora_embedding_A): ParameterDict()\n", - " (lora_embedding_B): ParameterDict()\n", - " (lora_magnitude_vector): ModuleDict()\n", - " )\n", - " (rotary_emb): LlamaRotaryEmbedding()\n", - " )\n", - " (mlp): LlamaMLP(\n", - " (gate_proj): Linear (4096 -> 11008)\n", - " (up_proj): Linear (4096 -> 11008)\n", - " (down_proj): Linear (11008 -> 4096)\n", - " (act_fn): SiLU()\n", - " )\n", - " (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)\n", - " (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)\n", - " )\n", - " )\n", - " (norm): LlamaRMSNorm((4096,), eps=1e-05)\n", - " (rotary_emb): LlamaRotaryEmbedding()\n", - " )\n", - " (lm_head): Linear (4096 -> 55296)\n", - ")" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#将 LoRA微调后的参数加载到预训练模型中\n", - "from mindnlp.peft import PeftModel\n", - "model = PeftModel.from_pretrained(ms_base_model, peft_output_dir)\n", - "model = model.merge_and_unload()\n", - "print('model merge succeeded')\n", - "model.eval()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "定义一个函数,用于根据用户输入的问题生成相应的回答" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [], - "source": [ - "import mindspore as ms\n", - "\n", - "def generate_response(question, model, tokenizer, max_length=256):\n", - " prompt = f\"以下是用户和助手之间的问答。\\n问:{question}\\n答:\"\n", - " inputs = tokenizer(prompt, return_tensors=\"ms\", padding=True, truncation=True, max_length=512)\n", - " output_ids = model.generate(\n", - " **inputs,\n", - " do_sample=False,\n", - " # temperature=0.7,\n", - " # top_p=0.9,\n", - " repetition_penalty=1.2,\n", - " no_repeat_ngram_size=3,\n", - " max_length=max_length,\n", - " eos_token_id=tokenizer.eos_token_id\n", - " )\n", - "\n", - " response = tokenizer.decode(output_ids[0], skip_special_tokens=True)\n", - " response = response.split(\"Answer:\")[-1].strip()\n", - " return response\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "一个实例" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/jiangna1/miniconda3/envs/ms39/lib/python3.9/site-packages/mindnlp/transformers/generation/configuration_utils.py:557: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.2` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n", - " warnings.warn(\n", - "/home/jiangna1/miniconda3/envs/ms39/lib/python3.9/site-packages/mindnlp/transformers/generation/configuration_utils.py:562: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.\n", - " warnings.warn(\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "User: 如何保持清醒?\n", - "LLAMA: 以下是用户和助手之间的问答。\n", - "问:如何保持清醒?\n", - "答:在你睡觉的时候,你的大脑会一直处于兴奋状态中;当你醒来时,它就会继续工作了。所以如果你的睡眠时间很短的话,你就不会感到太疲劳或昏沉。你可以通过使用一些药物来帮助恢复精力、提高警觉度以及降低血压等方法使自己进入深度睡眠的状态。此外,你还可以通过服用维生素B6片剂或者吃富含蛋白质的食物等方式让自己重新振作起来。\n" - ] - } - ], - "source": [ - "question = \"如何保持清醒?\"\n", - "response = generate_response(question, model, tokenizer)\n", - "\n", - "print(f\"User: {question}\")\n", - "print(f\"LLAMA: {response}\")" - ] - } - ], - "metadata": { - "AIGalleryInfo": { - "item_id": "5443b528-0dd5-4909-ac4f-1c9cf839e2aa" - }, - "flavorInfo": { - "architecture": "X86_64", - "category": "GPU" - }, - "imageInfo": { - "id": "e1a07296-22a8-4f05-8bc8-e936c8e54202", - "name": "mindspore1.7.0-cuda10.1-py3.7-ubuntu18.04" - }, - "kernelspec": { - "display_name": "ms39", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.11" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/01.LLM_Theory_Course/01.Industry_Model_Introduction/01.Classic_Model_Technical_Analysis/07.LLaMA2/llama_inference_debug.py b/01.LLM_Theory_Course/01.Industry_Model_Introduction/01.Classic_Model_Technical_Analysis/07.LLaMA2/llama_inference_debug.py deleted file mode 100644 index 4b7d81a..0000000 --- a/01.LLM_Theory_Course/01.Industry_Model_Introduction/01.Classic_Model_Technical_Analysis/07.LLaMA2/llama_inference_debug.py +++ /dev/null @@ -1,42 +0,0 @@ -import sys -import os -# Add current directory to Python path for local module imports -sys.path.insert(0, os.path.abspath(".")) -from ms_mindnlp.transformers.models.llama.modeling_llama import LlamaModel -from ms_mindnlp.transformers.models.llama.configuration_llama import LlamaConfig -import mindspore as ms -from mindspore import dtype, ops -import debugpy - -debugpy.listen(("0.0.0.0", 5678)) -print("Waiting for debugger to attach...") - -debugpy.wait_for_client() -print("Debugger is attached.") - -# import inspect -# llama_config_file_path = inspect.getfile(LlamaConfig) -# print(f"{llama_config_file_path}") - -ms.set_context(mode=ms.PYNATIVE_MODE) - -def run(): - """Main execution function for LLaMA model inference demo""" - config = LlamaConfig( - vocab_size=32000, # Tokenizer vocabulary size - hidden_size=4096, # Hidden layer dimension - intermediate_size=11008, # FFN layer inner dimension - num_hidden_layers=2, # Number of transformer blocks - num_attention_heads=32, # Parallel attention heads - num_key_value_heads=2, # KV heads for grouped-query attention - max_position_embeddings=2048, # Maximum sequence length - ) - model = LlamaModel(config=config) - # Generate random input tensor: (batch_size=2, seq_length=16) - input_ids = ops.randint(0, config.vocab_size, (2, 16), dtype=dtype.int32) - output = model(input_ids=input_ids) - print("inference") - print(output) - -if __name__ == "__main__": - run() \ No newline at end of file diff --git a/01.LLM_Theory_Course/02.Technical_Topic_Introduction/01.Prompt_Tuning/roberta_sequence_classification.ipynb b/01.LLM_Theory_Course/02.Technical_Topic_Introduction/01.Prompt_Tuning/roberta_sequence_classification.ipynb index 75c4c9e..5ccd5c5 100644 --- a/01.LLM_Theory_Course/02.Technical_Topic_Introduction/01.Prompt_Tuning/roberta_sequence_classification.ipynb +++ b/01.LLM_Theory_Course/02.Technical_Topic_Introduction/01.Prompt_Tuning/roberta_sequence_classification.ipynb @@ -5,74 +5,51 @@ "id": "7a2ac91c", "metadata": {}, "source": [ - "# 基于MindNLP的Roberta模型Prompt Tuning" + "# 基于MindNLP的RoBERTa模型Prompt Tuning\n", + "\n", + "## 案例介绍\n", + "\n", + "本案例对roberta-large模型基于GLUE基准数据集进行prompt tuning。\n", + "\n", + "## 模型介绍\n", + "\n", + "RoBERTa 的全称是 Robustly optimized BERT approach,可以理解为“经过更精细优化的 BERT 模型”。它由 Facebook AI(现 Meta AI)在 2019 年发布,是对 Google 的 BERT 模型的一次重大改进和重新设计。\n", + "\n", + "其核心思想是:BERT 的原始设计很好,但训练不充分、配置可以优化。通过一系列改进,RoBERTa 在多个自然语言理解基准测试上超越了 BERT,成为了当时最强大的预训练模型之一。" ] }, { "cell_type": "markdown", - "id": "324424c6", + "id": "de06584a", "metadata": {}, "source": [ - "安装mindspore, mindnlp及其他依赖" + "## 环境配置\n", + "\n", + "本案例的运行环境为:\n", + "\n", + "| Python | MindSpore | MindSpore NLP |\n", + "| :----- | :-------- | :------------ |\n", + "| 3.10 | 2.7.0 | 0.5.1 |\n", + "\n", + "如果你在如[昇思大模型平台](https://xihe.mindspore.cn/training-projects)、[华为云ModelArts](https://www.huaweicloud.com/product/modelarts.html)、[启智社区](https://openi.pcl.ac.cn/)等算力平台的Jupyter在线编程环境中运行本案例,可取消如下代码的注释,进行依赖库安装:" ] }, { "cell_type": "code", - "execution_count": 1, - "id": "cd3f2df1-da30-4009-8b33-80df52be80c7", + "execution_count": null, + "id": "7e3693f7", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple\n", - "Collecting mindspore==2.4.1\n", - " Downloading https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.4.1/MindSpore/unified/aarch64/mindspore-2.4.1-cp39-cp39-linux_aarch64.whl (335.5 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m335.5/335.5 MB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", - "\u001b[?25hRequirement already satisfied: numpy<2.0.0,>=1.20.0 in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages (from mindspore==2.4.1) (1.26.1)\n", - "Requirement already satisfied: protobuf>=3.13.0 in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages (from mindspore==2.4.1) (3.20.3)\n", - "Requirement already satisfied: asttokens>=2.0.4 in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages (from mindspore==2.4.1) (2.4.1)\n", - "Requirement already satisfied: pillow>=6.2.0 in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages (from mindspore==2.4.1) (9.0.1)\n", - "Requirement already satisfied: scipy>=1.5.4 in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages (from mindspore==2.4.1) (1.11.3)\n", - "Requirement already satisfied: packaging>=20.0 in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages (from mindspore==2.4.1) (23.2)\n", - "Requirement already satisfied: psutil>=5.6.1 in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages (from mindspore==2.4.1) (5.9.5)\n", - "Requirement already satisfied: astunparse>=1.6.3 in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages (from mindspore==2.4.1) (1.6.3)\n", - "Collecting safetensors>=0.4.0 (from mindspore==2.4.1)\n", - " Downloading https://pypi.tuna.tsinghua.edu.cn/packages/08/94/7760694760f1e5001bd62c93155b8b7ccb652d1f4d0161d1e72b5bf9581a/safetensors-0.4.5-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (442 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m442.4/442.4 kB\u001b[0m \u001b[31m39.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hRequirement already satisfied: six>=1.12.0 in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages (from asttokens>=2.0.4->mindspore==2.4.1) (1.16.0)\n", - "Requirement already satisfied: wheel<1.0,>=0.23.0 in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages (from astunparse>=1.6.3->mindspore==2.4.1) (0.41.2)\n", - "\u001b[33mDEPRECATION: moxing-framework 2.1.16.2ae09d45 has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of moxing-framework or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063\u001b[0m\u001b[33m\n", - "\u001b[0mInstalling collected packages: safetensors, mindspore\n", - " Attempting uninstall: mindspore\n", - " Found existing installation: mindspore 2.3.0\n", - " Uninstalling mindspore-2.3.0:\n", - " Successfully uninstalled mindspore-2.3.0\n", - "Successfully installed mindspore-2.4.1 safetensors-0.4.5\n" - ] - } - ], + "outputs": [], "source": [ - "!pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.4.1/MindSpore/unified/aarch64/mindspore-2.4.1-cp39-cp39-linux_aarch64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple" + "# !pip install mindspore==2.7.0 mindnlp==0.5.1" ] }, { - "cell_type": "code", - "execution_count": 14, - "id": "d8b0ba09", + "cell_type": "markdown", + "id": "82f37bbc", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "env: HF_ENDPOINT=https://hf-mirror.com\n" - ] - } - ], "source": [ - "%env HF_ENDPOINT=https://hf-mirror.com" + "其他场景可参考[MindSpore安装指南](https://www.mindspore.cn/install)与[MindSpore NLP安装指南](https://github.com/mindspore-lab/mindnlp?tab=readme-ov-file#installation)进行环境搭建。" ] }, { @@ -80,30 +57,49 @@ "id": "5b0e977f", "metadata": {}, "source": [ - "## 模型与数据集加载\n", + "## 数据加载与预处理\n", "\n", - "本案例对roberta-large模型基于GLUE基准数据集进行prompt tuning。" + "### 数据集加载" ] }, { "cell_type": "code", - "execution_count": 15, - "id": "ef577ba3", + "execution_count": 2, + "id": "01011a29-c30f-49f0-9daf-86d1150d1115", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", + " setattr(self, word, getattr(machar, word).flat[0])\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", + " return self._float_to_str(self.smallest_subnormal)\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero.\n", + " setattr(self, word, getattr(machar, word).flat[0])\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero.\n", + " return self._float_to_str(self.smallest_subnormal)\n", + "/usr/local/python3.10.14/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n", + "Modular Diffusers is currently an experimental feature under active development. The API is subject to breaking changes in future releases.\n" + ] + } + ], "source": [ "import argparse\n", "import os\n", "\n", + "import mindnlp\n", "import mindspore\n", - "from mindnlp.core.optim import AdamW\n", + "import mindnlp\n", "from tqdm import tqdm\n", "import evaluate\n", - "from mindnlp.dataset import load_dataset\n", - "from mindnlp.engine import set_seed\n", - "from mindnlp.transformers import AutoModelForSequenceClassification, AutoTokenizer\n", - "from mindnlp.common.optimization import get_linear_schedule_with_warmup\n", - "from mindnlp.peft import (\n", + "from datasets import load_dataset\n", + "\n", + "from transformers import AutoModelForSequenceClassification, AutoTokenizer\n", + "from transformers import get_linear_schedule_with_warmup\n", + "from peft import (\n", " get_peft_config,\n", " get_peft_model,\n", " get_peft_model_state_dict,\n", @@ -115,17 +111,34 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, + "id": "5c1b6b37-9f55-42fb-b076-715dc1c72f8a", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[WARNING] ME(67332:281473403209280,MainProcess):2025-11-25-17:06:03.983.000 [mindspore/context.py:1412] For 'context.set_context', the parameter 'pynative_synchronize' will be deprecated and removed in a future version. Please use the api mindspore.runtime.launch_blocking() instead.\n" + ] + } + ], + "source": [ + "mindspore.set_context(pynative_synchronize=True) #开启同步,方便定位" + ] + }, + { + "cell_type": "code", + "execution_count": null, "id": "af061f0b", "metadata": {}, "outputs": [], "source": [ "batch_size = 32\n", - "model_name_or_path = \"AI-ModelScope/roberta-large\"\n", + "model_name_or_path = \"FacebookAI/roberta-large\"\n", "task = \"mrpc\"\n", "peft_type = PeftType.PROMPT_TUNING\n", - "# num_epochs = 20\n", - "num_epochs = 5" + "num_epochs = 1" ] }, { @@ -138,7 +151,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 5, "id": "4e9663be", "metadata": {}, "outputs": [], @@ -159,19 +172,10 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": null, "id": "871ebbae", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/transformers/tokenization_utils_base.py:1526: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted, and will be then set to `False` by default. \n", - " warnings.warn(\n" - ] - } - ], + "outputs": [], "source": [ "# load tokenizer\n", "if any(k in model_name_or_path for k in (\"gpt\", \"opt\", \"bloom\")):\n", @@ -179,14 +183,35 @@ "else:\n", " padding_side = \"right\"\n", "\n", - "tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side=padding_side, mirror=\"modelscope\")\n", + "tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side=padding_side)\n", "if getattr(tokenizer, \"pad_token_id\") is None:\n", " tokenizer.pad_token_id = tokenizer.eos_token_id" ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 7, + "id": "f969c17b-48a9-45d6-90c3-852b105e9f86", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('right', 1)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tokenizer.padding_side, tokenizer.pad_token_id" + ] + }, + { + "cell_type": "code", + "execution_count": 8, "id": "79ef5257", "metadata": {}, "outputs": [ @@ -194,101 +219,230 @@ "name": "stdout", "output_type": "stream", "text": [ - "{'sentence1': Tensor(shape=[], dtype=String, value= 'Amrozi accused his brother , whom he called \" the witness \" , of deliberately distorting his evidence .'), 'sentence2': Tensor(shape=[], dtype=String, value= 'Referring to him as only \" the witness \" , Amrozi accused his brother of deliberately distorting his evidence .'), 'label': Tensor(shape=[], dtype=Int64, value= 1), 'idx': Tensor(shape=[], dtype=Int64, value= 0)}\n" + "DatasetDict({\n", + " train: Dataset({\n", + " features: ['sentence1', 'sentence2', 'label', 'idx'],\n", + " num_rows: 3668\n", + " })\n", + " validation: Dataset({\n", + " features: ['sentence1', 'sentence2', 'label', 'idx'],\n", + " num_rows: 408\n", + " })\n", + " test: Dataset({\n", + " features: ['sentence1', 'sentence2', 'label', 'idx'],\n", + " num_rows: 1725\n", + " })\n", + "})\n", + "{'sentence1': 'Amrozi accused his brother , whom he called \" the witness \" , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only \" the witness \" , Amrozi accused his brother of deliberately distorting his evidence .', 'label': Tensor(shape=[], dtype=Int64, value= 1), 'idx': Tensor(shape=[], dtype=Int64, value= 0)}\n" ] } ], "source": [ "datasets = load_dataset(\"glue\", task)\n", - "print(next(datasets['train'].create_dict_iterator()))" + "\n", + "# 查看数据集的划分\n", + "print(datasets) \n", + "train_dataset = datasets['train']\n", + "\n", + "# Set the dataset format for PyTorch\n", + "train_dataset.set_format(type='torch')\n", + "\n", + "# Simply iterate directly\n", + "print(train_dataset[0])" + ] + }, + { + "cell_type": "markdown", + "id": "19163e58", + "metadata": {}, + "source": [ + "### 数据集处理" ] }, { "cell_type": "code", - "execution_count": 20, - "id": "151943cb", + "execution_count": 12, + "id": "c27365a4-a79b-42ee-a666-3a46cf066f38", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Map: 100%|██████████| 3668/3668 [00:00<00:00, 11107.72 examples/s]\n", + "Map: 100%|██████████| 408/408 [00:00<00:00, 10199.22 examples/s]\n" + ] + } + ], "source": [ - "from mindnlp.dataset import BaseMapFunction\n", - "\n", - "class MapFunc(BaseMapFunction):\n", - " def __call__(self, sentence1, sentence2, label, idx):\n", - " outputs = tokenizer(sentence1, sentence2, truncation=True, max_length=None)\n", - " return outputs['input_ids'], outputs['attention_mask'], label\n", - "\n", + "from torch.utils.data import DataLoader\n", + "from transformers import DataCollatorWithPadding\n", "\n", "def get_dataset(dataset, tokenizer):\n", - " input_colums=['sentence1', 'sentence2', 'label', 'idx']\n", - " output_columns=['input_ids', 'attention_mask', 'labels']\n", - " dataset = dataset.map(MapFunc(input_colums, output_columns),\n", - " input_colums, output_columns)\n", - " dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id),\n", - " 'attention_mask': (None, 0)})\n", + " def tokenize_function(examples):\n", + " return tokenizer(\n", + " examples['sentence1'],\n", + " examples['sentence2'],\n", + " truncation=True,\n", + " max_length=None,\n", + " )\n", + " \n", + " # 应用tokenize函数\n", + " dataset = dataset.map(\n", + " tokenize_function,\n", + " batched=True,\n", + " remove_columns=['sentence1', 'sentence2', 'idx']\n", + " )\n", + " \n", " return dataset\n", "\n", + "# 处理数据集\n", "train_dataset = get_dataset(datasets['train'], tokenizer)\n", - "eval_dataset = get_dataset(datasets['validation'], tokenizer)" + "eval_dataset = get_dataset(datasets['validation'], tokenizer)\n", + "\n", + "# 创建数据整理器\n", + "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)\n", + "\n", + "# 创建DataLoader\n", + "train_dataloader = DataLoader(\n", + " train_dataset, \n", + " batch_size=batch_size, \n", + " collate_fn=data_collator,\n", + " shuffle=True\n", + ")\n", + "\n", + "eval_dataloader = DataLoader(\n", + " eval_dataset, \n", + " batch_size=batch_size, \n", + " collate_fn=data_collator\n", + ")" ] }, { "cell_type": "code", - "execution_count": 21, - "id": "a99c4ab6", + "execution_count": 13, + "id": "2ac57266-cd84-440b-b38a-2d6eacb287c3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_dataloader" + ] + }, + { + "cell_type": "markdown", + "id": "dd8a28a5", + "metadata": {}, + "source": [ + "### 查看数据集信息" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "79340ca2-a167-4987-ad37-d079ac1ea69e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[MS_ALLOC_CONF]Runtime config: enable_vmm:True vmm_align_size:2MB\n", + "输入ID: mindtorch.Size([32, 75])\n", + "注意力掩码: mindtorch.Size([32, 75])\n", + "标签: mindtorch.Size([32])\n" + ] + } + ], + "source": [ + "# 获取一个批次\n", + "batch = next(iter(train_dataloader))\n", + "\n", + "print(\"输入ID:\", batch['input_ids'].shape)\n", + "print(\"注意力掩码:\", batch['attention_mask'].shape)\n", + "if 'labels' in batch:\n", + " print(\"标签:\", batch['labels'].shape)" + ] + }, + { + "cell_type": "markdown", + "id": "f6ef16ca", + "metadata": {}, + "source": [ + "## 加载评估指标" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9ccecda-1637-420c-aaac-c9c25ef69c81", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.\n", - "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.\n", - "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.\n", - "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.\n", - "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.\n", - "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.\n", - "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.\n" + "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n", + "To disable this warning, you can either:\n", + "\t- Avoid using `tokenizers` before the fork if possible\n", + "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "{'input_ids': Tensor(shape=[32, 70], dtype=Int64, value=\n", - "[[ 0, 10127, 1001 ... 1, 1, 1],\n", - " [ 0, 975, 26802 ... 1, 1, 1],\n", - " [ 0, 1213, 56 ... 1, 1, 1],\n", - " ...\n", - " [ 0, 133, 1154 ... 1, 1, 1],\n", - " [ 0, 12667, 8423 ... 1, 1, 1],\n", - " [ 0, 32478, 1033 ... 1, 1, 1]]), 'attention_mask': Tensor(shape=[32, 70], dtype=Int64, value=\n", - "[[1, 1, 1 ... 0, 0, 0],\n", - " [1, 1, 1 ... 0, 0, 0],\n", - " [1, 1, 1 ... 0, 0, 0],\n", - " ...\n", - " [1, 1, 1 ... 0, 0, 0],\n", - " [1, 1, 1 ... 0, 0, 0],\n", - " [1, 1, 1 ... 0, 0, 0]]), 'labels': Tensor(shape=[32], dtype=Int64, value= [1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, \n", - " 1, 1, 0, 0, 1, 1, 1, 0])}\n" + "Looking in indexes: http://pip.modelarts.private.com:8888/repository/pypi/simple\n", + "Requirement already satisfied: scikit-learn in /usr/local/python3.10.14/lib/python3.10/site-packages (1.7.2)\n", + "Requirement already satisfied: numpy>=1.22.0 in /usr/local/python3.10.14/lib/python3.10/site-packages (from scikit-learn) (1.26.4)\n", + "Requirement already satisfied: scipy>=1.8.0 in /usr/local/python3.10.14/lib/python3.10/site-packages (from scikit-learn) (1.10.1)\n", + "Requirement already satisfied: joblib>=1.2.0 in /usr/local/python3.10.14/lib/python3.10/site-packages (from scikit-learn) (1.5.2)\n", + "Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/python3.10.14/lib/python3.10/site-packages (from scikit-learn) (3.6.0)\n" ] } ], "source": [ - "print(next(train_dataset.create_dict_iterator()))" + "!pip install scikit-learn #安装依赖" ] }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 16, "id": "9dc17398", "metadata": { "scrolled": true }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Downloading builder script: 5.75kB [00:00, 6.29MB/s]\n" + ] + } + ], "source": [ "metric = evaluate.load(\"glue\", task)" ] }, + { + "cell_type": "markdown", + "id": "e0d29ec3", + "metadata": {}, + "source": [ + "## 模型加载" + ] + }, { "cell_type": "markdown", "id": "9034b5b2", @@ -308,7 +462,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 17, "id": "f929a616", "metadata": {}, "outputs": [ @@ -316,7 +470,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at AI-ModelScope/roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']\n", + "Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] }, @@ -324,13 +478,13 @@ "name": "stdout", "output_type": "stream", "text": [ - "trainable params: 1,061,890 || all params: 356,423,684 || trainable%: 0.2979291353713745\n" + "trainable params: 1,061,890 || all params: 356,423,684 || trainable%: 0.2979\n" ] } ], "source": [ "# load model\n", - "model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, return_dict=True, mirror=\"modelscope\")\n", + "model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, return_dict=True)\n", "model = get_peft_model(model, peft_config)\n", "# print number of trainable parameters\n", "model.print_trainable_parameters()" @@ -354,12 +508,13 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 20, "id": "3c7ee704", "metadata": {}, "outputs": [], "source": [ - "optimizer = AdamW(params=model.trainable_params(), lr=lr)\n", + "import torch\n", + "optimizer = torch.optim.Adam(params=model.parameters(), lr=lr)\n", "\n", "# Instantiate scheduler\n", "lr_scheduler = get_linear_schedule_with_warmup(\n", @@ -379,44 +534,49 @@ }, { "cell_type": "code", - "execution_count": 25, - "id": "a0d2bff6", + "execution_count": 21, + "id": "222dece5-672c-4301-a26c-2c4e5fd3bf40", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "base_model.classifier.modules_to_save.default.dense.weight: mindtorch.Size([1024, 1024])\n", + "base_model.classifier.modules_to_save.default.dense.bias: mindtorch.Size([1024])\n", + "base_model.classifier.modules_to_save.default.out_proj.weight: mindtorch.Size([2, 1024])\n", + "base_model.classifier.modules_to_save.default.out_proj.bias: mindtorch.Size([2])\n", + "prompt_encoder.default.embedding.weight: mindtorch.Size([10, 1024])\n" + ] + } + ], + "source": [ + "for name, param in model.named_parameters():\n", + " if param.requires_grad:\n", + " print(f\"{name}: {param.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5db89ce-d3da-4f34-a9dd-0d805858ec1e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "(Tensor(shape=[1024, 1024], dtype=Float32, value=\n", - " [[-1.36615150e-02, 4.08777148e-02, 2.55590724e-03 ... 3.47721018e-02, 9.83245391e-03, 3.02866008e-02],\n", - " [-1.82124749e-02, -1.49800153e-02, -7.02886097e-03 ... 2.07055025e-02, 3.45048914e-03, -3.01328991e-02],\n", - " [-6.06489694e-03, 6.34483900e-03, 1.55880465e-03 ... 3.41698825e-02, -7.40761030e-03, 3.69770750e-02],\n", - " ...\n", - " [-4.91964221e-02, 1.94903351e-02, 2.51724524e-03 ... 3.08064763e-02, -7.55657675e-04, -8.02899338e-03],\n", - " [-2.02472787e-03, -2.46642623e-02, -7.02362158e-04 ... 2.86021479e-03, 8.27849377e-03, 9.28967725e-03],\n", - " [-2.06481982e-02, 2.20393538e-02, 3.17191752e-03 ... -2.68367468e-03, -4.67487238e-02, 9.09192720e-04]]),\n", - " Tensor(shape=[1024], dtype=Float32, value= [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00 ... 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]),\n", - " Tensor(shape=[2, 1024], dtype=Float32, value=\n", - " [[ 8.87530856e-03, 2.81313114e-04, 3.74777764e-02 ... -2.02168617e-02, 4.23110556e-03, -3.84111144e-02],\n", - " [ 3.84113006e-03, -1.38288038e-02, 1.98907983e-02 ... -3.23316827e-02, -3.48059200e-02, 7.11114611e-04]]),\n", - " Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00, 0.00000000e+00]),\n", - " Tensor(shape=[10, 1024], dtype=Float32, value=\n", - " [[-1.75136819e-01, 6.45715892e-02, 1.14947283e+00 ... 8.42640877e-01, 6.34459913e-01, 9.26455021e-01],\n", - " [ 7.65107423e-02, 5.32130003e-01, -2.12189722e+00 ... 1.34316778e+00, 4.83163930e-02, -2.11086214e-01],\n", - " [-7.30758488e-01, -8.77783835e-01, -5.94429135e-01 ... -2.58468151e-01, -2.85294857e-02, -2.18536639e+00],\n", - " ...\n", - " [ 4.13678169e-01, -1.15315497e+00, 8.49422574e-01 ... 2.54201055e-01, -1.30300558e+00, 2.13208008e+00],\n", - " [ 5.60092032e-01, -8.55898261e-01, -7.30682373e-01 ... -1.04416716e+00, -1.10600793e+00, 4.29843873e-01],\n", - " [-1.94377673e+00, 4.45314497e-02, -4.56895113e-01 ... 1.88079858e+00, -6.05825901e-01, -3.19380850e-01]]))" + "device(type=npu, index=0)" ] }, - "execution_count": 25, + "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "# print name of trainable parameters\n", - "model.trainable_params()" + "model.npu() #模型移动到npu侧\n", + "device = model.device # 获取模型所在的设备\n", + "device" ] }, { @@ -434,117 +594,70 @@ }, { "cell_type": "code", - "execution_count": 26, - "id": "0667ebea", + "execution_count": 32, + "id": "28c9818a-5225-4b05-a69c-3b490fd70d47", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "100%|██████████| 115/115 [00:26<00:00, 4.38it/s]\n", - "100%|██████████| 13/13 [00:01<00:00, 7.83it/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch 0: {'accuracy': 0.6985294117647058, 'f1': 0.8183161004431314}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 115/115 [00:26<00:00, 4.42it/s]\n", - "100%|██████████| 13/13 [00:01<00:00, 7.78it/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch 1: {'accuracy': 0.7009803921568627, 'f1': 0.8195266272189349}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 115/115 [00:26<00:00, 4.38it/s]\n", - "100%|██████████| 13/13 [00:01<00:00, 7.76it/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch 2: {'accuracy': 0.7083333333333334, 'f1': 0.8231797919762258}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 115/115 [00:26<00:00, 4.39it/s]\n", - "100%|██████████| 13/13 [00:01<00:00, 8.15it/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch 3: {'accuracy': 0.7009803921568627, 'f1': 0.8195266272189349}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 115/115 [00:27<00:00, 4.21it/s]\n", - "100%|██████████| 13/13 [00:01<00:00, 8.02it/s]" + "100%|██████████| 115/115 [00:26<00:00, 4.31it/s]\n", + "100%|██████████| 13/13 [00:02<00:00, 5.26it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "epoch 4: {'accuracy': 0.7009803921568627, 'f1': 0.8195266272189349}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n" + "epoch 0: {'accuracy': 0.6838235294117647, 'f1': 0.8122270742358079}\n" ] } ], "source": [ - "from mindnlp.core import value_and_grad\n", - "def forward_fn(**batch):\n", - " outputs = model(**batch)\n", - " loss = outputs.loss\n", - " return loss\n", - "\n", - "grad_fn = value_and_grad(forward_fn, tuple(model.parameters()))\n", + "import mindtorch\n", "\n", "for epoch in range(num_epochs):\n", - " model.set_train()\n", - " train_total_size = train_dataset.get_dataset_size()\n", - " for step, batch in enumerate(tqdm(train_dataset.create_dict_iterator(), total=train_total_size)):\n", - " optimizer.zero_grad()\n", - " loss = grad_fn(**batch)\n", + " # 训练阶段\n", + " model.train()\n", + " train_total_size = len(train_dataloader)\n", + " for step, batch in enumerate(tqdm(train_dataloader, total=train_total_size)):\n", + " \n", + " # 将batch中的所有张量移动到模型所在的设备\n", + " batch_on_device = {}\n", + " for key, value in batch.items():\n", + " if mindtorch.is_tensor(value):\n", + " batch_on_device[key] = value.to(device)\n", + " else:\n", + " batch_on_device[key] = value\n", + " \n", + " # 手动清零梯度,避免调用optimizer.zero_grad()\n", + " for param in model.parameters():\n", + " if param.grad is not None:\n", + " param.grad = None \n", + " \n", + " outputs = model(**batch_on_device)\n", + " loss = outputs.loss\n", + " loss.backward()\n", " optimizer.step()\n", " lr_scheduler.step()\n", "\n", - " model.set_train(False)\n", - " eval_total_size = eval_dataset.get_dataset_size()\n", - " for step, batch in enumerate(tqdm(eval_dataset.create_dict_iterator(), total=eval_total_size)):\n", - " outputs = model(**batch)\n", - " predictions = outputs.logits.argmax(axis=-1)\n", - " predictions, references = predictions, batch[\"labels\"]\n", + " # 评估阶段\n", + " model.eval()\n", + " eval_total_size = len(eval_dataloader)\n", + " for step, batch in enumerate(tqdm(eval_dataloader, total=eval_total_size)): \n", + " # 将batch中的所有张量移动到模型所在的设备\n", + " batch_on_device = {}\n", + " for key, value in batch.items():\n", + " if mindtorch.is_tensor(value):\n", + " batch_on_device[key] = value.to(device)\n", + " else:\n", + " batch_on_device[key] = value\n", + " with torch.no_grad():\n", + " outputs = model(**batch_on_device)\n", + " \n", + " predictions = outputs.logits.argmax(dim=-1)\n", + " predictions, references = predictions.asnumpy(), batch[\"labels\"].asnumpy()\n", " metric.add_batch(\n", " predictions=predictions,\n", " references=references,\n", @@ -553,22 +666,6 @@ " eval_metric = metric.compute()\n", " print(f\"epoch {epoch}:\", eval_metric)" ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4de28f75", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7cb41077-b027-4c0f-87ed-380cd816d2f4", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { @@ -584,9 +681,9 @@ "name": "mindspore1.7.0-cuda10.1-py3.7-ubuntu18.04" }, "kernelspec": { - "display_name": "MindSpore", + "display_name": "Python 3.10", "language": "python", - "name": "mindspore" + "name": "py310" }, "language_info": { "codemirror_mode": { @@ -598,7 +695,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.10" + "version": "3.10.14" }, "vscode": { "interpreter": { diff --git a/03.MindSpore_Compatible_Training_Course/Chapter2/pretrain.md b/03.MindSpore_Compatible_Training_Course/Chapter2/pretrain.md index 405fd26..f50e06d 100644 --- a/03.MindSpore_Compatible_Training_Course/Chapter2/pretrain.md +++ b/03.MindSpore_Compatible_Training_Course/Chapter2/pretrain.md @@ -117,7 +117,7 @@ INFO:root:Done! mkdir dataset cd dataset/ -wget https://hf-mirror.com/datasets/tatsu-lab/alpaca/blob/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet +wget https://modelscope.cn/datasets/angelala00/tatsu-lab-alpaca/resolve/master/train-00000-of-00001-a09b74b3ef9c3b56.parquet cd .. ```