首页速度优化深入解析k-means系列算法：从基础到进阶在load_iris数据集上的实战对比

网站优化

小程序计算机毕设之基于springboot+小程序的高校生活互助平台小程序基于SpringBoot校园生活服务小程序（完整前后端代码+说明文档+LW，调试定制等）

SeqGPT-560M效果验证：在无标注测试集上达到92.4% Exact Match准确率

2026-06-08 14:57:51

阅读时长:5分钟

562次阅读

核心内容摘要

卫报新闻文章数据集-2016-2022年14万+篇多领域英文新闻全文数据-适用于自然语言处理模型训练与内容分析研究-自然语言处理研究、媒体分析、社会趋势研究以及人工智能模型训练

Magma升级指南：从基础版到专业版的平滑过渡

1 代码# 注不建议再重复训练tokenizer“词典”MiniMind已自带此脚本仅供学习和参考。

基于不同词典训练的模型将导致输出完全不统一降低社区的模型复用性# Note: It is not recommended to re-train the tokenizer. MiniMind already includes one. This script is for learning and reference only. Training models with different tokenizers will lead to inconsistent outputs and reduce model reusability in the community.importosimportjsonfromtokenizersimportdecoders,models,pre_tokenizers,trainers,Tokenizer DATA_PATH../dataset/pretrain_hq.jsonlTOKENIZER_DIR../model_learn_tokenizer/VOCAB_SIZE6400defget_texts(data_path):withopen(data_path,r,encodingutf-

asf:fori,lineinenumerate(f):ifi10000:break# 实验性可只用前10000行测试datajson.loads(line)yielddata[text]deftrain_tokenizer(data_path,tokenizer_dir,vocab_size):tokenizerTokenizer(models.BPE())tokenizer.pre_tokenizerpre_tokenizers.ByteLevel(add_prefix_spaceFalse)trainertrainers.BpeTrainer(vocab_sizevocab_size,special_tokens[|endoftext|,|im_start|,|im_end|],show_progressTrue,initial_alphabetpre_tokenizers.ByteLevel.alphabet())textsget_texts(data_path)tokenizer.train_from_iterator(texts,trainertrainer)tokenizer.decoderdecoders.ByteLevel()asserttokenizer.token_to_id(|endoftext|)0asserttokenizer.token_to_id(|im_start|)1asserttokenizer.token_to_id(|im_end|)2os.makedirs(tokenizer_dir,exist_okTrue)tokenizer.save(os.path.join(tokenizer_dir,tokenizer.json))tokenizer.model.save(tokenizer_dir)config{add_bos_token:False,add_eos_token:False,add_prefix_space:False,added_tokens_decoder:{0:{content:|endoftext|,lstrip:False,normalized:False,rstrip:False,single_word:False,special:True},1:{content:|im_start|,lstrip:False,normalized:False,rstrip:False,single_word:False,special:True},2:{content:|im_end|,lstrip:False,normalized:False,rstrip:False,single_word:False,special:True}},additional_special_tokens:[],bos_token:|im_start|,clean_up_tokenization_spaces:False,eos_token:|im_end|,legacy:True,model_max_length:32768,pad_token:|endoftext|,sp_model_kwargs:{},spaces_between_special_tokens:False,tokenizer_class:PreTrainedTokenizerFast,unk_token:|endoftext|,chat_template:{%- if tools %}\n \n {%- if messages[0].role system %}\n \n {%- endif %}\n \n {%- for tool in tools %}\n \n \n {%- endfor %}\n \n{%- else %}\n {%- if messages[0][role] system -%}\n \n {%- else -%}\n \n {%- endif %}\n{%- endif %}\n{%- set ns namespace(multi_step_tooltrue, last_query_indexmessages|length -

%}\n{%- for message in messages[::-1] %}\n {%- set index (messages|length -

- loop.index0 %}\n {%- if ns.multi_step_tool and message.role \user\ and message.content is string and not(message.content.startswith(tool_response) and message.content.endswith(/tool_response)) %}\n {%- set ns.multi_step_tool false %}\n {%- set ns.last_query_index index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if message.content is string %}\n {%- set content message.content %}\n {%- else %}\n {%- set content %}\n {%- endif %}\n {%- if (message.role \user\) or (message.role \system\ and not loop.first) %}\n \n {%- elif message.role \assistant\ %}\n \n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n \n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call tool_call.function %}\n {%- endif %}\n \n \n \n {%- if tool_call.arguments is string %}\n \n {%- else %}\n \n {%- endif %}\n \n {%- endfor %}\n {%- endif %}\n \n {%- elif message.role \tool\ %}\n {%- if loop.first or (messages[loop.index0 - 1].role ! \tool\) %}\n \n {%- endif %}\n \n \n \n {%- if loop.last or (messages[loop.index0 1].role ! \tool\) %}\n \n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n \n {%- if enable_thinking is defined and enable_thinking is false %}\n \n {%- endif %}\n{%- endif %}}withopen(os.path.join(tokenizer_dir,tokenizer_config.json),w,encodingutf-

asf:json.dump(config,f,ensure_asciiFalse,indent

print(Tokenizer training completed.)defeval_tokenizer(tokenizer_dir):fromtransformersimportAutoTokenizer tokenizerAutoTokenizer.from_pretrained(tokenizer_dir)messages[{role:system,content:你是一个优秀的聊天机器人总是给我正确的回应},{role:user,content:你来自哪里},{role:assistant,content:我来自地球}]new_prompttokenizer.apply_chat_template(messages,tokenizeFalse)print(-*

print(new_prompt)print(-*

print(tokenizer词表长度,len(tokenizer))model_inputstokenizer(new_prompt)print(encoder长度,len(model_inputs[input_ids]))responsetokenizer.decode(model_inputs[input_ids],skip_special_tokensFalse)print(decoder一致性,responsenew_prompt,\n)print(-*

print(流式解码字节缓冲测试)input_idsmodel_inputs[input_ids]token_cache[]fortidininput_ids:token_cache.append(tid)current_decodetokenizer.decode(token_cache)ifcurrent_decodeand\ufffdnotincurrent_decode:display_idstoken_cache[0]iflen(token_cache)1elsetoken_cache raw_tokens[tokenizer.convert_ids_to_tokens(int(t))fortin(token_cacheifisinstance(token_cache,list)else[token_cache])]print(fToken ID:{str(display_ids):15}- Raw:{str(raw_tokens):20}- Decode Str:{current_decode})token_cache[]ifnamemain:train_tokenizer(DATA_PATH,TOKENIZER_DIR,VOCAB_SIZE)eval_tokenizer(TOKENIZER_DIR)

小程序计算机毕设之基于springboot+小程序的高校生活互助平台小程序基于SpringBoot校园生活服务小程序（完整前后端代码+说明文档+LW，调试定制等）

核心内容摘要

Magma升级指南：从基础版到专业版的平滑过渡

%}\n{%- for message in messages[::-1] %}\n {%- set index (messages|length -

asf:json.dump(config,f,ensure_asciiFalse,indent

print(new_prompt)print(-*

print(tokenizer词表长度,len(tokenizer))model_inputstokenizer(new_prompt)print(encoder长度,len(model_inputs[input_ids]))responsetokenizer.decode(model_inputs[input_ids],skip_special_tokensFalse)print(decoder一致性,responsenew_prompt,\n)print(-*

坤坤寒进桃子里在线看歌词免费-坤坤寒进桃子里在线看歌词免费应用

📑 文章目录

🔥 热门优化文章

🛠️ 实用工具推荐

百度百家号客服电话人工服务

小程序计算机毕设之基于springboot+小程序的高校生活互助平台小程序基于SpringBoot校园生活服务小程序（完整前后端代码+说明文档+LW，调试定制等）

核心内容摘要

Magma升级指南：从基础版到专业版的平滑过渡

%}\n{%- for message in messages[::-1] %}\n {%- set index (messages|length -

asf:json.dump(config,f,ensure_asciiFalse,indent

print(new_prompt)print(-*

print(tokenizer词表长度,len(tokenizer))model_inputstokenizer(new_prompt)print(encoder长度,len(model_inputs[input_ids]))responsetokenizer.decode(model_inputs[input_ids],skip_special_tokensFalse)print(decoder一致性,responsenew_prompt,\n)print(-*

坤坤寒进桃子里在线看歌词免费-坤坤寒进桃子里在线看歌词免费应用

📑 文章目录

🔥 热门优化文章

🛠️ 实用工具推荐

相关优化文章 推荐

百度百家号客服电话人工服务

相关优化文章推荐