www.444:解锁数字世界的无限可能

核心内容摘要

桃子移植系列:解锁生命密码,重塑味蕾传奇
《嘘声中的十月:光影与身体的私语》

草莓视频官网:解锁无限精彩,点亮你的数字生活

1 代码# 注不建议再重复训练tokenizer“词典”MiniMind已自带此脚本仅供学习和参考。

基于不同词典训练的模型将导致输出完全不统一降低社区的模型复用性# Note: It is not recommended to re-train the tokenizer. MiniMind already includes one. This script is for learning and reference only. Training models with different tokenizers will lead to inconsistent outputs and reduce model reusability in the community.importosimportjsonfromtokenizersimportdecoders,models,pre_tokenizers,trainers,Tokenizer DATA_PATH../dataset/pretrain_hq.jsonlTOKENIZER_DIR../model_learn_tokenizer/VOCAB_SIZE6400defget_texts(data_path):withopen(data_path,r,encodingutf-

asf:fori,lineinenumerate(f):ifi10000:break# 实验性可只用前10000行测试datajson.loads(line)yielddata[text]deftrain_tokenizer(data_path,tokenizer_dir,vocab_size):tokenizerTokenizer(models.BPE())tokenizer.pre_tokenizerpre_tokenizers.ByteLevel(add_prefix_spaceFalse)trainertrainers.BpeTrainer(vocab_sizevocab_size,special_tokens[|endoftext|,|im_start|,|im_end|],show_progressTrue,initial_alphabetpre_tokenizers.ByteLevel.alphabet())textsget_texts(data_path)tokenizer.train_from_iterator(texts,trainertrainer)tokenizer.decoderdecoders.ByteLevel()asserttokenizer.token_to_id(|endoftext|)0asserttokenizer.token_to_id(|im_start|)1asserttokenizer.token_to_id(|im_end|)2os.makedirs(tokenizer_dir,exist_okTrue)tokenizer.save(os.path.join(tokenizer_dir,tokenizer.json))tokenizer.model.save(tokenizer_dir)config{add_bos_token:False,add_eos_token:False,add_prefix_space:False,added_tokens_decoder:{0:{content:|endoftext|,lstrip:False,normalized:False,rstrip:False,single_word:False,special:True},1:{content:|im_start|,lstrip:False,normalized:False,rstrip:False,single_word:False,special:True},2:{content:|im_end|,lstrip:False,normalized:False,rstrip:False,single_word:False,special:True}},additional_special_tokens:[],bos_token:|im_start|,clean_up_tokenization_spaces:False,eos_token:|im_end|,legacy:True,model_max_length:32768,pad_token:|endoftext|,sp_model_kwargs:{},spaces_between_special_tokens:False,tokenizer_class:PreTrainedTokenizerFast,unk_token:|endoftext|,chat_template:{%- if tools %}\n \n {%- if messages[0].role system %}\n \n {%- endif %}\n \n {%- for tool in tools %}\n \n \n {%- endfor %}\n \n{%- else %}\n {%- if messages[0][role] system -%}\n \n {%- else -%}\n \n {%- endif %}\n{%- endif %}\n{%- set ns namespace(multi_step_tooltrue, last_query_indexmessages|length -

%}\n{%- for message in messages[::-1] %}\n {%- set index (messages|length -

- loop.index0 %}\n {%- if ns.multi_step_tool and message.role \user\ and message.content is string and not(message.content.startswith(tool_response) and message.content.endswith(/tool_response)) %}\n {%- set ns.multi_step_tool false %}\n {%- set ns.last_query_index index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if message.content is string %}\n {%- set content message.content %}\n {%- else %}\n {%- set content %}\n {%- endif %}\n {%- if (message.role \user\) or (message.role \system\ and not loop.first) %}\n \n {%- elif message.role \assistant\ %}\n \n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n \n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call tool_call.function %}\n {%- endif %}\n \n \n \n {%- if tool_call.arguments is string %}\n \n {%- else %}\n \n {%- endif %}\n \n {%- endfor %}\n {%- endif %}\n \n {%- elif message.role \tool\ %}\n {%- if loop.first or (messages[loop.index0 - 1].role ! \tool\) %}\n \n {%- endif %}\n \n \n \n {%- if loop.last or (messages[loop.index0 1].role ! \tool\) %}\n \n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n \n {%- if enable_thinking is defined and enable_thinking is false %}\n \n {%- endif %}\n{%- endif %}}withopen(os.path.join(tokenizer_dir,tokenizer_config.json),w,encodingutf-

asf:json.dump(config,f,ensure_asciiFalse,indent

print(Tokenizer training completed.)defeval_tokenizer(tokenizer_dir):fromtransformersimportAutoTokenizer tokenizerAutoTokenizer.from_pretrained(tokenizer_dir)messages[{role:system,content:你是一个优秀的聊天机器人总是给我正确的回应},{role:user,content:你来自哪里},{role:assistant,content:我来自地球}]new_prompttokenizer.apply_chat_template(messages,tokenizeFalse)print(-*

print(new_prompt)print(-*

print(tokenizer词表长度,len(tokenizer))model_inputstokenizer(new_prompt)print(encoder长度,len(model_inputs[input_ids]))responsetokenizer.decode(model_inputs[input_ids],skip_special_tokensFalse)print(decoder一致性,responsenew_prompt,\n)print(-*

print(流式解码字节缓冲测试)input_idsmodel_inputs[input_ids]token_cache[]fortidininput_ids:token_cache.append(tid)current_decodetokenizer.decode(token_cache)ifcurrent_decodeand\ufffdnotincurrent_decode:display_idstoken_cache[0]iflen(token_cache)1elsetoken_cache raw_tokens[tokenizer.convert_ids_to_tokens(int(t))fortin(token_cacheifisinstance(token_cache,list)else[token_cache])]print(fToken ID:{str(display_ids):15}- Raw:{str(raw_tokens):20}- Decode Str:{current_decode})token_cache[]if__name____main__:train_tokenizer(DATA_PATH,TOKENIZER_DIR,VOCAB_SIZE)eval_tokenizer(TOKENIZER_DIR)

爱情调色大片1000部-爱情调色大片1000部应用

百度百家号客服电话人工服务

123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123