丁香花开,五月心潮:一段关于爱与美好的温柔絮语

核心内容摘要

丝瓜的诱惑:一场舌尖上的感官盛宴
福建奶牛导航app官网版:智慧养殖,指尖上的高效牧场

御梦子心糖.logo

1 代码# 注不建议再重复训练tokenizer“词典”MiniMind已自带此脚本仅供学习和参考。

基于不同词典训练的模型将导致输出完全不统一降低社区的模型复用性# Note: It is not recommended to re-train the tokenizer. MiniMind already includes one. This script is for learning and reference only. Training models with different tokenizers will lead to inconsistent outputs and reduce model reusability in the community.importosimportjsonfromtokenizersimportdecoders,models,pre_tokenizers,trainers,Tokenizer DATA_PATH../dataset/pretrain_hq.jsonlTOKENIZER_DIR../model_learn_tokenizer/VOCAB_SIZE6400defget_texts(data_path):withopen(data_path,r,encodingutf-

asf:fori,lineinenumerate(f):ifi10000:break# 实验性可只用前10000行测试datajson.loads(line)yielddata[text]deftrain_tokenizer(data_path,tokenizer_dir,vocab_size):tokenizerTokenizer(models.BPE())tokenizer.pre_tokenizerpre_tokenizers.ByteLevel(add_prefix_spaceFalse)trainertrainers.BpeTrainer(vocab_sizevocab_size,special_tokens[|endoftext|,|im_start|,|im_end|],show_progressTrue,initial_alphabetpre_tokenizers.ByteLevel.alphabet())textsget_texts(data_path)tokenizer.train_from_iterator(texts,trainertrainer)tokenizer.decoderdecoders.ByteLevel()asserttokenizer.token_to_id(|endoftext|)0asserttokenizer.token_to_id(|im_start|)1asserttokenizer.token_to_id(|im_end|)2os.makedirs(tokenizer_dir,exist_okTrue)tokenizer.save(os.path.join(tokenizer_dir,tokenizer.json))tokenizer.model.save(tokenizer_dir)config{add_bos_token:False,add_eos_token:False,add_prefix_space:False,added_tokens_decoder:{0:{content:|endoftext|,lstrip:False,normalized:False,rstrip:False,single_word:False,special:True},1:{content:|im_start|,lstrip:False,normalized:False,rstrip:False,single_word:False,special:True},2:{content:|im_end|,lstrip:False,normalized:False,rstrip:False,single_word:False,special:True}},additional_special_tokens:[],bos_token:|im_start|,clean_up_tokenization_spaces:False,eos_token:|im_end|,legacy:True,model_max_length:32768,pad_token:|endoftext|,sp_model_kwargs:{},spaces_between_special_tokens:False,tokenizer_class:PreTrainedTokenizerFast,unk_token:|endoftext|,chat_template:{%- if tools %}\n \n {%- if messages[0].role system %}\n \n {%- endif %}\n \n {%- for tool in tools %}\n \n \n {%- endfor %}\n \n{%- else %}\n {%- if messages[0][role] system -%}\n \n {%- else -%}\n \n {%- endif %}\n{%- endif %}\n{%- set ns namespace(multi_step_tooltrue, last_query_indexmessages|length -

%}\n{%- for message in messages[::-1] %}\n {%- set index (messages|length -

- loop.index0 %}\n {%- if ns.multi_step_tool and message.role \user\ and message.content is string and not(message.content.startswith(tool_response) and message.content.endswith(/tool_response)) %}\n {%- set ns.multi_step_tool false %}\n {%- set ns.last_query_index index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if message.content is string %}\n {%- set content message.content %}\n {%- else %}\n {%- set content %}\n {%- endif %}\n {%- if (message.role \user\) or (message.role \system\ and not loop.first) %}\n \n {%- elif message.role \assistant\ %}\n \n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n \n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call tool_call.function %}\n {%- endif %}\n \n \n \n {%- if tool_call.arguments is string %}\n \n {%- else %}\n \n {%- endif %}\n \n {%- endfor %}\n {%- endif %}\n \n {%- elif message.role \tool\ %}\n {%- if loop.first or (messages[loop.index0 - 1].role ! \tool\) %}\n \n {%- endif %}\n \n \n \n {%- if loop.last or (messages[loop.index0 1].role ! \tool\) %}\n \n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n \n {%- if enable_thinking is defined and enable_thinking is false %}\n \n {%- endif %}\n{%- endif %}}withopen(os.path.join(tokenizer_dir,tokenizer_config.json),w,encodingutf-

asf:json.dump(config,f,ensure_asciiFalse,indent

print(Tokenizer training completed.)defeval_tokenizer(tokenizer_dir):fromtransformersimportAutoTokenizer tokenizerAutoTokenizer.from_pretrained(tokenizer_dir)messages[{role:system,content:你是一个优秀的聊天机器人总是给我正确的回应},{role:user,content:你来自哪里},{role:assistant,content:我来自地球}]new_prompttokenizer.apply_chat_template(messages,tokenizeFalse)print(-*

print(new_prompt)print(-*

print(tokenizer词表长度,len(tokenizer))model_inputstokenizer(new_prompt)print(encoder长度,len(model_inputs[input_ids]))responsetokenizer.decode(model_inputs[input_ids],skip_special_tokensFalse)print(decoder一致性,responsenew_prompt,\n)print(-*

print(流式解码字节缓冲测试)input_idsmodel_inputs[input_ids]token_cache[]fortidininput_ids:token_cache.append(tid)current_decodetokenizer.decode(token_cache)ifcurrent_decodeand\ufffdnotincurrent_decode:display_idstoken_cache[0]iflen(token_cache)1elsetoken_cache raw_tokens[tokenizer.convert_ids_to_tokens(int(t))fortin(token_cacheifisinstance(token_cache,list)else[token_cache])]print(fToken ID:{str(display_ids):15}- Raw:{str(raw_tokens):20}- Decode Str:{current_decode})token_cache[]if__name____main__:train_tokenizer(DATA_PATH,TOKENIZER_DIR,VOCAB_SIZE)eval_tokenizer(TOKENIZER_DIR)

91社-91社应用

百度百家号客服电话人工服务

123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123 123