首页速度优化冰封雪莲的无奈：申鹤翻白眼流泪的背后，藏着怎样的故事？

网站优化

17c最新免费网名：玩转个性，告别千篇一律，你的专属昵称等你来领！

17c黑料：尘封的秘闻，窥探历史的另一面

2026-06-12 23:15:52

阅读时长:1分钟

562次阅读

核心内容摘要

“麻豆涩漫官方版”

本文详细介绍了BGE-M3向量模型的原理与实现。

该模型支持多语言、多功能和多粒度特性通过知识蒸馏技术将密集检索、稀疏检索和多向量检索三种搜索方法统一为一个模型。

文章深入解析了模型架构包括嵌入层、Transformer块和多头注意力机制并提供了完整的TensorFlow实现代码帮助开发者理解并应用这一先进的语义搜索技术。

简介之前在进行语义切分和数据检索时提到向量模型在语义切分中也简单介绍过向量模型。

[

文档解析结构型文档解析-语义切分]

向量模型评测网上有对向量模型的评测内容先跟大家分享一下可以根据业务需求判断选择哪种向量模型。

https://blog.csdn.net/m0_52307083/article/details/147811974 https://zhuanlan.zhihu.com/p/1922035099731469235

向量模型原理上面这张图大家在学习bge-m3时经常看到说的是模型优点多语言、多功能、多粒度支持70多种语言最近也是嵌入向量提取任务中最常用的模型之一如RAG检索增强。

bge-m3模型的特点是同时优化以下三种反复损耗函数

密集检索通过句子CLS向量进行语义搜索将整句话的含义压缩并表示单一向量

稀疏检索通过令牌级重要权重进行搜索学习每个令牌的重要性提升关键词搜索表现

多向量检索通过标记级向量进行搜索通过为每个词独立的向量实现语义匹配这三种搜索方法通过知识蒸馏技术被学习成一个统一的模型代码如下。

注意KL散度介绍https://zhuanlan.zhihu.com/p/714024458# 教师模型分布将分数转换为概率分布 # ensemble_scores.detach()多个模型预测分数 self_teacher_targets torch.softmax(ensemble_scores.detach(), dim-

# 计算模型的蒸馏损失使用KL作为损失函数 # dense_scores密集检索模型 dense_self_distill_loss self.distill_loss(kl_div, self_teacher_targets, dense_scores) # 稀疏检索模型 sparse_self_distill_loss self.distill_loss(kl_div, self_teacher_targets, sparse_scores) # 多向量检索上下文检索 colbert_self_distill_loss self.distill_loss(kl_div, self_teacher_targets, colbert_scores) loss (dense_self_distill_loss

1 * sparse_self_distill_loss colbert_self_distill_loss) / 3通过调用模型可以看到模型主要组成部分# 模型调用代码 from transformers import AutoModel import torch model AutoModel.from_pretrained(BAAI/bge-m3 , trust_remote_codeTrue) for name, param in self.model.state_dict().items(): print(f{name:30} | shape: {param.shape})# 模型主要组成部分 # Embedding embeddings.word_embeddings.weight | shape: torch.Size([250002, 1024]) embeddings.position_embeddings.weight | shape: torch.Size([8194, 1024]) embeddings.token_type_embeddings.weight | shape: torch.Size([1, 1024]) embeddings.LayerNorm.weight | shape: torch.Size([1024]) embeddings.LayerNorm.bias | shape: torch.Size([1024]) # transformer block * 24 encoder.layer.

attention.self.query.weight | shape: torch.Size([1024, 1024]) encoder.layer.

attention.self.query.bias | shape: torch.Size([1024]) encoder.layer.

attention.self.key.weight | shape: torch.Size([1024, 1024]) encoder.layer.

attention.self.key.bias | shape: torch.Size([1024]) encoder.layer.

attention.self.value.weight | shape: torch.Size([1024, 1024]) encoder.layer.

attention.self.value.bias | shape: torch.Size([1024]) encoder.layer.

attention.output.dense.weight | shape: torch.Size([1024, 1024]) encoder.layer.

attention.output.dense.bias | shape: torch.Size([1024]) encoder.layer.

attention.output.LayerNorm.weight | shape: torch.Size([1024]) encoder.layer.

attention.output.LayerNorm.bias | shape: torch.Size([1024]) encoder.layer.

intermediate.dense.weight | shape: torch.Size([4096, 1024]) encoder.layer.

intermediate.dense.bias | shape: torch.Size([4096]) encoder.layer.

output.dense.weight | shape: torch.Size([1024, 4096]) encoder.layer.

output.dense.bias | shape: torch.Size([1024]) encoder.layer.

output.LayerNorm.weight | shape: torch.Size([1024]) encoder.layer.

output.LayerNorm.bias | shape: torch.Size([1024]) # Final Pooling Layer pooler.dense.weight | shape: torch.Size([1024, 1024]) pooler.dense.bias | shape: torch.Size([1024])通过上面打印可以看到模型主要由3层归一化实现嵌入层、变压器、归一化而在Transformer block块中重复了24次。

注意Transformer block介绍地址https://blog.csdn.net/qq_36803941/article/details/138795224

模型实现

嵌入层实现嵌入层时将自然语言转换为模型能够理解的数值向量的核心组件。

bge-m3有三种。

词嵌入、位置嵌入、令牌嵌入def __init__(self,...): # 词嵌入 self.word_embedding tf.keras.layers.Embedding( input_dim250002, output_dim1024, ) # 位置嵌入 self.position_embedding tf.keras.layers.Embedding( input_dim8194, output_dim1024, ) # 令牌类型嵌入 self.token_type_embedding tf.keras.layers.Embedding( input_dim1, output_dim1024, ) self.layer_norm tf.keras.layers.LayerNormalization(epsilon1e-

#self.dropout layers.Dropout(rate

0.

使用tf.gather词语的数值序列被转换为嵌入张量。

然后经过层范数层进行归一化。

def call(self, ..): self.inputs_embeds tf.gather(paramsself.weight, indicesinput_ids) self.position_embeds tf.gather(paramsself.position_embeddings, indicesposition_ids) self.token_type_embeds tf.gather(paramsself.token_type_embeddings, indicestoken_type_ids) embedding_output inputs_embeds position_embeds token_type_embeds embedding_output self.layerNorm(embedding_output)

变压器每个变压器模块由6个密集层、2层归一化、2层残差计算。

根据之前描述的权重结构多头注意力由以下组成。

# Query, Key, Value weights encoder.layer.

attention.self.query | shape: torch.Size([1024, 1024]) encoder.layer.

attention.self.key | shape: torch.Size([1024, 1024]) encoder.layer.

attention.self.value| shape: torch.Size([1024, 1024]) # 注意力机制输出处理 encoder.layer.

attention.output.dense | shape: torch.Size([1024, 1024]) encoder.layer.

attention.output.LayerNorm | shape: torch.Size([1024]) # 中间层 encoder.layer.

intermediate.dense | shape: torch.Size([4096, 1024]) # expand encoder.layer.

output.dense | shape: torch.Size([1024, 4096]) # reduce # 归一化 encoder.layer.

output.LayerNorm | shape: torch.Size([1024])开始自定义一个多头注意力机制def init(self, ...): self.wq tf.keras.layers.Dense(

self.wk tf.keras.layers.Dense(

self.wv tf.keras.layers.Dense(

self.dense tf.keras.layers.Dense(

self.attlayerNorm tf.keras.layers.LayerNormalization(epsilon1e-

self.intermediate tf.keras.layers.Dense(

self.output_dense tf.keras.layers.Dense(

self.output_norm tf.keras.layers.LayerNormalization(epsilon1e-

def call(self, ..) input embedding_output # Query, Key, Value 三个独立的线性层全连接层 q self.wq(input) # (batch_size, seq_len, d_model) k self.wk(input) # (batch_size, seq_len, d_model) v self.wv(input) # (batch_size, seq_len, d_model) q self.split_heads(q, batch_size, 16,

# (batch_size, num_heads, seq_len_q, depth) k self.split_heads(k, batch_size, 16,

# (batch_size, num_heads, seq_len_k, depth) v self.split_heads(v, batch_size, 16,

# (batch_size, num_heads, seq_len_v, depth) def split_heads(self, x, batch_size, num_heads, depth): x tf.reshape(x, (batch_size, -1, num_heads, depth)) return tf.transpose(x, perm[0, 2, 1, 3]) # (batch_size, num_heads, seq_len, depth)Q、K、V通过上述模型结构定义了三个密集层计算通过split\_heads操作分割多头。

然后应用模型中的缩放点积注意力通过**查询Q**、\*\*键K**和**值V\*\*计算序列中不同位置之间的相关性并生成上下文向量从而捕获长距离依赖关系。

。

# 缩放点积注意力得分 dk tf.cast(math.sqrt(1024 //

, tf.float

attention_scores tf.matmul(q, k, transpose_bTrue) # (batch_size, num_heads, seq_len_q, seq_len_k) attention_scores tf.divide(attention_scores, dk) attention_probs tf.nn.softmax(attention_scores 1e-9, axis-

# 注意力输出 attention_output tf.matmul(attention_probs, v) # (batch_size, num_heads, seq_len_q, depth) attention_output tf.transpose(attention_output, perm[0, 2, 1, 3]) # (batch_size, seq_len_q, num_heads, depth) attention_output tf.reshape(attention_output, (batch_size, -1, self.d_model)) # (batch_size, seq_len_q, d_model) # 最终通过全连接层输出 attention_output self.dense(attention_output) # (batch_size, seq_len_q, d_model) #attention_output self.dropout(inputsattention_output, trainingtraining) # 第一个残差链接 attention_output self.attlayerNorm(inputsattention_output input)最终输出值通过密集层获得。

第一个残差连接通过添加初始输入嵌入值或前一层块的隐藏值然后进行归一化来应用。

前馈神经网络FNN 接下来看变压器模块的第二个主要组件前馈神经网络特点是每个神经元的输出只传给下一层的神经元。

信息处理是单向的从输入层到输出层适合于解决分类和回归等问题。

# 输入通过一个中间层大小是输出层的四倍。

intermediate_output self.intermediate(attention_output) intermediate_output self.gelu_approx(intermediate_output) layer_output self.output_dense(intermediate_output) layer_output self.output_dropout(layer_output, trainingtraining) # 第二次残差计算随后进行归一化 output layer_output attention_output output self.output_norm(output) # 近似算法 def gelu_approx(self, x): x tf.convert_to_tensor(x) cdf

5 * (

0 tf.math.erf(x / tf.cast(tf.sqrt(

2.

, x.dtype))) return x * cdf以上这种结构bge-m3重复了24次。

encoder_layers [] for i in range(

: layer TransformerBlock( d_model1024, num_heads16, intermediate_size4096, dropout_rateself.dropout_rate, namefencoder.layer.{i} ) encoder_layers.append(layer)

完成代码Transformer类代码class TransformerBlock(tf.keras.layers.Layer): Transformer编码器块 Args: d_model: 模型维度 num_heads: 注意力头数量 intermediate_size: 中间层大小 dropout_rate: dropout比率 **kwargs: 其他参数传递给父类 def init(self, d_model, num_heads, intermediate_size, dropout_rate

1, kwargs): super().init(kwargs) self.attention MultiHeadAttention(d_model, num_heads, dropout_rate) self.attention_norm tf.keras.layers.LayerNormalization(epsilon1e-

self.attention_dropout tf.keras.layers.Dropout(dropout_rate) self.intermediate tf.keras.layers.Dense( intermediate_size, nameintermediate.dense ) self.output_dense tf.keras.layers.Dense(d_model, nameoutput.dense) self.output_dropout tf.keras.layers.Dropout(dropout_rate) self.output_norm tf.keras.layers.LayerNormalization(epsilon1e-

def gelu_approx(self, x): GELU激活函数的近似实现 Args: x: 输入张量 Returns: 经过GELU激活的张量 x tf.convert_to_tensor(x) cdf

5 * (

0 tf.math.erf(x / tf.cast(tf.sqrt(

2.

, x.dtype))) return x * cdf def call(self, x, attention_maskNone, trainingFalse): 前向传播 Args: x: 输入张量 attention_mask: 注意力掩码 training: 是否在训练模式 Returns: 输出张量 # 自注意力 attention_output, attention_weights self.attention( inputsx, maskattention_mask, trainingtraining ) # 使用GELU激活函数的前馈网络 intermediate_output self.intermediate(attention_output) intermediate_output self.gelu_approx(intermediate_output) # 使用GELU激活函数的前馈网络 layer_output self.output_dense(intermediate_output) if training: layer_output self.output_dropout(layer_output, trainingtraining) # 残差 output layer_output attention_output output self.output_norm(output) return output多头注意力class MultiHeadAttention(tf.keras.layers.Layer): 多头注意力机制层 Args: d_model: 模型维度 num_heads: 注意力头数量 dropout_rate: dropout比率默认

1 **kwargs: 其他参数传递给父类 Raises: ValueError: 当d_model不能被num_heads整除时抛出异常 def init(self, d_model, num_heads, dropout_rate

1, kwargs): super().init(kwargs) if d_model % num_heads ! 0: raise ValueError(fd_model ({d_model}) must be divisible by num_heads ({num_heads})) self.num_heads num_heads self.d_model d_model self.depth d_model // num_heads # 各头部的尺寸大小 # Query, Key, Value用于某目的的Dense Layer self.wq tf.keras.layers.Dense(d_model) self.wk tf.keras.layers.Dense(d_model) self.wv tf.keras.layers.Dense(d_model) # 输出层 self.dense tf.keras.layers.Dense(d_model) # 注意力机制 self.attlayerNorm tf.keras.layers.LayerNormalization(epsilon1e-

self.dropout tf.keras.layers.Dropout(dropout_rate) def stable_softmax(self, logits, axisNone, nameNone): 稳定的softmax实现 return tf.nn.softmax(logitslogits 1e-9, axisaxis, namename) def split_heads(self, x, batch_size): 将输入张量分割成多个头 Args: x: 输入张量 batch_size: 批次大小 Returns: 分割后的张量形状为(batch_size, num_heads, seq_len, depth) x tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) return tf.transpose(x, perm[0, 2, 1, 3]) # (batch_size, num_heads, seq_len, depth) def call(self, inputs, maskNone, trainingFalse): 前向传播 Args: inputs: 输入张量 mask: 注意力掩码 training: 是否在训练模式 Returns: 输出张量和注意力概率 batch_size tf.shape(inputs)[0] # Query, Key, Value q self.wq(inputs) # (batch_size, seq_len, d_model) k self.wk(inputs) # (batch_size, seq_len, d_model) v self.wv(inputs) # (batch_size, seq_len, d_model) q self.split_heads(q, batch_size) # (batch_size, num_heads, seq_len_q, depth) k self.split_heads(k, batch_size) # (batch_size, num_heads, seq_len_k, depth) v self.split_heads(v, batch_size) # (batch_size, num_heads, seq_len_v, depth) # 缩放点积注意力 sqrt_att_head_size math.sqrt(self.depth) attention_scores tf.matmul(q, k, transpose_bTrue) # (batch_size, num_heads, seq_len_q, seq_len_k) dk tf.cast(sqrt_att_head_size, tf.float

attention_scores tf.divide(attention_scores, dk) if mask is not None: attention_scores tf.add(attention_scores, mask) attention_probs self.stable_softmax(attention_scores, axis-

attention_probs self.dropout(attention_probs, trainingtraining) # 注意力结果 attention_output tf.matmul(attention_probs, v) # (batch_size, num_heads, seq_len_q, depth) attention_output tf.transpose(attention_output, perm[0, 2, 1, 3]) # (batch_size, seq_len_q, num_heads, depth) attention_output tf.reshape(attention_output, (batch_size, -1, self.d_model)) # (batch_size, seq_len_q, d_model) # 密集层 output self.dense(attention_output) # (batch_size, seq_len_q, d_model) if training: output self.dropout(inputsoutput, trainingtraining) # 残差 output self.attlayerNorm(inputsoutput inputs) return output, attention_probsbge-m3类class BGEM3TensorFlow(tf.keras.Model): BGE-M3 TensorFlow模型实现 Args: model_name: 预训练模型名称 normalize_embeddings: 是否标准化嵌入向量 use_fp16: 是否使用半精度浮点数 query_instruction_for_retrieval: 查询检索指令 query_instruction_format: 查询指令格式 pooling_method: 池化方法 trust_remote_code: 是否信任远程代码 cache_dir: 缓存目录 colbert_dim: ColBERT维度 batch_size: 批次大小 query_max_length: 查询最大长度 passage_max_length: 段落最大长度 return_dense: 是否返回密集向量 return_sparse: 是否返回稀疏向量 return_colbert_vecs: 是否返回ColBERT向量 dropout_rate: dropout比率 def init(self, model_name, normalize_embeddingsFalse, use_fp16True, query_instruction_for_retrievalNone, query_instruction_format{}{}, pooling_methodcls, trust_remote_codeFalse, cache_dirNone, colbert_dim-1, batch_size256, query_max_length512, passage_max_length512, return_denseTrue, return_sparseFalse, return_colbert_vecsFalse, dropout_rate

0.

: super().init(namebge-m3-tensorflow) self.model_name model_name self.normalize_embeddings normalize_embeddings self.use_fp16 use_fp16 self.query_instruction_for_retrieval query_instruction_for_retrieval self.query_instruction_format query_instruction_format self.pooling_method pooling_method self.batch_size batch_size self.query_max_length query_max_length self.passage_max_length passage_max_length self.return_dense return_dense self.return_sparse return_sparse self.return_colbert_vecs return_colbert_vecs self.dropout_rate dropout_rate self.padding_idx 1 # 加载配置 self.config AutoConfig.from_pretrained(model_name, trust_remote_codetrust_remote_code) # 模型参数 self.d_model self.config.hidden_size self.num_heads self.config.num_attention_heads self.num_layers self.config.num_hidden_layers self.vocab_size self.config.vocab_size # 构建组件 self._build_embeddings() self._build_encoder_layers() self._build_pooler() self._build_colbert() # 分词器 self.tokenizer AutoTokenizer.from_pretrained( model_name, trust_remote_codetrust_remote_code, cache_dircache_dir ) def shape_list(self, tensor: Union[tf.Tensor, np.ndarray]) - List[int]: 获取张量的形状列表 Args: tensor: 输入张量或数组 Returns: 张量形状的列表 if isinstance(tensor, np.ndarray): return list(tensor.shape) dynamic tf.shape(tensor) if tensor.shape tf.TensorShape(None): return dynamic static tensor.shape.as_list() return [dynamic[i] if s is None else s for i, s in enumerate(static)] def create_position_ids_from_input_ids(self, input_ids, past_key_values_length0, padding_idx

: 根据输入ID创建位置ID Args: input_ids: 输入ID张量 past_key_values_length: 过去键值长度 padding_idx: 填充索引 Returns: 位置ID张量 mask tf.cast(tf.math.not_equal(input_ids, padding_idx), dtypeinput_ids.dtype) incremental_indices (tf.math.cumsum(mask, axis

past_key_values_length) * mask return incremental_indices padding_idx def _build_embeddings(self): 按照XLMRoberta的结构构建嵌入层 with tf.name_scope(word_embeddings): self.weight self.add_weight( nameembeddings, shape[self.vocab_size, self.d_model], initializertf.keras.initializers.TruncatedNormal(stddev

0.

, ) with tf.name_scope(position_embeddings): self.position_embeddings self.add_weight( nameembeddings, shape[self.config.max_position_embeddings, self.d_model], initializertf.keras.initializers.TruncatedNormal(stddev

0.

, ) with tf.name_scope(token_type_embeddings): self.token_type_embeddings self.add_weight( nameembeddings, shape[self.config.type_vocab_size, self.d_model], initializertf.keras.initializers.TruncatedNormal(stddev

0.

, ) # 归一化 self.layerNorm tf.keras.layers.LayerNormalization( epsilonself.config.layer_norm_eps, nameLayerNorm ) # 丢弃 self.dropout tf.keras.layers.Dropout(rateself.dropout_rate) def _build_encoder_layers(self): 构建Transformer编码器层 self.encoder_layers [] for i in range(self.num_layers): layer TransformerBlock( d_modelself.d_model, num_headsself.num_heads, intermediate_sizeself.config.intermediate_size, dropout_rateself.dropout_rate, namefencoder.layer.{i} ) self.encoder_layers.append(layer) def _build_pooler(self): 池化层 self.pooler tf.keras.layers.Dense( self.d_model, activationtanh, kernel_initializertf.keras.initializers.TruncatedNormal(stddev

0.

, namepooler.dense ) def _build_colbert(self): 构建ColBERT相关层 self.colbert_linear tf.keras.layers.Dense( unitsself.d_model, ) def call(self, inputs, trainingFalse, output_hidden_statesFalse): 前向传播 Args: inputs: 输入字典包含input_ids等 training: 是否在训练模式 output_hidden_states: 是否输出隐藏状态 Returns: 包含密集向量、ColBERT向量和最后隐藏状态的字典 input_ids tf.cast(inputs[input_ids], tf.int

if input_ids is not None: inputs_embeds tf.gather(paramsself.weight, indicesinput_ids) attention_mask inputs.get(attention_mask, None) token_type_ids inputs.get(token_type_ids, None) position_ids inputs.get(position_ids, None) input_shape self.shape_list(inputs_embeds)[:-1] if token_type_ids is None: token_type_ids tf.fill(dimsinput_shape, value

if position_ids is None: if input_ids is not None: # 根据输入的标记ID创建位置ID。

任何填充的标记保持填充状态。

position_ids self.create_position_ids_from_input_ids(input_idsinput_ids, padding_idxself.padding_idx) else: position_ids tf.expand_dims( tf.range(startself.padding_idx 1, limitinput_shape[-1] self.padding_idx

, axis0 ) position_embeds tf.gather(paramsself.position_embeddings, indicesposition_ids) token_type_embeds tf.gather(paramsself.token_type_embeddings, indicestoken_type_ids) # 求和 embeddings embedding_output inputs_embeds position_embeds token_type_embeds # 归一化和丢弃 embedding_output self.layerNorm(embedding_output) if training: embedding_output self.dropout(embedding_output, trainingtraining) attention_mask_origin attention_mask attention_mask_shape self.shape_list(attention_mask) extended_attention_mask tf.reshape( attention_mask, (attention_mask_shape[0], 1, 1, attention_mask_shape[1]) ) extended_attention_mask tf.cast(extended_attention_mask, dtypeembedding_output.dtype) one_cst tf.constant(

0, dtypeembedding_output.dtype) ten_thousand_cst tf.constant(-

1

0, dtypeembedding_output.dtype) extended_attention_mask tf.multiply(tf.subtract(one_cst, extended_attention_mask), ten_thousand_cst) attention_mask extended_attention_mask all_hidden_states [embedding_output] if output_hidden_states else [] hidden_states embedding_output # 编码器层 for layer in self.encoder_layers: hidden_states layer( hidden_states, attention_maskattention_mask, trainingtraining ) if output_hidden_states: all_hidden_states.append(hidden_states) # 池化 if self.pooling_method mean: pooled_output tf.reduce_mean(hidden_states, axis

else: # 默认: cls pooled_output hidden_states[:, 0, :] # 如果return_dense为True则应用池化层 if self.return_dense: pooled_output pooled_output # 如果指定则对嵌入进行归一化 if self.normalize_embeddings: pooled_output tf.nn.l2_normalize(pooled_output, axis-

colbert_vecs self.colbert_linear(hidden_states[:, 1:]) colbert_vecs colbert_vecs * tf.cast(attention_mask_origin[:, 1:][:, :, None], dtypetf.float

outputs { dense_vecs: pooled_output, colbert_vecs: colbert_vecs, last_hidden_state: hidden_states } if output_hidden_states: outputs[hidden_states] all_hidden_states return outputs如何学习AI大模型如果你对AI大模型入门感兴趣那么你需要的话可以点击这里大模型重磅福利入门进阶全套104G学习资源包免费分享这份完整版的大模型 AI 学习和面试资料已经上传CSDN朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费】这是一份大模型从零基础到进阶的学习路线大纲全览小伙伴们记得点个收藏第一阶段从大模型系统设计入手讲解大模型的主要方法第二阶段在通过大模型提示词工程从Prompts角度入手更好发挥模型的作用第三阶段大模型平台应用开发借助阿里云PAI平台构建电商领域虚拟试衣系统第四阶段大模型知识库应用开发以LangChain框架为例构建物流行业咨询智能问答系统第五阶段大模型微调开发借助以大健康、新零售、新媒体领域构建适合当前领域大模型第六阶段以SD多模态大模型为主搭建了文生图小程序案例第七阶段以大模型平台应用与开发为主通过星火大模型文心大模型等成熟大模型构建大模型行业应用。

100套AI大模型商业化落地方案大模型全套视频教程200本大模型PDF书籍学会后的收获• 基于大模型全栈工程实现前端、后端、产品经理、设计、数据分析等通过这门课可获得不同能力• 能够利用大模型解决相关实际项目需求大数据时代越来越多的企业和机构需要处理海量数据利用大模型技术可以更好地处理这些数据提高数据分析和决策的准确性。

因此掌握大模型应用开发技能可以让程序员更好地应对实际项目需求• 基于大模型和企业数据AI应用开发实现大模型理论、掌握GPU算力、硬件、LangChain开发框架和项目实战技能学会Fine-tuning垂直训练大模型数据准备、数据蒸馏、大模型部署一站式掌握• 能够完成时下热门大模型垂直领域模型训练能力提高程序员的编码能力大模型应用开发需要掌握机器学习算法、深度学习框架等技术这些技术的掌握可以提高程序员的编码能力和分析能力让程序员更加熟练地编写高质量的代码。

LLM面试题合集大模型产品经理资源合集大模型项目实战合集获取方式有需要的小伙伴可以保存图片到wx扫描二v码免费领取【保证100%免费】

17c最新免费网名：玩转个性，告别千篇一律，你的专属昵称等你来领！

核心内容摘要

“麻豆涩漫官方版”

简介之前在进行语义切分和数据检索时提到向量模型在语义切分中也简单介绍过向量模型。

文档解析结构型文档解析-语义切分]

向量模型评测网上有对向量模型的评测内容先跟大家分享一下可以根据业务需求判断选择哪种向量模型。

向量模型原理上面这张图大家在学习bge-m3时经常看到说的是模型优点多语言、多功能、多粒度支持70多种语言最近也是嵌入向量提取任务中最常用的模型之一如RAG检索增强。

密集检索通过句子CLS向量进行语义搜索将整句话的含义压缩并表示单一向量

稀疏检索通过令牌级重要权重进行搜索学习每个令牌的重要性提升关键词搜索表现

多向量检索通过标记级向量进行搜索通过为每个词独立的向量实现语义匹配这三种搜索方法通过知识蒸馏技术被学习成一个统一的模型代码如下。

attention.self.query.weight | shape: torch.Size([1024, 1024]) encoder.layer.

attention.self.query.bias | shape: torch.Size([1024]) encoder.layer.

attention.self.key.weight | shape: torch.Size([1024, 1024]) encoder.layer.

attention.self.key.bias | shape: torch.Size([1024]) encoder.layer.

attention.self.value.weight | shape: torch.Size([1024, 1024]) encoder.layer.

attention.self.value.bias | shape: torch.Size([1024]) encoder.layer.

attention.output.dense.weight | shape: torch.Size([1024, 1024]) encoder.layer.

attention.output.dense.bias | shape: torch.Size([1024]) encoder.layer.

attention.output.LayerNorm.weight | shape: torch.Size([1024]) encoder.layer.

attention.output.LayerNorm.bias | shape: torch.Size([1024]) encoder.layer.

intermediate.dense.weight | shape: torch.Size([4096, 1024]) encoder.layer.

intermediate.dense.bias | shape: torch.Size([4096]) encoder.layer.

output.dense.weight | shape: torch.Size([1024, 4096]) encoder.layer.

output.dense.bias | shape: torch.Size([1024]) encoder.layer.

output.LayerNorm.weight | shape: torch.Size([1024]) encoder.layer.

模型实现

嵌入层实现嵌入层时将自然语言转换为模型能够理解的数值向量的核心组件。

#self.dropout layers.Dropout(rate

使用tf.gather词语的数值序列被转换为嵌入张量。

变压器每个变压器模块由6个密集层、2层归一化、2层残差计算。

attention.self.query | shape: torch.Size([1024, 1024]) encoder.layer.

attention.self.key | shape: torch.Size([1024, 1024]) encoder.layer.

attention.self.value| shape: torch.Size([1024, 1024]) # 注意力机制输出处理 encoder.layer.

attention.output.dense | shape: torch.Size([1024, 1024]) encoder.layer.

attention.output.LayerNorm | shape: torch.Size([1024]) # 中间层 encoder.layer.

intermediate.dense | shape: torch.Size([4096, 1024]) # expand encoder.layer.

output.dense | shape: torch.Size([1024, 4096]) # reduce # 归一化 encoder.layer.

output.LayerNorm | shape: torch.Size([1024])开始自定义一个多头注意力机制def __init__(self, ...): self.wq tf.keras.layers.Dense(

self.wk tf.keras.layers.Dense(

self.wv tf.keras.layers.Dense(

self.dense tf.keras.layers.Dense(

self.attlayerNorm tf.keras.layers.LayerNormalization(epsilon1e-

self.intermediate tf.keras.layers.Dense(

self.output_dense tf.keras.layers.Dense(

self.output_norm tf.keras.layers.LayerNormalization(epsilon1e-

def call(self, ..) input embedding_output # Query, Key, Value 三个独立的线性层全连接层 q self.wq(input) # (batch_size, seq_len, d_model) k self.wk(input) # (batch_size, seq_len, d_model) v self.wv(input) # (batch_size, seq_len, d_model) q self.split_heads(q, batch_size, 16,

# (batch_size, num_heads, seq_len_q, depth) k self.split_heads(k, batch_size, 16,

# (batch_size, num_heads, seq_len_k, depth) v self.split_heads(v, batch_size, 16,

, tf.float

attention_scores tf.matmul(q, k, transpose_bTrue) # (batch_size, num_heads, seq_len_q, seq_len_k) attention_scores tf.divide(attention_scores, dk) attention_probs tf.nn.softmax(attention_scores 1e-9, axis-

5 * (

0 tf.math.erf(x / tf.cast(tf.sqrt(

, x.dtype))) return x * cdf以上这种结构bge-m3重复了24次。

: layer TransformerBlock( d_model1024, num_heads16, intermediate_size4096, dropout_rateself.dropout_rate, namefencoder.layer.{i} ) encoder_layers.append(layer)

1, **kwargs): super().__init__(**kwargs) self.attention MultiHeadAttention(d_model, num_heads, dropout_rate) self.attention_norm tf.keras.layers.LayerNormalization(epsilon1e-

def gelu_approx(self, x): GELU激活函数的近似实现 Args: x: 输入张量 Returns: 经过GELU激活的张量 x tf.convert_to_tensor(x) cdf

5 * (

0 tf.math.erf(x / tf.cast(tf.sqrt(

1 **kwargs: 其他参数传递给父类 Raises: ValueError: 当d_model不能被num_heads整除时抛出异常 def __init__(self, d_model, num_heads, dropout_rate

attention_scores tf.divide(attention_scores, dk) if mask is not None: attention_scores tf.add(attention_scores, mask) attention_probs self.stable_softmax(attention_scores, axis-

: 根据输入ID创建位置ID Args: input_ids: 输入ID张量 past_key_values_length: 过去键值长度 padding_idx: 填充索引 Returns: 位置ID张量 mask tf.cast(tf.math.not_equal(input_ids, padding_idx), dtypeinput_ids.dtype) incremental_indices (tf.math.cumsum(mask, axis

, ) with tf.name_scope(position_embeddings): self.position_embeddings self.add_weight( nameembeddings, shape[self.config.max_position_embeddings, self.d_model], initializertf.keras.initializers.TruncatedNormal(stddev

, ) with tf.name_scope(token_type_embeddings): self.token_type_embeddings self.add_weight( nameembeddings, shape[self.config.type_vocab_size, self.d_model], initializertf.keras.initializers.TruncatedNormal(stddev

if position_ids is None: if input_ids is not None: # 根据输入的标记ID创建位置ID。

0, dtypeembedding_output.dtype) ten_thousand_cst tf.constant(-

else: # 默认: cls pooled_output hidden_states[:, 0, :] # 如果return_dense为True则应用池化层 if self.return_dense: pooled_output pooled_output # 如果指定则对嵌入进行归一化 if self.normalize_embeddings: pooled_output tf.nn.l2_normalize(pooled_output, axis-

colbert_vecs self.colbert_linear(hidden_states[:, 1:]) colbert_vecs colbert_vecs * tf.cast(attention_mask_origin[:, 1:][:, :, None], dtypetf.float

众乐乐3秒进入新世界下载-众乐乐3秒进入新世界下载应用

📑 文章目录

🔥 热门优化文章

🛠️ 实用工具推荐

相关优化文章 推荐

百度百家号客服电话人工服务

output.LayerNorm | shape: torch.Size([1024])开始自定义一个多头注意力机制def init(self, ...): self.wq tf.keras.layers.Dense(

1, kwargs): super().init(kwargs) self.attention MultiHeadAttention(d_model, num_heads, dropout_rate) self.attention_norm tf.keras.layers.LayerNormalization(epsilon1e-

1 **kwargs: 其他参数传递给父类 Raises: ValueError: 当d_model不能被num_heads整除时抛出异常 def init(self, d_model, num_heads, dropout_rate

相关优化文章推荐