核心内容摘要
真的太省时间!千笔,断层领先的AI论文平台
centos7-nvidia驱动安装类别信息服务器型号Rack Mount Chassis NF5280M6CPUIntel® Xeon® Silver 4310 CPU
10GHz * 2系统版本Centos 7系统内核版本
3.
1
0-
el
x86_64GPU型号NVIDIA A10040G*4Nvidia版本
525.
8
05CUDA版本
12.
0docker版本
20.
1
9
基础系统部分(已经安装过可以不用安装)
安装基础软件yum updateyum -yinstallopenssh-server openssh-client apt-utils freeipmi ipmitool sshpassethtoolzipunzipnanolessgitnetplan.io iputils-pingmtripvsadm smartmontools python3-pip socat conntrack libvirt-clients libnuma-dev ctorrent nvme-cli gcc-12 g-12vimwgetaptgitunzipzipntp ntpdate lrzsz lftp tree bash-completion elinks dos2unix tmux jqyum -yinstallnmap net-toolsmtrtraceroutetcptracerouteaptitudehtopiftop hping3 fping nethogs sshuttle tcpdump figlet stress iperf iperf3 dnsutilscurllinux-tools-generic linux-cloud-tools-genericyum groupinstall -yDevelopment Toolscurl-s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh|sudobashyuminstallgit-lfsgitlfsinstall
调整文件描述符echoulimit -SHn 655350/etc/profileechofs.file-max 655350/etc/sysctl.confechoroot soft nofile 655350/etc/security/limits.confechoroot hard nofile 655350/etc/security/limits.confecho* soft nofile 655350/etc/security/limits.confecho* hard nofile 655350/etc/security/limits.confsource/etc/profile优化historycat/etc/profileexportHISTTIMEFORMAT%Y-%m-%d %H:%M:%SwhoamiexportHISTFILESIZE50000exportHISTSIZE50000source/etc/profile
优化内核参数cp/etc/sysctl.conf /etc/sysctl.conf.bakvi/etc/sysctl.conf net.ipv
tcp_syncookies1net.ipv
tcp_abort_on_overflow1net.ipv
tcp_max_tw_buckets6000net.ipv
tcp_sack1net.ipv
tcp_window_scaling1net.ipv
tcp_rmem4096873804194304net.ipv
tcp_wmem4096663844194304net.ipv
tcp_mem94500000915000000927000000net.core.optmem_max81920net.core.wmem_default8388608net.core.wmem_max16777216net.core.rmem_default8388608net.core.rmem_max16777216net.ipv
tcp_max_syn_backlog1020000net.core.netdev_max_backlog862144net.core.somaxconn262144net.ipv
tcp_max_orphans327680net.ipv
tcp_timestamps0net.ipv
tcp_synack_retries1net.ipv
tcp_syn_retries1net.ipv
tcp_tw_reuse1net.ipv
tcp_fin_timeout15net.ipv
tcp_keepalive_time30net.ipv
ip_local_port_range102465535net.netfilter.nf_conntrack_tcp_timeout_established180net.netfilter.nf_conntrack_max1048576net.nf_conntrack_max1048576fs.file-max655350modprobe nf_conntrack sysctl -p /etc/sysctl.conf sysctl -w net.ipv
route.flush1
显卡驱动、cuda等部署手动创建禁用 nouveau 的配置bash-cecho blacklist nouveau /etc/modprobe.d/blacklist-nvidia-nouveau.confbash-cecho options nouveau modeset0 /etc/modprobe.d/blacklist-nvidia-nouveau.confechooptions nouveaumodeset0|tee-a /etc/modprobe.d/nouveau-kms.conf# boot备份cp-r /boot/ /root/ dracut -f /boot/initramfs-$(uname-r).img$(uname-r)# 重启验证是否禁用成功rebootlsmod|grepnouveau重启成功后打开终端输入如下如果什么都不显示说明正面上面禁用nouveau的流程正确安装nvidia驱动https://download.nvidia.com/XFree86/Linux-x86_64获取推荐安装版本可不选择推荐安装版本# 导入 ELRepo 的公钥sudorpm--import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org# 安装 ELRepo 仓库sudoyuminstall-y https://www.elrepo.org/elrepo-release-
0-
el
elrepo.noarch.rpmsudoyum makecache lspci|grep-i nvidia下载对应内核工具防止安装错误# 安装 yum-config-manager 工具开启工具查找centos7老版本内核工具yuminstall-y yum-utils# 启用 vault 仓库yum-config-manager --enable vault yuminstallkernel-devel-$(uname-r)kernel-headers-$(uname-r)wgethttps://download.nvidia.com/XFree86/Linux-x86_64/
525.
8
05/NVIDIA-Linux-x86_64-
525.
85.
runchmodx NVIDIA-Linux-x86_64-
525.
85.
runbashNVIDIA-Linux-x86_64-
525.
85.
run --no-opengl-files --uinone --no-questions --accept-license安装完成后执行nvidia-smi查看[rootgnode196 ~]# nvidia-smiTue Jan2716:48:412026-----------------------------------------------------------------------------|NVIDIA-SMI
525.
8
05 Driver Version:
525.
8
05 CUDA Version:
1
0||---------------------------------------------------------------------------|GPU Name Persistence-M|Bus-Id Disp.A|Volatile Uncorr. ECC||Fan Temp Perf Pwr:Usage/Cap|Memory-Usage|GPU-Util Compute M.||||MIG M.||||0NVIDIA A100-PCI... Off|00000000:4B:
0
0 Off|0||N/A 32C P0 36W / 250W|0MiB / 40960MiB|0% Default||||Disabled|---------------------------------------------------------------------------|1NVIDIA A100-PCI... Off|00000000:65:
0
0 Off|0||N/A 33C P0 36W / 250W|0MiB / 40960MiB|0% Default||||Disabled|---------------------------------------------------------------------------|2NVIDIA A100-PCI... Off|00000000:CA:
0
0 Off|0||N/A 31C P0 38W / 250W|0MiB / 40960MiB|0% Default||||Disabled|---------------------------------------------------------------------------|3NVIDIA A100-PCI... Off|00000000:E3:
0
0 Off|0||N/A 32C P0 39W / 250W|0MiB / 40960MiB|0% Default||||Disabled|--------------------------------------------------------------------------- -----------------------------------------------------------------------------|Processes:||GPU GI CI PID Type Process name GPU Memory||ID ID Usage||||No running processes found|-----------------------------------------------------------------------------安装cuda根据上面步骤可以看到cuda支持可用的cuda版本是
1
0登录访问https://developer.nvidia.com/cuda-toolkit-archive 并下载
1
0版本的cudawgethttps://developer.download.nvidia.com/compute/cuda/
12.
0/local_installers/cuda_
12.
0_
525.
6
13_linux.runbashcuda_
12.
0_
525.
6
13_linux.run --toolkit --silent --override增加环境变量并验证在pofile内添加cuda环境变量cat/etc/profileexportPATH/usr/local/cuda-
1
0/bin:$PATHexportLD_LIBRARY_PATH/usr/local/cuda-
1
0/lib64:$LD_LIBRARY_PATHsource/etc/profile nvcc -V 验证安装nvidia-dockercurl-s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo|\sudotee/etc/yum.repos.d/nvidia-container-toolkit.repo yuminstall-y nvidia-container-toolkit验证安装nvidia-container-cli --version nvidia-ctk --version配置docker使用nvidia-runtimenvidia-ctk runtime configure --runtimedocker systemctl restartdocker固定内核yum versionlockaddkernel-
3.
1
0-
el
x86_64 yum versionlockaddkernel-core-
3.
1
0-
el
x86_64 yum versionlockaddkernel-modules-
3.
1
0-
el
x86_64echoexcludekernel*/etc/yum.confCPU/GPU相关性能开启# 持久化开启开启Persistence Mode模式nvidia-smi -pm1# 允许ECC内存模式下模拟错误nvidia-smi -e ENABLED# CPU锁频yuminstall-y kernel-tools cpupower idle-set -D0cpupower frequency-set -g performanceechocpupower frequency-set -g performance/etc/rc.localchmodx /etc/rc.d/rc.local# GPU相关优化锁到最高频nvidia-smi -lgc1410,1410# 关闭 PCIe ASPM节能grubby --update-kernelALL --argspcie_aspmoff部署HPC-X(https://developer.nvidia.com/networking/hpc-x 页面最下选择下载版本)wgethttp://www.mellanox.com/page/hpcx_eula?mrequestdownloadsmtypehpcmverhpc-xmnamev
2.
1
1/hpcx-v
2.
1
1-gcc-inbox-redhat7-cuda12-x86_
tbztar-xf hpcx-v
2.
1
1-gcc-inbox-redhat7-cuda12-x86_
tbz -C /opt/ln-s /opt/hpcx-v
2.
1
1-gcc-inbox-redhat7-cuda12-x86_64 /opt/hpcxexportHPCX_HOME/opt/hpcx.$HPCX_HOME/hpcx-init.sh hpcx_loadnccl/gpubun测试安装nccl(静态编译)mkdir-p /root/nccl/cd/root/ncclgitclone https://github.com/NVIDIA/nccl.gitcdncclmake-j24src.buildCUDA_HOME/usr/local/cudaPATH$PATH:/usr/local/cuda/binLD_LIBRARY_PATH/usr/local/cuda/lib64:$LD_LIBRARY_PATH# -j 并法参数安装nccl-test (静态编译)mkdir-p /root/nccl/cd/root/ncclgitclone https://github.com/NVIDIA/nccl-tests.gitcdnccl-testswhichmpirun# /opt/hpcx/ompi/bin/mpirun 截取 MPI_HOME/opt/hpcx/ompicd/root/nccl/nccl-testsPATH$PATH:/usr/local/cuda/binLD_LIBRARY_PATH$LD_LIBRARY_PATH:/usr/local/cuda/lib64LIBRARY_PATH$LIBRARY_PATH:/usr/local/cuda/lib64make-j30CUDA_HOME/usr/local/cudaNCCL_HOME/root/nccl/nccl/buildNCCL_LIBDIR/root/nccl/nccl/build/libNCCL_STATIC1NVCC_GENCODE-gencodearchcompute_80,codesm_80nccl测试exportLD_LIBRARY_PATH$LD_LIBRARY_PATH:/root/nccl/nccl/build/lib ./build/all_reduce_perf -b8-e 35G -f2-g4-n50测试参数-b大小起始大小如 -b
-b 1M -e大小结束大小如 -e 10G -f倍数每次乘以几倍如 -f2表示翻倍 -g数量使用几个 GPU如 -g
-g4 -n次数测试迭代次数如 -n100默认20#
单 GPU 测试从 8 字节到 10GB每次翻倍./build/all_reduce_perf -b8-e 10G -f2-g1#
4 GPU 测试./build/all_reduce_perf -b8-e 10G -f2-g4#
测试更大数据量35GB4 GPU./build/all_reduce_perf -b8-e 35G -f2-g4#
增加迭代次数结果更稳定./build/all_reduce_perf -b8-e 10G -f2-g4-n100#
快速测试小数据范围./build/all_reduce_perf -b 1M -e 1G -f2-g4gpubungitclone https://github.com/wilicc/gpu-burn.git编辑配置文件cdgpu-burnviMakefile gpu_burn: gpu_burn-drv.o compare.ptx g -o$$-O3${LDFLAGS}修改为 gpu_burn: gpu_burn-drv.o compare.ptx g -o$$-O3${LDFLAGS}-static-libgcc -static-libstdc编译并测试修改后进行编译编译完成后在其他机器拷贝后就可以直接使用了 yuminstall-y libstdc-staticmakecleanmake./gpu_burn3600(测试时间)模型部署相关huggingface下载apt-get-yinstallgit-lfsgitlfsinstallapt-getinstallpython3 python-is-python3 python3 -m pipinstall--upgradepip
20.
4-i https://mirrors.aliyun.com/pypi/simple/ pip
12 configsetglobal.index-url https://pypi.org/simple/ pip
12install-U huggingface_hub --break-system-packageshuggingface登录huggingface-cli login# hf auth login# uggingface_hub 的最新版本
1.
3已经将 CLI 命令从 huggingface-cli 改为 hf。
旧命令 huggingface-cli 在新版本中不再支持⚠️ Warning:huggingface-cli loginis deprecated. Usehf auth logininstead. _|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|To log in,huggingface_hubrequires a token generated from https://huggingface.co/settings/tokens.Enter your token(input will not be visible): Add token asgitcredential?(Y/n)y Token is valid(permission: fineGrained). The tokendeployhas been saved to /root/.cache/huggingface/stored_tokens[rootgnode196 ~]# git config --global credential.helper store[rootgnode196 ~]# git config --global credential.helperstore