深海游弋的鱼 – 默默的点滴

目前由于 `CUDA-9.1.85` 已经不支持 `Femi` 架构了。

因此如下参数：，

arch=compute_20,code=sm_20

会导致全部的 `.cu` 文件会全部编译失败，我们只能是从 `CUDA-8.x` 上进行编译。

老老实实装一个 `ubuntu 16.04` 编译吧,实体机或者 `nvidia-docker` ，都可以试试。

目前 `ubuntu 18.04` 上使用 `sudo apt-get install nvidia-cuda-toolkit` 安装的是 `9.1.85` 版本的 `nvidia cuda` , 尽管版本比较老，但是好在稳定性好，适用范围广。

当我们的项目需要使用指定版本的 `pytorch` 的时候，目前官方提供的编译好的 `nvidia cuda` 安装包并不兼容全部的硬件。这个在实际环境中是比较麻烦的。

目前来说，比较稳妥的办法是直接从源代码编译。

如果显卡是几年前的显卡（GeForce GTX 760 Compute Capability = 3.0 / GeForce GT 720M Lenveo Thinkpad T440 Compute Capability = 2.1），运行的时候会提示：

Found GPU0 GeForce GTX 760 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability that we support is 3.5.

执行的时候会报错

RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

硬件的计算能力查询 Recommended GPU for Developers

------------------------------------------------------------------------------------

安装官方软件源的 `cuda-9.1.85`, 高版本的显卡驱动不支持：

# 卸载 nvidia-340 驱动，切换到开源的Nouveau驱动，否则在后面安装 nvidia-cuda-toolkit 会存在冲突
$ sudo apt-get remove nvidia-340

# 安装系统自带的cuda
$ sudo apt-get install nvidia-cuda-toolkit

# 安装390版本驱动
$ sudo apt-get install nvidia-driver-390

# 更新驱动之后，一定要重启系统，否则可能会出现各种莫名的异常
$ sudo reboot

如果安装时报错，如下：

$ sudo apt-get install nvidia-cuda-toolkit 
正在读取软件包列表... 完成
正在分析软件包的依赖关系树       
正在读取状态信息... 完成       
nvidia-cuda-toolkit 已经是最新版 (9.1.85-3ubuntu1)。
您也许需要运行“apt --fix-broken install”来修正上面的错误。
下列软件包有未满足的依赖关系：
 libcuinj64-9.1 : 依赖: libcuda1 (>= 387.26) 或
                          libcuda-9.1-1
E: 有未能满足的依赖关系。请尝试不指明软件包的名字来运行“apt --fix-broken install”(也可以指定一个解决办法)。

并且` sudo apt --fix-broken install`无效，则执行强制包清除命令：

$ sudo dpkg -P nvidia-340

Lenveo T440 Compute Capability = 2.1 不支持 `cuDNN` ，因此没必要安装 , 其实连最新版本的 `CUDA-10.1` 也不能安装，原因在于 `NVIDIA GT 720M` 的驱动只支持到 `390` 版本，而 `CUDA-10.1` 需 `418` 以上的版本才能支持，具体表现在于系统启动后没有加载显卡驱动，`dmesg` 可以查看到如下信息：

[   72.533870] NVRM: The NVIDIA GeForce GT 720M GPU installed in this system is
               NVRM:  supported through the NVIDIA 390.xx Legacy drivers. Please
               NVRM:  visit http://www.nvidia.com/object/unix.html for more
               NVRM:  information.  The 430.50 NVIDIA driver will ignore
               NVRM:  this GPU.  Continuing probe...
[   72.533875] NVRM: No NVIDIA graphics adapter found!

------------------------------------------------------------------------------------

切换 `GCC` 版本到 `GCC-5`

$ sudo apt install gcc-5

$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 70

$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-6 60

$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 50

$ sudo apt install g++-5

$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 70 

$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-6 60 

$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 50

$ sudo update-alternatives --config g++

# 一定要退出当前运行的SHELL，否则环境变量可能没有刷新
$ exit

------------------------------------------------------------------------------------

依旧是推荐在 Anaconda 上建立独立的编译环境，然后执行编译：

$ sudo apt-get install git

# conda remove -n pytorch --all

$ conda create -n pytorch -y python=3.6.8 pip

$ source activate pytorch

$ conda install numpy pyyaml mkl=2019.1 mkl-include=2019.1 setuptools cmake cffi typing pybind11

$ conda install ninja
 
$ conda install -c soumith magma-cuda80 cudatoolkit=8.0

$ git clone https://github.com/pytorch/pytorch

$ cd pytorch

# pytorch 1.0.1 版本支持“Compute Capability” 低于3.0版本的硬件，pytorch 1.2.0需要至少3.5版本的硬件才可以正常运行
# https://github.com/pytorch/pytorch/blob/v1.3.0/torch/utils/cpp_extension.py
$ git checkout v1.0.1 -b v1.0.1

$ git submodule sync

$ git submodule update --init --recursive

$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

# 如果不需要使用cuda的话，这里还要加上一句：export NO_CUDA=1

$ python setup.py clean

# 卸载以前安装的pytorch
$ conda uninstall pytorch

$ export CUDA_HOST_COMPILER=/usr/bin/gcc-5

$ export CUDAHOSTCXX=/usr/bin/gcc-5

$ export CMAKE_CXX_COMPILER=/usr/bin/gcc-5

# 调整代码，修正一系列已知的编译问题,代码要求6.0以上的GCC编译，否则报错，我们直接把这个要求降级到5.0
$ sed -i "s/6.0.0/5.0.0/g" cmake/MiscCheck.cmake

# 从Nvidia开发网站查询到自己硬件对应的“Compute Capability” 
# 比如 “GeForce GTX 760” 对应 “3.0” 计算能力，能力不正确会导致运行异常
# RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

$ python setup.py install

# 对于开发者模式，可以使用
# python setup.py build develop

# 一定要退出 pytorch 的编译目录，在pytorch代码目录下执行命令会出现异常
$ cd ..

如果出现如下错误：

[ 68%] Building NVCC (Device) object caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/ATen/native/sparse/cuda/caffe2_gpu_generated_SparseCUDABlas.cu.o
~/pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(58): error: more than one instance of function "at::native::sparse::cuda::cusparseGetErrorString" matches the argument list:
            function "cusparseGetErrorString(cusparseStatus_t)"
            function "at::native::sparse::cuda::cusparseGetErrorString(cusparseStatus_t)"
            argument types are: (cusparseStatus_t)

则需要调整代码 `aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu`, 在其中的 `cusparseGetErrorString` 函数上增加 `#if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))`

如下：

#if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))
const char* cusparseGetErrorString(cusparseStatus_t status) {
  switch(status)
  {
    case CUSPARSE_STATUS_SUCCESS:
      return "success";

    case CUSPARSE_STATUS_NOT_INITIALIZED:
      return "library not initialized";

    case CUSPARSE_STATUS_ALLOC_FAILED:
      return "resource allocation failed";

    case CUSPARSE_STATUS_INVALID_VALUE:
      return "an invalid numeric value was used as an argument";

    case CUSPARSE_STATUS_ARCH_MISMATCH:
      return "an absent device architectural feature is required";

    case CUSPARSE_STATUS_MAPPING_ERROR:
      return "an access to GPU memory space failed";

    case CUSPARSE_STATUS_EXECUTION_FAILED:
      return "the GPU program failed to execute";

    case CUSPARSE_STATUS_INTERNAL_ERROR:
      return "an internal operation failed";

    case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
      return "the matrix type is not supported by this function";

    case CUSPARSE_STATUS_ZERO_PIVOT:
      return "an entry of the matrix is either structural zero or numerical zero (singular block)";

    default:
      return "unknown error";
  }
}
#endif

这样解决跟 `CUDA-10.1`自带函数的冲突问题。

具体参考： https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu

源码安装的Pytorch，卸载需要执行：

# conda uninstall pytorch

$ pip uninstall torch

$ python setup.py clean

Pytorch 代码下载非常缓慢，可以本站下载同步好的pytorch源代码。

一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

pytorch 1.0.1在ubuntu 18.04(Lenveo Thinkpad T440)编译(CUDA-9.1.85)

参考链接

发布者

默默

《pytorch 1.0.1在ubuntu 18.04(Lenveo Thinkpad T440)编译(CUDA-9.1.85)》上有1条评论

发表回复取消回复

参考链接

发布者

默默

《pytorch 1.0.1在ubuntu 18.04(Lenveo Thinkpad T440)编译(CUDA-9.1.85)》上有1条评论

发表回复 取消回复

发表回复取消回复