深海游弋的鱼 – 第 2 页

pytorch 1.0.1在ubuntu 18.04(Lenveo Thinkpad T440)编译(CUDA-9.1.85)

目前由于 CUDA-9.1.85 已经不支持 Femi 架构了。

因此如下参数：，

arch=compute_20,code=sm_20

1	arch=compute_20,code=sm_20

会导致全部的 .cu 文件会全部编译失败，我们只能是从 CUDA-8.x 上进行编译。

老老实实装一个 ubuntu 16.04 编译吧,实体机或者 nvidia-docker ，都可以试试。

目前 ubuntu 18.04 上使用 sudo apt-get install nvidia-cuda-toolkit 安装的是 9.1.85 版本的 nvidia cuda , 尽管版本比较老，但是好在稳定性好，适用范围广。

当我们的项目需要使用指定版本的 pytorch 的时候，目前官方提供的编译好的 nvidia cuda 安装包并不兼容全部的硬件。这个在实际环境中是比较麻烦的。

目前来说，比较稳妥的办法是直接从源代码编译。

如果显卡是几年前的显卡（GeForce GTX 760 Compute Capability = 3.0 / GeForce GT 720M Lenveo Thinkpad T440 Compute Capability = 2.1），运行的时候会提示：

Found GPU0 GeForce GTX 760 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability that we support is 3.5.

Found GPU0 GeForce GTX 760 which is of cuda capability 3.0.

PyTorch no longer supports this GPU because it is too old.

The minimum cuda capability that we support is 3.5.

执行的时候会报错

RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

1	RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

硬件的计算能力查询 Recommended GPU for Developers

------------------------------------------------------------------------------------

安装官方软件源的 cuda-9.1.85, 高版本的显卡驱动不支持：

# 卸载 nvidia-340 驱动，切换到开源的Nouveau驱动，否则在后面安装 nvidia-cuda-toolkit 会存在冲突
$ sudo apt-get remove nvidia-340

# 安装系统自带的cuda
$ sudo apt-get install nvidia-cuda-toolkit

# 安装390版本驱动
$ sudo apt-get install nvidia-driver-390

# 更新驱动之后，一定要重启系统，否则可能会出现各种莫名的异常
$ sudo reboot

# 卸载 nvidia-340 驱动，切换到开源的Nouveau驱动，否则在后面安装 nvidia-cuda-toolkit 会存在冲突

$ sudo apt-get remove nvidia-340

# 安装系统自带的cuda

$ sudo apt-get install nvidia-cuda-toolkit

# 安装390版本驱动

$ sudo apt-get install nvidia-driver-390

# 更新驱动之后，一定要重启系统，否则可能会出现各种莫名的异常

$ sudo reboot

如果安装时报错，如下：

$ sudo apt-get install nvidia-cuda-toolkit 
正在读取软件包列表... 完成
正在分析软件包的依赖关系树       
正在读取状态信息... 完成       
nvidia-cuda-toolkit 已经是最新版 (9.1.85-3ubuntu1)。
您也许需要运行“apt --fix-broken install”来修正上面的错误。
下列软件包有未满足的依赖关系：
 libcuinj64-9.1 : 依赖: libcuda1 (>= 387.26) 或
                          libcuda-9.1-1
E: 有未能满足的依赖关系。请尝试不指明软件包的名字来运行“apt --fix-broken install”(也可以指定一个解决办法)。

$ sudo apt-get install nvidia-cuda-toolkit

正在读取软件包列表... 完成

正在分析软件包的依赖关系树

正在读取状态信息... 完成

nvidia-cuda-toolkit 已经是最新版 (9.1.85-3ubuntu1)。

您也许需要运行“apt --fix-broken install”来修正上面的错误。

下列软件包有未满足的依赖关系：

libcuinj64-9.1 : 依赖: libcuda1 (>= 387.26) 或

libcuda-9.1-1

E: 有未能满足的依赖关系。请尝试不指明软件包的名字来运行“apt --fix-broken install”(也可以指定一个解决办法)。

并且 sudo apt --fix-broken install无效，则执行强制包清除命令：

$ sudo dpkg -P nvidia-340

1	$ sudo dpkg -P nvidia-340

Lenveo T440 Compute Capability = 2.1 不支持 cuDNN ，因此没必要安装 , 其实连最新版本的 CUDA-10.1 也不能安装，原因在于 NVIDIA GT 720M 的驱动只支持到 390 版本，而 CUDA-10.1 需 418 以上的版本才能支持，具体表现在于系统启动后没有加载显卡驱动，dmesg 可以查看到如下信息：

[   72.533870] NVRM: The NVIDIA GeForce GT 720M GPU installed in this system is
               NVRM:  supported through the NVIDIA 390.xx Legacy drivers. Please
               NVRM:  visit http://www.nvidia.com/object/unix.html for more
               NVRM:  information.  The 430.50 NVIDIA driver will ignore
               NVRM:  this GPU.  Continuing probe...
[   72.533875] NVRM: No NVIDIA graphics adapter found!

[ 72.533870] NVRM: The NVIDIA GeForce GT 720M GPU installed in this system is

NVRM: supported through the NVIDIA 390.xx Legacy drivers. Please

NVRM: visit http://www.nvidia.com/object/unix.html for more

NVRM: information. The 430.50 NVIDIA driver will ignore

NVRM: this GPU. Continuing probe...

[ 72.533875] NVRM: No NVIDIA graphics adapter found!

------------------------------------------------------------------------------------

切换 GCC 版本到 GCC-5

$ sudo apt install gcc-5

$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 70

$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-6 60

$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 50

$ sudo apt install g++-5

$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 70 

$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-6 60 

$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 50

$ sudo update-alternatives --config g++

# 一定要退出当前运行的SHELL，否则环境变量可能没有刷新
$ exit

$ sudo apt install gcc-5

$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 70

$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-6 60

$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 50

$ sudo apt install g++-5

$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 70

$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-6 60

$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 50

$ sudo update-alternatives --config g++

# 一定要退出当前运行的SHELL，否则环境变量可能没有刷新

$ exit

------------------------------------------------------------------------------------

依旧是推荐在 Anaconda 上建立独立的编译环境，然后执行编译：

$ sudo apt-get install git

# conda remove -n pytorch --all

$ conda create -n pytorch -y python=3.6.8 pip

$ source activate pytorch

$ conda install numpy pyyaml mkl=2019.1 mkl-include=2019.1 setuptools cmake cffi typing pybind11

$ conda install ninja
 
$ conda install -c soumith magma-cuda80 cudatoolkit=8.0

$ git clone https://github.com/pytorch/pytorch

$ cd pytorch

# pytorch 1.0.1 版本支持“Compute Capability” 低于3.0版本的硬件，pytorch 1.2.0需要至少3.5版本的硬件才可以正常运行
# https://github.com/pytorch/pytorch/blob/v1.3.0/torch/utils/cpp_extension.py
$ git checkout v1.0.1 -b v1.0.1

$ git submodule sync

$ git submodule update --init --recursive

$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

# 如果不需要使用cuda的话，这里还要加上一句：export NO_CUDA=1

$ python setup.py clean

# 卸载以前安装的pytorch
$ conda uninstall pytorch

$ export CUDA_HOST_COMPILER=/usr/bin/gcc-5

$ export CUDAHOSTCXX=/usr/bin/gcc-5

$ export CMAKE_CXX_COMPILER=/usr/bin/gcc-5

# 调整代码，修正一系列已知的编译问题,代码要求6.0以上的GCC编译，否则报错，我们直接把这个要求降级到5.0
$ sed -i "s/6.0.0/5.0.0/g" cmake/MiscCheck.cmake

# 从Nvidia开发网站查询到自己硬件对应的“Compute Capability” 
# 比如 “GeForce GTX 760” 对应 “3.0” 计算能力，能力不正确会导致运行异常
# RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

$ python setup.py install

# 对于开发者模式，可以使用
# python setup.py build develop

# 一定要退出 pytorch 的编译目录，在pytorch代码目录下执行命令会出现异常
$ cd ..

$ sudo apt-get install git

# conda remove -n pytorch --all

$ conda create -n pytorch -y python=3.6.8 pip

$ source activate pytorch

$ conda install numpy pyyaml mkl=2019.1 mkl-include=2019.1 setuptools cmake cffi typing pybind11

$ conda install ninja

$ conda install -c soumith magma-cuda80 cudatoolkit=8.0

$ git clone https://github.com/pytorch/pytorch

$ cd pytorch

# pytorch 1.0.1 版本支持“Compute Capability” 低于3.0版本的硬件，pytorch 1.2.0需要至少3.5版本的硬件才可以正常运行

# https://github.com/pytorch/pytorch/blob/v1.3.0/torch/utils/cpp_extension.py

$ git checkout v1.0.1 -b v1.0.1

$ git submodule sync

$ git submodule update --init --recursive

$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

# 如果不需要使用cuda的话，这里还要加上一句：export NO_CUDA=1

$ python setup.py clean

# 卸载以前安装的pytorch

$ conda uninstall pytorch

$ export CUDA_HOST_COMPILER=/usr/bin/gcc-5

$ export CUDAHOSTCXX=/usr/bin/gcc-5

$ export CMAKE_CXX_COMPILER=/usr/bin/gcc-5

# 调整代码，修正一系列已知的编译问题,代码要求6.0以上的GCC编译，否则报错，我们直接把这个要求降级到5.0

$ sed -i "s/6.0.0/5.0.0/g" cmake/MiscCheck.cmake

# 从Nvidia开发网站查询到自己硬件对应的“Compute Capability”

# 比如 “GeForce GTX 760” 对应 “3.0” 计算能力，能力不正确会导致运行异常

# RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

$ python setup.py install

# 对于开发者模式，可以使用

# python setup.py build develop

# 一定要退出 pytorch 的编译目录，在pytorch代码目录下执行命令会出现异常

$ cd ..

如果出现如下错误：

[ 68%] Building NVCC (Device) object caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/ATen/native/sparse/cuda/caffe2_gpu_generated_SparseCUDABlas.cu.o
~/pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(58): error: more than one instance of function "at::native::sparse::cuda::cusparseGetErrorString" matches the argument list:
            function "cusparseGetErrorString(cusparseStatus_t)"
            function "at::native::sparse::cuda::cusparseGetErrorString(cusparseStatus_t)"
            argument types are: (cusparseStatus_t)

[ 68%] Building NVCC (Device) object caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/ATen/native/sparse/cuda/caffe2_gpu_generated_SparseCUDABlas.cu.o

~/pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(58): error: more than one instance of function "at::native::sparse::cuda::cusparseGetErrorString" matches the argument list:

function "cusparseGetErrorString(cusparseStatus_t)"

function "at::native::sparse::cuda::cusparseGetErrorString(cusparseStatus_t)"

argument types are: (cusparseStatus_t)

则需要调整代码 aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu, 在其中的 cusparseGetErrorString 函数上增加 #if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))

如下：

#if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))
const char* cusparseGetErrorString(cusparseStatus_t status) {
  switch(status)
  {
    case CUSPARSE_STATUS_SUCCESS:
      return "success";

    case CUSPARSE_STATUS_NOT_INITIALIZED:
      return "library not initialized";

    case CUSPARSE_STATUS_ALLOC_FAILED:
      return "resource allocation failed";

    case CUSPARSE_STATUS_INVALID_VALUE:
      return "an invalid numeric value was used as an argument";

    case CUSPARSE_STATUS_ARCH_MISMATCH:
      return "an absent device architectural feature is required";

    case CUSPARSE_STATUS_MAPPING_ERROR:
      return "an access to GPU memory space failed";

    case CUSPARSE_STATUS_EXECUTION_FAILED:
      return "the GPU program failed to execute";

    case CUSPARSE_STATUS_INTERNAL_ERROR:
      return "an internal operation failed";

    case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
      return "the matrix type is not supported by this function";

    case CUSPARSE_STATUS_ZERO_PIVOT:
      return "an entry of the matrix is either structural zero or numerical zero (singular block)";

    default:
      return "unknown error";
  }
}
#endif

#if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))

const char* cusparseGetErrorString(cusparseStatus_t status) {

switch(status)

{

case CUSPARSE_STATUS_SUCCESS:

return "success";

case CUSPARSE_STATUS_NOT_INITIALIZED:

return "library not initialized";

case CUSPARSE_STATUS_ALLOC_FAILED:

return "resource allocation failed";

case CUSPARSE_STATUS_INVALID_VALUE:

return "an invalid numeric value was used as an argument";

case CUSPARSE_STATUS_ARCH_MISMATCH:

return "an absent device architectural feature is required";

case CUSPARSE_STATUS_MAPPING_ERROR:

return "an access to GPU memory space failed";

case CUSPARSE_STATUS_EXECUTION_FAILED:

return "the GPU program failed to execute";

case CUSPARSE_STATUS_INTERNAL_ERROR:

return "an internal operation failed";

case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:

return "the matrix type is not supported by this function";

case CUSPARSE_STATUS_ZERO_PIVOT:

return "an entry of the matrix is either structural zero or numerical zero (singular block)";

default:

return "unknown error";

}

#endif

这样解决跟 CUDA-10.1自带函数的冲突问题。

具体参考： https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu

源码安装的Pytorch，卸载需要执行：

# conda uninstall pytorch

$ pip uninstall torch

$ python setup.py clean

# conda uninstall pytorch

$ pip uninstall torch

$ python setup.py clean

Pytorch 代码下载非常缓慢，可以本站下载同步好的pytorch源代码。

参考链接

在ubuntu 18.04(GeForce GTX 760 4GB显存)编译/测试MaskTextSpotter(CUDA-10.1)

如果需要运行 MaskTextSpotter，最少需要 4GB 显存，低于这个容量，运行不起来。

安装最新版本的 cuda-10.1,低版本的编译会出问题：

# 卸载之前已经安装的cuda
$ sudo apt-get remove nvidia-cuda-toolkit

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

$ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600

$ wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub

$ sudo apt-get update

$ sudo apt-get -y install cuda

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常
$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

# 可能需要删除一下XWindow的配置文件，否则驱动可能不能正常加载
$ sudo rm -rf ~/.Xauthority 

# 如果出现如下错误
# ubuntu 18.04 "nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1 
# 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib"
# 参考 http://www.mobibrw.com/?p=21739 

# 删除安装源，可以节约几个GB的磁盘，安装完成后这部分已经用不上了
$ sudo apt-get remove --purge cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00 

$ sudo apt-get update
 
# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常
$ sudo apt-get dist-upgrade
 
$ sudo apt-get autoremove

# 卸载之前已经安装的cuda

$ sudo apt-get remove nvidia-cuda-toolkit

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

$ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600

$ wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub

$ sudo apt-get update

$ sudo apt-get -y install cuda

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常

$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

# 可能需要删除一下XWindow的配置文件，否则驱动可能不能正常加载

$ sudo rm -rf ~/.Xauthority

# 如果出现如下错误

# ubuntu 18.04 "nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1

# 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib"

# 参考 http://www.mobibrw.com/?p=21739

# 删除安装源，可以节约几个GB的磁盘，安装完成后这部分已经用不上了

$ sudo apt-get remove --purge cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00

$ sudo apt-get update

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常

$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

配置独立环境

# first, make sure that your conda is setup properly with the right environment
# for that, check that `which conda`, `which pip` and `which python` points to the
# right path. From a clean conda env, this is what you need to do

# conda remove -n MaskTextSpotter --all

$ conda create -n MaskTextSpotter -y python=3.6.8 pip

# first, make sure that your conda is setup properly with the right environment

# for that, check that `which conda`, `which pip` and `which python` points to the

# right path. From a clean conda env, this is what you need to do

# conda remove -n MaskTextSpotter --all

$ conda create -n MaskTextSpotter -y python=3.6.8 pip

编译安装 Pytoch

$ sudo apt-get install git

# 进入运行环境
$ source activate MaskTextSpotter

$ conda install numpy pyyaml mkl=2019.1 mkl-include=2019.1 setuptools cmake cffi typing pybind11

$ conda install ninja

# magma-cuda90 magma-cuda91 magma-cuda92 会编译失败 
$ conda install -c pytorch magma-cuda101

$ git clone https://github.com/pytorch/pytorch

# 也可直接本站下载一份同步好的代码 wget https://www.mobibrw.com/wp-content/uploads/2019/11/pytorch.zip

$ cd pytorch

# pytorch 1.0.1 版本支持“Compute Capability” 低于3.0版本的硬件，pytorch 1.2.0需要至少3.5版本的硬件才可以正常运行 
# https://github.com/pytorch/pytorch/blob/v1.3.0/torch/utils/cpp_extension.py
$ git checkout v1.0.1 -b v1.0.1

$ git submodule sync

$ git submodule update --init --recursive

$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

$ python setup.py clean

# 卸载以前安装的pytorch
$ conda uninstall pytorch

$ pip uninstall pytorch

# 从Nvidia开发网站查询到自己硬件对应的“Compute Capability” 
# 比如 “GeForce GTX 760” 对应 “3.0” 计算能力，能力不正确会导致运行异常
# RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

$ TORCH_CUDA_ARCH_LIST="3.0" python setup.py install

# 一定要退出 pytorch 的编译目录，在pytorch代码目录下执行命令会出现异常
$ cd ..

# 退出环境
$ conda deactivate

$ sudo apt-get install git

# 进入运行环境

$ source activate MaskTextSpotter

$ conda install numpy pyyaml mkl=2019.1 mkl-include=2019.1 setuptools cmake cffi typing pybind11

$ conda install ninja

# magma-cuda90 magma-cuda91 magma-cuda92 会编译失败

$ conda install -c pytorch magma-cuda101

$ git clone https://github.com/pytorch/pytorch

# 也可直接本站下载一份同步好的代码 wget https://www.mobibrw.com/wp-content/uploads/2019/11/pytorch.zip

$ cd pytorch

# pytorch 1.0.1 版本支持“Compute Capability” 低于3.0版本的硬件，pytorch 1.2.0需要至少3.5版本的硬件才可以正常运行

# https://github.com/pytorch/pytorch/blob/v1.3.0/torch/utils/cpp_extension.py

$ git checkout v1.0.1 -b v1.0.1

$ git submodule sync

$ git submodule update --init --recursive

$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

$ python setup.py clean

# 卸载以前安装的pytorch

$ conda uninstall pytorch

$ pip uninstall pytorch

# 从Nvidia开发网站查询到自己硬件对应的“Compute Capability”

# 比如 “GeForce GTX 760” 对应 “3.0” 计算能力，能力不正确会导致运行异常

# RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

$ TORCH_CUDA_ARCH_LIST="3.0" python setup.py install

# 一定要退出 pytorch 的编译目录，在pytorch代码目录下执行命令会出现异常

$ cd ..

# 退出环境

$ conda deactivate

如果出现如下错误：

[ 68%] Building NVCC (Device) object caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/ATen/native/sparse/cuda/caffe2_gpu_generated_SparseCUDABlas.cu.o
~/pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(58): error: more than one instance of function "at::native::sparse::cuda::cusparseGetErrorString" matches the argument list:
            function "cusparseGetErrorString(cusparseStatus_t)"
            function "at::native::sparse::cuda::cusparseGetErrorString(cusparseStatus_t)"
            argument types are: (cusparseStatus_t)

[ 68%] Building NVCC (Device) object caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/ATen/native/sparse/cuda/caffe2_gpu_generated_SparseCUDABlas.cu.o

~/pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(58): error: more than one instance of function "at::native::sparse::cuda::cusparseGetErrorString" matches the argument list:

function "cusparseGetErrorString(cusparseStatus_t)"

function "at::native::sparse::cuda::cusparseGetErrorString(cusparseStatus_t)"

argument types are: (cusparseStatus_t)

则需要调整代码 aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu, 在其中的 cusparseGetErrorString 函数上增加 #if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))

如下：

#if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))
const char* cusparseGetErrorString(cusparseStatus_t status) {
  switch(status)
  {
    case CUSPARSE_STATUS_SUCCESS:
      return "success";

    case CUSPARSE_STATUS_NOT_INITIALIZED:
      return "library not initialized";

    case CUSPARSE_STATUS_ALLOC_FAILED:
      return "resource allocation failed";

    case CUSPARSE_STATUS_INVALID_VALUE:
      return "an invalid numeric value was used as an argument";

    case CUSPARSE_STATUS_ARCH_MISMATCH:
      return "an absent device architectural feature is required";

    case CUSPARSE_STATUS_MAPPING_ERROR:
      return "an access to GPU memory space failed";

    case CUSPARSE_STATUS_EXECUTION_FAILED:
      return "the GPU program failed to execute";

    case CUSPARSE_STATUS_INTERNAL_ERROR:
      return "an internal operation failed";

    case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
      return "the matrix type is not supported by this function";

    case CUSPARSE_STATUS_ZERO_PIVOT:
      return "an entry of the matrix is either structural zero or numerical zero (singular block)";

    default:
      return "unknown error";
  }
}
#endif

#if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))

const char* cusparseGetErrorString(cusparseStatus_t status) {

switch(status)

{

case CUSPARSE_STATUS_SUCCESS:

return "success";

case CUSPARSE_STATUS_NOT_INITIALIZED:

return "library not initialized";

case CUSPARSE_STATUS_ALLOC_FAILED:

return "resource allocation failed";

case CUSPARSE_STATUS_INVALID_VALUE:

return "an invalid numeric value was used as an argument";

case CUSPARSE_STATUS_ARCH_MISMATCH:

return "an absent device architectural feature is required";

case CUSPARSE_STATUS_MAPPING_ERROR:

return "an access to GPU memory space failed";

case CUSPARSE_STATUS_EXECUTION_FAILED:

return "the GPU program failed to execute";

case CUSPARSE_STATUS_INTERNAL_ERROR:

return "an internal operation failed";

case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:

return "the matrix type is not supported by this function";

case CUSPARSE_STATUS_ZERO_PIVOT:

return "an entry of the matrix is either structural zero or numerical zero (singular block)";

default:

return "unknown error";

}

#endif

这样解决跟 CUDA-10.1自带函数的冲突问题。

具体参考： https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu

编译安装 TorchVision

$ sudo apt-get install git

# 进入运行环境
$ source activate MaskTextSpotter

$ git clone https://github.com/pytorch/vision.git

# 也可本站下载一份拷贝 wget https://www.mobibrw.com/wp-content/uploads/2019/11/vision.zip

$ cd vision

$ git checkout v0.2.1 -b v0.2.1

$ python setup.py install

# 退出环境 
$ conda deactivate

$ sudo apt-get install git

# 进入运行环境

$ source activate MaskTextSpotter

$ git clone https://github.com/pytorch/vision.git

# 也可本站下载一份拷贝 wget https://www.mobibrw.com/wp-content/uploads/2019/11/vision.zip

$ cd vision

$ git checkout v0.2.1 -b v0.2.1

$ python setup.py install

# 退出环境

$ conda deactivate

源代码编译

$ source activate MaskTextSpotter

# this installs the right pip and dependencies for the fresh python
$ conda install ipython pip

# python dependencies
$ pip install ninja yacs cython matplotlib tqdm opencv-python shapely scipy tensorboardX

$ export INSTALL_DIR=$PWD

# install pycocotools
$ cd $INSTALL_DIR
$ git clone https://github.com/cocodataset/cocoapi.git
$ cd cocoapi/PythonAPI
$ python setup.py build_ext install

# 本站下载 https://www.mobibrw.com/wp-content/uploads/2019/11/cocoapi.zip

# install apex (optional)
$ cd $INSTALL_DIR
$ git clone https://github.com/NVIDIA/apex.git
$ cd apex
$ python setup.py install --cuda_ext --cpp_ext

# 本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/11/apex.zip

# clone repo
$ cd $INSTALL_DIR
$ git clone https://github.com/MhLiao/MaskTextSpotter.git
$ cd MaskTextSpotter

# 本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/11/MaskTextSpotter.zip

# build
$ python setup.py build develop

$ unset INSTALL_DIR

$ source activate MaskTextSpotter

# this installs the right pip and dependencies for the fresh python

$ conda install ipython pip

# python dependencies

$ pip install ninja yacs cython matplotlib tqdm opencv-python shapely scipy tensorboardX

$ export INSTALL_DIR=$PWD

# install pycocotools

$ cd $INSTALL_DIR

$ git clone https://github.com/cocodataset/cocoapi.git

$ cd cocoapi/PythonAPI

$ python setup.py build_ext install

# 本站下载 https://www.mobibrw.com/wp-content/uploads/2019/11/cocoapi.zip

# install apex (optional)

$ cd $INSTALL_DIR

$ git clone https://github.com/NVIDIA/apex.git

$ cd apex

$ python setup.py install --cuda_ext --cpp_ext

# 本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/11/apex.zip

# clone repo

$ cd $INSTALL_DIR

$ git clone https://github.com/MhLiao/MaskTextSpotter.git

$ cd MaskTextSpotter

# 本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/11/MaskTextSpotter.zip

# build

$ python setup.py build develop

$ unset INSTALL_DIR

准备测试数据

# 创建目录(源代码根目录)
$ mkdir outputs

$ cd outputs

$ mkdir finetune

$ cd finetune

# 下载已经训练好的模型 https://drive.google.com/open?id=1pPRS7qS_K1keXjSye0kksqhvoyD0SARz

# 本站下载
$ wget https://www.mobibrw.com/wp-content/uploads/2019/11/model_finetune.zip

$ unzip model_finetune.zip

$ cd ../../

$ mkdir datasets

$ cd datasets

# 下载 icdar2013 数据集
$ wget https://www.mobibrw.com/wp-content/uploads/2019/11/icdar2013.zip

$ unzip icdar2013.zip

$ cd icdar2013

# 下载测试集文件
$ git clone https://github.com/zazaliu/ICDAR2PASCAL_VOC.git

# 本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/11/ICDAR2PASCAL_VOC.zip

$ cp -r ICDAR2PASCAL_VOC/ICDAR2015/ch4_training_localization_transcription_gt/ test_gts

# 执行测试

$ cd ../../

# 预先删除生成的文件，否则可能会启动之后就崩溃退出
$ rm -rf outputs/finetune/inference/

$ bash test.sh

$ mkdir outputs

$ cd outputs

$ mkdir finetune

$ cd finetune

# 下载已经训练好的模型 https://drive.google.com/open?id=1pPRS7qS_K1keXjSye0kksqhvoyD0SARz

# 本站下载

$ wget https://www.mobibrw.com/wp-content/uploads/2019/11/model_finetune.zip

$ unzip model_finetune.zip

$ cd ../../

$ mkdir datasets

$ cd datasets

# 下载 icdar2013 数据集

$ wget https://www.mobibrw.com/wp-content/uploads/2019/11/icdar2013.zip

$ unzip icdar2013.zip

$ cd icdar2013

# 下载测试集文件

$ git clone https://github.com/zazaliu/ICDAR2PASCAL_VOC.git

# 本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/11/ICDAR2PASCAL_VOC.zip

$ cp -r ICDAR2PASCAL_VOC/ICDAR2015/ch4_training_localization_transcription_gt/ test_gts

# 执行测试

$ cd ../../

# 预先删除生成的文件，否则可能会启动之后就崩溃退出

$ rm -rf outputs/finetune/inference/

$ bash test.sh

执行测试的时候，如果出现如下错误信息：

  File "tools/test_net.py", line 95, in <module>
    main()
  File "tools/test_net.py", line 89, in main
    cfg=cfg,
  File "~/MaskTextSpotter/maskrcnn_benchmark/engine/text_inference.py", line 380, in inference
    predictions = compute_on_dataset(model, data_loader, device)
  File "~/MaskTextSpotter/maskrcnn_benchmark/engine/text_inference.py", line 55, in compute_on_dataset
    for i, batch in tqdm(enumerate(data_loader)):
  File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/tqdm/std.py", line 1091, in __iter__
    for obj in iterable:
  File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "~/MaskTextSpotter/maskrcnn_benchmark/data/datasets/icdar.py", line 32, in __getitem__
    words,boxes,charsbbs,segmentations=self.load_gt_from_txt(gt_path,height,width)
  File "~/MaskTextSpotter/maskrcnn_benchmark/data/datasets/icdar.py", line 94, in load_gt_from_txt
    strs, loc = self.line2boxes(line)
  File "~/MaskTextSpotter/maskrcnn_benchmark/data/datasets/icdar.py", line 153, in line2boxes
    loc = np.vstack(v).transpose()
  File "<__array_function__ internals>", line 6, in vstack
  File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/numpy/core/shape_base.py", line 282, in vstack
    return _nx.concatenate(arrs, 0)
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 2 and the array at index 1 has size 1

File "tools/test_net.py", line 95, in <module>

main()

File "tools/test_net.py", line 89, in main

cfg=cfg,

File "~/MaskTextSpotter/maskrcnn_benchmark/engine/text_inference.py", line 380, in inference

predictions = compute_on_dataset(model, data_loader, device)

File "~/MaskTextSpotter/maskrcnn_benchmark/engine/text_inference.py", line 55, in compute_on_dataset

for i, batch in tqdm(enumerate(data_loader)):

File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/tqdm/std.py", line 1091, in __iter__

for obj in iterable:

File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in __next__

return self._process_next_batch(batch)

File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch

raise batch.exc_type(batch.exc_msg)

ValueError: Traceback (most recent call last):

File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop

samples = collate_fn([dataset[i] for i in batch_indices])

File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>

samples = collate_fn([dataset[i] for i in batch_indices])

File "~/MaskTextSpotter/maskrcnn_benchmark/data/datasets/icdar.py", line 32, in __getitem__

words,boxes,charsbbs,segmentations=self.load_gt_from_txt(gt_path,height,width)

File "~/MaskTextSpotter/maskrcnn_benchmark/data/datasets/icdar.py", line 94, in load_gt_from_txt

strs, loc = self.line2boxes(line)

File "~/MaskTextSpotter/maskrcnn_benchmark/data/datasets/icdar.py", line 153, in line2boxes

loc = np.vstack(v).transpose()

File "<__array_function__ internals>", line 6, in vstack

File "~.conda/envs/MaskTextSpotter/lib/python3.6/site-packages/numpy/core/shape_base.py", line 282, in vstack

return _nx.concatenate(arrs, 0)

File "<__array_function__ internals>", line 6, in concatenate

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 2 and the array at index 1 has size 1

那么问题出现的原因是maskrcnn_benchmark/data/datasets/icdar.py解析文件的时候，遇到了478,239,511,241,511,255,478,253,$5,000这样的数据，测试代码如下：

import numpy as np

line = '478,239,511,241,511,255,478,253,$5,000'
def line2boxes(line):
    parts = line.strip().split(',')
    if '\xef\xbb\xbf' in parts[0]:
        parts[0] = parts[0][3:]
    if '\ufeff' in parts[0]:
        parts[0] = parts[0].replace('\ufeff', '')
    x1 = np.array([int(float(x)) for x in parts[::9]])
    y1 = np.array([int(float(x)) for x in parts[1::9]])
    x2 = np.array([int(float(x)) for x in parts[2::9]])
    y2 = np.array([int(float(x)) for x in parts[3::9]])
    x3 = np.array([int(float(x)) for x in parts[4::9]])
    y3 = np.array([int(float(x)) for x in parts[5::9]])
    x4 = np.array([int(float(x)) for x in parts[6::9]])
    y4 = np.array([int(float(x)) for x in parts[7::9]])
    strs = parts[8::9]
    print(x1)
    loc = np.vstack((x1, y1, x2, y2, x3, y3, x4, y4)).transpose()
    print(loc)
    return strs, loc

line2boxes(line)

import numpy as np

line = '478,239,511,241,511,255,478,253,$5,000'

def line2boxes(line):

parts = line.strip().split(',')

if '\xef\xbb\xbf' in parts[0]:

parts[0] = parts[0][3:]

if '\ufeff' in parts[0]:

parts[0] = parts[0].replace('\ufeff', '')

x1 = np.array([int(float(x)) for x in parts[::9]])

y1 = np.array([int(float(x)) for x in parts[1::9]])

x2 = np.array([int(float(x)) for x in parts[2::9]])

y2 = np.array([int(float(x)) for x in parts[3::9]])

x3 = np.array([int(float(x)) for x in parts[4::9]])

y3 = np.array([int(float(x)) for x in parts[5::9]])

x4 = np.array([int(float(x)) for x in parts[6::9]])

y4 = np.array([int(float(x)) for x in parts[7::9]])

strs = parts[8::9]

print(x1)

loc = np.vstack((x1, y1, x2, y2, x3, y3, x4, y4)).transpose()

print(loc)

return strs, loc

line2boxes(line)

修正后的代码如下：

import numpy as np

line = '478,239,511,241,511,255,478,253,$5,000'
def line2boxes(line):
    parts = line.strip().split(',', 8)
    if '\xef\xbb\xbf' in parts[0]:
        parts[0] = parts[0][3:]
    if '\ufeff' in parts[0]:
        parts[0] = parts[0].replace('\ufeff', '')
    x1 = np.array([int(float(x)) for x in parts[::9]])
    y1 = np.array([int(float(x)) for x in parts[1::9]])
    x2 = np.array([int(float(x)) for x in parts[2::9]])
    y2 = np.array([int(float(x)) for x in parts[3::9]])
    x3 = np.array([int(float(x)) for x in parts[4::9]])
    y3 = np.array([int(float(x)) for x in parts[5::9]])
    x4 = np.array([int(float(x)) for x in parts[6::9]])
    y4 = np.array([int(float(x)) for x in parts[7::9]])
    strs = parts[8::9]
    print(x1)
    loc = np.vstack((x1, y1, x2, y2, x3, y3, x4, y4)).transpose()
    print(loc)
    return strs, loc

line2boxes(line)

import numpy as np

line = '478,239,511,241,511,255,478,253,$5,000'

def line2boxes(line):

parts = line.strip().split(',', 8)

if '\xef\xbb\xbf' in parts[0]:

parts[0] = parts[0][3:]

if '\ufeff' in parts[0]:

parts[0] = parts[0].replace('\ufeff', '')

x1 = np.array([int(float(x)) for x in parts[::9]])

y1 = np.array([int(float(x)) for x in parts[1::9]])

x2 = np.array([int(float(x)) for x in parts[2::9]])

y2 = np.array([int(float(x)) for x in parts[3::9]])

x3 = np.array([int(float(x)) for x in parts[4::9]])

y3 = np.array([int(float(x)) for x in parts[5::9]])

x4 = np.array([int(float(x)) for x in parts[6::9]])

y4 = np.array([int(float(x)) for x in parts[7::9]])

strs = parts[8::9]

print(x1)

loc = np.vstack((x1, y1, x2, y2, x3, y3, x4, y4)).transpose()

print(loc)

return strs, loc

line2boxes(line)

其他错误，可能是中途软件安装卸载造成的软件版本冲突，则直接删除环境，重新创建一个干净的环境重新构建。

参考链接

ubuntu 18.04 Android Studio运行模拟器时提示“/dev/kvm device: permission denied”

升级 ubuntu 系统, 从 16.04.5 升级到 18.04.1 ，接着又开始配置各种软件环境。

当配置好 Android 开发环境，准备创建一个模拟器并运行程序环境看是否OK时，问题出现了。

创建和运行时都提示：/dev/kvm device: permission denied 或者 /dev/kvm device: open failed，而且模拟器跑不起来。

执行命令查看：

$ ls -al /dev/kvm
crw------- 1 root root 10, 232 11月 17 22:37 /dev/kvm

1 2	$ ls -al /dev/kvm crw------- 1 root root 10, 232 11月 17 22:37 /dev/kvm

需要安装 qemu-kvm 并把当前用户加入到 kvm 用户组即可：

$ sudo apt install qemu-kvm

$ sudo adduser `whoami` kvm

$ ls -al /dev/kvm
crw-rw---- 1 root kvm 10, 232 11月 18 14:40 /dev/kvm

$ sudo apt install qemu-kvm

$ sudo adduser `whoami` kvm

$ ls -al /dev/kvm

crw-rw---- 1 root kvm 10, 232 11月 18 14:40 /dev/kvm

然后运行模拟器。

如果依旧报错，则需要修改 /dev/kvm 的所有者为当前用户，如下：

$ sudo chown `whoami` /dev/kvm

1	$ sudo chown `whoami` /dev/kvm

参考链接

Android模拟器支持Vulkan

Android 模拟器 29.0.6（2019 年 5 月 1 日）版本开始在Android Q测试版3 中引入了对于 Vulkan 的支持，因此已经可以在模拟器上调试 Vulkan 代码了。

继续阅读Android模拟器支持Vulkan

pytorch 1.0.1在ubuntu 18.04(GeForce GTX 760)编译(CUDA-10.1)

目前 ubuntu 18.04 上使用 sudo apt-get install nvidia-cuda-toolkit 安装的是 9.1.85 版本的 nvidia cuda , 尽管版本比较老，但是好在稳定性好，适用范围广。

当我们的项目需要使用指定版本的 pytorch 的时候，目前官方提供的编译好的 nvidia cuda 安装包并不兼容全部的硬件。这个在实际环境中是比较麻烦的。

目前来说，比较稳妥的办法是直接从源代码编译。

如果显卡是几年前的显卡（GeForce GTX 760 Compute Capability = 3.0 / GeForce GT 720M Lenveo Thinkpad T440 Compute Capability = 2.1），运行的时候会提示：

Found GPU0 GeForce GTX 760 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability that we support is 3.5.

Found GPU0 GeForce GTX 760 which is of cuda capability 3.0.

PyTorch no longer supports this GPU because it is too old.

The minimum cuda capability that we support is 3.5.

执行的时候会报错

RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

1	RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

硬件的计算能力查询 Recommended GPU for Developers

------------------------------------------------------------------------------------

安装最新版本的 cuda-10.1,低版本的编译会出问题：

# 卸载之前已经安装的cuda
$ sudo apt-get remove nvidia-cuda-toolkit

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

$ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600

$ wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub

$ sudo apt-get update

$ sudo apt-get -y install cuda

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常
$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

# 可能需要删除一下XWindow的配置文件，否则驱动可能不能正常加载
$ sudo rm -rf ~/.Xauthority 

# 如果出现如下错误
# ubuntu 18.04 "nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1 
# 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib"
# 参考 http://www.mobibrw.com/?p=21739 

# 删除安装源，可以节约几个GB的磁盘，安装完成后这部分已经用不上了
$ sudo apt-get remove --purge cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00 

$ sudo apt-get update

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常
$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

# 卸载之前已经安装的cuda

$ sudo apt-get remove nvidia-cuda-toolkit

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

$ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600

$ wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub

$ sudo apt-get update

$ sudo apt-get -y install cuda

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常

$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

# 可能需要删除一下XWindow的配置文件，否则驱动可能不能正常加载

$ sudo rm -rf ~/.Xauthority

# 如果出现如下错误

# ubuntu 18.04 "nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1

# 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib"

# 参考 http://www.mobibrw.com/?p=21739

# 删除安装源，可以节约几个GB的磁盘，安装完成后这部分已经用不上了

$ sudo apt-get remove --purge cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00

$ sudo apt-get update

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常

$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

安装 cuDNN 去官网下载对应版本的 cuDNN 一共是三个安装包，逐个安装即可。

[   72.533870] NVRM: The NVIDIA GeForce GT 720M GPU installed in this system is
               NVRM:  supported through the NVIDIA 390.xx Legacy drivers. Please
               NVRM:  visit http://www.nvidia.com/object/unix.html for more
               NVRM:  information.  The 430.50 NVIDIA driver will ignore
               NVRM:  this GPU.  Continuing probe...
[   72.533875] NVRM: No NVIDIA graphics adapter found!

[ 72.533870] NVRM: The NVIDIA GeForce GT 720M GPU installed in this system is

NVRM: supported through the NVIDIA 390.xx Legacy drivers. Please

NVRM: visit http://www.nvidia.com/object/unix.html for more

NVRM: information. The 430.50 NVIDIA driver will ignore

NVRM: this GPU. Continuing probe...

[ 72.533875] NVRM: No NVIDIA graphics adapter found!

------------------------------------------------------------------------------------

依旧是推荐在 Anaconda 上建立独立的编译环境，然后执行编译：

$ sudo apt-get install git

# conda remove -n pytorch --all

$ conda create -n pytorch -y python=3.6.8 pip

$ source activate pytorch

$ conda install numpy pyyaml mkl=2019.1 mkl-include=2019.1 setuptools cmake cffi typing pybind11

$ conda install ninja

# magma-cuda90 magma-cuda91 magma-cuda92 会编译失败 
$ conda install -c pytorch magma-cuda101

$ git clone https://github.com/pytorch/pytorch

$ cd pytorch

# pytorch 1.0.1 版本支持“Compute Capability” 低于3.0版本的硬件，pytorch 1.2.0需要至少3.5版本的硬件才可以正常运行
# https://github.com/pytorch/pytorch/blob/v1.3.0/torch/utils/cpp_extension.py
$ git checkout v1.0.1 -b v1.0.1

$ git submodule sync

$ git submodule update --init --recursive

$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

# 如果不需要使用cuda的话，这里还要加上一句：export NO_CUDA=1

$ python setup.py clean

# 卸载以前安装的pytorch
$ conda uninstall pytorch

# 从Nvidia开发网站查询到自己硬件对应的“Compute Capability” 
# 比如 “GeForce GTX 760” 对应 “3.0” 计算能力，能力不正确会导致运行异常
# RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

$ python setup.py install

# 对于开发者模式，可以使用
# python setup.py build develop

# 一定要退出 pytorch 的编译目录，在pytorch代码目录下执行命令会出现异常
$ cd ..

$ sudo apt-get install git

# conda remove -n pytorch --all

$ conda create -n pytorch -y python=3.6.8 pip

$ source activate pytorch

$ conda install numpy pyyaml mkl=2019.1 mkl-include=2019.1 setuptools cmake cffi typing pybind11

$ conda install ninja

# magma-cuda90 magma-cuda91 magma-cuda92 会编译失败

$ conda install -c pytorch magma-cuda101

$ git clone https://github.com/pytorch/pytorch

$ cd pytorch

# pytorch 1.0.1 版本支持“Compute Capability” 低于3.0版本的硬件，pytorch 1.2.0需要至少3.5版本的硬件才可以正常运行

# https://github.com/pytorch/pytorch/blob/v1.3.0/torch/utils/cpp_extension.py

$ git checkout v1.0.1 -b v1.0.1

$ git submodule sync

$ git submodule update --init --recursive

$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

# 如果不需要使用cuda的话，这里还要加上一句：export NO_CUDA=1

$ python setup.py clean

# 卸载以前安装的pytorch

$ conda uninstall pytorch

# 从Nvidia开发网站查询到自己硬件对应的“Compute Capability”

# 比如 “GeForce GTX 760” 对应 “3.0” 计算能力，能力不正确会导致运行异常

# RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

$ python setup.py install

# 对于开发者模式，可以使用

# python setup.py build develop

# 一定要退出 pytorch 的编译目录，在pytorch代码目录下执行命令会出现异常

$ cd ..

如果出现如下错误：

[ 68%] Building NVCC (Device) object caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/ATen/native/sparse/cuda/caffe2_gpu_generated_SparseCUDABlas.cu.o
~/pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(58): error: more than one instance of function "at::native::sparse::cuda::cusparseGetErrorString" matches the argument list:
            function "cusparseGetErrorString(cusparseStatus_t)"
            function "at::native::sparse::cuda::cusparseGetErrorString(cusparseStatus_t)"
            argument types are: (cusparseStatus_t)

[ 68%] Building NVCC (Device) object caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/ATen/native/sparse/cuda/caffe2_gpu_generated_SparseCUDABlas.cu.o

~/pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(58): error: more than one instance of function "at::native::sparse::cuda::cusparseGetErrorString" matches the argument list:

function "cusparseGetErrorString(cusparseStatus_t)"

function "at::native::sparse::cuda::cusparseGetErrorString(cusparseStatus_t)"

argument types are: (cusparseStatus_t)

则需要调整代码 aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu, 在其中的 cusparseGetErrorString 函数上增加 #if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))

如下：

#if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))
const char* cusparseGetErrorString(cusparseStatus_t status) {
  switch(status)
  {
    case CUSPARSE_STATUS_SUCCESS:
      return "success";

    case CUSPARSE_STATUS_NOT_INITIALIZED:
      return "library not initialized";

    case CUSPARSE_STATUS_ALLOC_FAILED:
      return "resource allocation failed";

    case CUSPARSE_STATUS_INVALID_VALUE:
      return "an invalid numeric value was used as an argument";

    case CUSPARSE_STATUS_ARCH_MISMATCH:
      return "an absent device architectural feature is required";

    case CUSPARSE_STATUS_MAPPING_ERROR:
      return "an access to GPU memory space failed";

    case CUSPARSE_STATUS_EXECUTION_FAILED:
      return "the GPU program failed to execute";

    case CUSPARSE_STATUS_INTERNAL_ERROR:
      return "an internal operation failed";

    case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
      return "the matrix type is not supported by this function";

    case CUSPARSE_STATUS_ZERO_PIVOT:
      return "an entry of the matrix is either structural zero or numerical zero (singular block)";

    default:
      return "unknown error";
  }
}
#endif

#if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))

const char* cusparseGetErrorString(cusparseStatus_t status) {

switch(status)

{

case CUSPARSE_STATUS_SUCCESS:

return "success";

case CUSPARSE_STATUS_NOT_INITIALIZED:

return "library not initialized";

case CUSPARSE_STATUS_ALLOC_FAILED:

return "resource allocation failed";

case CUSPARSE_STATUS_INVALID_VALUE:

return "an invalid numeric value was used as an argument";

case CUSPARSE_STATUS_ARCH_MISMATCH:

return "an absent device architectural feature is required";

case CUSPARSE_STATUS_MAPPING_ERROR:

return "an access to GPU memory space failed";

case CUSPARSE_STATUS_EXECUTION_FAILED:

return "the GPU program failed to execute";

case CUSPARSE_STATUS_INTERNAL_ERROR:

return "an internal operation failed";

case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:

return "the matrix type is not supported by this function";

case CUSPARSE_STATUS_ZERO_PIVOT:

return "an entry of the matrix is either structural zero or numerical zero (singular block)";

default:

return "unknown error";

}

#endif

这样解决跟 CUDA-10.1自带函数的冲突问题。

具体参考： https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu

源码安装的Pytorch，卸载需要执行：

# conda uninstall pytorch

$ pip uninstall torch

$ python setup.py clean

# conda uninstall pytorch

$ pip uninstall torch

$ python setup.py clean

Pytorch 代码下载非常缓慢，可以本站下载同步好的pytorch源代码。

参考链接

Anaconda conda切换为国内源

Windows下

1 添加清华源

命令行中直接使用以下命令

$ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/

$ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/

# pytorch
$ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/

# 设置搜索时显示通道地址
$ conda config --set show_channel_urls yes

$ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/

$ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/

# pytorch

$ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/

# 设置搜索时显示通道地址

$ conda config --set show_channel_urls yes

2 添加中科大源

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/

$ conda config --set show_channel_urls yes

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/

$ conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/

$ conda config --set show_channel_urls yes

Linux下

将以上配置文件写在~/.condarc中

$ vim ~/.condarc

1	$ vim ~/.condarc

channels:
  - defaults
show_channel_urls: true
default_channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

channels:

- defaults

show_channel_urls: true

default_channels:

- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free

- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r

custom_channels:

conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

切记

在修改完成之后，一定要重新启动一个新的Shell, 否则设置不生效。

参考链接

PyTorch运行时提示'ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.'

在测试编译FOTS 的时候，出现如下错误：

(FOTS) $~/Source/FOTS.PyTorch$ bash build.sh 
Compiling crop_and_resize kernels by nvcc...
Traceback (most recent call last):
  File "build.py", line 3, in <module>
    from torch.utils.ffi import create_extension
  File "~/.conda/envs/FOTS/lib/python2.7/site-packages/torch/utils/ffi/__init__.py", line 1, in <module>
    raise ImportError("torch.utils.ffi is deprecated. Please use cpp extensions instead.")
ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.

(FOTS) $~/Source/FOTS.PyTorch$ bash build.sh

Compiling crop_and_resize kernels by nvcc...

Traceback (most recent call last):

File "build.py", line 3, in <module>

from torch.utils.ffi import create_extension

File "~/.conda/envs/FOTS/lib/python2.7/site-packages/torch/utils/ffi/__init__.py", line 1, in <module>

raise ImportError("torch.utils.ffi is deprecated. Please use cpp extensions instead.")

ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.

最省事的情况是降级pytorch版本到0.4，强烈建议使用Anaconda创建独立的Python开发环境，然后在干净的环境中运行。

参考链接

ubuntu 18.04 "nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib"

$ sudo apt-get install cuda
正在读取软件包列表... 完成
正在分析软件包的依赖关系树
正在读取状态信息... 完成
cuda 已经是最新版 (10.1.243-1)。
您也许需要运行“apt --fix-broken install”来修正上面的错误。
下列软件包有未满足的依赖关系：
cuda-drivers : 依赖: libnvidia-gl-418 (>= 418.87.00) 但是它将不会被安装
libnvidia-ifr1-418 : 依赖: libnvidia-gl-418 但是它将不会被安装
nvidia-driver-418 : 依赖: libnvidia-gl-418 (= 418.87.00-0ubuntu1) 但是它将不会被安装
推荐: libnvidia-compute-418:i386 (= 418.87.00-0ubuntu1)
推荐: libnvidia-decode-418:i386 (= 418.87.00-0ubuntu1)
推荐: libnvidia-encode-418:i386 (= 418.87.00-0ubuntu1)
推荐: libnvidia-ifr1-418:i386 (= 418.87.00-0ubuntu1)
推荐: libnvidia-fbc1-418:i386 (= 418.87.00-0ubuntu1)
推荐: libnvidia-gl-418:i386 (= 418.87.00-0ubuntu1)
E: 有未能满足的依赖关系。请尝试不指明软件包的名字来运行“apt --fix-broken install”(也可以指定一个解决办法)。

$ sudo apt --fix-broken install
正在读取软件包列表... 完成
正在分析软件包的依赖关系树
正在读取状态信息... 完成
正在修复依赖关系... 完成
下列软件包是自动安装的并且现在不需要了：
lib32gcc1 libc6-i386
使用'sudo apt autoremove'来卸载它(它们)。
将会同时安装下列软件：
libnvidia-gl-418
下列【新】软件包将被安装：
libnvidia-gl-418
升级了 0 个软件包，新安装了 1 个软件包，要卸载 0 个软件包，有 0 个软件包未被升级。
有 68 个软件包没有被完全安装或卸载。
需要下载 0 B/32.2 MB 的归档。
解压缩后会消耗 164 MB 的额外空间。
您希望继续执行吗？ [Y/n]
获取:1 file:/var/cuda-repo-10-1-local-10.1.243-418.87.00 libnvidia-gl-418 418.87.00-0ubuntu1 [32.2 MB]
(正在读取数据库 ... 系统当前共安装有 293566 个文件和目录。)
正准备解包 .../libnvidia-gl-418_418.87.00-0ubuntu1_amd64.deb ...
dpkg-query: 没有找到与 libnvidia-gl-410 相匹配的软件包
nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib
dpkg-divert: 错误: 删除 被 libnvidia-gl-418 转移的 /usr/lib/x86_64-linux-gnu/libGL.so.1 时
软件包名不匹配
发现了 nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib
dpkg: 处理归档 /var/cuda-repo-10-1-local-10.1.243-418.87.00/./libnvidia-gl-418_418.87.00-0ubuntu1_amd64.deb (--unpack)时出错：
new libnvidia-gl-418:amd64 package pre-installation script subprocess returned error exit status 2
在处理时有错误发生：
/var/cuda-repo-10-1-local-10.1.243-418.87.00/./libnvidia-gl-418_418.87.00-0ubuntu1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

$ sudo apt-get install cuda

正在读取软件包列表... 完成

正在分析软件包的依赖关系树

正在读取状态信息... 完成

cuda 已经是最新版 (10.1.243-1)。

您也许需要运行“apt --fix-broken install”来修正上面的错误。

下列软件包有未满足的依赖关系：

cuda-drivers : 依赖: libnvidia-gl-418 (>= 418.87.00) 但是它将不会被安装

libnvidia-ifr1-418 : 依赖: libnvidia-gl-418 但是它将不会被安装

nvidia-driver-418 : 依赖: libnvidia-gl-418 (= 418.87.00-0ubuntu1) 但是它将不会被安装

推荐: libnvidia-compute-418:i386 (= 418.87.00-0ubuntu1)

推荐: libnvidia-decode-418:i386 (= 418.87.00-0ubuntu1)

推荐: libnvidia-encode-418:i386 (= 418.87.00-0ubuntu1)

推荐: libnvidia-ifr1-418:i386 (= 418.87.00-0ubuntu1)

推荐: libnvidia-fbc1-418:i386 (= 418.87.00-0ubuntu1)

推荐: libnvidia-gl-418:i386 (= 418.87.00-0ubuntu1)

E: 有未能满足的依赖关系。请尝试不指明软件包的名字来运行“apt --fix-broken install”(也可以指定一个解决办法)。

$ sudo apt --fix-broken install

正在读取软件包列表... 完成

正在分析软件包的依赖关系树

正在读取状态信息... 完成

正在修复依赖关系... 完成

下列软件包是自动安装的并且现在不需要了：

lib32gcc1 libc6-i386

使用'sudo apt autoremove'来卸载它(它们)。

将会同时安装下列软件：

libnvidia-gl-418

下列【新】软件包将被安装：

libnvidia-gl-418

升级了 0 个软件包，新安装了 1 个软件包，要卸载 0 个软件包，有 0 个软件包未被升级。

有 68 个软件包没有被完全安装或卸载。

需要下载 0 B/32.2 MB 的归档。

解压缩后会消耗 164 MB 的额外空间。

您希望继续执行吗？ [Y/n]

获取:1 file:/var/cuda-repo-10-1-local-10.1.243-418.87.00 libnvidia-gl-418 418.87.00-0ubuntu1 [32.2 MB]

(正在读取数据库 ... 系统当前共安装有 293566 个文件和目录。)

正准备解包 .../libnvidia-gl-418_418.87.00-0ubuntu1_amd64.deb ...

dpkg-query: 没有找到与 libnvidia-gl-410 相匹配的软件包

nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib

dpkg-divert: 错误: 删除被 libnvidia-gl-418 转移的 /usr/lib/x86_64-linux-gnu/libGL.so.1 时

软件包名不匹配

发现了 nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib

dpkg: 处理归档 /var/cuda-repo-10-1-local-10.1.243-418.87.00/./libnvidia-gl-418_418.87.00-0ubuntu1_amd64.deb (--unpack)时出错：

new libnvidia-gl-418:amd64 package pre-installation script subprocess returned error exit status 2

在处理时有错误发生：

/var/cuda-repo-10-1-local-10.1.243-418.87.00/./libnvidia-gl-418_418.87.00-0ubuntu1_amd64.deb

E: Sub-process /usr/bin/dpkg returned an error code (1)

解决方案如下：

# 解除nvidia 340全部依赖

$ LC_MESSAGES=C dpkg-divert --list '*nvidia-340*' | sed -nre 's/^diversion of (.*) to .*/\1/p' | xargs -rd'\n' -n1 -- sudo dpkg-divert --remove

$ dpkg-divert --package nvidia-340 --remove /usr/lib/i386-linux-gnu/libGL.so.1

# 解除nvidia 340全部依赖

$ LC_MESSAGES=C dpkg-divert --list '*nvidia-340*' | sed -nre 's/^diversion of (.*) to .*/\1/p' | xargs -rd'\n' -n1 -- sudo dpkg-divert --remove

$ dpkg-divert --package nvidia-340 --remove /usr/lib/i386-linux-gnu/libGL.so.1

参考链接

发现了 nvidia-340 导致 /usr/lib/i386-linux-gnu/libGL.so.1 /usr/lib/i386-linux-gnu/libGL.so.1.distrib...

ubuntu 18.04 systemd-udevd进程CPU占用特别高，CUDA Toolkit 10.1 Update 2安装之后出现

最近在T440笔记本的ubuntu 18.04系统上安装最新的CUDA Toolkit 10.1 Update 2之后，发现 systemd-udevd 进程CPU占用特别高，执行 sudo /lib/systemd/systemd-udevd -D ,会发现持续输出如下信息：

RUN '/bin/systemctl start --no-block nvidia-persistenced.service' /lib/udev/rules.d/71-nvidia.rules:12
RUN '/sbin/modprobe nvidia-modeset' /lib/udev/rules.d/71-nvidia.rules:16
RUN '/sbin/modprobe nvidia-drm' /lib/udev/rules.d/71-nvidia.rules:20
RUN '/sbin/modprobe nvidia-uvm' /lib/udev/rules.d/71-nvidia.rules:24
RUN '/usr/bin/nvidia-smi' /lib/udev/rules.d/71-nvidia.rules:28
starting '/bin/systemctl start --no-block nvidia-persistenced.service'
Process '/bin/systemctl start --no-block nvidia-persistenced.service' succeeded.
starting '/sbin/modprobe nvidia-modeset'
seq 115679 queued, 'remove' 'module'
seq 115680 queued, 'add' 'module'
seq 115681 queued, 'add' 'slab'
seq 115682 queued, 'add' 'drivers'
seq 115681 running
seq 115682 running
seq 115681 processed
seq 115683 queued, 'remove' 'slab'
seq 115684 queued, 'remove' 'drivers'
seq 115683 running
seq 115683 processed
seq 115682 processed
seq 115684 running
seq 115684 processed
seq 115685 queued, 'remove' 'module'
'/sbin/modprobe nvidia-modeset'(err) 'modprobe: ERROR: could not insert 'nvidia_modeset': No such device'
Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.

RUN '/bin/systemctl start --no-block nvidia-persistenced.service' /lib/udev/rules.d/71-nvidia.rules:12

RUN '/sbin/modprobe nvidia-modeset' /lib/udev/rules.d/71-nvidia.rules:16

RUN '/sbin/modprobe nvidia-drm' /lib/udev/rules.d/71-nvidia.rules:20

RUN '/sbin/modprobe nvidia-uvm' /lib/udev/rules.d/71-nvidia.rules:24

RUN '/usr/bin/nvidia-smi' /lib/udev/rules.d/71-nvidia.rules:28

starting '/bin/systemctl start --no-block nvidia-persistenced.service'

Process '/bin/systemctl start --no-block nvidia-persistenced.service' succeeded.

starting '/sbin/modprobe nvidia-modeset'

seq 115679 queued, 'remove' 'module'

seq 115680 queued, 'add' 'module'

seq 115681 queued, 'add' 'slab'

seq 115682 queued, 'add' 'drivers'

seq 115681 running

seq 115682 running

seq 115681 processed

seq 115683 queued, 'remove' 'slab'

seq 115684 queued, 'remove' 'drivers'

seq 115683 running

seq 115683 processed

seq 115682 processed

seq 115684 running

seq 115684 processed

seq 115685 queued, 'remove' 'module'

'/sbin/modprobe nvidia-modeset'(err) 'modprobe: ERROR: could not insert 'nvidia_modeset': No such device'

Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.

解决方法如下：

$ sudo rm -rf /lib/udev/rules.d/71-nvidia.rules

1	$ sudo rm -rf /lib/udev/rules.d/71-nvidia.rules

参考链接

macOS Catalina(10.15.1)无法安装使用CUDA了

在 macOS Catalina (10.15.1) 系统上，已经无法安装使用CUDA了。

貌似 CUDA 只能支持到 macOS High Sierra (10.13)。

原因在于 Apple 严格控制着显卡驱动的权限，导致即时nvidia 希望更新显卡驱动，也需要等待 Apple 的许可。

但是貌似 Apple 对显卡驱动进行了深度的定制，而这个驱动定制团队貌似已经终止了。这就造成没办法进行驱动的更新，简直是悲剧啊！

参考链接中各种说法很多，但是一致的见解是只有降级系统这一条路了, Docker 也是不行的,宿主机不支持，也是搞不定。

不过，貌似安装双系统可以规避这个问题！(可惜只能是低版本系统安装高版本系统！高版本系统安装低版本的时候会被拒绝)。一个 macOS High Sierra (10.13)，另一个安装更高的系统版本。参考：在单独的 APFS 宗卷上安装 macOS

2019 年 11 月
一	二	三	四	五	六	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30