安装驱动:
# 清理全部的其他版本的nvidia驱动
$ sudo apt-get purge nvidia-*
$ sudo reboot
# nvidia-smi
$ sudo apt install nvidia-utils-470
# 驱动
$ sudo apt install nvidia-driver-470
# cuda 11.3
$ sudo apt install nvidia-cuda-toolkit
$ sudo apt-get update
# 部分驱动可能会更新,需要执行更新,否则可能依旧不正常
$ sudo apt-get dist-upgrade
$ sudo apt-get autoremove
# 重启,否则部分驱动可能工作不正常
$ sudo reboot
在 Anaconda 上建立独立的编译环境,然后执行编译:
# wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
# 国内镜像下载
$ wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2021.11-Linux-x86_64.sh
$ bash Anaconda3-*-Linux-x86_64.sh
# 更新到最新版本
$ conda update -n base -c defaults conda
参考 Anaconda conda切换为国内源 加速下载。
编译配置StyleGAN3
$ sudo apt-get install git
$ git clone git@github.com:NVlabs/stylegan3.git
$ cd stylegan3
$ conda env create -f environment.yml
$ conda activate stylegan3
$ pip install psutil
# cudnn加速
$ conda install cudnn
# 目前测试 RTX 3060 12GB的情况下,batch建议是2,更高会报告OOM
# 并且当batch低于4的时候,需要同时指定 --mbstd-group=2
$ python train.py --outdir=~/training-runs --cfg=stylegan3-t --data=~/datasets/metfaces-1024x1024.zip --gpus=1 --batch=2 --mbstd-group=2 --gamma=8.2 --mirror=1 --metrics=none
如果报错:
Constructing networks...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
Traceback (most recent call last):
File "~/source/stylegan3/train.py", line 286, in <module>
main() # pylint: disable=no-value-for-parameter
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "~/source/stylegan3/train.py", line 281, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "~/source/stylegan3/train.py", line 96, in launch_training
subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
File "~/source/stylegan3/train.py", line 47, in subprocess_fn
training_loop.training_loop(rank=rank, **c)
File "~/source/stylegan3/training/training_loop.py", line 168, in training_loop
img = misc.print_module_summary(G, [z, c])
File "~/source/stylegan3/torch_utils/misc.py", line 216, in print_module_summary
outputs = module(*inputs)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "~/source/stylegan3/training/networks_stylegan3.py", line 511, in forward
ws = self.mapping(z, c, truncation_psi=truncation_psi, truncation_cutoff=truncation_cutoff, update_emas=update_emas)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "~/source/stylegan3/training/networks_stylegan3.py", line 151, in forward
x = getattr(self, f'fc{idx}')(x)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "~/source/stylegan3/training/networks_stylegan3.py", line 100, in forward
x = bias_act.bias_act(x, b, act=self.activation)
File "~/source/stylegan3/torch_utils/ops/bias_act.py", line 84, in bias_act
if impl == 'cuda' and x.device.type == 'cuda' and _init():
File "~/source/stylegan3/torch_utils/ops/bias_act.py", line 41, in _init
_plugin = custom_ops.get_plugin(
File "~/source/stylegan3/torch_utils/custom_ops.py", line 136, in get_plugin
torch.utils.cpp_extension.load(name=module_name, build_directory=cached_build_dir,
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1080, in load
return _jit_compile(
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1318, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1701, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 565, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1173, in create_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
ImportError: ~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/lib/../../../../libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by ~/.cache/torch_extensions/bias_act_plugin/3cb576a0039689487cfba59279dd6d46-nvidia-geforce-rtx-3060/bias_act_plugin.so)
上述报错产生的原因是在 Anaconda 下载的包,在进行编译的时候,使用了高版本的 libstdc++.so。而运行时却使用了Anaconda 环境里低版本的 libstdc++.so 导致报错。
了解了原因,解决方法就比较简单了,可以手工升级 Anaconda 环境下的 libstdc++.so 动态库。
如下:
$ conda activate stylegan3
$ conda install cmake
$ conda install make
# 关键升级命令,更新当前项目里面的 libstdc++.so
$ conda install -c conda-forge libstdcxx-ng
# 删除上次失败时候的编译缓存
$ rm -rf ~/.cache
# 目前测试 RTX 3060 12GB的情况下,batch建议是2,更高会报告OOM
# 当batch=4的时候会在第11天的时候报告OOM
# 并且当batch低于4的时候,需要同时指定 --mbstd-group=2
$ python train.py --outdir=~/training-runs --cfg=stylegan3-t --data=~/datasets/metfaces-1024x1024.zip --gpus=1 --batch=2 --mbstd-group=2 --gamma=8.2 --mirror=1 --metrics=none
目前测试发现,当batch=4的时候会在第11天的时候报告OOM,如下:
tick 444 kimg 1776.0 time 11d 17h 14m sec/tick 2292.6 sec/kimg 573.16 maintenance 0.2 cpumem 5.40 gpumem 7.69 reserved 10.03 augment 0.344
Traceback (most recent call last):
File "~/source/stylegan3/train.py", line 286, in <module>
main() # pylint: disable=no-value-for-parameter
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "~/source/stylegan3/train.py", line 281, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "~/source/stylegan3/train.py", line 96, in launch_training
subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
File "~/source/stylegan3/train.py", line 47, in subprocess_fn
training_loop.training_loop(rank=rank, **c)
File "~/source/stylegan3/training/training_loop.py", line 278, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg)
File "~/source/stylegan3/training/loss.py", line 81, in accumulate_gradients
loss_Gmain.mean().mul(gain).backward()
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 147, in backward
Variable._execution_engine.run_backward(
File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/autograd/function.py", line 87, in apply
return self._forward_cls.backward(self, *args) # type: ignore[attr-defined]
File "~/source/stylegan3/torch_utils/ops/grid_sample_gradfix.py", line 50, in backward
grad_input, grad_grid = _GridSample2dBackward.apply(grad_output, input, grid)
File "~/source/stylegan3/torch_utils/ops/grid_sample_gradfix.py", line 59, in forward
grad_input, grad_grid = op(grad_output, input, grid, 0, 0, False)
RuntimeError: CUDA out of memory. Tried to allocate 1.39 GiB (GPU 0; 11.76 GiB total capacity; 7.06 GiB already allocated; 443.88 MiB free; 10.02 GiB reserved in total by PyTorch)
参考链接