RuntimeError: CUDA environment is not correctly set up · Issue #2516 · chainer/chainer

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
相关文章推荐
豪气的移动电源 · Apt install ...· 昨天 ·
成熟的饭盒 · 【エバンジェリスト・ボイス】Windows上 ...· 昨天 ·
跑龙套的太阳 · 请教-for循环里面的同步理解 - ...· 1 周前 ·
一直单身的饭卡 · rwkv模型lora微调之accelerat ...· 2 周前 ·
温暖的长颈鹿 · 非root用户在linux下安装多个版本的C ...· 2 周前 ·
爱热闹的作业本 · 如何使用C#在WPF中开发类似谷歌Chrom ...· 4 月前 ·
爱搭讪的蚂蚁 · 萌白酱-cos刻晴 – NaiJiang(奶酱)· 11 月前 ·
强健的吐司 · screenshot | Cypress ...· 1 年前 ·
跑龙套的柚子 · 550只羊同时被劈死！闪电是怎么做到的？将闪 ...· 1 年前 ·
豪爽的钱包 · “终极一班”汪东城，男团里外都是精 - 知乎· 1 年前 ·
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm trying to run 'lda2vec' which calls 'chainer' on my AWS g2.2xlarge Ubuntu 14.04 LTS instance.
I'm getting this error:
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-bccd002b2788> in <module>()
     21 gpu_id = int(os.getenv('CUDA_GPU', 0))
---> 22 cuda.get_device(gpu_id).use()
     23 print "Using GPU " + str(gpu_id)
/home/ubuntu/anaconda/lib/python2.7/site-packages/chainer-1.22.0-py2.7-linux-x86_64.egg/chainer/cuda.pyc in get_device(*args)
    220     for arg in args:
    221         if type(arg) in _integer_types:
--> 222             check_cuda_available()
    223             return Device(arg)
    224         if isinstance(arg, ndarray):
/home/ubuntu/anaconda/lib/python2.7/site-packages/chainer-1.22.0-py2.7-linux-x86_64.egg/chainer/cuda.pyc in check_cuda_available()
     85                '(see https://github.com/pfnet/chainer#installation).')
     86         msg += str(_resolution_error)
---> 87         raise RuntimeError(msg)
     88     if (not cudnn_enabled and
     89             not _cudnn_disabled_by_user and
RuntimeError: CUDA environment is not correctly set up
(see https://github.com/pfnet/chainer#installation).CuPy is not correctly installed. Please check your environment, uninstall Chainer and reinstall it with `pip install chainer --no-cache-dir -vvvv`.
original error: libcublas.so.7.0: cannot open shared object file: No such file or directory
I installed 'chainer' from source:
In [9]: import chainer
In [10]: chainer.__version__
Out[10]: '1.22.0'
Tensorflow runs and utilizes the GPU (so we know CUDA 7.5 is working):
In [1]: import tensorflow
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally
NVIDIA graphics driver: nvidia-graphics-drivers-367_367.57.orig.tar

cuda 7.5: cudnn-7.5-linux-x64-v5.1.tgz
ubuntu@ip-10-0-1-127:~$ ls -l /usr/local/cuda-7.0/lib64/libcublas.so.7.0*
lrwxrwxrwx 1 root root       19 Mar 27 16:01 /usr/local/cuda-7.0/lib64/libcublas.so.7.0 -> libcublas.so.7.0.28
-rwxr-xr-x 1 root root 31160168 Mar 27 16:02 /usr/local/cuda-7.0/lib64/libcublas.so.7.0.28
ubuntu@ip-10-0-1-127:~$ ls -l /usr/local/cuda-7.0/lib64/
total 723244
-rw-r--r-- 1 root root 26032916 Mar 27 16:01 libcublas_device.a
lrwxrwxrwx 1 root root       16 Mar 27 16:01 libcublas.so -> libcublas.so.7.0
lrwxrwxrwx 1 root root       19 Mar 27 16:01 libcublas.so.7.0 -> libcublas.so.7.0.28
-rwxr-xr-x 1 root root 31160168 Mar 27 16:02 libcublas.so.7.0.28
-rw-r--r-- 1 root root 35269768 Mar 27 16:02 libcublas_static.a
-rw-r--r-- 1 root root   310328 Mar 27 16:01 libcudadevrt.a
lrwxrwxrwx 1 root root       16 Mar 27 16:02 libcudart.so -> libcudart.so.7.0
lrwxrwxrwx 1 root root       19 Mar 27 16:01 libcudart.so.7.0 -> libcudart.so.7.0.28
-rwxr-xr-x 1 root root   377896 Mar 27 16:01 libcudart.so.7.0.28
-rw-r--r-- 1 root root   708938 Mar 27 16:02 libcudart_static.a
lrwxrwxrwx 1 root root       15 Mar 27 16:01 libcufft.so -> libcufft.so.7.0
lrwxrwxrwx 1 root root       18 Mar 27 16:01 libcufft.so.7.0 -> libcufft.so.7.0.28
-rwxr-xr-x 1 root root 62554648 Mar 27 16:01 libcufft.so.7.0.28
-rw-r--r-- 1 root root 91554528 Mar 27 16:02 libcufft_static.a
lrwxrwxrwx 1 root root       16 Mar 27 16:02 libcufftw.so -> libcufftw.so.7.0
lrwxrwxrwx 1 root root       19 Mar 27 16:02 libcufftw.so.7.0 -> libcufftw.so.7.0.28
-rwxr-xr-x 1 root root   443648 Mar 27 16:01 libcufftw.so.7.0.28
-rw-r--r-- 1 root root    44288 Mar 27 16:01 libcufftw_static.a
lrwxrwxrwx 1 root root       17 Mar 27 16:01 libcuinj64.so -> libcuinj64.so.7.0
lrwxrwxrwx 1 root root       20 Mar 27 16:01 libcuinj64.so.7.0 -> libcuinj64.so.7.0.28
-rwxr-xr-x 1 root root  5276064 Mar 27 16:01 libcuinj64.so.7.0.28
-rw-r--r-- 1 root root  1649726 Mar 27 16:01 libculibos.a
lrwxrwxrwx 1 root root       16 Mar 27 16:01 libcurand.so -> libcurand.so.7.0
lrwxrwxrwx 1 root root       19 Mar 27 16:02 libcurand.so.7.0 -> libcurand.so.7.0.28
-rwxr-xr-x 1 root root 51746464 Mar 27 16:01 libcurand.so.7.0.28
-rw-r--r-- 1 root root 51981092 Mar 27 16:02 libcurand_static.a
lrwxrwxrwx 1 root root       18 Mar 27 16:01 libcusolver.so -> libcusolver.so.7.0
lrwxrwxrwx 1 root root       21 Mar 27 16:01 libcusolver.so.7.0 -> libcusolver.so.7.0.28
-rwxr-xr-x 1 root root 41170544 Mar 27 16:01 libcusolver.so.7.0.28
-rw-r--r-- 1 root root 17900430 Mar 27 16:01 libcusolver_static.a
lrwxrwxrwx 1 root root       18 Mar 27 16:01 libcusparse.so -> libcusparse.so.7.0
lrwxrwxrwx 1 root root       21 Mar 27 16:01 libcusparse.so.7.0 -> libcusparse.so.7.0.28
-rwxr-xr-x 1 root root 43486576 Mar 27 16:01 libcusparse.so.7.0.28
-rw-r--r-- 1 root root 51094980 Mar 27 16:01 libcusparse_static.a
lrwxrwxrwx 1 root root       14 Mar 27 16:01 libnppc.so -> libnppc.so.7.0
lrwxrwxrwx 1 root root       17 Mar 27 16:02 libnppc.so.7.0 -> libnppc.so.7.0.28
-rwxr-xr-x 1 root root   418944 Mar 27 16:02 libnppc.so.7.0.28
-rw-r--r-- 1 root root    20568 Mar 27 16:01 libnppc_static.a
lrwxrwxrwx 1 root root       14 Mar 27 16:02 libnppi.so -> libnppi.so.7.0
lrwxrwxrwx 1 root root       17 Mar 27 16:02 libnppi.so.7.0 -> libnppi.so.7.0.28
-rwxr-xr-x 1 root root 70595696 Mar 27 16:01 libnppi.so.7.0.28
-rw-r--r-- 1 root root 99608960 Mar 27 16:01 libnppi_static.a
lrwxrwxrwx 1 root root       14 Mar 27 16:02 libnpps.so -> libnpps.so.7.0
lrwxrwxrwx 1 root root       17 Mar 27 16:01 libnpps.so.7.0 -> libnpps.so.7.0.28
-rwxr-xr-x 1 root root  5851016 Mar 27 16:01 libnpps.so.7.0.28
-rw-r--r-- 1 root root  8461132 Mar 27 16:02 libnpps_static.a
lrwxrwxrwx 1 root root       16 Mar 27 16:02 libnvblas.so -> libnvblas.so.7.0
lrwxrwxrwx 1 root root       19 Mar 27 16:02 libnvblas.so.7.0 -> libnvblas.so.7.0.28
-rwxr-xr-x 1 root root   451984 Mar 27 16:01 libnvblas.so.7.0.28
lrwxrwxrwx 1 root root       24 Mar 27 16:01 libnvrtc-builtins.so -> libnvrtc-builtins.so.7.0
lrwxrwxrwx 1 root root       27 Mar 27 16:01 libnvrtc-builtins.so.7.0 -> libnvrtc-builtins.so.7.0.28
-rwxr-xr-x 1 root root 24079328 Mar 27 16:01 libnvrtc-builtins.so.7.0.28
lrwxrwxrwx 1 root root       15 Mar 27 16:01 libnvrtc.so -> libnvrtc.so.7.0
lrwxrwxrwx 1 root root       18 Mar 27 16:01 libnvrtc.so.7.0 -> libnvrtc.so.7.0.27
-rwxr-xr-x 1 root root 18179472 Mar 27 16:01 libnvrtc.so.7.0.27
lrwxrwxrwx 1 root root       18 Mar 27 16:02 libnvToolsExt.so -> libnvToolsExt.so.1
lrwxrwxrwx 1 root root       22 Mar 27 16:01 libnvToolsExt.so.1 -> libnvToolsExt.so.1.0.0
-rwxr-xr-x 1 root root    37568 Mar 27 16:01 libnvToolsExt.so.1.0.0
-rw-r--r-- 1 root root    25840 Mar 27 16:02 libOpenCL.so
drwxr-xr-x 2 root root     4096 Mar 27 16:02 stubs
ubuntu@ip-10-0-1-127:~$ 
Is this going to be a problem?
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
OSError                                   Traceback (most recent call last)

 in ()

102         t0 = time.time()

103         optimizer.zero_grads()

--> 104         l = model.fit_partial(s.copy(), a.copy(), f.copy())

105         prior = model.prior()

106         loss = prior * fraction
/hdfs/lda2vec/examples/hacker_news/lda2vec/lda2vec_model.pyc in fit_partial(self, rsty_ids, raut_ids, rwrd_ids, window)

40         sty_ids, aut_ids, wrd_ids = move(self.xp, rsty_ids, raut_ids, rwrd_ids)

41         pivot_idx = next(move(self.xp, rwrd_ids[window: -window]))

---> 42         pivot = F.embed_id(pivot_idx, self.sampler.W)

43         sty_at_pivot = rsty_ids[window: -window]

44         aut_at_pivot = raut_ids[window: -window]
/home/ubuntu/anaconda/lib/python2.7/site-packages/chainer-1.22.0-py2.7-linux-x86_64.egg/chainer/functions/connection/embed_id.pyc in embed_id(x, W, ignore_label)

110     """

--> 111     return EmbedIDFunction(ignore_label=ignore_label)(x, W)
/home/ubuntu/anaconda/lib/python2.7/site-packages/chainer-1.22.0-py2.7-linux-x86_64.egg/chainer/function.pyc in call(self, *inputs)

197         # Forward prop

198         with cuda.get_device(*in_data):

--> 199             outputs = self.forward(in_data)

200             assert type(outputs) == tuple

201         for hook in six.itervalues(hooks):
/home/ubuntu/anaconda/lib/python2.7/site-packages/chainer-1.22.0-py2.7-linux-x86_64.egg/chainer/functions/connection/embed_id.pyc in forward(self, inputs)

47                 mask[..., None], 0, W.take(xp.where(mask, 0, x), axis=0)),

---> 49         return W.take(x, axis=0),

51     def backward(self, inputs, grad_outputs):
cupy/core/core.pyx in cupy.core.core.ndarray.take (cupy/core/core.cpp:13176)()
cupy/core/core.pyx in cupy.core.core.ndarray.take (cupy/core/core.cpp:13065)()
cupy/core/core.pyx in cupy.core.core._take (cupy/core/core.cpp:67356)()
cupy/core/elementwise.pxi in cupy.core.core.ElementwiseKernel.call (cupy/core/core.cpp:42995)()
cupy/util.pyx in cupy.util.memoize.decorator.ret (cupy/util.cpp:1471)()
cupy/core/elementwise.pxi in cupy.core.core._get_elementwise_kernel (cupy/core/core.cpp:41341)()
cupy/core/elementwise.pxi in cupy.core.core._get_simple_elementwise_kernel (cupy/core/core.cpp:33972)()
cupy/core/elementwise.pxi in cupy.core.core._get_simple_elementwise_kernel (cupy/core/core.cpp:33794)()
cupy/core/carray.pxi in cupy.core.core.compile_with_cache (cupy/core/core.cpp:33449)()
/home/ubuntu/anaconda/lib/python2.7/site-packages/chainer-1.22.0-py2.7-linux-x86_64.egg/cupy/cuda/compiler.pyc in compile_with_cache(source, options, arch, cache_dir)

123             options += '-m32',

--> 125     env = (arch, options, _get_nvcc_version())

126     if '#include' in source:

127         pp_src = '%s %s' % (env, preprocess(source, options))
/home/ubuntu/anaconda/lib/python2.7/site-packages/chainer-1.22.0-py2.7-linux-x86_64.egg/cupy/cuda/compiler.pyc in _get_nvcc_version()

20     if _nvcc_version is None:

21         cmd = ['nvcc', '--version']

---> 22         _nvcc_version = _run_nvcc(cmd, '.')

24     return _nvcc_version
/home/ubuntu/anaconda/lib/python2.7/site-packages/chainer-1.22.0-py2.7-linux-x86_64.egg/cupy/cuda/compiler.pyc in _run_nvcc(cmd, cwd)

59               'Check PATH environment variable: ' 

60               + str(e)

---> 61         raise OSError(msg)

OSError: Failed to run nvcc command. Check PATH environment variable: [Errno 2] No such file or directory
In [5]:
          I tried installing the cuda-toolkit to to get nvcc.
$ sudo apt-get install nvidia-cuda-toolkit
It installed the 375 version of the driver and broke CUDA.  I need the 376 version.
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available  (error: Unable to get the number of gpus available: unknown error)
          How did you install CUDA? Chainer uses nvcc command, that is a compiler for CUDA environment and is provided by CUDA. You need to set PATH environment correctly.

If you are using amazon linux AMI provided by NVIDIA, nvcc is already in PATH. And, all you need to do is pip install chainer.

If you don't use it and install CUDA manually, I recommend you to install CUDA from deb package. And, you need to add /usr/local/cuda/bin to PATH environment variable. My fabfile may helps you: https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py
          I’m using my own AMI.  I’ve been having trouble getting the GPU on the g2.2xlarge instance to work.
I had to use the 367 version of the NVIDIA driver to get TensorFlow to work.
nvidia-graphics-drivers-367_367.57.orig.tar
Unfortunately, this package did not install ‘nvcc’.  Other attempts to install CUDA installed the 375 version of the driver which doesn’t work
on the g2.2xlarge instance.
I’m trying this now:
cuda_7.5.18_linux.run
from:
https://developer.nvidia.com/cuda-75-downloads-archive
 On Apr 9, 2017, at 12:13 AM, Yuya Unno ***@***.***> wrote:
 How did you install CUDA? Chainer uses nvcc command, that is a compiler for CUDA environment and is provided by CUDA. You need to set PATH environment correctly.
 If you are using amazon linux AMI provided by NVIDIA, nvcc is already in PATH. And, all you need to do is pip install chainer.
 If you don't use it and install CUDA manually, I recommend you to install CUDA from deb package. And, you need to add /usr/local/cuda/bin to PATH environment variable. My fabfile may helps you: https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py <https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py>
 You are receiving this because you authored the thread.
 Reply to this email directly, view it on GitHub <#2516 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC9i2zplWQK-tP2vhBtafTWSJk03Ckouks5ruIUvgaJpZM4M3z8a>.
The NVIDIA CUDA Toolkit provides command-line and graphical
Do you accept the previously read EULA? (accept/decline/quit): accept
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 352.39? ((y)es/(n)o/(q)uit): y
Do you want to install the OpenGL libraries? ((y)es/(n)o/(q)uit) [ default is yes ]: n
Install the CUDA 7.5 Toolkit? ((y)es/(n)o/(q)uit): y
Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]: y
Toolkit location must be an absolute path.
Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]:
Do you want to install a symbolic link at /usr/local/cuda? ((y)es/(n)o/(q)uit): y
Install the CUDA 7.5 Samples? ((y)es/(n)o/(q)uit): y
Enter CUDA Samples Location [ default is /home/ubuntu ]:
Installing the NVIDIA display driver...
The driver installation has failed due to an unknown error. Please consult the driver installation log located at /var/log/nvidia-installer.log.
===========
= Summary =
===========
Driver:   Installation Failed
Toolkit:  Installation skipped
Samples:  Installation skipped
Logfile is /tmp/cuda_install_14884.log
ubuntu@ip-10-0-1-189:~$
ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
                                                              18,1          Bot
ubuntu@ip-10-0-1-189:~$ lsmod | grep nvidia
nvidia_drm             14412  0
nvidia_modeset        764387  1 nvidia_drm
nvidia              11488748  1 nvidia_modeset
drm                   303102  4 ttm,drm_kms_helper,cirrus,nvidia_drm
ubuntu@ip-10-0-1-189:~$ sudo apt-get --purge remove nvidia*
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package nvidia-graphics-drivers
E: Unable to locate package nvidia-graphics-drivers-367_367.57.orig.tar
E: Couldn't find any package by regex 'nvidia-graphics-drivers-367_367.57.orig.tar'




    

 On Apr 9, 2017, at 8:39 AM, David Laxer ***@***.***> wrote:
 I’m using my own AMI.  I’ve been having trouble getting the GPU on the g2.2xlarge instance to work.
 I had to use the 367 version of the NVIDIA driver to get TensorFlow to work.
 nvidia-graphics-drivers-367_367.57.orig.tar
 Unfortunately, this package did not install ‘nvcc’.  Other attempts to install CUDA installed the 375 version of the driver which doesn’t work
 on the g2.2xlarge instance.
 I’m trying this now:
 cuda_7.5.18_linux.run
 from:
 https://developer.nvidia.com/cuda-75-downloads-archive <https://developer.nvidia.com/cuda-75-downloads-archive>
> On Apr 9, 2017, at 12:13 AM, Yuya Unno ***@***.*** ***@***.***>> wrote:
> How did you install CUDA? Chainer uses nvcc command, that is a compiler for CUDA environment and is provided by CUDA. You need to set PATH environment correctly.
> If you are using amazon linux AMI provided by NVIDIA, nvcc is already in PATH. And, all you need to do is pip install chainer.
> If you don't use it and install CUDA manually, I recommend you to install CUDA from deb package. And, you need to add /usr/local/cuda/bin to PATH environment variable. My fabfile may helps you: https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py <https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py>
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <#2516 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC9i2zplWQK-tP2vhBtafTWSJk03Ckouks5ruIUvgaJpZM4M3z8a>.
          This is the history of the issue:
http://stackoverflow.com/questions/42984743/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver
 On Apr 9, 2017, at 9:04 AM, David Laxer ***@***.***> wrote:
 Installation failed:
 The NVIDIA CUDA Toolkit provides command-line and graphical
 Do you accept the previously read EULA? (accept/decline/quit): accept
 Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 352.39? ((y)es/(n)o/(q)uit): y
 Do you want to install the OpenGL libraries? ((y)es/(n)o/(q)uit) [ default is yes ]: n
 Install the CUDA 7.5 Toolkit? ((y)es/(n)o/(q)uit): y
 Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]: y
 Toolkit location must be an absolute path.
 Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]:
 Do you want to install a symbolic link at /usr/local/cuda? ((y)es/(n)o/(q)uit): y
 Install the CUDA 7.5 Samples? ((y)es/(n)o/(q)uit): y
 Enter CUDA Samples Location [ default is /home/ubuntu ]:
 Installing the NVIDIA display driver...
 The driver installation has failed due to an unknown error. Please consult the driver installation log located at /var/log/nvidia-installer.log.
 ===========
 = Summary =
 ===========
 Driver:   Installation Failed
 Toolkit:  Installation skipped
 Samples:  Installation skipped
 Logfile is /tmp/cuda_install_14884.log
 ***@***.***:~$
 ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
 ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com <http://www.nvidia.com/>.
                                                               18,1          Bot
 ***@***.***:~$ lsmod | grep nvidia
 nvidia_drm             14412  0
 nvidia_modeset        764387  1 nvidia_drm
 nvidia              11488748  1 nvidia_modeset
 drm                   303102  4 ttm,drm_kms_helper,cirrus,nvidia_drm
 ***@***.***:~$ sudo apt-get --purge remove nvidia*
 Reading package lists... Done
 Building dependency tree
 Reading state information... Done
 E: Unable to locate package nvidia-graphics-drivers
 E: Unable to locate package nvidia-graphics-drivers-367_367.57.orig.tar
 E: Couldn't find any package by regex 'nvidia-graphics-drivers-367_367.57.orig.tar'
> On Apr 9, 2017, at 8:39 AM, David Laxer ***@***.*** ***@***.***>> wrote:
> I’m using my own AMI.  I’ve been having trouble getting the GPU on the g2.2xlarge instance to work.
> I had to use the 367 version of the NVIDIA driver to get TensorFlow to work.
> nvidia-graphics-drivers-367_367.57.orig.tar
> Unfortunately, this package did not install ‘nvcc’.  Other attempts to install CUDA installed the 375 version of the driver which doesn’t work
> on the g2.2xlarge instance.
> I’m trying this now:
> cuda_7.5.18_linux.run
> from:
> https://developer.nvidia.com/cuda-75-downloads-archive <https://developer.nvidia.com/cuda-75-downloads-archive>
>> On Apr 9, 2017, at 12:13 AM, Yuya Unno ***@***.*** ***@***.***>> wrote:
>> How did you install CUDA? Chainer uses nvcc command, that is a compiler for CUDA environment and is provided by CUDA. You need to set PATH environment correctly.
>> If you are using amazon linux AMI provided by NVIDIA, nvcc is already in PATH. And, all you need to do is pip install chainer.
>> If you don't use it and install CUDA manually, I recommend you to install CUDA from deb package. And, you need to add /usr/local/cuda/bin to PATH environment variable. My fabfile may helps you: https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py <https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py>
>> You are receiving this because you authored the thread.
>> Reply to this email directly, view it on GitHub <#2516 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC9i2zplWQK-tP2vhBtafTWSJk03Ckouks5ruIUvgaJpZM4M3z8a>.
          Tensorflow still works with the GPU.  Any suggestions on how to get only the toolkit with ‘nvcc’?
ubuntu@ip-10-0-1-189:~$ nvidia-smi
Sun Apr  9 16:23:03 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   33C    P0    41W / 125W |   3742MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3222    C   python                                        3740MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-10-0-1-189:~$
ubuntu@ip-10-0-1-189:~/models/neural_gpu$ python neural_gpu_trainer.py --problem=bmul
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally
Error appending to /tmp/neural_gpu/log
Creating checkpoint directory /tmp/neural_gpu.
Generating data for bmul.
cut 1.20 lr 0.100 iw 0.80 cr 0.30 nm 64 d0.1000 gn 4.00 layers 2 kw 3 h 4 kh 3 batch 32 noise 0.00
Creating model.
Creating backward pass for the model.
WARNING:tensorflow:Tried to colocate gpu0/gradients/gpu0/Gather_2_grad/Shape with an op target_embedding/read that had a different device: /device:GPU:0 vs /device:CPU:0. Ignoring colocation property.
Created model for gpu 0 in 6.29 s.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Created model. Checkpoint dir /tmp/neural_gpu
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2495 get requests, put_count=2461 evicted_count=1000 eviction_rate=0.406339 and unsatisfied allocation rate=0.454509
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 6023 get requests, put_count=4617 evicted_count=1000 eviction_rate=0.216591 and unsatisfied allocation rate=0.403287
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 256 to 281




    

 On Apr 9, 2017, at 9:04 AM, David Laxer ***@***.***> wrote:
 Installation failed:
 The NVIDIA CUDA Toolkit provides command-line and graphical
 Do you accept the previously read EULA? (accept/decline/quit): accept
 Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 352.39? ((y)es/(n)o/(q)uit): y
 Do you want to install the OpenGL libraries? ((y)es/(n)o/(q)uit) [ default is yes ]: n
 Install the CUDA 7.5 Toolkit? ((y)es/(n)o/(q)uit): y
 Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]: y
 Toolkit location must be an absolute path.
 Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]:
 Do you want to install a symbolic link at /usr/local/cuda? ((y)es/(n)o/(q)uit): y
 Install the CUDA 7.5 Samples? ((y)es/(n)o/(q)uit): y
 Enter CUDA Samples Location [ default is /home/ubuntu ]:
 Installing the NVIDIA display driver...
 The driver installation has failed due to an unknown error. Please consult the driver installation log located at /var/log/nvidia-installer.log.
 ===========
 = Summary =
 ===========
 Driver:   Installation Failed
 Toolkit:  Installation skipped
 Samples:  Installation skipped
 Logfile is /tmp/cuda_install_14884.log
 ***@***.***:~$
 ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
 ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com <http://www.nvidia.com/>.
                                                               18,1          Bot
 ***@***.***:~$ lsmod | grep nvidia
 nvidia_drm             14412  0
 nvidia_modeset        764387  1 nvidia_drm
 nvidia              11488748  1 nvidia_modeset
 drm                   303102  4 ttm,drm_kms_helper,cirrus,nvidia_drm
 ***@***.***:~$ sudo apt-get --purge remove nvidia*
 Reading package lists... Done
 Building dependency tree
 Reading state information... Done
 E: Unable to locate package nvidia-graphics-drivers
 E: Unable to locate package nvidia-graphics-drivers-367_367.57.orig.tar
 E: Couldn't find any package by regex 'nvidia-graphics-drivers-367_367.57.orig.tar'
> On Apr 9, 2017, at 8:39 AM, David Laxer ***@***.*** ***@***.***>> wrote:
> I’m using my own AMI.  I’ve been having trouble getting the GPU on the g2.2xlarge instance to work.
> I had to use the 367 version of the NVIDIA driver to get TensorFlow to work.
> nvidia-graphics-drivers-367_367.57.orig.tar
> Unfortunately, this package did not install ‘nvcc’.  Other attempts to install CUDA installed the 375 version of the driver which doesn’t work
> on the g2.2xlarge instance.
> I’m trying this now:
> cuda_7.5.18_linux.run
> from:
> https://developer.nvidia.com/cuda-75-downloads-archive <https://developer.nvidia.com/cuda-75-downloads-archive>
>> On Apr 9, 2017, at 12:13 AM, Yuya Unno ***@***.*** ***@***.***>> wrote:
>> How did you install CUDA? Chainer uses nvcc command, that is a compiler for CUDA environment and is provided by CUDA. You need to set PATH environment correctly.
>> If you are using amazon linux AMI provided by NVIDIA, nvcc is already in PATH. And, all you need to do is pip install chainer.
>> If you don't use it and install CUDA manually, I recommend you to install CUDA from deb package. And, you need to add /usr/local/cuda/bin to PATH environment variable. My fabfile may helps you: https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py <https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py>
>> You are receiving this because you authored the thread.
>> Reply to this email directly, view it on GitHub <#2516 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC9i2zplWQK-tP2vhBtafTWSJk03Ckouks5ruIUvgaJpZM4M3z8a>.
          I got further … by copying ‘nvcc’ from an NVIDIA AMI into /usr/local/cuda-7.0/bin.
Any ideas?
===============================
nvcc fatal   : Path to libdevice library not specified
['nvcc', '-shared', '-O3', '-use_fast_math', '-m64', '-Xcompiler', '-DCUDA_NDARRAY_CUH=c72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden', '-Xlinker', '-rpath,/home/ubuntu/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.11-64/cuda_ndarray', '-I/home/ubuntu/anaconda/lib/python2.7/site-packages/theano/sandbox/cuda', '-I/home/ubuntu/anaconda/lib/python2.7/site-packages/numpy/core/include', '-I/home/ubuntu/anaconda/include/python2.7', '-I/home/ubuntu/anaconda/lib/python2.7/site-packages/theano/gof', '-o', '/home/ubuntu/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.11-64/cuda_ndarray/cuda_ndarray.so', 'mod.cu', '-L/home/ubuntu/anaconda/lib', '-lcublas', '-lpython2.7', '-lcudart']
ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: ('nvcc return status', 1, 'for cmd', 'nvcc -shared -O3 -use_fast_math -m64 -Xcompiler -DCUDA_NDARRAY_CUH=c72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden -Xlinker -rpath,/home/ubuntu/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.11-64/cuda_ndarray -I/home/ubuntu/anaconda/lib/python2.7/site-packages/theano/sandbox/cuda -I/home/ubuntu/anaconda/lib/python2.7/site-packages/numpy/core/include -I/home/ubuntu/anaconda/include/python2.7 -I/home/ubuntu/anaconda/lib/python2.7/site-packages/theano/gof -o /home/ubuntu/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.11-64/cuda_ndarray/cuda_ndarray.so mod.cu -L/home/ubuntu/anaconda/lib -lcublas -lpython2.7 -lcudart')
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available  (error: cuda unavailable)
In [3]: quit
ubuntu@ip-10-0-1-189:/hdfs/lda2vec/examples/hacker_news/lda2vec$ nvcc -shared -O3 -use_fast_math -m64 -Xcompiler -DCUDA_NDARRAY_CUH=c72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden -Xlinker -rpath,/home/ubuntu/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.11-64/cuda_ndarray -I/home/ubuntu/anaconda/lib/python2.7/site-packages/theano/sandbox/cuda -I/home/ubuntu/anaconda/lib/python2.7/site-packages/numpy/core/include -I/home/ubuntu/anaconda/include/python2.7 -I/home/ubuntu/anaconda/lib/python2.7/site-packages/theano/gof -o /home/ubuntu/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.11-64/cuda_ndarray/cuda_ndarray.so mod.cu -L/home/ubuntu/anaconda/lib -lcublas -lpython2.7 -lcudart
nvcc fatal   : Path to libdevice library not specified
 On Apr 9, 2017, at 9:24 AM, David Laxer ***@***.***> wrote:
 Tensorflow still works with the GPU.  Any suggestions on how to get only the toolkit with ‘nvcc’?
 ***@***.***:~$ nvidia-smi
 Sun Apr  9 16:23:03 2017
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |===============================+======================+======================|
 |   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
 | N/A   33C    P0    41W / 125W |   3742MiB /  4036MiB |      0%      Default |
 +-------------------------------+----------------------+----------------------+
 +-----------------------------------------------------------------------------+
 | Processes:                                                       GPU Memory |
 |  GPU       PID  Type  Process name                               Usage      |
 |=============================================================================|
 |    0      3222    C   python                                        3740MiB |
 +-----------------------------------------------------------------------------+
 ***@***.***:~$
 ***@***.***:~/models/neural_gpu$ python neural_gpu_trainer.py --problem=bmul
 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally
 Error appending to /tmp/neural_gpu/log
 Creating checkpoint directory /tmp/neural_gpu.
 Generating data for bmul.
 cut 1.20 lr 0.100 iw 0.80 cr 0.30 nm 64 d0.1000 gn 4.00 layers 2 kw 3 h 4 kh 3 batch 32 noise 0.00
 Creating model.
 Creating backward pass for the model.
 WARNING:tensorflow:Tried to colocate gpu0/gradients/gpu0/Gather_2_grad/Shape with an op target_embedding/read that had a different device: /device:GPU:0 vs /device:CPU:0. Ignoring colocation property.
 Created model for gpu 0 in 6.29 s.
 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
 I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
 name: GRID K520
 major: 3 minor: 0 memoryClockRate (GHz) 0.797
 pciBusID 0000:00:03.0
 Total memory: 3.94GiB
 Free memory: 3.91GiB
 I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
 I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
 Created model. Checkpoint dir /tmp/neural_gpu
 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2495 get requests, put_count=2461 evicted_count=1000 eviction_rate=0.406339 and unsatisfied allocation rate=0.454509
 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 6023 get requests, put_count=4617 evicted_count=1000 eviction_rate=0.216591 and unsatisfied allocation rate=0.403287
 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 256 to 281
> On Apr 9, 2017, at 9:04 AM, David Laxer ***@***.*** ***@***.***>> wrote:
> Installation failed:
> The NVIDIA CUDA Toolkit provides command-line and graphical
> Do you accept the previously read EULA? (accept/decline/quit): accept
> Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 352.39? ((y)es/(n)o/(q)uit): y
> Do you want to install the OpenGL libraries? ((y)es/(n)o/(q)uit) [ default is yes ]: n
> Install the CUDA 7.5 Toolkit? ((y)es/(n)o/(q)uit): y
> Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]: y
> Toolkit location must be an absolute path.
> Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]:
> Do you want to install a symbolic link at /usr/local/cuda? ((y)es/(n)o/(q)uit): y
> Install the CUDA 7.5 Samples? ((y)es/(n)o/(q)uit): y
> Enter CUDA Samples Location [ default is /home/ubuntu ]:
> Installing the NVIDIA display driver...
> The driver installation has failed due to an unknown error. Please consult the driver installation log located at /var/log/nvidia-installer.log.
> ===========
> = Summary =
> ===========
> Driver:   Installation Failed
> Toolkit:  Installation skipped
> Samples:  Installation skipped
> Logfile is /tmp/cuda_install_14884.log
> ***@***.***:~$
> ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
> ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com <http://www.nvidia.com/>.
>                                                               18,1          Bot
> ***@***.***:~$ lsmod | grep nvidia
> nvidia_drm             14412  0
> nvidia_modeset        764387  1 nvidia_drm
> nvidia              11488748  1 nvidia_modeset
> drm                   303102  4 ttm,drm_kms_helper,cirrus,nvidia_drm
> ***@***.***:~$ sudo apt-get --purge remove nvidia*
> Reading package lists... Done
> Building dependency tree
> Reading state information... Done
> E: Unable to locate package nvidia-graphics-drivers
> E: Unable to locate package nvidia-graphics-drivers-367_367.57.orig.tar
> E: Couldn't find any package by regex 'nvidia-graphics-drivers-367_367.57.orig.tar'
>> On Apr 9, 2017, at 8:39 AM, David Laxer ***@***.*** ***@***.***>> wrote:
>> I’m using my own AMI.  I’ve been having trouble getting the GPU on the g2.2xlarge instance to work.
>> I had to use the 367 version of the NVIDIA driver to get TensorFlow to work.
>> nvidia-graphics-drivers-367_367.57.orig.tar
>> Unfortunately, this package did not install ‘nvcc’.  Other attempts to install CUDA installed the 375 version of the driver which doesn’t work
>> on the g2.2xlarge instance.
>> I’m trying this now:
>> cuda_7.5.18_linux.run
>> from:
>> https://developer.nvidia.com/cuda-75-downloads-archive <https://developer.nvidia.com/cuda-75-downloads-archive>
>>> On Apr 9, 2017, at 12:13 AM, Yuya Unno ***@***.*** ***@***.***>> wrote:
>>> How did you install CUDA? Chainer uses nvcc command, that is a compiler for CUDA environment and is provided by CUDA. You need to set PATH environment correctly.
>>> If you are using amazon linux AMI provided by NVIDIA, nvcc is already in PATH. And, all you need to do is pip install chainer.
>>> If you don't use it and install CUDA manually, I recommend you to install CUDA from deb package. And, you need to add /usr/local/cuda/bin to PATH environment variable. My fabfile may helps you: https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py <https://github.com/unnonouno/chainer-vagrant-ec2/blob/master/fabfile.py>
>>> You are receiving this because you authored the thread.
>>> Reply to this email directly, view it on GitHub <#2516 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC9i2zplWQK-tP2vhBtafTWSJk03Ckouks5ruIUvgaJpZM4M3z8a>.
          I was told CUDA 8.0 doesn’t work with the g2.2xlarge AWS machine.
I tried running ‘lda2vec’ & chainer on the public AMI:
NVIDIA DIGITS 4.0 on Ubuntu 14.04-6250ff83-f573-4050-80fc-413ac84d044d-ami-94054383.3
ami-d5ce80b5
This works. Chainer can’t detect cuDNN 6.5 - which I installed.  cudnn-6.5-linux-x64-v2.tar
But, lda2vec appears to runs properly.
In [3]: %load lda2vec_run.py
In [4]:     for s, a, f in utils.chunks(batchsize, story_id, author_id, flattene
   ...: d):
   ...:         t0 = time.time()
   ...:         optimizer.zero_grads()
   ...:         l = model.fit_partial(s.copy(), a.copy(), f.copy())
   ...:         prior = model.prior()
   ...:         loss = prior * fraction
   ...:         loss.backward()
   ...:         optimizer.update()
   ...:         msg = ("J:{j:05d} E:{epoch:05d} L:{loss:1.3e} "
   ...:                "P:{prior:1.3e} R:{rate:1.3e}")
   ...:         prior.to_cpu()
   ...:         loss.to_cpu()
   ...:         t1 = time.time()
   ...:         dt = t1 - t0
   ...:         rate = batchsize / dt
   ...:         logs = dict(loss=float(l), epoch=epoch, j=j,
   ...:                     prior=float(prior.data), rate=rate)
   ...:         print msg.format(**logs)
   ...:         j += 1
   ...:     serializers.save_hdf5("lda2vec.hdf5", model)
In [4]:     for s, a, f in utils.chunks(batchsize, story_id, author_id, flattene
   ...: d):
   ...:         t0 = time.time()
   ...:         optimizer.zero_grads()
   ...:         l = model.fit_partial(s.copy(), a.copy(), f.copy())
   ...:         prior = model.prior()
   ...:         loss = prior * fraction
   ...:         loss.backward()
   ...:         optimizer.update()
   ...:         msg = ("J:{j:05d} E:{epoch:05d} L:{loss:1.3e} "
   ...:                "P:{prior:1.3e} R:{rate:1.3e}")
   ...:         prior.to_cpu()
   ...:         loss.to_cpu()
   ...:         t1 = time.time()
   ...:         dt = t1 - t0
   ...:         rate = batchsize / dt
   ...:         logs = dict(loss=float(l), epoch=epoch, j=j,
   ...:                     prior=float(prior.data), rate=rate)
   ...:         print msg.format(**logs)
   ...:         j += 1
   ...:     serializers.save_hdf5("lda2vec.hdf5", model)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-baa2abd917df> in <module>()
     21 gpu_id = int(os.getenv('CUDA_GPU', 0))
---> 22 cuda.get_device(gpu_id).use()
     23 print "Using GPU " + str(gpu_id)
/home/ubuntu/anaconda2/lib/python2.7/site-packages/chainer-2.0.0b1-py2.7.egg/chainer/cuda.pyc in get_device(*args)
    168     for arg in args:
    169         if type(arg) in _integer_types:
--> 170             check_cuda_available()
    171             return Device(arg)
    172         if isinstance(arg, ndarray):
/home/ubuntu/anaconda2/lib/python2.7/site-packages/chainer-2.0.0b1-py2.7.egg/chainer/cuda.pyc in check_cuda_available()
     73                '(see https://github.com/pfnet/chainer#installation).')
     74         msg += str(_resolution_error)
---> 75         raise RuntimeError(msg)
     76     if (not cudnn_enabled and
     77             not _cudnn_disabled_by_user and
RuntimeError: CUDA environment is not correctly set up
(see https://github.com/pfnet/chainer#installation).No module named cupy
Still not sure how to configure my AMI.
 On Apr 11, 2017, at 1:46 AM, Yuya Unno ***@***.***> wrote:
 How about using latest CUDA 8.0? I recommend to use deb file instead of run file. I feel run is unstable.
 You are receiving this because you authored the thread.
 Reply to this email directly, view it on GitHub <#2516 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC9i2y9e4btK1QjdfJ1Zcib6vtty8MhMks5ruz33gaJpZM4M3z8a>.
          I’m running on AWS EC2 g2.x2large spot instances.
Does chainer support checkpointing, so that I can continue training my models at the point that the spot instance was preempted?
 On Apr 11, 2017, at 3:35 AM, David Laxer ***@***.***> wrote:
 I was told CUDA 8.0 doesn’t work with the g2.2xlarge AWS machine.
 I tried running ‘lda2vec’ & chainer on the public AMI:
 NVIDIA DIGITS 4.0 on Ubuntu 14.04-6250ff83-f573-4050-80fc-413ac84d044d-ami-94054383.3
 ami-d5ce80b5
 This works. Chainer can’t detect cuDNN 6.5 - which I installed.  cudnn-6.5-linux-x64-v2.tar
 But, lda2vec appears to runs properly.
 In [3]: %load lda2vec_run.py
 In [4]:     for s, a, f in utils.chunks(batchsize, story_id, author_id, flattene
    ...: d):
    ...:         t0 = time.time()
    ...:         optimizer.zero_grads()
    ...:         l = model.fit_partial(s.copy(), a.copy(), f.copy())
    ...:         prior = model.prior()
    ...:         loss = prior * fraction
    ...:         loss.backward()
    ...:         optimizer.update()
    ...:         msg = ("J:{j:05d} E:{epoch:05d} L:{loss:1.3e} "
    ...:                "P:{prior:1.3e} R:{rate:1.3e}")
    ...:         prior.to_cpu()
    ...:         loss.to_cpu()
    ...:         t1 = time.time()
    ...:         dt = t1 - t0
    ...:         rate = batchsize / dt
    ...:         logs = dict(loss=float(l), epoch=epoch, j=j,
    ...:                     prior=float(prior.data), rate=rate)
    ...:         print msg.format(**logs)
    ...:         j += 1
    ...:     serializers.save_hdf5("lda2vec.hdf5", model)
 In [4]:     for s, a, f in utils.chunks(batchsize, story_id, author_id, flattene
    ...: d):
    ...:         t0 = time.time()
    ...:         optimizer.zero_grads()
    ...:         l = model.fit_partial(s.copy(), a.copy(), f.copy())
    ...:         prior = model.prior()
    ...:         loss = prior * fraction
    ...:         loss.backward()
    ...:         optimizer.update()
    ...:         msg = ("J:{j:05d} E:{epoch:05d} L:{loss:1.3e} "
    ...:                "P:{prior:1.3e} R:{rate:1.3e}")
    ...:         prior.to_cpu()
    ...:         loss.to_cpu()
    ...:         t1 = time.time()
    ...:         dt = t1 - t0
    ...:         rate = batchsize / dt
    ...:         logs = dict(loss=float(l), epoch=epoch, j=j,
    ...:                     prior=float(prior.data), rate=rate)
    ...:         print msg.format(**logs)
    ...:         j += 1
    ...:     serializers.save_hdf5("lda2vec.hdf5", model)
 ---------------------------------------------------------------------------
 RuntimeError                              Traceback (most recent call last)
 <ipython-input-4-baa2abd917df> in <module>()
      21 gpu_id = int(os.getenv('CUDA_GPU', 0))
 ---> 22 cuda.get_device(gpu_id).use()
      23 print "Using GPU " + str(gpu_id)
 /home/ubuntu/anaconda2/lib/python2.7/site-packages/chainer-2.0.0b1-py2.7.egg/chainer/cuda.pyc in get_device(*args)
     168     for arg in args:
     169         if type(arg) in _integer_types:
 --> 170             check_cuda_available()
     171             return Device(arg)
     172         if isinstance(arg, ndarray):
 /home/ubuntu/anaconda2/lib/python2.7/site-packages/chainer-2.0.0b1-py2.7.egg/chainer/cuda.pyc in check_cuda_available()
      73                '(see https://github.com/pfnet/chainer#installation <https://github.com/pfnet/chainer#installation>).')
      74         msg += str(_resolution_error)
 ---> 75         raise RuntimeError(msg)
      76     if (not cudnn_enabled and
      77             not _cudnn_disabled_by_user and
 RuntimeError: CUDA environment is not correctly set up
 (see https://github.com/pfnet/chainer#installation <https://github.com/pfnet/chainer#installation>).No module named cupy
 Still not sure how to configure my AMI.
> On Apr 11, 2017, at 1:46 AM, Yuya Unno ***@***.*** ***@***.***>> wrote:
> How about using latest CUDA 8.0? I recommend to use deb file instead of run file. I feel run is unstable.
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <#2516 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC9i2y9e4btK1QjdfJ1Zcib6vtty8MhMks5ruz33gaJpZM4M3z8a>.