Error trying to evaluate network after re-training and interrupting after snapshot (docker)

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

深情的青蛙 · tensorflow::ops::Softm ...· 2 周前 ·

八块腹肌的西装 · Tensorflow C++ ...· 2 周前 ·

想旅行的熊猫 · Tensorflow C++ ...· 2 周前 ·

有情有义的牛排 · Restoring Tensorflow ...· 2 周前 ·

没读研的大葱 · Solved: Framework ...· 2 周前 ·

考研的西红柿 · curl: (7) Failed to ...· 2 月前 ·

悲伤的拐杖 · mysql如何用nacivate for ...· 3 月前 ·

重感情的水龙头 · Css 網頁符合瀏覽器大小· 3 月前 ·

憨厚的烈酒 · Django其实不是MVC - @Lenciel· 4 月前 ·

开心的山羊 · 黄河水清长江水浊-西瓜视频搜索· 8 月前 ·

Hi all

I had refined labels and tried to retrain my network last week, I think I forgot to run docker with the gpus arg so it ended up taking like 5 days just to get to 151,000 iterations. But I pulled the plug there. The model data is all there under iteration-1, it successfully starts the evaluation process but then runs into an error as follows:

Analyzing data...
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/deeplabcut/gui/evaluate_network.py", line 270, in evaluate_network
    deeplabcut.evaluate_network(
  File "/usr/local/lib/python3.8/dist-packages/deeplabcut/pose_estimation_tensorflow/core/evaluate.py", line 777, in evaluate_network
    os.path.join(cfg["project_path"], imagename), mode="RGB"
  File "/usr/lib/python3.8/posixpath.py", line 90, in join
    genericpath._check_arg_types('join', a, *p)
  File "/usr/lib/python3.8/genericpath.py", line 152, in _check_arg_types
    raise TypeError(f'{funcname}() argument must be str, bytes, or '
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'tuple'
I am running Ubuntu 20.0.4, an RTX 3070 (CUDA version 11.7), Python 3.8 using the docker container to run DeepLabCut, Cuda compilation tools 10.1.
I’m not sure how I would go about fixing this, given I don’t think I have access to anything beyond /dist-packages/ as that folder appears empty when I navigate to it in Files. So I can’t actually examine the line of code in question that it’s taking issue with. I’m fairly new to linux though, so I wonder if it’s a case like in windows of files being hidden, or whether the files in that directory are contained within the docker and I can’t access that either way. I’m a bit out of my depth here.
Any assistance would be appreciated.
              Hi there
That would be the obvious solution, but unfortunately I don’t think that’s the exact problem, the config file is selected through the gui and matches the project_path in the config.yaml file. Unless there is something in the name of the folder that’s throwing it off?
/app/DeepLabCut/Limb_motion_Expt_test_train1_2-Jack Scott-2023-12-19/config.yaml
Cheers

              
Yup, it’s what @Tim_DLC said. When ruamel serializes the string it must be taking the space as a delimiter leading to the path being read as a tuple of 2 strings.
I think maybe during creation of the project passed names of the project, experimenter etc. could be sanitized, where whitespaces are automatically replaced by an underscore. @scoki211 do you want to make an issue with such a feature request on the github?
              Thanks for the suggestions and taking the time to help me diagnose the issue. @Tim_DLC, @Konrad_Danielewski
Unfortunately, after correcting the rudimentary error in my folder and file naming, I am still getting the same error when I go to evaluate the network.
It might be worthwhile noting, that when I first evaluated the network, it ran absolutely fine with that white space.
I am wondering if it is possible there is an issue with my cuda driver that is messing the whole thing up, or my installation of tensor flow has got bungled somehow along the way between when I first evaluated the network in December, and my attempts to evaluate the new iteration now? I know there’s an issue with my cudakeys when I try to sudo apt-update, conveniently Nvidia said the old keys were a security issue, but when I try to fetch the updated keys from the website address, nvidia have taken them down and it 404s. Could that be a factor here? Relevant readouts from my terminal commands:
Err:1 file:/var/cudnn-local-repo-ubuntu2004-8.4.1.50  InRelease    
  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 15198637E3EC4A60
W: GPG error: file:/var/cudnn-local-repo-ubuntu2004-8.4.1.50  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 15198637E3EC4A60
E: The repository 'file:/var/cudnn-local-repo-ubuntu2004-8.4.1.50  InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
(base) ---- $ wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.0-1_all.deb
--2024-01-30 10:55:28--  https://developer.download.nvidia.com/compute/cuda/repos///cuda-keyring_1.0-1_all.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)| ... |:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-01-30 10:55:29 ERROR 404: Not Found.
Running  DLC_resnet50_Limb_motion_Expt_test_train1_2Dec19shuffle1_150000  with # of trainingiterations: 150000
/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py:1692: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '
2024-01-29 21:30:04.104476: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2024-01-29 21:30:04.128082: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2024-01-29 21:30:04.128124: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 81c0358977ed
2024-01-29 21:30:04.128134: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: [... removed ...]
2024-01-29 21:30:04.128353: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 515.105.1
2024-01-29 21:30:04.128388: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 515.105.1
2024-01-29 21:30:04.128400: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: [... removed ...]
2024-01-29 21:30:04.128762: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-29 21:30:04.187968: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2496000000 Hz
Also, I’m not even sure I’ve got the right attributes, as it’s telling me evaluation results already exist with shuffle 1, trainingset index 0. But when I increase the trainingset index from 0 to 1 it tells me:
Exception: ('Please check the trainingsetindex! ', 1, ' should be an integer from 0 .. ', 0)
And when I try to increase the shuffle to 2 it gives me a FileNotFound Error:
FileNotFoundError: [Errno 2] No such file or directory: '/app/DeepLabCut/Limb_motion_Expt_test_train1_2-Jack_Scott-2023-12-19/training-datasets/iteration-1/UnaugmentedDataSet_Limb_motion_Expt_test_train1_2Dec19/Documentation_data-Limb_motion_Expt_test_train1_2_95shuffle2.pickle'
So while I’d be happy to raise a feature request on the github, I think at the present time I’m still focused on diagnosing the issue.
Cheers

              
While your CUDA installation most likely has issues (looking at the errors you posted) it’s probably not related to the issue with indexing and config reading.
Could you confirm which training iteration and shuffle are in the dlc-models and have snapshots?
both have ‘shuffle1’ under the folder name
iteration-1 I assume is the original (given last modified date), has 10 snapshots
Iteration-1 I assume is the combined w/retraining. Has 12 snapshots
              In the config.yaml?
As far as I know unless I have accidentally been hamhanded, I have not adjusted anything under the training section and the iteration is a full number.
TrainingFraction:
- 0.95
iteration: 1
default_net_type: resnet_50
default_augmenter: default
snapshotindex: -1
batch_size: 8