活泼的海龟 · Ports Used by ...· 3 周前 · |
腼腆的菠菜 · @splinetool/runtime ...· 2 周前 · |
坏坏的小狗 · Change Color of 3D ...· 2 周前 · |
深情的鞭炮 · Install .NET Runtime ...· 2 周前 · |
无邪的八宝粥 · How do I use ...· 4 月前 · |
很拉风的酱牛肉 · 合龙!广州白云站最新动态来了→· 5 月前 · |
狂野的水煮鱼 · 定义采用 PowerShell 的 ...· 5 月前 · |
豁达的针织衫 · 湖北黄石着力发展先进制造业· 6 月前 · |
潇洒的课本 · Solved: Database ...· 8 月前 · |
neuron runtime |
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-troubleshoot.html |
发财的青椒
3 月前 |
torch-neuronx
) Weight Replacement API for Inference
torch-neuronx
)
torch-neuronx
) release notes
torch-neuron
) Core Placement API [Beta]
torch-neuron
)
LSTM
Support
torch-neuron
) Core Placement
torch-neuron
) Supported operators
torch-neuron
)
torch-neuron
) release notes
torch-neuronx
) - Supported Operators
torch-neuronx
) for Training Troubleshooting Guide
torch-neuronx
) release notes
tensorflow-neuronx
) Tracing API
tensorflow-neuronx
) Auto Multicore Replication (Beta)
tensorflow-neuronx
) analyze_model API
tensorflow-neuronx
) Release Notes
tensorflow-neuron
) Tracing API
tensorflow-neuron
) analyze_model API
tensorflow-neuron
) Auto Multicore Replication (Beta)
tensorflow-neuron
) Release Notes
tensorflow-neuron
) Accelerated (
torch-neuron
) Python APIs and Graph Ops
neuronx-distributed-training
)
transformers-neuronx
) Developer Guide
transformers-neuronx
) Developer Guide For Continuous Batching
transformers-neuronx
) release notes
neuron-cc
)
neuron-cc
)
This document is relevant for
:
Inf1
,
Inf2
,
Trn1
,
Trn1n
This document aims to provide more information on how to fix issues you might encounter while using the Neuron Runtime 2.x or above. For each issue we will provide an explanation of what happened and what can potentially correct the issue.
If your issue is not listed below or you have a more nuanced problem, contact us via issues posted to this repo, the AWS Neuron developer forum , or through AWS support.
Table of contents
RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe
An application is trying to use more cores that are available on the instance
aws-neuron-dkms is a driver package which needs to be compiled during
installation. The compilation requires kernel headers for the instance’s
kernel.
uname
-r
can be used to find kernel version in the instance.
In some cases, the installed kernel headers might be newer than the
instance’s kernel itself.
Please look at the aws-neuron-dkms installation log for message like the following:
Building for 4.14.193-149.317.amzn2.x86_64
Module build for kernel 4.14.193-149.317.amzn2.x86_64 was skipped since the
kernel headers for this kernel does not seem to be installed.
If installation log is not available, check whether the module is
loaded.
$ lsmod | grep neuron
If the above has no output then that means aws-neuron-dkms
installation is failed.
Solution#
Stop all applications using the NeuronCores.
Uninstall aws-neuron-dkms sudo apt remove aws-neuron-dkms
or
sudo yum remove aws-neuron-dkms
Install kernel headers for the current kernel
sudo apt install -y linux-headers-$(uname -r)
or
sudo yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
Install aws-neuron-dkms sudo apt install aws-neuron-dkms
or
sudo yum install aws-neuron-dkms
Application fails to start#
Neuron Runtime requires Neuron Driver(aws-neuron-dkms package) to access Neuron
devices. If the driver is not installed then Neuron Runtime wont able to access the
Neuron devices and will fail with an error message in console and syslog.
If aws-neuron-dkms
is not installed then the error message will be like the following:
2021-Aug-11 18:38:27.0917 13713:13713 ERROR NRT:nrt_init Unable to determine Neuron Driver version. Please check aws-neuron-dkms package is installed.
If aws-neuron-dkms
is installed but does not support the latest runtime then the error message will be like the following:
2021-Aug-11 19:18:21.0661 24616:24616 ERROR NRT:nrt_init This runtime requires Neuron Driver version 2.0 or greater. Please upgrade aws-neuron-dkms package.
When using any supported framework from Neuron SDK version 2.5.0 and Neuron Driver (aws-neuron-dkms) versions 2.4 or older, Neuron Runtime will return the following error message:
2022-Dec-01 09:34:12.0559 138:138 ERROR HAL:aws_hal_tpb_pooling_write_profile failed programming the engine
Solution#
Please follow the installation steps in Setup Guide to install aws-neuronx-dkms
.
This Neuron Runtime (compatibility id: X) is not compatible with the installed aws-neuron-dkms package#
This error is caused by incompatibility between the Neuron Driver (dkms package) and the Runtime Library (runtime-lib package). The driver remains backwards compatible with older versions of Neuron Runtime, but newer versions of the Runtime might rely on the functionality that is only provided by a newer driver. In that case, an update to the newer driver is required.
In some cases the compatibility error persists even after the driver has been updated. That happens when the update process fails to reload the driver at the end of the update. Note that $ modinfo neuron
will misleadingly show the new version because modinfo reads the version information for neuron.ko file that’s been successfully replaced.
Reload failure happens because one of the processes is still using Neuron Devices and thus the driver cannot be reloaded.
Solution#
Check for any process that is still using the Neuron driver by running lsmod:
ubuntu@ip-10-1-200-50:~$ lsmod | grep neuron
neuron 237568 0
ubuntu@ip-10-1-200-50:~$
“Used by” counter, the second number, should be 0. If it is not, there is still a running process that is using Neuron. Terminate that process and either:
$ sudo rmmod neuron
$ sudo modprobe neuron
Or simply rerun the installation one more time. The driver logs its version in dmesg:
$ sudo dmesg
[21531.105295] Neuron Driver Started with Version:2.9.4.0-8a6fdf292607dccc3b7059ebbe2fb24c60dfc7c4
A common culprit is a Jupyter process. If you are using Jupyter on the instance, make sure to terminate Jupyter process before updating the driver.
Neuron Core is in use#
A Neuron Core cant be shared between two applications. If an application
started using a Neuron Core all other applications trying to use the
NeuronCore would fail during runtime initialization with the following
message in the console and in syslog:
2021-Aug-27 23:22:12.0323 28078:28078 ERROR NRT:nrt_allocate_neuron_cores NeuronCore(s) not available - Requested:nc1-nc1 Available:0
Solution#
Terminate any other processes that are using NeuronCore devices and then try launching the application again. If you are using Jupyter, ensure that you only have a single Jupyter kernel attempting to access the NeuronCores by restarting or shutting-down any other kernels, which will release any NeuronCores that might be in use.
Unsupported NEFF Version#
While loading a model(NEFF), Neuron Runtime checks the version compatibility.
If the version the NEFF is incompatible with Runtime then it would fail the
model load with following error message:
NEFF version mismatch supported: 1.1 received: 2.0
Solution#
Use compatible versions of Neuron Compiler and Runtime. Updating to the
latest version of both Neuron Compiler and Neuron Runtime is the
simplest solution. If updating one of the two is not an option, please
refer to the neuron-runtime-release-notes
of the Neuron Runtime to determine NEFF version support.
Unsupported Hardware Operator Code#
While loading a model(NEFF), Neuron Runtime checks whether the hardware operators are supported or not. If unsupported,
Neuron Runtime will display the following error messages:
2023-Jul-28 22:23:13.0357 101413:101422 ERROR TDRV:translate_one_pseudo_instr_v2 Unsupported hardware operator code 214 found in neff.
2023-Jul-28 22:23:13.0357 101413:101422 ERROR TDRV:translate_one_pseudo_instr_v2 Please make sure to upgrade to latest aws-neuronx-runtime-lib and aws-neuronx-collective; for detailed installation instructions visit Neuron documentation.
Solution#
Upgrade to latest Neuron Runtime and Neuron Collectives.
Insufficient Memory#
While loading a model(NEFF), Neuron Runtime reserves both device and host memory
for storing weights, ifmap and ofmap of the Model. The memory consumption of
each model is different. If Neuron Runtime is unable to allocate memory then
the model load would fail with the following message in syslog
kernel: [XXXXX] neuron:mc_alloc: device mempool [0:0] total 1073741568 occupied 960539030 needed 1272 available 768
Solution#
As the error is contextual to what’s going on with your instance, the
exact next step is unclear. Try unloading some of the loaded models
which will free up device DRAM space. If this is still a problem, moving
to a larger Inf1 instance size with additional NeuronCores may help.
Insufficient number of NeuronCores#
The NEFF requires more NeuronCores than available on the instance.
Check for error messages in syslog similar to:
NRT: 26638:26638 ERROR TDRV:db_vtpb_get_mla_and_tpb Could not find VNC id n
NRT: 26638:26638 ERROR NMGR:dlr_kelf_stage Failed to create shared io
NRT: 26638:26638 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-a.json to NeuronCore
NRT: 26638:26638 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: xxxxxxx, err: 2
Solution#
The NeuronCores may be in use by models you are not actively using.
Ensure you’ve unloaded models you’re not using and terminated unused applications.
If this is still a problem, moving to a larger Inf1 instance
size with additional NeuronCores may help.
Numerical Error#
Neuron Devices will detect any NaN generated during execution and
report it. If Neuron Runtime sees NaNs are generated then it would
fail the execution request with Numerical Error with the following
message:
nrtd[nnnnn]: .... Error notifications found on NC .... INFER_ERROR_SUBTYPE_NUMERICAL
Solution#
This usually an indication of either error in the model or error in the
input.
Report issue to Neuron by posting the relevant details on GitHub
issues.
RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe#
This usually means that the numpy version used during compilation is different than the one used when executing the model.
As of Neuron SDK release 2.15, numpy versions supported in Neuron SDK are following: numpy<=1.25.2, >=1.22.2. Check and confirm the right
numpy version is installed and re-compile/execute the model.
Failure to initialize Neuron#
nd0 nc0 Timestamp program stop timeout (1000 ms)
nd0 nc0 Error while waiting for timestamp program to end on TPB eng 0
nd0 nc0 Failed to stop neuron core
nd0 nc0 Failed to end timestamp sync programs
TDRV not initialized
Failed to initialize devices, error:5
Previously executed application left Neuron devices in running state.
Reset Neuron devices but reloading Neuron Driver. Note, this is a
temporary workaround, future versions of Neuron will reset running
devices automatically.
sudo rmmod neuron; sudo modprobe neuron
An application is trying to use more cores that are available on the instance#
Could not open the nd1
Use properly sized instance. trn1.32xlarge has 32 Neuron Cores,
trn1.2xlarge has 2 Neuron Cores.