添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
严肃的西红柿  ·  Help wanted: CUDA ...·  6 月前    · 
安静的便当  ·  TickZoo·  8 月前    · 
打篮球的长颈鹿  ·  Everything you wanted ...·  12 月前    · 
眼睛小的足球  ·  GamePad mappings are ...·  1 年前    · 
逃课的黑框眼镜  ·  Ubisoft | Welcome to ...·  2 月前    · 
从容的火锅  ·  KeyNotFoundException ...·  10 月前    · 
俊逸的肉夹馍  ·  第三百四十八章 ...·  11 月前    · 

My program got an error code 700 after running for a day, with the following error message. Can I trust the error message that the program errors at exactly where it breaks?

step = 51, energy/site = -0.28279876708984375, fidelity = 0.3100615535536275 step = 52, energy/site = -0.2982025146484375, fidelity = 0.31670590995681797 error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing) ERROR: LoadError: CUDA error: an illegal memory access was encountered (code #700, ERROR_ILLEGAL_ADDRESS) Stacktrace: [1] macro expansion at /home/liujinguo/.julia/packages/CUDAdrv/y9e4P/src/base.jl:147 [inlined] [2] #download!#11(::Bool, ::Function, ::Ptr{Float32}, ::CUDAdrv.Mem.Buffer, ::Int64, ::CUDAdrv.CuStream) at /home/ liujinguo/.julia/packages/CUDAdrv/y9e4P/src/memory.jl:254 [3] download! at /home/liujinguo/.julia/packages/CUDAdrv/y9e4P/src/memory.jl:248 [inlined] (repeats 2 times) [4] unsafe_copyto! at /home/liujinguo/.julia/dev/CuArrays/src/array.jl:127 [inlined] [5] copyto!(::Array{Float32,2}, ::CuArray{Float32,2}) at /home/liujinguo/.julia/dev/GPUArrays/src/abstractarray. jl:110 [6] #measure_reset!#16(::Int64, ::Function, ::DefaultRegister{4096,Complex{Float32},CuArray{Complex{Float32},2}}) at ./array.jl:497 [7] #measure_reset! at ./none:0 [inlined] [8] #measure_reset!#29 at /home/liujinguo/.julia/dev/Yao/src/Registers/measure.jl:80 [inlined] [9] (::getfield(Yao.Registers, Symbol("#kw##measure_reset!")))(::NamedTuple{(:val,),Tuple{Int64}}, :: typeof(measure_reset!), ::DefaultRegister{4096,Complex{Float32},CuArray{Complex{Float32},2}}, ::Int64) at ./none:0 [10] energy(::QuantumMPS{DefaultRegister{4096,Complex{Float32},CuArray{Complex{Float32},2}}}, ::Yao.Blocks. YGate{Complex{Float32}}, ::Heisenberg{2}) at /home/liujinguo/jcode/QuantumMPS/src/heisenberg.jl:114 [11] energy at /home/liujinguo/jcode/QuantumMPS/src/heisenberg.jl:56 [inlined] [12] gradient(::QuantumMPS{DefaultRegister{4096,Complex{Float32},CuArray{Complex{Float32},2}}}, ::Yao.Blocks. QDiff{Yao.Blocks.RotationGate{1,Float32,Yao.Blocks.ZGate{Complex{Float32}}},1,Float32}, ::Heisenberg{2}) at /home/ liujinguo/jcode/QuantumMPS/src/gradient.jl:16 [13] _broadcast_getindex at ./broadcast.jl:582 [inlined] [14] getindex at ./broadcast.jl:515 [inlined] [15] macro expansion at ./broadcast.jl:846 [inlined] [16] macro expansion at ./simdloop.jl:73 [inlined] [17] copyto! at ./broadcast.jl:845 [inlined] [18] copyto! at ./broadcast.jl:800 [inlined] [19] copy at ./broadcast.jl:776 [inlined]

Here is the source code:

  function measure_reset!(reg::GPUReg{B, T}; val=0) where {B, T}
      regm = reg |> rank3
      pl = dropdims(mapreduce(abs2, +, regm, dims=2), dims=2)
      pl_cpu = pl |> Matrix
      res_cpu = map(ib->_measure(view(pl_cpu, :, ib), 1)[], 1:B)
      res = CuArray(res_cpu)
      @inline function kernel(regm, res, pl, val)
         state = (blockIdx().x-1) * blockDim().x + threadIdx().x
         if state <= length(regm)
            k,i,j = GPUArrays.gpu_ind2sub(regm, state)
            @inbounds rind = res[j] + 1
            @inbounds k==val+1 && (regm[k,i,j] = regm[rind,i,j]/CUDAnative.sqrt(pl[rind, j]))
            CuArrays.sync_threads()
            @inbounds k!=val+1 && (regm[k,i,j] = 0)
         return
      X, Y = cudiv(length(regm))
      @cuda threads=X blocks=Y kernel(regm, res, pl, val)

This function is called millions times in my program, It is strange that it breaks suddenly after running a long time. Is it possible that my GPU card is unstable?

No, the download! call is synchronizing and catches whatever errors might have happened before.

Try removing @inbounds and running with CUDAnative/CuArrays master branches, they will report exceptions with accurate stack traces. Or run under cuda-memcheck.