添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
不要命的自行车  ·  ELECT ...·  2 月前    · 
高大的凉茶  ·  浣溪沙 - ...·  3 月前    · 
坏坏的皮蛋  ·  Error during ...·  7 月前    · 

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.
  • Currently when I try using ray.init to connect a client to head node of a cluster (of two machines) I get the error message:

    ConnectionError: Request can’t be sent because the Ray client has already been disconnected due to an error. Last exception: <_MultiThreadedRendezvous of RPC that terminated with:

    status = StatusCode.NOT_FOUND
    details = “Attempted to reconnect a session that has already been cleaned up”
    debug_error_string = “{“created”:” @1668701483.254000000 ",“description”:“Error received from peer ipv4::”,“file”:“src/code/lib/surfaces/call.cc”,“file_line”:1075,“gprc_message”:“Attempted to reconnect a session that has already been cleaned up”,“gprc_status”:5}"

    When I only the head node standalone (without any other cluster machines connected), I do not get this error. Its only when I try to connect other machines to the head node. I am running on Windows. Happy to provide more more info! Just frustrated because I’ve been stuck on this for a while.

    Hi @Albert
    I’m happy to help here. Do you mind showing some logs in your head node?
    The logs should be in /tmp/ray/session_latest/logs/ by default.

    Could you check gcs_server and ray_client_server and also raylet logs to see whether there something abnormal?

    In ray_client_server.err

    INFO proxier.py:670 – New data connection from client [xxxx]
    INFO proxier.py:340 – SpecificServer started on port: 23000 with PID: 8292 for client [xxxx]
    ERROR proxier.py:723 – Proxying Datapath failed!
    Traceback (most recent call last):
    File “… ray\util\client\server\proxier.py”, line 716, in Datapath
    for rep in rep_stream:
    File “…gprc_channel.py”, line 426, in next
    return self._next()
    File “…gprc_channel.py”, line 826, in _next
    raise self
    gprc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
    status = StatusCode.UNKNOWN
    details = “Stream removed”
    debug_error_string = “{“created”:” @16688818856.548000000 ",“descripton”:“Error received from peer ipv4:127.0.0.1:23000”,“file”:“src/core/lib/surface/call.cc”,“file_line”:1075,“gprc_message”:“Stream removed”,“gprc_status”:2}"

    ERROR proxier.py:797 – Proxying Logstream failed!
    Traceback (most recent call last):
    File “… ray\util\client\server\proxier.py”, line 794, in Logstream
    for rep in rep_stream:
    File “…gprc_channel.py”, line 426, in next
    return self._next()
    File “…gprc_channel.py”, line 826, in _next
    raise self
    gprc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
    status = StatusCode.UNKNOWN
    details = “Stream removed”
    debug_error_string = “{“created”:” @16688818856.548000000 ",“descripton”:“Error received from peer ipv4:127.0.0.1:23000”,“file”:“src/core/lib/surface/call.cc”,“file_line”:1075,“gprc_message”:“Stream removed”,“gprc_status”:2}"

    INFO proxier.py:390 – Specific server [xxxxx] is no longer running, freeing its port 23000
    INFO proxier.py:742 – [xxxxx] last started stream at 1668818843.641747. Current stream started at 1668818843.641747.

    In gcs_server.out… I get the following repeated error messages:

    [datetime] (gcs_server.exe) gcs_server.cc:285: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:

    I did not see anything weird in raylet.out
    Let me know if I can provide any additional information.

    Hello, I am seeing the same error after upgrading our Ray image to versions later than 2.3.1 (Tried Kuberay 0.5.0, 1.1.0, and Ray versions 2.9, 2.11 and 2.12). I know in the docs it says this error is indicative of the Ray head recently restarting, but my head node has 0 restarts and I’m still seeing this. Any ideas?

    2024-04-30 11:51:36.557 | INFO | am_analytics.utils.ray_config:ray_init:13 - Using existing cluster: ray://ray-kuberay-head-svc..svc.cluster.local:10001

    2024-04-30 11:51:36,591 INFO client_builder.py:244 – Passing the following kwargs to ray.init() on the server: logging_level

    2024-04-30 11:51:36,629 DEBUG worker.py:378 – client gRPC channel state change: ChannelConnectivity.IDLE

    2024-04-30 11:51:36,831 DEBUG worker.py:378 – client gRPC channel state change: ChannelConnectivity.CONNECTING

    2024-04-30 11:51:36,836 DEBUG worker.py:378 – client gRPC channel state change: ChannelConnectivity.READY

    2024-04-30 11:51:36,837 DEBUG worker.py:818 – Pinging server.

    SIGTERM handler is not set because current thread is not the main thread.

    2024-04-30 11:52:19,358 DEBUG dataclient.py:333 – Recoverable error in data channel.

    2024-04-30 11:52:19,358 DEBUG dataclient.py:334 – <_MultiThreadedRendezvous of RPC that terminated with:

    status = StatusCode.UNAVAILABLE

    details = “Socket closed”

    debug_error_string = “UNKNOWN:Error received from peer {created_time:“2024-04-30T11:52:19.358331754+00:00”, grpc_status:14, grpc_message:“Socket closed”}”

    2024-04-30 11:52:19,359 DEBUG worker.py:818 – Pinging server.

    2024-04-30 11:52:19,361 ERROR dataclient.py:330 – Unrecoverable error in data channel.

    2024-04-30 11:52:19,361 DEBUG dataclient.py:331 – <_MultiThreadedRendezvous of RPC that terminated with:

    status = StatusCode.NOT_FOUND

    details = “Attempted to reconnect a session that has already been cleaned up”

    debug_error_string = “UNKNOWN:Error received from peer {created_time:“2024-04-30T11:52:19.360925106+00:00”, grpc_status:5, grpc_message:“Attempted to reconnect a session that has already been cleaned up”}”

    2024-04-30 11:52:19,361 DEBUG dataclient.py:285 – Shutting down data channel.