public void handleTransportError(StompSession session, Throwable exception) {
if (session.isConnected()) {
disconnect()
It would be great if cancelInactivityTasks
was called when handleTransportError
is called because right now in our case this just pollutes the log file. Closed sessions writing every 30 seconds that they didn't receive any heartbeats.
Yes, I will add the DEBUG logging for org.springframework.web, org.springframework.http, and org.springframework.messaging to the ticket. I already have for org.springframework.messaging
ws.messaging.snippet.log
Here you can see fdcb4147-2a7b-5960-40fd-9ad30f33d5b9 session is established and gets heartbeats.
After a while an exception comes (you can't see it in this log but it's the one from above Caused by: java.lang.IllegalStateException: The WebSocket session [3] has been closed and no method (apart from close()) may be called on a closed session
and the session gets disconnected
A new session is created 2826788c-5e28-ca54-b937-8c0d4abecd4d and after 30 seconds a DEBUG DefaultStompSession:127 - Server has gone quiet. Closing connection in session id=fdcb4147-2a7b-5960-40fd-9ad30f33d5b9.
comes from the old (already disconnected) session.
I will come back with logs from the org.springframework.web, org.springframework.http.
LATER EDIT
Before OOM I had these statistics
WebSocketMessageBrokerStats:128 - WebSocketSession[152 current WS(152)-HttpStream(0)-HttpPoll(0), 964 total, 0 closed abnormally (0 connect failure, 0 send limit, 170 transport error)], stompSubProtocol[processed CONNECT(649)-CONNECTED(649)-DISCONNECT(642)], stompBrokerRelay[null], inboundChannel[pool size = 2, active threads = 0, queued tasks = 0, completed tasks = 12297], outboundChannel[pool size = 2, active threads = 0, queued tasks = 0, completed tasks = 3776], sockJsScheduler[pool size = 1, active threads = 1, queued tasks = 169, completed tasks = 37864]
Can you shed some light into what they show ? Especially the sockJsScheduler and if you see something wrong here
One more mention here, I've seen on other systems X closed abnormally. (
2024-01-31 13:03:45 INFO WebSocketMessageBrokerStats:128 - WebSocketSession[406 current WS(406)-HttpStream(0)-HttpPoll(0), 1794 total, 732 closed abnormally (732 connect failure, 0 send limit, 738 transport error)], stompSubProtocol[processed CONNECT(658)-CONNECTED(658)-DISCONNECT(642)], stompBrokerRelay[null], inboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 16524], outboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6154], sockJsScheduler[pool size = 2, active threads = 1, queued tasks = 425, completed tasks = 130457]
Regarding the 'No messages received after' message. Most of them are closed after > 10 minutes. ...
If the websocket session lingers and if enough sessions are open like this, this could very well cause the OOM Error couldn't it ?
Why not have a scheduled job which clears these after 60 seconds (as the method comment says).
Right now I saw this checkSessions() method is only called when a new message comes in. (called from SubProtocolWebSocketHandler.handleMessage() method)
The idea is to use a natural trigger vs managing yet another scheduled task. Thinking about it again, it might make more sense to check when new connections are established instead of when messages arrive. I'll make a change along those lines.
Regarding the environment configuration, I was assured by our OPS team that nothing had changed in that regard.
Nevertheless, the "No server messages received..." indicates clearly WebSocket connectivity issues. WebSocket connections are established, but messages do not start flowing, and STOMP negotiation fails to complete. Maybe this issue happens for some clients only and the proxy issue is on their side if they are external, or perhaps there was some other less obvious or indirect environment or network change (some upgrade somewhere).
Maybe the WSSession is closed but not the StompSession.
That's what I expect, correct. At the level of DefaultStompSession
the isConnected
just means the internal connection
reference hasn't been reset, which would happen when afterConnectionClosed
is called and that's where cancelInactivityTasks
is also called. For a WebSocket connection, this callback should be triggered by the underlying WebSocket client library (Tomcat in this case it seems). For SockJS polling, it's us making sure it is called if polling fails. It doesn't seem to be happening.
WebSocketBrokerStats
should be reasonably straight forward to follow, if you look in the toString()
method and then follow usages of the individual stats objects. The first log message shows 964 sessions in total established over time; currently 152 WebSocket sessions are open, and no HTTP streaming/polling sessions; 170 transport errors where we were notified by the WebSocket session that writing failed (and presumably sessions were closed); 649 sessions successfully negotiated at the STOMP level, and 642 eventually disconnected at the STOMP level, so 7 remain connected at the STOMP level, while 163 remain not successfully negotiated at the STOMP level.
The SockJS protocol requires the server to send heartbeat messages, by default every 25 seconds but it's configurable. When STOMP is used on top, we disable those from StompSubProtocolHandler
by calling sockJsSession.disableHeartbeat()
but only after STOMP is successfully negotiates heartbeats at its level. That negotiation fails for a number of sessions and the queued tasks for SockJS scheduler indicate that that it's sending SockJS heartbeats from the server side, but we know a number of WebSocket sessions are not working well.
I've looked at the logs. I was hoping they would help me to see if there is any issue with SockJS transport fallbacks and the close notification, but I did not see any evidence of falling back such as requests to "/xhr_streaming" or "/xhr" as you had indicated earlier, and likewise no IllegalStateException
nor any exception or stacktrace. Not sure if there are some parts that are maybe missing.
Thank you for your answer and insight
The logs are only from a test system whilst the xhr_streaming and /xhr were from productive systems. It's going to take some time until we rollout a version to those customers but I'll come back with a feedback.
From the log that I've attached spring.20240210.log it is visible that the session is closed at 03:51:43
2024-02-10 03:51:43 TRACE DefaultStompSession:137 - Sending DISCONNECT session=839c0640-30cc-c1e5-f6da-fb46734b9388 2024-02-10 03:51:43 TRACE StompEncoder:137 - Encoding STOMP DISCONNECT, headers=null
Afterwards, a new websocket connection is established at 03.51.49 (6 seconds later)
2024-02-10 03:51:49 DEBUG LoggingWebSocketHandlerDecorator:46 - New WebSocketClientSockJsSession[id='ef2e1004f6dd400c84e7f9e3d29fd930, url=https://nhc-demo1.noventi-ora.de/jart/api2/local-jart-proxy] 2024-02-10 03:51:49 DEBUG DefaultStompSession:127 - Connection established in session id=67ee2be0-7bc8-1ada-0617-4be2ef61c85a
and afterwards the old session is sending exceptions
2024-02-10 03:52:04 DEBUG DefaultStompSession:127 - Server has gone quiet. Closing connection in session id=839c0640-30cc-c1e5-f6da-fb46734b9388. 2024-02-10 03:52:19 DEBUG DefaultStompSession:127 - Server has gone quiet. Closing connection in session id=839c0640-30cc-c1e5-f6da-fb46734b9388.
So maybe two things can be done here,
you already mentioned, calling the checkSessions() when a connection is established and not when a message comes in and
cancelling the inactivity tasks when the socket is closed / handleTransportError
is called
"Sending DISCONNECT" is not present in the log file you provided. I do see the following:
2024-02-10 03:51:43 DEBUG WebSocketClientSockJsSession:189 - Closing session with CloseStatus[code=1000, reason=null] in WebSocketClientSockJsSession[id='f3a1ea4c69854b9083c05932f2d99463, url=https://fake-url.com/api2/local-jart-proxy]
2024-02-10 03:51:43 DEBUG NativeWebSocketSession:142 - Closing StandardWebSocketSession[id=7946dcfa-2f4e-f949-dae0-acd4040a5f3a, uri=null]
This indicates something called the close()
method on the WebSocket session as a result of a call to resetConnection
in DefaultStompSession
, but that could have been your handleTransportError
, and doesn't help me to understand more about the scenario and see why there was no afterConnectionClosed
notification. There is no actual transport error visible either, the IllegalStateException
presumably, so not sure what led to the closing from the given log output.
It makes more sense to call this from afterConnectionEstablished as it
relates to the creation of new sessions.
See gh-32195
You are right. The logs I was mentioning are on level TRACE.
I've attached a snippet.
On the test system, every night at 3:50 (aprox) the router disconnects and gets a new IP address.
I've attached the org.springframework.messaging
logs that are on log level TRACE
ws.20240210.log
Okay, so for this:
2024-02-10 03:51:43 TRACE StompEncoder:137 - Encoding heartbeat
2024-02-10 03:51:43 TRACE DefaultStompSession:137 - Sending UNSUBSCRIBE {id=[0]} session=839c0640-30cc-c1e5-f6da-fb46734b9388
2024-02-10 03:51:43 TRACE StompEncoder:137 - Encoding STOMP UNSUBSCRIBE, headers={id=[0]}
2024-02-10 03:51:43 TRACE DefaultStompSession:137 - Sending DISCONNECT session=839c0640-30cc-c1e5-f6da-fb46734b9388
2024-02-10 03:51:43 TRACE StompEncoder:137 - Encoding STOMP DISCONNECT, headers=null
2024-02-10 03:51:43 DEBUG WebSocketClientSockJsSession:189 - Closing session with CloseStatus[code=1000, reason=null] in WebSocketClientSockJsSession[id='f3a1ea4c69854b9083c05932f2d99463, url=https://fake-url.com/api2/local-jart-proxy]
2024-02-10 03:51:43 DEBUG NativeWebSocketSession:142 - Closing StandardWebSocketSession[id=7946dcfa-2f4e-f949-dae0-acd4040a5f3a, uri=null]
Any idea what triggers the unubscribe and disconnect? If it is to match the scenarios described in #32195 (comment), then there should be an IllegalStateException
from the heartbeat failing and a "WriteInactivityTask failure" message, but I don't see that.
That scenario I could only reproduce in debug mode. (by manually setting the state to CLOSED
I saw it in the logs but that's it unfortunately.
I will leave the DEBUG logs in the application and if I see it again I'll check the logs.
Is DEBUG enough for this ?
ConnectionLostException was applies only after the WebSocket library
notified us of a session close. However, read inactivity and heartbeat
send failures are also cases of the connection being lost rather than
closed intentionally.
This commit also ensures resetConnection is called after a heartbeat
write failure, consistent with other places where a transport error
is handled that implies the connection is lost.
See gh-32195
changed the title
OOM error because of lots of “org.apache.tomcat.websocket.server.WsFrameServer” instances
Improve client and server handling of STOMP heartbeat failures
Feb 13, 2024
On the client side I'm using spring boot version 3.2.0. Can I get a test version for this ?
Otherwise I could just get the spring library and manually patch it :)