Description
There are a couple cases where using WebSockets with WebFlux on Tomcat can leave connections in a CLOSE_WAIT state after closing the websocket session. These connections stick around, and will eventually cause tomcat to reach its connection limit (if set). This prevents tomcat from accepting new connections, and thus leads to the server becoming unresponsive (except for previously established connections)
When running the same test cases with WebFlux on Netty or Undertow, the connections are closed properly.
I have provided an example project (ws-close-waiting.zip) that shows the cases where the connection gets stuck in CLOSE_WAIT on tomcat after the websocket session is closed.
The project has three websocket endpoints, each showing a different case (only 2 cases fail). In each case, the server will close the websocket session (but in different ways) after receiving a message from the client.
/closeZip
- Callssession.close(...)
while processing the input stream. The input/output stream are merged with thezip
operator. This case leaves the connection in CLOSE_WAIT on tomcat./closeZipDelayError
- Callssession.close(...)
while processing the input stream. The input/output stream are merged with thezipDelayError
operator. This case properly closes the connection. I included this case for comparison with the first case. I'm not sure what the downsides of usingzipDelayError
would be though. Advice appreciated./exceptionZipDelayError
- Propagates an exception on the input stream, but handles that exception withonErrorResume
by callingsession.close(...)
. The input/output streams are merged with thezipDelayError
operator. This case leaves the connection in CLOSE_WAIT on tomcat. I included this case to show that thezipWithError
operator will "fix" some cases (2), but not every case.
I have enabled the following logging:
logging.level.org.springframework.http.server.reactive=debug
logging.level._org.springframework.http.server.reactive.AbstractListenerReadPublisher=trace
logging.level._org.springframework.http.server.reactive.AbstractListenerWriteProcessor=trace
logging.level._org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor=trace
logging.level._org.springframework.http.server.reactive=trace
logging.level.reactor.netty=debug
logging.level.org.apache.tomcat.websocket=debug
In the failing cases (1 and 3), the read publisher logs a cancel message, and I see the following log lines:
2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher : [37936546] cancel [READING]
2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher : [37936546] READING -> COMPLETED
In the successful case (2), the read publisher does not log a cancel message. I think the cancelling is the underlying problem. It prevents the server from noticing that the client has closed the connection.
To test each use case, I used netstat to observe connections, and websocat as the websocket client. Specifically...
I started netstat in a loop to observe connections every second...
while true ; do clear; date; sudo netstat -pn | grep 8080; sleep 1; done
Then I used websocat in another terminal as follows:
- connect to one of the three websocket endpoints...
e.g.websocat -v -v ws://localhost:8080/closeZip
(orcloseZipNoDelay
orexceptionZipNoDelay
)
netstat will show something like...Fri Apr 28 01:59:55 PM CEST 2023 tcp 0 0 127.0.0.1:57316 127.0.0.1:8080 ESTABLISHED 232014/./websocat tcp6 0 0 127.0.0.1:8080 127.0.0.1:57316 ESTABLISHED 231835/java
- type something on the websocat console and press enter. websocat will send what you typed as a text websocket message, and leave the connection open. netstat output remains unchanged
- press CTRL-D on the websocat console to end the input stream. websocat will exit.
For the successful cases, the connections will disappear from netstat.
For the failure cases, netstat will show something like...Eventually the old client side connection (the one in FIN_WAIT2) will go away. But the server connection (the one in CLOSE_WAIT) will remain until the server is shutdown.Fri Apr 28 02:01:36 PM CEST 2023 tcp 0 0 127.0.0.1:57316 127.0.0.1:8080 FIN_WAIT2 - tcp6 7 0 127.0.0.1:8080 127.0.0.1:57316 CLOSE_WAIT 231835/java
Again, when running WebFlux on Netty or Undertow, the connections always go away in all three cases.