Skip to content

Improve read publisher cancel handling to avoid connections in CLOSE_WAIT state with WebSocket on Tomcat  #30393

Closed
@philsttr

Description

@philsttr

There are a couple cases where using WebSockets with WebFlux on Tomcat can leave connections in a CLOSE_WAIT state after closing the websocket session. These connections stick around, and will eventually cause tomcat to reach its connection limit (if set). This prevents tomcat from accepting new connections, and thus leads to the server becoming unresponsive (except for previously established connections)

When running the same test cases with WebFlux on Netty or Undertow, the connections are closed properly.

I have provided an example project (ws-close-waiting.zip) that shows the cases where the connection gets stuck in CLOSE_WAIT on tomcat after the websocket session is closed.

The project has three websocket endpoints, each showing a different case (only 2 cases fail). In each case, the server will close the websocket session (but in different ways) after receiving a message from the client.

  1. /closeZip - Calls session.close(...) while processing the input stream. The input/output stream are merged with the zip operator. This case leaves the connection in CLOSE_WAIT on tomcat.
  2. /closeZipDelayError - Calls session.close(...) while processing the input stream. The input/output stream are merged with the zipDelayError operator. This case properly closes the connection. I included this case for comparison with the first case. I'm not sure what the downsides of using zipDelayError would be though. Advice appreciated.
  3. /exceptionZipDelayError - Propagates an exception on the input stream, but handles that exception with onErrorResume by calling session.close(...). The input/output streams are merged with the zipDelayError operator. This case leaves the connection in CLOSE_WAIT on tomcat. I included this case to show that the zipWithError operator will "fix" some cases (2), but not every case.

I have enabled the following logging:

logging.level.org.springframework.http.server.reactive=debug
logging.level._org.springframework.http.server.reactive.AbstractListenerReadPublisher=trace
logging.level._org.springframework.http.server.reactive.AbstractListenerWriteProcessor=trace
logging.level._org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor=trace
logging.level._org.springframework.http.server.reactive=trace
logging.level.reactor.netty=debug
logging.level.org.apache.tomcat.websocket=debug

In the failing cases (1 and 3), the read publisher logs a cancel message, and I see the following log lines:

2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [37936546] cancel [READING]
2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [37936546] READING -> COMPLETED

In the successful case (2), the read publisher does not log a cancel message. I think the cancelling is the underlying problem. It prevents the server from noticing that the client has closed the connection.

To test each use case, I used netstat to observe connections, and websocat as the websocket client. Specifically...

I started netstat in a loop to observe connections every second...

while true ; do clear; date; sudo netstat -pn | grep 8080; sleep 1; done

Then I used websocat in another terminal as follows:

  1. connect to one of the three websocket endpoints...
    e.g. websocat -v -v ws://localhost:8080/closeZip (or closeZipNoDelay or exceptionZipNoDelay)
    netstat will show something like...
    Fri Apr 28 01:59:55 PM CEST 2023
    tcp        0      0 127.0.0.1:57316         127.0.0.1:8080          ESTABLISHED 232014/./websocat
    tcp6       0      0 127.0.0.1:8080          127.0.0.1:57316         ESTABLISHED 231835/java
    
  2. type something on the websocat console and press enter. websocat will send what you typed as a text websocket message, and leave the connection open. netstat output remains unchanged
  3. press CTRL-D on the websocat console to end the input stream. websocat will exit.
    For the successful cases, the connections will disappear from netstat.
    For the failure cases, netstat will show something like...
    Fri Apr 28 02:01:36 PM CEST 2023
    tcp        0      0 127.0.0.1:57316         127.0.0.1:8080          FIN_WAIT2   -
    tcp6       7      0 127.0.0.1:8080          127.0.0.1:57316         CLOSE_WAIT  231835/java
    
    Eventually the old client side connection (the one in FIN_WAIT2) will go away. But the server connection (the one in CLOSE_WAIT) will remain until the server is shutdown.

Again, when running WebFlux on Netty or Undertow, the connections always go away in all three cases.

Metadata

Metadata

Assignees

Labels

in: webIssues in web modules (web, webmvc, webflux, websocket)type: enhancementA general enhancement

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions