Keep Server Online
If you find the Apache Lounge, the downloads and overall help useful, please express your satisfaction with a donation.
or
A donation makes a contribution towards the costs, the time and effort that's going in this site and building.
Thank You! Steffen
Your donations will help to keep this site alive and well, and continuing building binaries. Apache Lounge is not sponsored.
| |
|
Topic: Websockets: Server restarts eventually freeze whole server |
|
Author |
|
patchy
Joined: 10 Dec 2020 Posts: 2 Location: Germany
|
Posted: Fri 18 Dec '20 18:35 Post subject: Websockets: Server restarts eventually freeze whole server |
|
|
Hi,
Short version:
I use httpd on Windows as a reverse proxy for a microservice system. Some services communicate over websockets (more precicely: SignalR). From time to time I have to restart the server in order to read a new configuration. I observe an increasing number of threads blocked by the SignalR connections. It's a matter of time until the server completely freezes because no threads are available for other requests.
Details:
I reduced my system as much as possible. I end up with two microservices, A and B. A has a SignalR hub. Both, A and B subscribe to the events of this hub. Thus, there should be two connections.
Now the experiment:
1. Start the two microservices: They repeatedly try to connect, but fail. This is expected, because they are configured to connect via the reverse proxy and httpd is not running yet.
2. Start httpd (Windows Service): As expected, both services establish their connection, confirmed by the service logs and mod_status showing 2 connections.
3. Restart httpd: In real-world, I call
httpd.exe -n "ServiceName" -k restart
programmatically. For this experiment, I call it from Powershell. What happens?
3a. The parent starts a new child and hands over 2 sockets, see error.log on Pastebin (link below)
3b. The parent needs to stop the old child. The old child cannot stop because of the open connections. The old child waits a grace period of 30s before, then it terminates the 2 threads. My services log that their connection was disconnected and attempt to reconnect. At this moment, 2 more connections appear in mod_status. However, I don't see any socket handover in error.log.
4. Repeat httpd restart.
4a. The parent starts a new child and hands over 2 sockets, see error.log. It's still 2 sockets, although I saw 4 connections in mod_status in the previous step.
4b. The parent shuts down the old child. This time, there is no grace period, but 18(!) threads that failed to exit are terminated, see error.log. Both services log disconnect and reconnect. However, no additional connections appear in mod_stats, it remains 4.
When I repeat restarting httpd, most of the time it happens the same as described in step 4. Only difference is a changing number of "threads that failed to exit". But sometimes, additional connections appear in mod_status. I can't reproduce this on purpose. I suspect a race condition how fast the old child is shut down, the new one is started and my services trying to reconnect, but I don't know the httpd source code.
To get my job done, I need to know: What can I do to avoid eventually blocking the server?
Out of curiosity, I also would like to know what excatly happens, how the SignalR connectios are handed over to the next child, why the first restart works different than the other restarts, ...
I appreciate any hint, also if it is just about further investigations!
Additional information
Version: 2.4.41
Some config snippets:
Code: | ThreadsPerChild 20 # handy for debugging, not in production
|
Code: | RewriteEngine On
RewriteCond %{HTTP:Upgrade} websocket [NC]
RewriteCond %{HTTP:Connection} upgrade [NC]
RewriteRule "^/my/microservice" "wss://hostname:53728%{REQUEST_URI}"[P]
ProxyPass /my/microservice https://hostname:53728/my/microservice
ProxyPassReverse /my/microservice https://hostname:53728/my/microservice |
Link to error.log on Pastebin: https://pastebin.com/7a7B0bLb
Disclaimer: I posted this on the Apache users mailing list. Since there is no answer since one week, I dare to double post here. |
|
Back to top |
|
tangent Moderator
Joined: 16 Aug 2020 Posts: 348 Location: UK
|
Posted: Sun 20 Dec '20 21:59 Post subject: |
|
|
A knatty problem indeed, caused no doubt by the fact websockets are stateful.
To me, if you perform a graceful restart of Apache, to reload the configuration for future connections, those updates are not going to be applied to any threads owned by existing child processes. Connections maybe, if the sockets can be duplicated and handed off to threads in a successor child.
So even though a child has been instructed to exit, any free child threads will block until the timeout period has elapsed on any existing child connections.
Your log file initially shows this process, but equally subsequent analysis of the socket handover problem does rather suggest a race condition as you describe, and the unused child threads end up getting orphaned.
However, I note your configuration code snippet shows a mixture of mod_rewrite and specific mod_proxy directives, and if you read the RewriteRule documentation over the proxy [p] flag, https://httpd.apache.org/docs/current/rewrite/flags.html#flag_p, it does say:
Performance warning
Using this flag triggers the use of mod_proxy, without handling of persistent connections. This means the performance of your proxy will be better if you set it up with ProxyPass or ProxyPassMatch
This is because this flag triggers the use of the default worker, which does not handle connection pooling/reuse.
Avoid using this flag and prefer those directives, whenever you can.
This states persistent connections are not handled, so for me, despite the default documentation and other web posts showing the use of mod_rewrite to proxy, I'd try removing the mod_rewrite proxy logic, and try and stick with using mod_proxy/mod_proxy_wstunnel directives for websockets, viz:
Code: | ProxyRequests Off
ProxyPreserveHost on
ProxyPass /my/microservice wss://hostname:53728/my/microservice
ProxyPassReverse /my/microservice wss://hostname:53728/my/microservice |
Not sure if this helps. |
|
Back to top |
|
patchy
Joined: 10 Dec 2020 Posts: 2 Location: Germany
|
Posted: Mon 21 Dec '20 20:02 Post subject: |
|
|
The hint about the [P] flag is interesting indeed.
I tried your config snippet, but I had to modify it a bit in order to connect at all. I guess this is because of connection negotiation: When establishing the connection, the client does not know whether the server speaks websockets and starts with plain http. Using the upgrade header, server and client agree on switching protocols to websockets (at least this is the case for SignalR). So I must proxy http and websockets.
I separated proxying for negotiation and actual websocket connection:
Code: | ProxyRequests Off
ProxyPreserveHost on
ProxyPass /ChatHub/negotiate http://localhost:53353/ChatHub/negotiate
ProxyPassReverse /ChatHub/negotiate http://localhost:53353/ChatHub/negotiate
ProxyPass /ChatHub ws://localhost:53353/ChatHub
ProxyPassReverse /ChatHub ws://localhost:53353/ChatHub
|
Still, the behaviour does not change - even without the [P] flag.
Note: In the meanwhile I tried it with the newest server version and changed from my microservices to the Microsoft chat example ( https://github.com/dotnet/AspNetCore.Docs/tree/master/aspnetcore/signalr/dotnet-client/sample), so everyone can reproduce. |
|
Back to top |
|
tangent Moderator
Joined: 16 Aug 2020 Posts: 348 Location: UK
|
Posted: Tue 22 Dec '20 23:25 Post subject: |
|
|
Since you're using the WinNT MPM, have you tried disabling accept filters to see if that has any affect on the socket handling?
Code: | AcceptFilter http none
AcceptFilter https none
AcceptFilter ws none
AcceptFilter wss none |
|
|
Back to top |
|
James Blond Moderator
Joined: 19 Jan 2006 Posts: 7371 Location: Germany, Next to Hamburg
|
|
Back to top |
|
patch2
Joined: 25 Jul 2023 Posts: 1
|
Posted: Tue 25 Jul '23 10:36 Post subject: |
|
|
tangent wrote: | Since you're using the WinNT MPM, have you tried disabling accept filters to see if that has any affect on the socket handling?
Code: | AcceptFilter http none
AcceptFilter https none
AcceptFilter ws none
AcceptFilter wss none |
|
Hello! I'm a colleague of patchy. Since then we tried several things to mitigate this issue and we now came to a conclusion. I wanted to report back so that the findings are not lost.
The approach here with the AcceptFilter settings was tested. It turned out that in a newer version of Apache, this version:
Server Version: Apache/2.4.57 (Win64) OpenSSL/3.1.1
Server MPM: WinNT
Apache Lounge VS17 Server built: May 31 2023 10:48:22
when these setings are active, It at first seems as if it solves the issue when restarting apache several times using 'httpd.exe -n "ServiceName" -k restart'.
But we did a repeated test and found out that the issue still arises, just after around 50-70 times repeated calls of 'httpd.exe -n "ServiceName" -k restart'. In between each test it was waited for 10 seconds.
After around 50-70 times the threads started with being stuck again and were not closed afterwards anymore.
James Blond wrote: | Maybe it is a better idea to stop the service and start it instead of doing a graceful restart. |
We are considering this right now to solve the issue. Were introducing a check to find out the idle threads and if they are running low at some point, we will kill all apache processes and just restart it.
This will result in a connection-hickup but actually the soft restart also results in a connection-hickup always - so the situation is not getting worse by this.
We are also considering replacing apache alltogether because we are just using the reverse-proxy functionality of it until now and have no plans for using more features of it. |
|
Back to top |
|
|
|
|
|
|