I'm having trouble starting HPC Cluster Manager and Job Manager locally on the Head Node, getting the following error almost every time (sometimes it works):
Unable to connect to the head node.
The connection to the management service failed. detail error: Microsoft.Hpc.RetryCountExhaustException: Retry Count of RetryManager is exhausted. ---> System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full
at System.Net.Sockets.Socket.DoBind(EndPoint endPointSnapshot, SocketAddress socketAddress)
at System.Net.Sockets.Socket.InternalBind(EndPoint localEP)
at System.Net.Sockets.Socket.BeginConnectEx(EndPoint remoteEP, Boolean flowContext, AsyncCallback callback, Object state)
at System.Net.Sockets.Socket.UnsafeBeginConnect(EndPoint remoteEP, AsyncCallback callback, Object state)
at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception)
--- End of inner exception stack trace ---
at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
at System.Net.Http.HttpClientHandler.GetResponseCallback(IAsyncResult ar)
--- End of inner exception stack trace ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Hpc.HttpClientExtension.<>c__DisplayClass5_0.<
Hi SvenssonOscar,
This issue could be caused by the depletion of max user ports for TCP connections. You may run the following commands to check and modify the max user ports.
netsh int ipv4 show dynamicport tcp
netsh int ipv4 set dynamicport tcp start=10000 num=55536
Regards,
Yutong Sun
Hi Oscar,
Do you have any chance to run 'netstat -ano' and share the output to investigate this issue further? Could you check which process is occupying the dynamic ports?
Regards,
Yutong Sun
Hello all together.
We are running into the very same problem described at the beginning.
Using Windows Server 2016 Standard (14393.4770) and HPC Pack 2016 (5.2.6291.0).
Already changed the portrange to
netsh int ipv4 show dynamicport tcp
Protocol tcp Dynamic Port Range
---------------------------------
Start Port : 1025
Number of Ports : 40000
it ran longer until the same error behavior occurred.
Seems to me that the HPC scheduler opens a lot winsock based connections until it exhausts the configured number of ports.
Is there a way to solve this behavior without restarting the server?
Many thanks in advance and regards,
Michael
Hi Michael,
We have known port leak issue in HPC Pack 2016 Update 2. It is fixed in HPC Pack 2016 Update 3 with the latest QFE. Please upgrade the cluster to version 5.3.6450 whenever possible.
Please check https://github.com/azure/hpcpack for version details.
Regards,
Yutong Sun
Dear Yutong Sun,
thank you for your reply. I updated the HPC cluster server to Update 3 with the latest QFE and since then it works fine.
Regards,
Michael
Hi SvenssonOscar,
I got the same problem, WinSrv 2022 & HPC 2019 update1, and I think it caused by TLS1.0 was disabled.
I used "IISCrypto.exe" on the head node, and click "Best Practices" and reboot, then everything is OK .