Ran into something I had never seen last week, and it ended up taking: 4 days, 9 ConfigMgr Engineers, and 4 Network Engineers to figure out. It involved many error messages that I had never seen before, and could find no mention of when consulting the Google, so I knew I needed to document this for the future. Welcome, dear reader, to the tale of the broken MECM infrastructure.
The problems begin
First some context, the site I manage has: 10,000 clients, 180 Pull Distribution Points ( 10 of which are PXE enabled) and the site database is on a SQL HA Cluster. We have a production data center (1 SQL server and the site server lives there), and a disaster recovery data center in a different city(the 2nd SQL server lives there). The site server and SQL servers are VMWare VMs, the pull DPs are physical hardware.
Two weeks ago I started getting reports that PXE wasn’t working at all of my locations. Checking the SMSPXE.log on the servers I could see the requests coming in, the standard messages about machine being known/unknown, task sequences available, etc, but the clients would just hang on the starting IPv4 message and then eventually time out, no error messages. It turned out to be very intermittent… sometimes it would work all morning and then be down all afternoon, other times it was only out of commission for an hour and then worked the rest of the day. We had run into an issue similar to this previously when switching our AV solution from McAfee to Crowdstrike, Security team had been told there was no need for exclusions because Crowdstrike uses machine learning. This is very not true, at least as far as ConfigMgr is concerned, and it caused PXE issues, slow collection evaluation, and hardware inventory processing issues. When uninstalled from the site server, the problems went away. Exclusions were put in place, and the issues resolved…. I assumed this was the same thing and uninstalled Crowdstrike from the site server on a Wednesday and the problem went away for the remainder of that week. I thought it was case closed and notified my security team that they needed to revisit the exclusions.
Not so fast
The following Monday I’m notified that PXE is not working at all again. I’m seeing the same behavior as the week before, but Crowdstrike is still not back on the server at this point. I get my network team involved and we start taking packet captures, we see DHCP offers coming from our DHCP servers, and PXE offers coming from the DPs, but the client never sends a DHCP request to get the IP lease being offered. They say there have been no changes to the network, we haven’t made any changes to MECM, and server team advises no changes other than applying monthly patches. We roll back the patches, but it makes no difference. I start running through the usual suspects, remove and reinstall PXE, switch from WDS to SCCMPxe and back again, double check IP helpers, look for any issues with the DHCP servers. Spend the rest of Monday troubleshooting and no resolution.
Tuesday, PXE is still down, but now I’m starting to see other issues…. 90% of my clients are showing as offline in the console, CMPivot queries and scripts aren’t working, start seeing issues with software deployment. I start looking at other logs on the site server. I’m seeing ************** Wait timed out! **************** being spammed in:
Was also seeing the CcmStatistics.log rolling over constantly and repeating this:
Inv: Discovery Data Manager (Trusted) maps to D:\Program Files\Microsoft Configuration Manager\inboxes\auth\ddm.box
Inv: Data Loader (Trusted) maps to D:\Program Files\Microsoft Configuration Manager\inboxes\auth\dataldr.box
Inv: Discovery Data Manager (Registration) maps to D:\Program Files\Microsoft Configuration Manager\inboxes\auth\ddm.box\regreq
Inv: Software Inventory Processor (Site Trusted) maps to D:\Program Files\Microsoft Configuration Manager\inboxes\auth\sinv.box
Inv: Status Manager maps to D:\Program Files\Microsoft Configuration Manager\inboxes\statmgr.box\statmsgs
Inv: Software Metering Processor Usage (Site) maps to D:\Program Files\Microsoft Configuration Manager\inboxes\swmproc.box\usage
Inv: Distribution Manager (Incoming) maps to D:\Program Files\Microsoft Configuration Manager\inboxes\distmgr.box\incoming
Inv: State System (Incoming) maps to D:\Program Files\Microsoft Configuration Manager\inboxes\auth\statesys.box\incoming
Inv: Site Server Inventory Collection maps to D:\Program Files\Microsoft Configuration Manager\inboxes\inventry.box
Inv: SMS_AMT_PROXY_COMPONENT maps to D:\Program Files\Microsoft Configuration Manager\inboxes\amtproxymgr.box
State message out-box 'D:\Program Files\Microsoft Configuration Manager\inboxes\auth\statesys.box\incoming'
Statistic initialization failed 0x87d00217
Data file for report is 'C:\WINDOWS\CCM\Statistic\ReportData'
0x87d00217 translates to ‘Data is corrupt’, but I couldn’t find any issues related to that.
I started thinking my management point was very sick, so I uninstalled/reinstalled the MP role, but continued to see the Wait Timed Out! errors repeating in those logs. We’re on CB 2010, and there was a hotfix available, so I decided to install that knowing that it would do a site reset afterwards, but this didn’t help either. I was seeing files build up in CCM\Incoming, at one point it got up to 60,000+. Also saw a backlog in CCM\ServiceData\Messaging\EndpointQueues\MP_RelayEndpoint , this one reached 20,000+. Looking in the IIS logs I was seeing lots of 500.0 and 503.2 errors. Checking open connections on the site server with Get-NetTCPConnection the count was over 20,000 with nearly half of those being in TimeWait status.
Time to reach out to Microsoft
I opened up a case with Microsoft and got my first of many CM Engineers, he seemed entirely unconcerned with the backlogs, IIS errors, etc and kept wanting to focus on the network and PXE problems, and insisted it had to be a network issue and we needed a Network engineer involved. Was on the phone with him for 3 hours awaiting a network engineer that never got assigned to the case. He had me doing things that I thought were entirely unrelated, like updating the ADK on the site sever and rebuilding boot wims. The process wasn’t even making it far enough for a boot wim to be involved.
Wednesday morning we got a network engineer assigned and ending up wasting the majority of that day with him insisting there was something on the local LAN preventing the client from getting the DHCP/PXE offers. Even though we could clearly see it being broadcast in packet captures. When asked how to trouble shoot such an issue, he didn’t know. We started noticing that it was taking the DP a very long time to send a PXE offer out, sometimes 30-40 seconds… this progressively got worse and ended up being sometimes 2-3 minutes later on. However, he kept pointing back to the client not sending a DHCP request to the DHCP server saying that had to mean that something was keeping it from getting the DHCP offers. I asked several times if it could have to do with the client not getting a PXE offer with option 60 in it soon enough, would that make it keep sending discovers instead of requesting an IP address, and he said no that wasn’t the case. At 6pm shifts rolled over and we got new CM and Network Engineers assigned, but ended up going through more of the same with them insisting it was a network issue. I finally gave up at 11 pm and said we would have to resume in the morning.
Thursday morning we got new CM and Network Engineers assigned, and they tried to go down the network route again. I insisted that they take into account the error messages in the logs and the inbox backlogs that were happening. This led to collecting and uploading logs several times, and having them request time to review. We ended up having the CM Engineer switch out 3 times, and the network twice over the course of the day and no progress was made.
A possible solution is found
At 6pm shifts rolled over again and we got a new CM Engineer, and finally a good knowledgeable one that knew what she was doing. I briefed her on the other things I was seeing aside from PXE and she agreed with me that it seemed like a site server or management point problem and not network. We ran through a few things, she had me uninstall/reinstall the MP role again, turn on debug logging, and then run the diagnostics to collect and upload all of the relevant logs for her to review. She advised that when a client PXE boots, the DP it’s booting from sends a request to the site server to run a stored procedure called LOOKUPDEVICE on the site database to get advertisements, policies, etc for the client. She said we should be able to take the UUID of the client and search it in the MP_Policy.log on the site server and find where that is happening while debug logging is turned on, but it was never showing up in there. By this time it was 9pm and I hadn’t eaten dinner, so requested a break while she was reviewing everything. When I came back at 10, she asked to take a look at the SQL servers. These were configured with 8 CPU and 32 GB of RAM each, we had been arguing with our DBA team that this was low and we needed more for quite awhile, but they kept insisting everything was fine. She advised we were definitely undersized for the amount of clients we had, and also suggested we might need another management point as well. I advised my server team that per documentation for a remote SQL server they were suggesting 16 CPU and 72 GB of RAM, and that the MS Engineer believed this might be the cause of all of our issues. They ended up giving each server 22 CPU and 96 GB of RAM. This didn’t seem to make any difference as far as PXE was concerned, but the Wait Timed Out errors in the various logs on the site server went away. It was around midnight, and I was exhausted, so I headed to bed to resume troubleshooting in the morning.
Friday morning we once again got a new CM Engineer, PXE still wasn’t working, however the 500.0 and 503.2 errors in the IIS logs had gone away and my clients started showing back up as online again in the console. The CM Engineer barely reviewed the details of the case and then said she needed to hand off less than 30 minutes later. During this time, after our DBA team had been notified of the increase to SQL server resources and that MS believed that may have been the root of the problem they discovered what I think was likely the root cause of this whole mess. At some point, they thought during patching the previous weekend, the SQL cluster had failed over to the DR server. The configuration on this server was set to cap SQL at a percentage that ended up limiting it to approximately 20 GB of RAM prior to the resources increase the night before, nobody knows why it was different from the other SQL server. That also meant that network latency across the link to the DR datacenter was affecting SQL read/writes as well. So in essence, the infrastructure was slowly strangling itself the whole time. I also installed a second MP per the recommendation the previous night, but it didn’t seem to make any difference as far as PXE was concerned, and we were still having some issues with software/patch deployment, scripts and CMPivot. The rest of Friday was mostly spent awaiting assignment of engineers as they seemed to be switching out more frequently than usual, and we were told there was a high influx of cases delaying things. Towards the end of the day I had found that the second MP was building up a backlog of state messages. However the connection counts the site server were dropping, I think it was around 10,000 connections with only 1000 being in TimeWait.
Know when to take a step back
After discussing with my boss, and advising I was pretty wiped out after all of the long days he decided we could put things on hold for the weekend. This was to be a three day weekend as the following Monday was a bank holiday. I told him that on Sunday or Monday I would see about building some USB boot media to see if that could be a solution to get new machines out the door in case this went on much longer, and I logged off for some much needed rest.
On Saturday I went digging around and found a white paper from Intel on the PXE specification and was able to confirm what I had been thinking all along. If a PXE client doesn’t get both a DHCP lease offer, and a PXE offer with option 60 in it, it will not send a DHCP request. Which makes sense, why would it need an IP address if it has nowhere to boot from? In addition, if it does not receive the DHCP and PXE offers within the same discover transaction, it will keep sending discovers. Looking back at the packet captures we had taken where it was getting DHCP and PXE offers, the PXE offer was always delayed and in a separate transaction. I found that there is an option to set a response delay on the DHCP scope, and had decided to give this a shot to see if I could get both offers to hit within the time frame as a workaround…. however this ended up not being needed.
Light at the end of the tunnel
Sunday morning I logged in to test my DHCP theory only to find that PXE was working, it was taking a little longer than normal, but it was in fact working. Looking at the site server I saw the the MP_RelayEndpoint had emptied out and open connections were down to 5000 with only a few being in TimeWait status. I checked the second MP and saw that it had built up quite a backlog of state messages, was sitting at around 88,000 total, but it was slowly grinding through them. However the CPU was pegged at 100%… I checked and saw that it only had 4 CPU and 8 GB of RAM, so I doubled it up to 8 CPU and 16 GB RAM. The CPU was still pegged at 100%, with ccmexec taking up the majority, but it did seem to start processing the state messages faster. Also checked the site server, which has 8 CPU and 32 GB of RAM and saw ccmexec taking up around 50% of the CPU on it… was running around 80-90% total.
Checked back on Monday morning, PXE was functioning normally now and the CPU usage on both the site server and second MP had dropped down to a normal level. Second MP still had around 55,000 state messages in the queue, but continued to drop.
Tuesday morning, with all the clients coming online and the rest of my team beginning operations PXE kept functioning normally. However we were still seeing some issues, when distributing content to DPs it was taking quite a long time… but after looking in the pulldp.log I could see the content was already 100% distributed, it was delayed state messages making it seem as though it wasn’t done. Checking the second MP, it was still sitting around 55,000 in queue and not dropping. I fired up performance monitor and checked outboxes for messages sent per min and it was only 200. On a hunch, I figured Crowdstrike probably didn’t have proper exclusions since this was a new server, and I uninstalled it. Messages per min immediately jumped up to 3,000.
All in all, it took until Thursday for the entire backlog of queued up state messages, hardware inventory, etc to process through and everything to fully return to normal. The moral of the story is: make sure your configurations match when running a SQL HA Cluster, that they are sized appropriately, and sometimes things are not always as they appear. Don’t get stuck on a red herring, always look at the big picture.