A few months back, I came across a peculiar issue in my production network. We had 3 data centers with WAN links between them. Each had a Domain Controller with one location having two (A regular and a Read Only Domain Controller). One of my colleagues reported that there was an intermittent replication issue that was discovered only after one of the DC’s failed, of course, not long before my arrival. They had it on the overall plan to setup some form of monitoring, but my predecessor had never gotten around to it. So, like many other projects, it was added to my technical debt list.
Shortly after promoting the RODC, I decided to use a powershell script to monitor replication between DC’s and shoot off an email along with alerting our monitoring application if something was amiss. After about a week of monitoring, we noticed that there were almost regular interruptions every night. At the time, our networking team was doing some large overhaul projects including a rather large reconfiguration of the WAN so we chaulked it up to that project. It wasn’t until a few months had gone by and networking had completed their work that we noticed the alerts were still firing.
I started investigating and during the initial troubleshooting, we couldn’t recreate the issue or capture live logs. Seeing that it was an intermittent issue that didn’t seem to be impacting production or performance since we did see regular replication with only sporadic interruptions, I casually ignore it. Of course, like most problems in life, this bit me pretty hard. One day, while onboarding a new user, we notice that after creating him on one of the Domain Controller’s, it doesn’t seem to replicate to the others as he can’t log into servers in other data centers. A little digging and we find that the Domain Controller we created him on was throwing 1722 replication errors and wasn’t replicating AT ALL between the other DC’s. So, that technical debt comes back and it’s now a top priority. I begin all basic troubleshooting for communication between DC’s. Ping and tracert seem fine and return expected results so we ruled out network. I dig into DNS as that’s almost always second and find a few problems. Some old meta-data from old DC’s that was never cleaned up, incorrect entries for servers that had had IP addresses changed for various reasons, etc. One item stuck out though. I happened to notice that the IPv6 address of the default gateway for the DC I had created the user on and receiving the most errors was incorrect and waaaay out of scope. I corrected it and went on troubleshooting. The issue went from consistent back to intermittent. Thinking I had triumphed, I let it run for 24 hours to make sure we’re all clear. Of course, the next morning, no such luck. I go back to troubleshooting and I see the traffic hitting the Windows host firewall about every 3rd try. My immediate thought, besides firewall, was DNS again. I tear back through the DNS records and sure enough find that the DC’s DNS is showing up with weird IPv6 addresses that are completely out of scope for our network again.