I ran into an odd issue with a friend's network yesterday, and have decided that feeding it to the LazyWeb to chew on is a good idea.
For semi-relevant informational background, the subject network in this diatribe is a small private religious K-12 school w/ roughly 120 Windows XP Professional SP2-based PCs. They have a Windows Server 2003 Standard Edition R2 32-bit file server computer acting as an AD domain controller, a Windows Server 2003 Standard Edition R2 32-bit server computer acting as a replica domain controller, and a Windows Server 2003 Standard Edition R2 64-bit server computer running Exchange 2007. Everything was a clean install migrating away from Novell Netware last September, and all the client computers were reinstalled from fresh Windows XP installations at the time of the migration. The network infrastructure is all Cisco-based switched 10/100 Ethernet (w/ gigabit uplinks between switches) with no VLANs or QoS. I did most of the original setup, and things are sanely configured (clients pointed to internal DNS servers running on domain controllers, IP addresses handed out via DHCP, etc). In general, everything has been humming along since I did the initial setup, and walking in to look at the issue I didn't expect that it was anything setup-related. (Because of bogus political reasons, I can't bill for work on this network anymore, but I became friends with the on-site "computer teacher" and I still stay in touch with him. It's frustrating, but I like the people and try to generally be helpful and nice... *smile*)
Okay, okay-- enough blathering on. The issue shakes out like this:
Last week, the client computers (most of them current on Microsoft updates as of April 20th or so) started hanging during common user activities-- mainly opening and closing Microsoft Office and Adobe CS3 applications and using Internet Explorer. Even Windows Explorer would hang, from time to time. If one would leave the computers sit in this "frozen" state, they would eventually "free up" and begin to work again. In cases where a hang occurred closing a program (such as WINWORD.EXE), the program might hang around in the process list for awhile and eventually disappear. You could open more copies of the program, and as you closed them, you would build up more "hung" copies in the process list.
My friend and I found that we were able, just by fiddling around with Microsoft Word, Adobe Illustrator, etc, to reliably generate failures in about 3 - 5 minutes of work. Strangely, though, we could only get failures to occur when logged-on as a user who did not have local "Administrator" rights.
Of all the users on the network, only one (1) user logs-on with a non-limited "Administrator" account (for frustrating reasons I won't go into). We checked with this user and found that she has seen no issues. This seems to jibe with our inability to reproduce the issue except when logged-on as a limited user.
Watching the hangs with Process Explorer, I was seeing several threads in the hanging programs stuck on calls to kernel32.dll's GetModuleFileNameA+0x1b4 export. I think this is related to the root-cause of the issue, but I don't have the right source code to debug this any further down into the stack. Anyway, I kept banging on Process Explorer for a bit, but then we moved on to think about other things that might've changed.
A major "changed" item that we discovered related to the Trend Micro OfficeScan product. The OfficeScan "32-bit Virus Scan Engine" was updated on 4/22/2008 to version 8.700.1004. My friend recalled that the problems being reported by users starting last Tuesday, and a quick review of the trouble ticketing system revealed that this was the case-- all the trouble reports started on Tuesday after the Trend Micro update.
I'd already gotten the feeling that the root cause was probably anti-virus related, simply because the issue was happening to such a variety of computers and in a variety of applications. The only commonality between the machines, aside from the operating system, was the anti-virus software. In an earlier test, we removed OfficeScan from a machine on which we had been able to reproduce the issues and tried for 30 minutes to reproduce the issues without success. We allowed Group Policy to reinstall OfficeScan and reproduced the issue again within 5 minutes.
I performed a "rollback" to OfficeScan virus scan engine version 8.550.1001 on our test client computer (via the OfficeScan console). We verified that the client reported the older scan engine, bounced the machine, and spent 30 mintues attempting to reproduce the issue. We could not reproduce the issue with scan engine 8.550.1001. We rolled the engine forward to 8.700.1004 again and were able to reproduce our issue.
For now, we've initiated "rollbacks" on all the client computers that are "online", and my friend will watch tomorrow and rollback any other clients that don't pick up the rollback request automatically. I don't like not being current on updates to things like anti-virus software, but I think it's a necessary evil in this case, and because it's only the scan engine and not the virus definitions, we are probably not opening ourselves up to undue risk.
The only thing I found on the 'net thusfar was a vague posting, and it's too vague to really get anything out of.
How about it, Lazyweb? Any similar situations happening out there?