Fixing the Timer Service when everything breaks down


This is a pure troubleshooting post (that will save you days!) if you experience any of the following problems:

  • You have trouble deploying solutions to a your farm (more than one node) where nothing happens the deployment never completes
    • You may be able to complete deployments by executing “stsadm –o execadmsvcjobs” on every server in the farm
  • You experience frequent CPU spikes of 100% every few minutes on your frontend servers and they are all but unresponsive at those times
  • You find one of the following lines in the ULS logs (the SharePoint logs)
    • OWSTimer.exe, w3wp.exe (Id: 888k) : “File system cache monitor encountered error, flushing in memory cache”
    • OWSTimer.exe (Id: 8xqx): “Exception in RefreshCache. Exception message :Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information.”
    • OWSTimer.exe (5utx): “The timer service could not initialize its configuration, please check the configuration database. Will retry later.”
  • “Some” of the administrative pages in the central administration occasionally fails with the error
    • “Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information”
  • The timer and topology cache is never updated (see next section…)
  • PSConfig (GUI wizard or not) fails to configure your server and the log file reveals the “LoaderExceptions” error above.

The above problems will probably affect all servers in the farm.

Note: The fusion log troubleshooting will be applicable to all .NET loader errors, not just SharePoint specific stuff.

First shot: Clear the Config Cache

A little known fact is that SharePoint maintains a disk cache (and memory) based configuration on every server that contains the topology information and timer job definitions. Go have a look at “C:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config\<guid>\”.

Sometimes old stuff can get stuck in there and you can kick start the timer by clearing it.

The procedure is simple (do it for every server):

  1. Stop the Administration and Timer service
  2. Delete all XML files in the directory (not the folder itself)
  3. Open “Cache.ini” and write the number 1 instead of the existing number (you might want to make a note of it)
  4. Start the services again
  5. Wait for a minute or two and see if the folder starts to fill up with xml files. It is likely that it will contain less than before clearing it.
  6. Check the cache.ini file. If it’s accessible and the number is considerable greater than 1 your cache has been properly initialized and chances are that your problems are now fixed. It didn’t fix my problem, so you may need to read on… (if you didn’t have the “888k” log entry mentioned above you probably have it now)

The above procedure is grabbed from tomblog’s excellent post (that will also help if you actually did delete the folder too)

[Updated] Or you can run this batch file. [/Updated]

If the procedure didn’t fix the problem you’ll notice that the xml files are updated (timestamp) every few minutes coinciding with the CPU spikes.

Second shot: Digging in (Using the fusion log)

The core problem in my case was that some .NET class/assembly could not be loaded as the message “Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information” strongly hints. It may be a little surprising how many times an assembly, completely unrelated to the task at hand (e.g. deployment), is actually loaded. To fix the problem all we have to do is identify the assembly and “make it available”.

Enabling Fusion Log

You need to debug the failed assembly bindings you need to look into the fusion log. Fusion is an ancient (.Net 1.0) codename for the GAC or assembly loader (I think).

To enable the log you need to add/change three keys in the registry:

HKLM\Software\Microsoft\Fusion\LogPath    REG_SZ    (path to local directory)

HKLM\Software\Microsoft\Fusion\LogFailues    DWORD     1

HKLM\Software\Microsoft\Fusion\EnableLog    DWORD    1

It is not strictly necessary to restart anything, but I recommend that you now restart your timer service in order for it to log any binding errors that occurs. Try starting it a couple of times with 10 minutes in between. That should reveal any binding error patterns.

(Refer to Brad Adams for more info)

Interpreting the Log

Finally we will use the “fuslogvw.exe” program that is part of the .NET SDK to view the actual logs. Your development machine will have this file; copy the executable to your server.

It is not a very good program. It gets the job done, but it’s hard to figure out, you can’t order errors by date you can’t resize the window, etc..

Hopefully your window will contain a lot less entries (in my case the offending entry is the highlighted one. It also failed with similar errors from the psconfig wizard and powershell).

Look for patterns using the timestamps. Did you get a group of binding errors a few minutes after you started the timer service? Or are there just some errors that look spurious? I’ll recommend skipping the core internal microsoft dll’s to begin with (msvcm80 is the c runtime library version 8 that .Net 2.0 uses).

So what goes into the log? Every assembly binding failure, which is (at least):

  • If an assembly cannot be located (it will show you where it searched and any assembly redirects)
  • If the dll load methods within the assembly throws an exception – e.g. a dependent assemly could not be found – it sadly looks exactly like the dll file could not be found (other exceptions are possible but unlikely for .NET people). The one way to distinguish is to look for the line “LOG: GAC Lookup was unsuccessful.” if it is not there then it was (probably) found in the GAC and a dependent assembly failed.

The last bullet means that if assembly A depends on B and B could not be found then both the binding for A and B fails (at the same time). To distinguish the two I’ll recommend that you look at the “Calling Assembly” in the bind log.

In my case:

  1. The Nintex.Workflow dll failed to load and the calling assembly was Microsoft.SharePoint
  2. The Nintex.Workflow.Charting.dll failed to load and the calling assembly was Nintex.Workflow. Aha, so the Nintex.Workflow.dll was actually found but failed to find the dependent Nintex.Workflow.Charting assembly.
  3. Found the assembly on a development machine and copied it to the server. Retried. Added another missing assembly
  4. And everything worked!

I should stress that neither the error type nor the troubleshooting is Nintex workflow specific.

The bind log failure from step 2 above was:

*** Assembly Binder Log Entry (1/13/2009 @ 4:34:15 PM) ***

The operation failed.

Bind result: hr = 0x80070002. The system cannot find the file specified.

Assembly manager loaded from: C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\mscorwks.dll

Running under executable C:\Program Files\Common Files\Microsoft Shared\web server extensions\12\BIN\STSADM.EXE

— A detailed error log follows.

=== Pre-bind state information ===

LOG: User = ….

LOG: DisplayName = Nintex.Charting, Version=1.0.0.0, Culture=neutral, PublicKeyToken=913f6bae0ca5ae12

(Fully-specified)

LOG: Appbase = file:///C:/Program Files/Common Files/Microsoft Shared/web server extensions/12/BIN/

LOG: Initial PrivatePath = NULL

LOG: Dynamic Base = NULL

LOG: Cache Base = NULL

LOG: AppName = NULL

Calling assembly : Nintex.Workflow, Version=1.0.0.0, Culture=neutral, PublicKeyToken=913f6bae0ca5ae12.

===

LOG: This bind starts in default load context.

LOG: No application configuration file found.

LOG: Using machine configuration file from C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\config\machine.config.

LOG: Post-policy reference: Nintex.Charting, Version=1.0.0.0, Culture=neutral, PublicKeyToken=913f6bae0ca5ae12

LOG: GAC Lookup was unsuccessful.

LOG: Attempting download of new URL file:///C:/Program Files/Common Files/Microsoft Shared/web server extensions/12/BIN/Nintex.Charting.DLL.

LOG: Attempting download of new URL file:///C:/Program Files/Common Files/Microsoft Shared/web server extensions/12/BIN/Nintex.Charting/Nintex.Charting.DLL.

LOG: Attempting download of new URL file:///C:/Program Files/Common Files/Microsoft Shared/web server extensions/12/BIN/Nintex.Charting.EXE.

LOG: Attempting download of new URL file:///C:/Program Files/Common Files/Microsoft Shared/web server extensions/12/BIN/Nintex.Charting/Nintex.Charting.EXE.

LOG: All probing URLs attempted and failed.

Note: To grab a dll from the GAC on one machine you need to use a shell (cmd, bash or powershell) to go into the c:\windows\assembly\gac_msil\… folder structure and copy it.

Why did it go wrong in the First Place?

It’s always a useful exercise to figure out what went wrong and why instead of just fixing the error at hand.

The root cause of the error was that the dll’s in questions were distributed through a wsp solution file which was referenced in other code. When that particular solution was undeployed the dll’s were removed (from the GAC) and the new updated wsp file only contained the new versions. Suddenly something went missing 😦

It can also happen if you use some auto wsp packaging procedures like vbs scripts or the ubiquitous WSPBuilder. What if one of the SharePoint system dll’s are marked as “copy local” on one (of the others’) local develop environment and automatically included in the wsp file? It will be deployed, no problems, but when undeployed everything stops as the SharePoint system dll just got removed from the GAC. Oops. Prevention is obviously to use a build server to make clean builds, to educate your people and to have the means to troubleshoot when it didn’t work out.

Advertisements