Fixing the Timer Service when everything breaks down

This is a pure troubleshooting post (that will save you days!) if you experience any of the following problems:

  • You have trouble deploying solutions to a your farm (more than one node) where nothing happens the deployment never completes
    • You may be able to complete deployments by executing “stsadm –o execadmsvcjobs” on every server in the farm
  • You experience frequent CPU spikes of 100% every few minutes on your frontend servers and they are all but unresponsive at those times
  • You find one of the following lines in the ULS logs (the SharePoint logs)
    • OWSTimer.exe, w3wp.exe (Id: 888k) : “File system cache monitor encountered error, flushing in memory cache”
    • OWSTimer.exe (Id: 8xqx): “Exception in RefreshCache. Exception message :Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information.”
    • OWSTimer.exe (5utx): “The timer service could not initialize its configuration, please check the configuration database. Will retry later.”
  • “Some” of the administrative pages in the central administration occasionally fails with the error
    • “Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information”
  • The timer and topology cache is never updated (see next section…)
  • PSConfig (GUI wizard or not) fails to configure your server and the log file reveals the “LoaderExceptions” error above.

The above problems will probably affect all servers in the farm.

Note: The fusion log troubleshooting will be applicable to all .NET loader errors, not just SharePoint specific stuff.

First shot: Clear the Config Cache

A little known fact is that SharePoint maintains a disk cache (and memory) based configuration on every server that contains the topology information and timer job definitions. Go have a look at “C:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config\<guid>\”.

Sometimes old stuff can get stuck in there and you can kick start the timer by clearing it.

The procedure is simple (do it for every server):

  1. Stop the Administration and Timer service
  2. Delete all XML files in the directory (not the folder itself)
  3. Open “Cache.ini” and write the number 1 instead of the existing number (you might want to make a note of it)
  4. Start the services again
  5. Wait for a minute or two and see if the folder starts to fill up with xml files. It is likely that it will contain less than before clearing it.
  6. Check the cache.ini file. If it’s accessible and the number is considerable greater than 1 your cache has been properly initialized and chances are that your problems are now fixed. It didn’t fix my problem, so you may need to read on… (if you didn’t have the “888k” log entry mentioned above you probably have it now)

The above procedure is grabbed from tomblog’s excellent post (that will also help if you actually did delete the folder too)

[Updated] Or you can run this batch file. [/Updated]

If the procedure didn’t fix the problem you’ll notice that the xml files are updated (timestamp) every few minutes coinciding with the CPU spikes.

Second shot: Digging in (Using the fusion log)

The core problem in my case was that some .NET class/assembly could not be loaded as the message “Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information” strongly hints. It may be a little surprising how many times an assembly, completely unrelated to the task at hand (e.g. deployment), is actually loaded. To fix the problem all we have to do is identify the assembly and “make it available”.

Enabling Fusion Log

You need to debug the failed assembly bindings you need to look into the fusion log. Fusion is an ancient (.Net 1.0) codename for the GAC or assembly loader (I think).

To enable the log you need to add/change three keys in the registry:

HKLM\Software\Microsoft\Fusion\LogPath    REG_SZ    (path to local directory)

HKLM\Software\Microsoft\Fusion\LogFailues    DWORD     1

HKLM\Software\Microsoft\Fusion\EnableLog    DWORD    1

It is not strictly necessary to restart anything, but I recommend that you now restart your timer service in order for it to log any binding errors that occurs. Try starting it a couple of times with 10 minutes in between. That should reveal any binding error patterns.

(Refer to Brad Adams for more info)

Interpreting the Log

Finally we will use the “fuslogvw.exe” program that is part of the .NET SDK to view the actual logs. Your development machine will have this file; copy the executable to your server.

It is not a very good program. It gets the job done, but it’s hard to figure out, you can’t order errors by date you can’t resize the window, etc..

Hopefully your window will contain a lot less entries (in my case the offending entry is the highlighted one. It also failed with similar errors from the psconfig wizard and powershell).

Look for patterns using the timestamps. Did you get a group of binding errors a few minutes after you started the timer service? Or are there just some errors that look spurious? I’ll recommend skipping the core internal microsoft dll’s to begin with (msvcm80 is the c runtime library version 8 that .Net 2.0 uses).

So what goes into the log? Every assembly binding failure, which is (at least):

  • If an assembly cannot be located (it will show you where it searched and any assembly redirects)
  • If the dll load methods within the assembly throws an exception – e.g. a dependent assemly could not be found – it sadly looks exactly like the dll file could not be found (other exceptions are possible but unlikely for .NET people). The one way to distinguish is to look for the line “LOG: GAC Lookup was unsuccessful.” if it is not there then it was (probably) found in the GAC and a dependent assembly failed.

The last bullet means that if assembly A depends on B and B could not be found then both the binding for A and B fails (at the same time). To distinguish the two I’ll recommend that you look at the “Calling Assembly” in the bind log.

In my case:

  1. The Nintex.Workflow dll failed to load and the calling assembly was Microsoft.SharePoint
  2. The Nintex.Workflow.Charting.dll failed to load and the calling assembly was Nintex.Workflow. Aha, so the Nintex.Workflow.dll was actually found but failed to find the dependent Nintex.Workflow.Charting assembly.
  3. Found the assembly on a development machine and copied it to the server. Retried. Added another missing assembly
  4. And everything worked!

I should stress that neither the error type nor the troubleshooting is Nintex workflow specific.

The bind log failure from step 2 above was:

*** Assembly Binder Log Entry (1/13/2009 @ 4:34:15 PM) ***

The operation failed.

Bind result: hr = 0×80070002. The system cannot find the file specified.

Assembly manager loaded from: C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\mscorwks.dll

Running under executable C:\Program Files\Common Files\Microsoft Shared\web server extensions\12\BIN\STSADM.EXE

— A detailed error log follows.

=== Pre-bind state information ===

LOG: User = ….

LOG: DisplayName = Nintex.Charting, Version=1.0.0.0, Culture=neutral, PublicKeyToken=913f6bae0ca5ae12

(Fully-specified)

LOG: Appbase = file:///C:/Program Files/Common Files/Microsoft Shared/web server extensions/12/BIN/

LOG: Initial PrivatePath = NULL

LOG: Dynamic Base = NULL

LOG: Cache Base = NULL

LOG: AppName = NULL

Calling assembly : Nintex.Workflow, Version=1.0.0.0, Culture=neutral, PublicKeyToken=913f6bae0ca5ae12.

===

LOG: This bind starts in default load context.

LOG: No application configuration file found.

LOG: Using machine configuration file from C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\config\machine.config.

LOG: Post-policy reference: Nintex.Charting, Version=1.0.0.0, Culture=neutral, PublicKeyToken=913f6bae0ca5ae12

LOG: GAC Lookup was unsuccessful.

LOG: Attempting download of new URL file:///C:/Program Files/Common Files/Microsoft Shared/web server extensions/12/BIN/Nintex.Charting.DLL.

LOG: Attempting download of new URL file:///C:/Program Files/Common Files/Microsoft Shared/web server extensions/12/BIN/Nintex.Charting/Nintex.Charting.DLL.

LOG: Attempting download of new URL file:///C:/Program Files/Common Files/Microsoft Shared/web server extensions/12/BIN/Nintex.Charting.EXE.

LOG: Attempting download of new URL file:///C:/Program Files/Common Files/Microsoft Shared/web server extensions/12/BIN/Nintex.Charting/Nintex.Charting.EXE.

LOG: All probing URLs attempted and failed.

Note: To grab a dll from the GAC on one machine you need to use a shell (cmd, bash or powershell) to go into the c:\windows\assembly\gac_msil\… folder structure and copy it.

Why did it go wrong in the First Place?

It’s always a useful exercise to figure out what went wrong and why instead of just fixing the error at hand.

The root cause of the error was that the dll’s in questions were distributed through a wsp solution file which was referenced in other code. When that particular solution was undeployed the dll’s were removed (from the GAC) and the new updated wsp file only contained the new versions. Suddenly something went missing :-(

It can also happen if you use some auto wsp packaging procedures like vbs scripts or the ubiquitous WSPBuilder. What if one of the SharePoint system dll’s are marked as “copy local” on one (of the others’) local develop environment and automatically included in the wsp file? It will be deployed, no problems, but when undeployed everything stops as the SharePoint system dll just got removed from the GAC. Oops. Prevention is obviously to use a build server to make clean builds, to educate your people and to have the means to troubleshoot when it didn’t work out.

About these ads

About Søren Nielsen
Long time SharePoint Consultant.

17 Responses to Fixing the Timer Service when everything breaks down

  1. Pingback: Timer Job Monitoring Utility « Brandon.ToString()

  2. Gopi Ega says:

    Thanks you very much.
    In my case it is Microsoft.sharepoint.ApplicationPages.dll is missing from GAC and bcoz of which,I was unable to execute stsadm ActivateFeature command.

    Once again Thank you very much for this detailed explanation of how to locate missing dll

  3. Mike says:

    Just passing by.Btw, you website have great content!

    ______________________________
    Don’t pay for your electricity any longer…
    Instead, the power company will pay YOU!

  4. Rafael Audy Ganzner says:

    Serious! THANKS SO MUCH! That think was making me crazy! :)

  5. Pingback: Fixing the SharePoint Timer Service with PowerShell « Mickey Jervin’s SharePoint blog

  6. Hannh Alex Xavier says:

    Excellent. Thanks a lot.

  7. Sander says:

    Brilliant, we have a farm with 1 dedicated index/query machine. One day all the timerjobs stopped running on this machine only (so no more crawling for us). We tried deleteing the cache, but this gave us indeed the 888k messages and still the jobs did not run., Tried the second option (we use 64bit, make sure the regkeys are 32bit, so dword and no qword) and investigated the reports. FOund that one of the core MS sharepoint dlls (ms.spnt.application.somthing) was not in the GAC and my new timerjobs needed it.

    Appearently, crashing one timer job crashes ALL timer jobs on that server…. ohoh..

    But you solution helped a lot. Thanx!

    Sander

  8. Ink says:

    Just want to thank you for the Timerservice article and batch file. Solved my issues on first shot. Never new that the Timerservice had a cache location. Learn something new everyday with MOSS.

    Regards,
    Ink

  9. David Grayston says:

    Very helpful post. Tbshooting a flood of Sharepoint log entries re: “job-config-refresh OWSTIMER.EXE (0×1640)”. Following these steps to clear the cache seems to have cured the problems.

  10. Pingback: Fusion Logging « Sladescross's Blog

  11. Joe says:

    This was brilliant, thank you. It was spot on.

    I want to add that the solution deployment wizard from Bamboo included log4net, which very helpfully pointed out the rogue DLL.

  12. Chhotu says:

    Hello Nielsen

    Your post is great…

    We have three servers, 2 WFE and 1 App. I have tried above option on three server. Out of which, App server and 1 WFE server started to work but 1 WFE server still giving problem.

    I have tried all above steps which you have mentioned….. All dll are in place.

    Do you know any reason

  13. Thank you, thank you, thank you!
    I found this blog after hours of frustrating debugging and immediately my problem was solved.

  14. Amol Ghuge says:

    Excellent, Thank you so much for sharing this valuable information!!

  15. mike says:

    Brilliant, thanks for this, that did the trick,

  16. Corinna says:

    Thank you for sharing this information! I was able to fix our systems by following your instructions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: