Make the Search Work for You
08 Feb 2008 14 Comments
The SharePoint 2007 search engine (MOSS) is head and shoulders above the one found in SPS 2003 and it’s a breeze to setup. Right? Yes and maybe.
In this iteration of SharePoint I consider the search engine to be very good and if you spend the time to configure it properly it will work great for your site. But nothing comes for free and I’ve collected the few issues I ran into configuring the search on my farms which include several local SharePoint collaboration and publishing sites, people search, and external websites with and without SSL.
I’m not too fond of the search webparts and their configuration options, but consider them out of scope for this entry.
Don’t consider this to be an exhaustive guide to search setup, there’s plenty of areas that I don’t cover. I didn’t need them and chances are you don’t, e.g. crawl impact rules etc.
Setting up the Indexer Role
I’ll recommend setting up the index server as
- Behind the firewall so no users can access it directly
- Hosting the “Windows SharePoint Web Application”, i.e. the front-end for all your sites
- The indexer should use a particular server to index all (local) content, namely itself (set this on the Central Administration / Operations / Services on Server / Office SharePoint Server Search Service page)
That way your indexing does not affect your front-end web servers significantly, only indirectly as you’re still querying the same database.
It works great if you know and accept the following two caveats:
The timer service will execute a job that tries to modify your host file (%SystemDrive%\windows\system32\drivers\etc\hosts) which is a rare and alarming thing for any application to do. It will add the default access mapping for all your local site to the host file pointing to one of the local IP addresses (so be sure that your web sites are responding to all IP addresses in your IIS manager, or at least the one SharePoint chooses – it will not use 127.0.0.1).
By default no web application is allowed to do this, so you’ll have to allow it explicitly (have a look in the SharePoint and/or Windows Application log for this error)! Grant you Central Admin service user write/modify access to the “%SystemDrive%\windows\system32\drivers\etc\hosts” folder to fix the issue (note: Surprisingly it’s not the service user running your timer service).
If you are just a little bit paranoid you can remove the access again afterwards. I choose not to as subsequent changes to access mappings would otherwise require me to fiddle with this again and again.
One nice thing about this scheme is that any SSL certificates that you utilize on your site will be valid as the hostname will match the certificate hostname, provided that you had a valid one in the first place – just remember to install your certificates on the index server as well.
- The “Check services enabled in this farm” feature might now report that “Web Front End servers in the server farm are not consistent in the services they run”. Technically your index server is now also a front-end server (though users can’t access it) and therefore things look a bit fishy to SharePoint. Obviously this warning is a false positive and can safely be ignored.
Finally remember to install iFilters for any files that are not supported out of the box, e.g. Pdf. I generally install these filters on all servers to ensure that they will work as expected the day I decide to shuffle the server roles a bit.
Note: You should also add icons for these extra file types in the docicon.xml file. I’ll not dive into this other people has done so (here).
Accessing Search Configuration: Fixing 403 forbidden (on /ssp/admin/_layouts/searchsspsettings.aspx)
This is a rare issue that you run into if you use the same farm topology as me, if you haven’t seen this error then jump happily to next section.
I had to open a premier support case for this issue and spend a number of hours looking at log files and having live sessions with the nice MS guys. At the end of the day it was a security issue that you’ll encounter in the following scenario:
- Your farm has at least two servers
- You access (try at least) the SSP search configuration page on a server that does not hold the index role
- You have different service users for the SSP site and your Central Administration site (this is best practice that should always be followed). Note that the service user for your Central Administration site is the same that is being used for DB access
The cause of the error is that the Search Settings page executes some web service calls to query the index server on indexing status – if the page is hosted on the same server as the index server it will just use the OM and you’ll have no problems at all. I’m talking about all the values that are listed as “Retrieving…” when you first enter the page that rapidly changes to something useful.
The page queries a web service hosted on the “Office Server Web Services” web application on your index server that is restricted to administrative users. As the SSP site is running as a different service user than the central administration (and definitely not as any kind of domain or local admin) that call fails. The solution is simply to add your SSP service user to the local SharePoint admin group, WSS_ADMIN_WPG and WSS_RESTRICTED_WPG on the index server.
Setting up SharePoint Content Sources
Any new Shared Services Provider (SSP) that you associate with your farm will come with a default content source named “Local Office SharePoint Server sites” that you can happily use for most of your needs.
Whether you want several content sources for your various sites or just group all your sites into this one doesn’t matter much. It’s basically a question of how much granularity you need in controlling the crawl schedules and the ability to start/stop/reset each site. When you are troubleshooting or testing your search settings it is convenient to split it up in several parts other than that I don’t see the big need. It’s very easy to change later on when you change your mind one way or the other.
What you need to do here is to add the root address of each of your site collections. Be sure to use DNS names that are also part of the access mappings for the sites and use http:// and https:// as appropriate. If you use SSL sites I’ll recommend that you let the indexer crawl the sites through SSL instead of adding an “internal access mapping” without SSL, i.e. use the same DNS as your users to simplify things as much as possible.
Remember that the crawl user must have at least read rights on your sites to make them searchable. Don’t worry that your crawl account has access to all areas of the sites, the search result will be trimmed so users only get search results from the subset of items/pages/sites that they can access.
Searching MySites and People Search
If you want to be able to search documents on the MySites you also need to add the MySite web application to the list of start addresses – just the root, not the managed path, e.g. use https://mysites.company.com not https://mysites.company.com/personal.
To make people search work you need to add a second entry for the MySites web application with the protocol sps3:// or sps3s:// for SSL, e.g.
- If you host your MySites on “https://mysites.company.com/personal/user1” using SSL you should add “sps3s://mysites.company.com/” as a start address
- If you host your MySites on “http://mysites.company.com/personal/user1” add “sps3://mysites.company.com/”
Finally, the very last step, is to grant the crawl user read permissions for the MySites. On the SSP main administration site, go to “Personalization services permissions” and add your crawl user with the “Use personal features”. You might already have granted the rights for this through other groups, e.g. if you enable all users to use and create MySites by granting “Use personal features” and “Create personal site” rights to “NT Authority\Authenticated Users”.
Note: On the Search Settings page the “Default content access account” is what I call “the crawl user”.
Handling SSL and Certificate Errors
If you use SSL for some of your sites chances are that you are using some self-issued certificates for some of your dev and test environments that are not valid. Or perhaps you use the real certificates but with a different DNS name than specified in the cert.
Any of these errors will cause the indexer to stop crawling the site.
To ignore these errors go to Central Administration / Applications / Manage search service / Farm-level search settings and check the “Ignore SSL certificate name warnings”.
Note: If you use a self-issued certificate I’m not sure whether or not you need to add it to the list of certificates that the server trusts, regardless of this switch.
Setting up External Content Sources
To enable search of external non-SharePoint sources is generally fairly easy, however I found a lot of special cases that needed some tweaking.
First off, understand that the indexer is not a browser:
- It does not execute flash content
- It does not store cookies
- It follows some index/robot rules that your browser don’t – usually nothing to worry about and indicates that the people creating the site actually had some thoughts about search engines and made an effort to support it
So with that out of the way, happily create a new content source with type “Web Sites” and add all of your start addresses to it. Now you can start the indexing and afterwards have a look at the crawl log (see below) and then go back and forth and tweak the stuff you need.
Crawl Rules might be needed
If you crawl sites that use query parameters to distinguish pages you need to add a crawl rule to have SharePoint crawl all those pages, i.e. http://wiredispatch.com/news/?id=36938 . I guess about half the CMS systems out there uses such a scheme, so you very likely end up here.
The solution is easy:
- Go Search Settings / Crawl Rule and click “New Crawl Rule”
- Enter the pattern for which this rule should be effective. You can just enter http://SiteToBeCrawled.com/* or if you just want a single rule for all those sites out there use http://* (I do that). The generic approach works fairly well, just be careful of sites that automatically add dynamic session/caching query parameters. In that case the indexer will be very confused and recreate the full index of that site every time it crawls. In that case you can add another higher priority rule to limit the behavior for that site.
- Enable “Crawl complex URLs (URLs that contain a question mark (?))”
Additional File types might be needed
Some sites uses non-default file extensions for their URLs that SharePoint won’t crawl. I’ve only found one example of this where one of my external sites used xxx.exe at the end of the URL with some query parameters attached. I guess it’s some kind of CGI – who uses that these days anyway?
I suppose likely candidates for unsupported-out-of-the-box extensions would be: exe, dll, cgi, bin, etc.
Don’t go about adding file types to the crawler unless you have to. First add a crawl rule (if needed), perform a full indexing of the content source and use the log to verify that the pages are still not being crawled.
Then go to Search Settings / File types and add a new one. Just write the extension without the dot, e.g. “exe”.
Note: This is hardly a security risk in my mind as you are not enabling users to upload exe files to any of your SharePoint sites, you just let the crawler include them in the index.
Troubleshoot the Crawl Logs
After all the setup steps you need to have a long hard look at the crawl logs – or more like – you have been doing this all along and are trying to fix the problems spotted there (which turned you towards this blog).
It’s fairly easy. Go to Search Settings / Crawl logs, which will give you a view of the crawl log grouped by start address (regardless of content source).
Look for (and drill-down in):
- Any start address with none or only a very limited number of “successfully crawled” documents.
- Any start address with a number of errors.
- Warnings are to be expected. Whenever a document/page is no longer found at the start address (i.e. removed) it will be flagged with a warning. When you fiddle around with the search settings there’s bound to be a left-over from your fumbling 😉 Look at the time stamps to verify that it’s nothing current.
Follow appropriate steps above to solve the problem or use a crawl rule to exclude parts of the content if needed.
One annoying error I had several times in my log was “The crawler could not communicate with the server. Check that the server is available and that the firewall access is configured correctly”. That is a rather generic error message and I found out that it generally covers problems communicating with the server, i.e. the target server is responding with a http response code 5xx “internal server error” or not at all.
Quite often if I hit that particular page I would see the error. For instance in one site an email contact form was failing because it used the referral header that wasn’t given by the indexer, or if you hit it directly with a browser. If you followed the links on the site it worked fine… Guess that one went through their tests 😉
If you’re having this problem for local SharePoint sites (and you verified that the page worked) remember to test it on the index server, not just the front-end, as the index server is using itself for indexing. You might have forgotten to deploy some resources or about a billion other things. Enable stack trace on the index server (fiddle with web config) and fix the actual problem afterwards.
Baring the issue with forbidden access to the Search Settings page I must say that I’m pleasantly surprised at the versatility and ease of setup.
One small complaint would be the people search that requires you to use a strange custom protocol (yes I know there are others too, e.g. “sts3://” and “sts3s://” that I haven’t covered).
Other stuff that has been left out:
Any mentioning of the search role. You can host it on the index server (and avoid propagation issues) or on the front-end servers. I generally put it on the index server as I expect that server to be less busy than the rest
- Any recommendation of crawl schedules; you should use whatever you find appropriate, but please ensure that you do a full crawl once in a while, don’t trust incremental update with your life 😉 Have an eye on CPU utilization to help decide this and a dozen other things like crawl impact rules, indexing performance settings etc
As with all things, once you know how, it’s easy to make it work 😉