Wednesday, February 01, 2006

 

Testing the Limits

We recently ran a test of the search software across thirty domains, to test the abilities and limits of the latest interface version. Although not all pages across the domains were crawled, the search software crawled and indexed approximently 4,500 pages. The initial crawl speed was extremely good downloading about nine pages a second. This speed remained for the first two thousand pages, but began to go downhill with later pages (crawling still retained its speed, but the index rate was down to about one page every one to two seconds). The main decrease in speed was related to the indexers attempt to avoid duplicate content, and as the number of pages in the index increases, this becomes a big task.

Statistics for the data digged up by the crawl:
Pages: 4,550
Lexicon: 37,344 words
Inverted Index: 876,135 entries
Links: 122,631

This part of the data took up 135MB in the MySQL database the crawler/indexer work with. After converting it to the MaxMo Search data format, which is optimized for fast searching, the data only took up 56MB (interesting that the faster database format created for MaxMo Search, which far exceeds the speed of MySQL, is also smaller). Search speeds were usually 0.1-0.3 seconds, but for super common words (ie, appeared on two thousdand pages), the search time exceeded one second.

Results of this test

Indexing must be improved to better detect duplicate pages, or at least no longer worry about duplicate pages (maybe worry about them in later processing). Also, not really a problem in the test, but a long needed improvement is better ability to track http redirects. Although they are properly handled now, the redirect page can end up being crawled multiple times.

Ranking must be easier to tweak, and also calculate in IDF (Inverse Document Frequency) scores to reward rare (and most likely more descriptive, or mispelled) words.

Results need to be sped up for super common words, by ignoring results after, say, the 1,000th page.

Crawling and indexing system will be revamped to put less stress on servers being crawled, and to allow for more continous crawling.

In addition, the overall system should also eventually work with simply the MaxMo Search data format. After testing has shown it to be consistently and signifigantly faster then MySQL (simply due to its very narrowly designed parameters), it seems only logical to take MySQL completely out of the picture. In addition, doing so will help remove the timely step of converting the MySQL data into MaxMo Search data.

Comments: Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?