<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-21694297</id><updated>2010-04-26T04:25:01.523-07:00</updated><title type='text'>MaxMo Search Development</title><subtitle type='html'>A side project of MaxMo Technologies, the MaxMo Search engine is an attempt to create a compact and efficient search technology that can be implemented for both site search and cross domain vertical search.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://blog.maxmosearch.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default'/><link rel='alternate' type='text/html' href='http://blog.maxmosearch.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Nathan Perkins</name><uri>http://www.blogger.com/profile/05834947993769511759</uri><email>noreply@blogger.com</email></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>5</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-21694297.post-114929833624894486</id><published>2006-06-02T18:28:00.000-07:00</published><updated>2006-06-02T18:32:16.260-07:00</updated><title type='text'></title><content type='html'>As promised, there is now support for Microsoft Excel and Microsoft PowerPoint documents. And this list will continue to grow.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;How are non-HTML files indexed?&lt;/b&gt;&lt;br /&gt;In order to best index non-HTML files, MaxMo Search will first call on a conversion tool which will output an HTML version of the document. This allows for original formatting to influence the indexing process, ensuring more accurate results. In the absense of an HTML converter, MaxMo Search can also use text conversion tools as well.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21694297-114929833624894486?l=blog.maxmosearch.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.maxmosearch.com/feeds/114929833624894486/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=21694297&amp;postID=114929833624894486' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default/114929833624894486'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default/114929833624894486'/><link rel='alternate' type='text/html' href='http://blog.maxmosearch.com/2006/06/as-promised-there-is-now-support-for.php' title=''/><author><name>Nathan Perkins</name><uri>http://www.blogger.com/profile/05834947993769511759</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='15333254638867892639'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21694297.post-114921486961211397</id><published>2006-06-01T19:18:00.000-07:00</published><updated>2006-06-01T19:21:09.623-07:00</updated><title type='text'>New File Types Supported</title><content type='html'>The crawler will now process Microsoft Word and WordPerfect documents, allowing them to be fully indexed and accessible through the search software. This complements our existing support for Adobe PDFs and RTF (Rich Text Format). Soon to be added: Microsoft Excel and Microsoft Powerpoint.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21694297-114921486961211397?l=blog.maxmosearch.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.maxmosearch.com/feeds/114921486961211397/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=21694297&amp;postID=114921486961211397' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default/114921486961211397'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default/114921486961211397'/><link rel='alternate' type='text/html' href='http://blog.maxmosearch.com/2006/06/new-file-types-supported.php' title='New File Types Supported'/><author><name>Nathan Perkins</name><uri>http://www.blogger.com/profile/05834947993769511759</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='15333254638867892639'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21694297.post-113886858040330544</id><published>2006-02-01T23:38:00.000-08:00</published><updated>2006-02-02T00:23:00.413-08:00</updated><title type='text'>Testing the Limits</title><content type='html'>We recently ran a test of the search software across thirty domains, to test the abilities and limits of the latest interface version. Although not all pages across the domains were crawled, the search software crawled and indexed approximently 4,500 pages. The initial crawl speed was extremely good downloading about nine pages a second. This speed remained for the first two thousand pages, but began to go downhill with later pages (crawling still retained its speed, but the index rate was down to about one page every one to two seconds). The main decrease in speed was related to the indexers attempt to avoid duplicate content, and as the number of pages in the index increases, this becomes a big task.&lt;br /&gt;&lt;br /&gt;Statistics for the data digged up by the crawl:&lt;br /&gt;&lt;span style='font-weight: bold; width: 10em; float: left;'&gt;Pages:&lt;/span&gt; 4,550&lt;br /&gt;&lt;span style='font-weight: bold; width: 10em; float: left;'&gt;Lexicon:&lt;/span&gt; 37,344 words&lt;br /&gt;&lt;span style='font-weight: bold; width: 10em; float: left;'&gt;Inverted Index:&lt;/span&gt; 876,135 entries&lt;br /&gt;&lt;span style='font-weight: bold; width: 10em; float: left;'&gt;Links:&lt;/span&gt; 122,631&lt;br /&gt;&lt;br /&gt;This part of the data took up 135MB in the MySQL database the crawler/indexer work with. After converting it to the MaxMo Search data format, which is optimized for fast searching, the data only took up 56MB (interesting that the faster database format created for MaxMo Search, which far exceeds the speed of MySQL, is also smaller). Search speeds were usually 0.1-0.3 seconds, but for super common words (ie, appeared on two thousdand pages), the search time exceeded one second.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Results of this test&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Indexing&lt;/i&gt; must be improved to better detect duplicate pages, or at least no longer worry about duplicate pages (maybe worry about them in later processing). Also, not really a problem in the test, but a long needed improvement is better ability to track http redirects. Although they are properly handled now, the redirect page can end up being crawled multiple times.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Ranking&lt;/i&gt; must be easier to tweak, and also calculate in IDF (Inverse Document Frequency) scores to reward rare (and most likely more descriptive, or mispelled) words.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Results&lt;/i&gt; need to be sped up for super common words, by ignoring results after, say, the 1,000th page.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Crawling and indexing system&lt;/i&gt; will be revamped to put less stress on servers being crawled, and to allow for more continous crawling.&lt;br /&gt;&lt;br /&gt;In addition, the overall system should also eventually work with simply the MaxMo Search data format. After testing has shown it to be consistently and signifigantly faster then MySQL (simply due to its very narrowly designed parameters), it seems only logical to take MySQL completely out of the picture. In addition, doing so will help remove the timely step of converting the MySQL data into MaxMo Search data.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21694297-113886858040330544?l=blog.maxmosearch.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.maxmosearch.com/feeds/113886858040330544/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=21694297&amp;postID=113886858040330544' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default/113886858040330544'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default/113886858040330544'/><link rel='alternate' type='text/html' href='http://blog.maxmosearch.com/2006/02/testing-limits.php' title='Testing the Limits'/><author><name>Nathan Perkins</name><uri>http://www.blogger.com/profile/05834947993769511759</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='15333254638867892639'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21694297.post-113861099154684698</id><published>2006-01-30T00:58:00.000-08:00</published><updated>2006-01-30T01:06:47.853-08:00</updated><title type='text'>Version 4</title><content type='html'>A new version of the interface has changed the layout of files to allow for much faster processing. In the new version of the interface, databases for seperate sites are kept distinct, no longer sharing certain tables (ie, the lexicon and such). This improvement allows for easier creation and management of sites and their configuration data, with faster database access (no longer having to include query conditions to seperate sites).&lt;br /&gt;&lt;br /&gt;MaxMo Technologies can be &lt;a href="v4.maxmosearch.com/maxmo.msconfig/search.msscript"&gt;searched&lt;/a&gt; using the new interface. If you look at the URL, you will  notice the URL management system has changed. Features for managing sites with the search software have been removed, as it is a waste of resources. The search software should only have to handle queries, not serve pages as well. Apache is better equipped for site management.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Note:&lt;/i&gt; Interface refers to the way all the information is stored and organized, as well as accessed (ie, URL format, database layout, files, etc). Essentially, all the programming other then the crawling and parsing algorithms. The version of the actual crawler/indexer and the interface are tracked independently.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21694297-113861099154684698?l=blog.maxmosearch.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.maxmosearch.com/feeds/113861099154684698/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=21694297&amp;postID=113861099154684698' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default/113861099154684698'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default/113861099154684698'/><link rel='alternate' type='text/html' href='http://blog.maxmosearch.com/2006/01/version-4.php' title='Version 4'/><author><name>Nathan Perkins</name><uri>http://www.blogger.com/profile/05834947993769511759</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='15333254638867892639'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21694297.post-113861105054934263</id><published>2006-01-30T00:49:00.000-08:00</published><updated>2006-01-30T00:50:50.556-08:00</updated><title type='text'>New Blog</title><content type='html'>This new blog will trace the development of the MaxMo Search product line, showing search statistics and linking to new samples we create. Note that this is a side product, and therefore will develop slowly.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21694297-113861105054934263?l=blog.maxmosearch.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.maxmosearch.com/feeds/113861105054934263/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=21694297&amp;postID=113861105054934263' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default/113861105054934263'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21694297/posts/default/113861105054934263'/><link rel='alternate' type='text/html' href='http://blog.maxmosearch.com/2006/01/new-blog.php' title='New Blog'/><author><name>Nathan Perkins</name><uri>http://www.blogger.com/profile/05834947993769511759</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='15333254638867892639'/></author><thr:total>0</thr:total></entry></feed>