Wednesday, February 4, 2009

Lucene

Lucene is an excellent text search engine. Solr a sub project of lucene provides web based interface handling XML requests and executing XML commands . "Solr is more of a general-purpose search server, and it assumes you already have structured data (like catalog data, music collections,etc)." Nutch again a sub project of lucene is an excellent web crawler. "Nutch is more like an open-source google... it's for crawling, converting, indexing, and searching websites." Assuming you have an existing J2ee application with struts and hibernate. The following major components/classes would be required if we consider to use lucene or solr: 1. Search Index writer: This will use existing hibernate methods/APIs to read the records (which need to be searched) and create lucene indeces. The index can be stored on the disk or in a database via JDBC. Lucene calls these index records as documents and will have relevant information to display in search result.
2. Search index reader: This will read the index stored(on disk or database by step 1.) and return the search result. There is no hibernate method calls involved as the lucene index is separate from database. To display the search result, the struts action classes would need to be altered/added. However on click of the search result for details of the record, the existing struts/hibernate(if it is available) functionality will be used to display the details of a record.
Basically you can think of it as a search engine implementation(like google) where you index the records(like websites) and search results will contain only the basic information. The detailed information is delivered on click of a link on search result.

Coming to Solr, this is basically a web service wrapper on top of lucene. Under the hood it also builds lucene index. The advantage of Solr being it is web service based so it is easy to sync index in a distributed environment. Also it provides caching, index syncing etc out of the box.

Gotchas and Tips: Sorting: Need to maintain a duplicate field which is NOT_ANALYZED. **** sorting is case sensitive ***** sorting field value can be having a fixed max length say 20 ... this improves performance. Indexing: Analyzer used for indexing and searching should be the same. StandardAnalyzer can be extended to have HTMLStripReader and ISOLatin1AccentFilter. Links: Syntax supported: http://lucene.apache.org/java/1_4_3/queryparsersyntax.html Solr + Jquery sample: http://solrjs.solrstuff.org/test/reuters/ http://www.theserverside.com/news/thread.tss?thread_id=43617 http://www.xml.com/pub/a/2006/08/09/solr-indexing-xml-with-lucene-andrest.html http://www.ibm.com/developerworks/java/library/j-solr-update/?S_TACT=105AGX01&S_CMP=HP

No comments: