This is an open-source document/object search and retrieval system specially tuned for bioinformatics text databases and documents. LuceGene is similar in concept to the widely used, commercially successful, bioinformatics program SRS (Sequence Retrieval System).
It is built with the open-source Lucene package.
It includes common text search features: booleans, phrases, word stemming, fuzzy and field range searches, relevance ranking. It supports data field structure of many kinds. Lucene is comparable to web-indexing systems such as Exite, Alta-vista, and Google.
LuceGene adds these bio-data methods to Lucene:
LuceGene is speedy with big data sets: Searching the UniProt library of 1.7 million sequences with LuceGene is a close equivalent to SRS in speed and content. Gene Annotation object search and retrieval with LuceGene is 10x to 20x faster than using a Postgres Chado database. LuceGene has been tested and works well with millions of documents from genome sequence, annotation and literature databases.
Others are noticing that Chado-database user searches, whether for genome maps, reports, or other complex data, can be quite slow. Chado is a good management database, but lacks efficiency for web access to support many customers. Lucene has the ability to search genome reports, the range of bio-data (XML, sequence records, interaction data sets), GBrowse map data, etc.
There is also a GBrowse-Lucene adaptor as part of the LuceGene project software (which works like the Mysql adaptor).
The GMOD/Turnkey web interface now has a Lucene search to avoid slow ChadoDB queries.
Please note that EBI’s new search-everything EB-eye is based on Lucene, like LuceGene: it is fast, and works easily and well on huge, complex bio-data sets:
EMBL-EBI News Dec 2006: Better, faster, easier EMBL-EBI launches its new website with powerful search engine
Behind this new web interface lies the EB-eye, a powerful search engine allowing instant searches of all the EBI’s databases from a single query.
What is the EB-eye Search? The system is developed on top of the Apache Lucene project framework, which is an Open-source, high-performance, full-featured text search engine library written entirely in Java. It uses this technology to index EBI databases in various formats (e.g. flatfiles, XML dumps, OBO format, etc.) and provides very fast access to the EBI’s data resources. The system allows the user to search globally across all EBI databases or individually in selected resources by using an Advance search.
Uniprot’s new version (2007) also uses Lucene as it’s search-all-proteins system. See it in action at http://www.uniprot.org/
Dongilbert 16:25, 4 September 2007 (EDT)
Jakarta Lucene software is included with this package, as are other required java libraries.
Current distribution files are at SourceForge and http://eugenes.org/gmod/lucegene/