LuceGene

From GMOD
Revision as of 18:43, 25 January 2007 by 165.124.152.78 (Talk)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Description

This is an open-source document/object search and retrieval system specially tuned for bioinformatics text databases and documents. LuceGene is similar in concept to the widely used, commercially successful, bioinformatics program SRS (Sequence Retrieval System).

It is built with the open-source Lucene package.

It includes common text search features: booleans, phrases, word stemming, fuzzy and field range searches, relevance ranking. It supports data field structure of many kinds. Lucene is comparable to web-indexing systems such as Exite, Alta-vista, and Google.

LuceGene adds these bio-data methods to Lucene:

  • Indexing adaptors for formats such as XML, PDF Documents, Biosequences, Spreadsheets, HTML, and others.
  • Configurations for bio-data include UniProt/Swiss-Prot, Fasta and GenBank sequences, BIND protein interactions, NCBI Gene Expression Omnibus, BLAST output tables, Medline.
  • Support for batch-list look-ups and searches is included, useful for data miners.
  • Web applications offer paged search results, batch downloads, search refinement and search-linking among data libraries.
  • Web Services support for data mining is included with a SOAP interface.
  • Output support includes field selection and formats such as Spreadsheet, XML, HTML via XSLT, and others.

LuceGene is speedy with big data sets: Searching the UniProt library of 1.7 million sequences with LuceGene is a close equivalent to SRS in speed and content. Gene Annotation object search and retrieval with LuceGene is 10x to 20x faster than using a Postgres Chado database. LuceGene has been tested and works well with millions of documents from genome sequence, annotation and literature databases.

Jakarta Lucene software is included with this package, as are other required java libraries.

Current distribution files are at SourceForge and http://eugenes.org/gmod/lucegene/

  • lucegene.war: web application archive
  • lucegene-*-src.jar : sources, documents, configurations
  • sample data for lucegene.war as lucegene_demo*.zip
    Contact
    email: lucegene AT eugenes.org
    Current developers: Don Gilbert, Paul Poole, and others



Please note that EBI's new search-everything "EB-eye" is based on Lucene, as is the GMOD LuceGene project (http://www.gmod.org/lucegene), for the same reasons I would guess: it is fast, and works easily and well on huge, complex bio-data sets.

Others are noticing that Chado-database user searches, whether for
genome maps, reports, or other complex data, can be quite slow. Chado
is a good management database, but lacks efficiency for web access to
support many customers. Lucene has the ability to search genome
reports, the range of bio-data (XML, sequence records, interaction
data sets), GBrowse map data, etc.

There is also a GBrowse-Lucene adaptor as part of the LuceGene
project software (which works like the Mysql adaptor),
that I use all the time in preference to Mysql.

The GMOD/Turnkey web interface now has a Lucene search to avoid slow ChadoDB queries (albeit via an older c-lucene port; I find that Java Lucene can be run well from Perl (GBrowse)).

...........

EMBL-EBI News Dec 2006: Better, faster, easier – EMBL-EBI launches its
new website with powerful search engine

Behind this new web interface lies the ‘EB-eye’, a powerful
search engine allowing instant searches of all the
EBI’s databases from a single query.

What is the EB-eye Search?
The system is developed on top of the Apache Lucene project framework,
which is an Open-source, high-performance, full-featured text search
engine library written entirely in Java. It uses this technology to
index EBI databases in various formats (e.g. flatfiles, XML dumps, OBO
format, etc.) and provides very fast access to the EBI's data
resources. The system allows the user to search globally across all
EBI databases or individually in selected resources by using an
Advance search.
.......


Flat list - collapsedFlat list - expandedThreaded list - collapsedThreaded list - expanded Date - newest firstDate - oldest first 10 comments per page30 comments per page50 comments per page70 comments per page90 comments per page Select your preferred way to display the comments and click "Save settings" to activate your changes.