Textpresso

From GMOD
Revision as of 18:43, 25 January 2007 by 165.124.152.78 (Talk)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Description
Textpresso is a text mining system for scientific literature whose
capabilities go far beyond that of a simple keyword search engine. The two
key elements are the collection of the full text of scientific articles split
into individual sentences, and the implementation of semantic categories, for
which a database of articles and individual sentences can be searched. The
source of the full text articles are PDFs, and additional bibliographical
information that is obtained from other citation databases can be processed
as well.
Demo & Screenshots
Please visit the live main site at
www.textpresso.org for examples and
screenshots.
Requirements
The package is designed for Linux operating systems and is tested to run on
an Intel x86 based hardware. The required minimal disk space is around 6GB
per 1000 full text papers, half of it is used by the publically (via WWW)
accessible database, while the other half is needed for database preparation
and maintenance. (If necessary, the latter can be reduced.) Software for a
world wide web server such as Apache needs to be installed, and an Internet
connection should exist. Furthermore, the standard Perl 5.6.1 or higher
should be present, and the most common Perl packages. Thes installation
script requires a bash shell. The Textpresso system requires the modules
XML::Checker::Parser, XML::DOM::Parser, XML::XQL::DOM and
XML::Checker::Parser, which usually come with a Linux distribution. If a
standard Perl package is missing, it can be downloaded and installed from
http://www.cpan.org. There are two non-standard Perl modules required,
Mailer::Mail (in MailTools-1.58) and PDF::Create (in PDF-Create). They too
can be downloaded from http://www.cpan.org. If a model organism database is
used and based on ACeDB (http://www.acedb.org), the Perl module AcePerl is
required. Textpresso uses two software packages: XPDF
(http://www.foolabs.com/xpdf/) is distributed under the GNU general public
license and provides the pdftotext converter. The other package contains a
part-of- speech tagger developed by Eric Brill
(http://research.microsoft.com/~brill/). It is distributed free of charge
under a license of the Massachusetts Institute of Technology and the
University of Pennsylvania. If you want to recompile either of the packages,
you additionally need a C compiler, such as gcc (GNU project).

This package has been tested with the Linux RedHat 9.0 distribution
(http://www.redhat.com) and Debian Linux 3.1 (http://www.debian.org) . Both
work with a 2.4.20 kernel or higher.

Documentation
Installation instruction can be found in the tarzipped package file and is
called TextpressoManual.pdf.

A user guide is available

online
.

Contact
Hans-Michael Muller, mueller (at) caltech.edu
Downloads

http://www.textpresso.org/textpresso/downloads.html



Can I download anywhere?



Hi,

Sorry for the delay; I've been on vacation. The Textpresso download link seems to be working now. It must have been a transient problem.

Scott


Flat list - collapsedFlat list - expandedThreaded list - collapsedThreaded list - expanded Date - newest firstDate - oldest first 10 comments per page30 comments per page50 comments per page70 comments per page90 comments per page Select your preferred way to display the comments and click "Save settings" to activate your changes.

Facts about "Textpresso"RDF feed
Available on platformweb +
Has URLhttp://textpresso.org +, http://textpresso.org/downloads.html +, http://textpresso-www.caltech.edu/cgi-bin/celegans/user_guide +, http://whis.caltech.edu/textpresso/ +, http://textpresso.yeastgenome.org/textpresso/ + and http://www.textpresso.org/celegans/ +
Has descriptionTextpresso is an information extracting anTextpresso is an information extracting and processing (text mining) package for biological literature whose capabilities go far beyond that of a simple keyword search engine. The two key elements are the collection of the full text of scientific articles split into individual sentences, and the implementation of semantic categories, for which a database of articles and individual sentences can be searched. The source of the full text articles are PDFs, and additional bibliographical information that is obtained from other citation databases can be processed as well. Alere is a package of scripts that can be used to construct a corpus (retrieve articles) for use with Textpresso. Textpresso is supported by a grant from the National Human Genome Research Institute at the US National Institutes of Health # HG004090. National Institutes of Health # HG004090. +
Has development statusactive +
Has input formatPlain text +, PDF + and html +
Has licenceModified GPL +
Has logoTextpressoLogo.jpg +
Has output formatXML + and text +
Has software maturity statusmature +
Has support statusactive +
Has titleTextpresso User Guide +, Textpresso for Sea Urchin +, Textpresso for S. cerevisiae + and Textpresso for C. elegans +
Has topicTextpresso +
Is open sourceCaveats apply +
Link typewebsite +, download +, documentation + and wild URL +
Tool functionality or classificationLiterature tools + and Text mining +
Written in languagePerl +
Has subobjectThis property is a special property in this wiki.Textpresso#http://textpresso.org +, Textpresso#http://textpresso.org/downloads.html +, Textpresso#http://textpresso-www.caltech.edu/cgi-bin/celegans/user_guide +, Textpresso#http://whis.caltech.edu/textpresso/ +, Textpresso#http://textpresso.yeastgenome.org/textpresso/ + and Textpresso#http://www.textpresso.org/celegans/ +