Difference between revisions of "Textpresso"

From GMOD
Jump to: navigation, search
(New page: ; '''Description'''<br /> : Textpresso is a text mining system for scientific literature whose<br /> capabilities go far beyond that of a simple keyword search engine. The two<br /> key ...)
 
m
 
(22 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 +
{{Tool data
 +
|use template=yes
 +
|name=Textpresso
 +
|status=mature
 +
|dev=active
 +
|support=active
 +
|type=Literature tools, Text mining
 +
|platform=web
 +
|about=Textpresso is an information extracting and processing (text mining) package for biological literature whose  capabilities go far beyond that of a simple keyword search engine. The two key elements are the collection of the full text of scientific articles split into individual sentences, and the implementation of semantic categories, for which a database of articles and individual sentences can be searched. The source of the full text articles are PDFs, and additional bibliographical information that is obtained from other citation databases can be processed as well.  [http://ilex.caltech.edu/trac/alere/ Alere] is a package of scripts that can be used to construct a corpus (retrieve articles) for use with '''Textpresso'''.  Textpresso is supported by a grant from the National Human Genome Research Institute at the US National Institutes of Health # HG004090.
 +
|open source=Caveats apply
 +
|licence=Modified GPL
 +
|input=Plain text, PDF, html
 +
|output=XML, text
 +
|language=Perl
 +
|audience=public
 +
|value=yes=Yes
 +
|logo=TextpressoLogo.jpg
 +
|contact=[mailto:mueller@caltech.edu Hans-Michael Müller]
 +
|getting started preamble=The package is designed for [[:Category:Linux|Linux]] operating systems and is tested to run on  an Intel x86 based hardware. The required minimal disk space is around 6GB  per 1000 full text papers; half of it is used by the publicly (via WWW) accessible database, while the other half is needed for database preparation and maintenance. If necessary, the latter can be reduced.
 +
|req=* Software for a  world wide web server such as Apache needs to be installed, and an Internet connection should exist
 +
* Perl 5.6.1 or higher  should be present, and the most common Perl packages.
 +
* The installation script requires bash
 +
* {{CPAN|XML::Checker::Parser}}
 +
* {{CPAN|XML::DOM::Parser}}
 +
* {{CPAN|XML::XQL::DOM}}
 +
* {{CPAN|XML::Checker::Parser}},
 +
* {{CPAN|Mailer::Mail}} (in MailTools-1.58)
 +
* {{CPAN|PDF::Create}} (in PDF-Create).
 +
* If the model organism database is based on ACeDB then {{CPAN|AcePerl}} is required
 +
* XPDF  (http://www.foolabs.com/xpdf/), the pdftotext converter
 +
* RBT, a part-of-speech tagger developed by Eric Brill  ([http://research.microsoft.com/~brill/blog.htm blog], [http://research.microsoft.com/~brill/ homepage]''deprecated'').  RBT seems to be no longer available at JHU.  A copy appears to be available at [http://www.cst.dk/download/tagger/RBT1_14.tar.Z Københavns Universitet] (I didn't download and check it).  RBT is distributed free of charge  under a license of the Massachusetts Institute of Technology and the University of Pennsylvania. If you want to recompile either of the packages,  you additionally need a C compiler.
  
 +
This package has been tested with the Linux RedHat 9.0 distribution  (http://www.redhat.com) and Debian Linux 3.1 (http://www.debian.org) . Both  work with a 2.4.20 kernel or higher.
 +
|doc=* [http://www.textpresso.org/celegans/misc/Textpresso-2.0-documentation/ Installation Guide]
 +
* [http://textpresso-www.caltech.edu/cgi-bin/celegans/user_guide User Guide].
  
; '''Description'''<br />
+
=== Textpresso 2 Extensions ===
: Textpresso is a text mining system for scientific literature whose<br /> capabilities go far beyond that of a simple keyword search engine. The two<br /> key elements are the collection of the full text of scientific articles split<br /> into individual sentences, and the implementation of semantic categories, for<br /> which a database of articles and individual sentences can be searched. The<br /> source of the full text articles are PDFs, and additional bibliographical<br /> information that is obtained from other citation databases can be processed<br /> as well.
+
  
; '''Demo &amp; Screenshots'''<br />
+
A [http://en.wikipedia.org/wiki/Fork_%28software_development%29 fork] of Textpresso has been created that contains a number of extensions to Textpresso 2. These include
: Please visit the live main site at<br />[http://www.textpresso.org www.textpresso.org] for examples and<br /> screenshots.
+
* Interface overhaul, including [[Glossary#AJAX|AJAX]] and heavy integration of [http://datatables.net jQuery DatatTables], user authentication, and a GUI for managing the literature corpus.
 +
* Modularization and customization for better database support
 +
* Addition of a plug-in API
 +
* Speed increase
  
; '''Requirements'''<br />
+
The [http://sourceforge.net/projects/textpresso extended version of Textpresso 2 is available at SourceForge].
: The package is designed for Linux operating systems and is tested to run on<br /> an Intel x86 based hardware. The required minimal disk space is around 6GB<br /> per 1000 full text papers, half of it is used by the publically (via WWW)<br /> accessible database, while the other half is needed for database preparation<br /> and maintenance. (If necessary, the latter can be reduced.) Software for a<br /> world wide web server such as Apache needs to be installed, and an Internet<br /> connection should exist. Furthermore, the standard Perl 5.6.1 or higher<br /> should be present, and the most common Perl packages. Thes installation<br /> script requires a bash shell. The Textpresso system requires the modules<br /> XML::Checker::Parser, XML::DOM::Parser, XML::XQL::DOM and<br /> XML::Checker::Parser, which usually come with a Linux distribution. If a<br /> standard Perl package is missing, it can be downloaded and installed from<br /> http://www.cpan.org. There are two non-standard Perl modules required,<br /> Mailer::Mail (in MailTools-1.58) and PDF::Create (in PDF-Create). They too<br /> can be downloaded from http://www.cpan.org. If a model organism database is<br /> used and based on ACeDB (http://www.acedb.org), the Perl module AcePerl is<br /> required. Textpresso uses two software packages: XPDF<br /> (http://www.foolabs.com/xpdf/) is distributed under the GNU general public<br /> license and provides the pdftotext converter. The other package contains a<br /> part-of- speech tagger developed by Eric Brill<br /> (http://research.microsoft.com/~brill/). It is distributed free of charge<br /> under a license of the Massachusetts Institute of Technology and the<br /> University of Pennsylvania. If you want to recompile either of the packages,<br /> you additionally need a C compiler, such as gcc (GNU project).
+
  
This package has been tested with the Linux RedHat 9.0 distribution<br /> (http://www.redhat.com) and Debian Linux 3.1 (http://www.debian.org) . Both<br /> work with a 2.4.20 kernel or higher.
+
These extensions were written by Nathan Liles of the [[User:JimHu|Hu Lab]] at Texas A&M University. Nathan [[:Image:Jan2010Testpresso.pdf|presented this work]] at the [[January 2010 GMOD Meeting]]. The Textpresso team plans to fold these extensions back into the main Textpresso code base in the future.
 +
}}
  
; '''Documentation'''<br />
 
: Installation instruction can be found in the tarzipped package file and is<br /> called TextpressoManual.pdf.
 
  
A user guide is available<br />[http://www.textpresso.org/doc/userguide/doc-con.html#top <br /> online].
 
  
; '''Contact'''<br />
 
: Hans-Michael Muller, mueller (at) caltech.edu
 
  
; '''Downloads'''<br />
+
{{SemanticLink
: [http://www.textpresso.org/textpresso/downloads.html <br /> http://www.textpresso.org/textpresso/downloads.html]
+
|linkurl=http://textpresso.org
 +
|linktype=website
 +
}}
 +
{{SemanticLink
 +
|linkurl=http://textpresso.org/downloads.html
 +
|linktype=download
 +
}}
 +
{{SemanticLink
 +
|linkurl=http://textpresso-www.caltech.edu/cgi-bin/celegans/user_guide
 +
|linktitle=Textpresso User Guide
 +
|linktype=documentation
 +
}}
 +
{{SemanticLink
 +
|linkurl=http://whis.caltech.edu/textpresso/
 +
|linktitle=Textpresso for Sea Urchin
 +
|linktype=wild URL
 +
}}
 +
{{SemanticLink
 +
|linkurl=http://textpresso.yeastgenome.org/textpresso/
 +
|linktitle=Textpresso for S. cerevisiae
 +
|linktype=wild URL
 +
}}
 +
{{SemanticLink
 +
|linkurl=http://www.textpresso.org/celegans/
 +
|linktitle=Textpresso for C. elegans
 +
|linktype=wild URL
 +
}}
  
 
+
[[Category:GMOD Components]]
 
+
[[Category:Textpresso]]
 
+
[[Category:Annotation]]
Can I download anywhere?
+
[[Category:GMOD Component]]
 
+
 
+
 
+
 
+
Hi,
+
 
+
Sorry for the delay; I've been on vacation. The Textpresso download link seems to be working now. It must have been a transient problem.
+
 
+
Scott
+
 
+
 
+
Flat list - collapsedFlat list - expandedThreaded list - collapsedThreaded list - expanded Date - newest firstDate - oldest first 10 comments per page30 comments per page50 comments per page70 comments per page90 comments per page  Select your preferred way to display the comments and click "Save settings" to activate your changes.
+

Latest revision as of 17:56, 17 October 2013

Textpresso logo
Status
  • Mature release
  • Development: active
  • Support: active
Licence


Modified GPL

Resources


About Textpresso

Textpresso is an information extracting and processing (text mining) package for biological literature whose capabilities go far beyond that of a simple keyword search engine. The two key elements are the collection of the full text of scientific articles split into individual sentences, and the implementation of semantic categories, for which a database of articles and individual sentences can be searched. The source of the full text articles are PDFs, and additional bibliographical information that is obtained from other citation databases can be processed as well. Alere is a package of scripts that can be used to construct a corpus (retrieve articles) for use with Textpresso. Textpresso is supported by a grant from the National Human Genome Research Institute at the US National Institutes of Health # HG004090.


Visit the Textpresso website.


Downloads


Using Textpresso

The package is designed for Linux operating systems and is tested to run on an Intel x86 based hardware. The required minimal disk space is around 6GB per 1000 full text papers; half of it is used by the publicly (via WWW) accessible database, while the other half is needed for database preparation and maintenance. If necessary, the latter can be reduced.

System Requirements

  • Software for a world wide web server such as Apache needs to be installed, and an Internet connection should exist
  • Perl 5.6.1 or higher should be present, and the most common Perl packages.
  • The installation script requires bash
  • XML::Checker::Parser
  • XML::DOM::Parser
  • XML::XQL::DOM
  • XML::Checker::Parser,
  • Mailer::Mail (in MailTools-1.58)
  • PDF::Create (in PDF-Create).
  • If the model organism database is based on ACeDB then AcePerl is required
  • XPDF (http://www.foolabs.com/xpdf/), the pdftotext converter
  • RBT, a part-of-speech tagger developed by Eric Brill (blog, homepagedeprecated). RBT seems to be no longer available at JHU. A copy appears to be available at Københavns Universitet (I didn't download and check it). RBT is distributed free of charge under a license of the Massachusetts Institute of Technology and the University of Pennsylvania. If you want to recompile either of the packages, you additionally need a C compiler.

This package has been tested with the Linux RedHat 9.0 distribution (http://www.redhat.com) and Debian Linux 3.1 (http://www.debian.org) . Both work with a 2.4.20 kernel or higher.


Documentation

Textpresso 2 Extensions

A fork of Textpresso has been created that contains a number of extensions to Textpresso 2. These include

  • Interface overhaul, including AJAX and heavy integration of jQuery DatatTables, user authentication, and a GUI for managing the literature corpus.
  • Modularization and customization for better database support
  • Addition of a plug-in API
  • Speed increase

The extended version of Textpresso 2 is available at SourceForge.

These extensions were written by Nathan Liles of the Hu Lab at Texas A&M University. Nathan presented this work at the January 2010 GMOD Meeting. The Textpresso team plans to fold these extensions back into the main Textpresso code base in the future.



Contacts and Mailing Lists

For support, please contact the Textpresso developer, Hans-Michael Müller.

Textpresso in the wild

Public installations of Textpresso:



More on Textpresso

See Category:Textpresso



Facts about "Textpresso"RDF feed
Available on platformweb +
Has URLhttp://textpresso.org +, http://textpresso.org/downloads.html +, http://textpresso-www.caltech.edu/cgi-bin/celegans/user_guide +, http://whis.caltech.edu/textpresso/ +, http://textpresso.yeastgenome.org/textpresso/ + and http://www.textpresso.org/celegans/ +
Has descriptionTextpresso is an information extracting anTextpresso is an information extracting and processing (text mining) package for biological literature whose capabilities go far beyond that of a simple keyword search engine. The two key elements are the collection of the full text of scientific articles split into individual sentences, and the implementation of semantic categories, for which a database of articles and individual sentences can be searched. The source of the full text articles are PDFs, and additional bibliographical information that is obtained from other citation databases can be processed as well. Alere is a package of scripts that can be used to construct a corpus (retrieve articles) for use with Textpresso. Textpresso is supported by a grant from the National Human Genome Research Institute at the US National Institutes of Health # HG004090. National Institutes of Health # HG004090. +
Has development statusactive +
Has input formatPlain text +, PDF + and html +
Has licenceModified GPL +
Has logoTextpressoLogo.jpg +
Has output formatXML + and text +
Has software maturity statusmature +
Has support statusactive +
Has titleTextpresso User Guide +, Textpresso for Sea Urchin +, Textpresso for S. cerevisiae + and Textpresso for C. elegans +
Has topicTextpresso +
Is open sourceCaveats apply +
Link typewebsite +, download +, documentation + and wild URL +
Tool functionality or classificationLiterature tools + and Text mining +
Written in languagePerl +
Has subobjectThis property is a special property in this wiki.Textpresso#http://textpresso.org +, Textpresso#http://textpresso.org/downloads.html +, Textpresso#http://textpresso-www.caltech.edu/cgi-bin/celegans/user_guide +, Textpresso#http://whis.caltech.edu/textpresso/ +, Textpresso#http://textpresso.yeastgenome.org/textpresso/ + and Textpresso#http://www.textpresso.org/celegans/ +