Web-apollo-meeting-2011-4-4

From GMOD
Jump to: navigation, search
  • Chris questions
    • I've put together some thoughts and questions about the project from the perspective of the groups looking to use WebApollo for their own community annotation projects. I know that a lot of these won't really have an answer until the project is further along, and will differ from one project to another. I think a lot of these questions go along with the same questions that we were talking about during the hackathon.
    1. Is there anything we need to do to prepare our data to facilitate use by WebApollo? Would it be better to pull from databases or use preprocessed flatfiles? What would the trade-offs be in terms of disk space and server resource usage?
      1. Support for pulling straight from databases is essential and will support UCSC, Ensembl and chado
      2. Screen real estate is critical, and it is limited
      3. Flatfiles are fine if the data is not changing much or at all
      4. For large datasets, the flatfiles will get fairly big, but that should be ok for the frontend
    2. What additional requirements does WebApollo have on the server side on top of the requirements JBrowse has?
      1. If pulling from ucsc
      2. If housing all data locally
      3. What is needed to be an edit server
    3. How do the system resources compare to Apollo classic in terms of memory and bandwidth usage?
    4. Will the system scale well to annotation projects with very large datasets/very large chromosomes?
      1. They are already loading whole chromosomes
      2. It should scale well
    5. JBrowse is fast on loading large regions but I don't know how it compares with Apollo (classic or WebApollo). Are there any estimates for the resources needed to serve a really large annotation project(one to two hundred annotators at peak load)? Will there be issues in running multiple annotation projects simultaneously from one server?
    6. What sort memory or bandwidth overhead will there be for loading a multi megabase sequence with a dozen evidence tracks? Will there also be issues on the client side?
      1. The limit is more on how much memory the browser can use
      2. If it gets to be an issue at some point, they can look at optimizing memory usage
      3. Monitor caching on the client side, and if it gets excessive, have to explain to users how to set chache size
    7. On projects where the genome is not as well polished, there are many unplaced scaffolds (ChrUn), on the order of thousands. Will there be an option to type in the chromosome name and position in addition to or instead of a drop down box?
      1. That has not been requested yet but can be done
      2. It is good to come up with ways to remove the unwanted parts of the sequence (slice-view) will also be what is needed for scaffold view as well.
      3. The edit engine is currently not ready yet for this, but it should not be problematic to implement
      4. Related to the lazy loading mechanisms that Jbrowse uses.
      5. Ed wrote a caching system to do loading based on jbrowse lazy loading for use by WebApollo
      6. Getting the sequence might be an issue
      7. Cross scaffold annotation will make slice-view a higher priority.
      8. This should be a IU issue, as long as the order and orientation of scaffolds is known
      9. This brings up a different type of editing: reorganizing the scaffolds to offer corrections to the assembly
      • Need a generic way
    8. We need to evaluate how NCBI and UCSC utilize unplaced scaffolds. Some groups concatenate all the unplaced scaffolds into one sequence, which may make annotation problematic. In the past, we had split all the ChrUn scaffolds into separate sequences, but this may be a problem if we are to keep compatability with UCSC.
      • Might need to embed the Georgetown splitter software into the retrieval software. Hard to make this work across different genome projects. Perhaps an initial configuration phase to gather the coordinates (NCBI or whoever has AGP file) will be needed first.
    9. What system requirements will there be (if any) for the end users?
      • Hardware:
        • Memory requirements (scaffold vs chromosome scale regions)
        • CPU?
      • Software:
        • OS/browser/other requirements?
  • UI
    • Server now returning CDS features. But JBrowse doesn't yet handle these and is still using separate UTR features. JSONUtils.createJBrowseFeature() parses the CDS feature returned from the server and if the PROCESS_CDS flag is set to true, it creates UTR/CDS JBrowse features. This however breaks the selection model for the annotation track. Until we change JBrowse's data model, perhaps we can change the selection behavior to select all features that are adjacent to the one selected, so features 1-5, 6-8, 9-12 would all be selected when selecting any of them?
    • Server will return flags in JSON for exons that have boundaries that are non-canonical. This keeps to the model of having the server deal with all the biological issues. The UI can then use this flag to display non-canonical splice sites.
    • Added code to communicate with the server for splitting exons and making introns (two separate operations).
  • Server
    • Handles lazy loading and caching of genomic sequence. Makes use of the same chunks used by JBrowse, so doesn't require any extra pre-processing.
    • Added code to "Make intron" operation. As discussed, it will default to finding the closest acceptor and donor sites from selected position (1 bp), but the minimum length of the automatically calculated intron will be configurable. Note that this is only the default behavior, and the curator can always manually drag an end to override the exon boundary.
  • Jay
    • Not much to report this work, but did talk for a couple hours last week and figured out a lot of details.