Data Integration for Systems Biology using the ONDEX system
ONDEX Visualization ToolKit (OVTK) User Tutorial
Rothamsted Research, Biomathematics and Bioinformatics
TABLE OF CONTENTS
This section will introduce the Ondex Visulalization ToolKit user interface. A network consists of genes/proteins/metabolites as concepts and interactions represented as links i.e. relations between concepts.
First of all, we will look at the basic user interface of OVTK. Then we will load up a network to show all menu features of OVTK and some of the core functionality such as layout algorithms, annotators and filters.
Launch OVTK - You should see a window that looks like this:
It is possible to undock the top toolbar of the OVTK by a simple “drag and drop”. When this toolbar is undocked, it is possible to re-dock it by clicking on its far left side and again dragging and dropping to its original position.
We will now load a network to show all features of OVTK user interface.
You should see the following:
The first window that opens is the metagraph view. Note that the main network is minimized in the bottom left-hand side corner. The metagraph gives an overview of all the different types of concepts present on the main network, as well as all the different types of relations which link them.
The metagraph view has four buttons at the bottom of its window:
Trying to understand the metagraph before opening the main network usually helps. The main network is opened with the “Circular” layout which is a circular arrangements of all the concept classes in the network.
Moving concepts around in the metagraph should help you to make sense of it. This is an example of how the metagraph can be laid out:
A protein is encoded by a gene (en_by).
A protein is part of a protein complex (is_p).
A protein is consumed by or produced by a reaction (cs_by, pd_by).
An enzyme is a protein or a protein complex (is_a).
An enzyme has a catalysing class EC (cat_c).
An enzyme is co-factored (or activated) by a compound (co_by).
A compound is consumed by or produced by a reaction (cs_by, pd_by).
A reaction is a member, is part of a pathway (m_isp).
A reaction is catalysed by an enzyme (ca_by).
Finally, most concepts can be mentioned in publications or publicated in (pub_in).
Right-clicking on concepts/relations on the metagraph shows the number of concepts/relations of that type in the network. It also gives users the opportunity to untick “Visible”. This will make concepts/relations of that type invisible on the main network. The concept/relation in the metagraph will appear in a paler colour.
Note: Making concepts invisible in the network will not delete them from the network altogether. If you wish to do so, use Network -> Synchronise Network. A window will pop up to ask you for confirmation.
In the metadata legend, clicking on colours/shapes will allow you to pick different colours/shapes for concept classes, controlled vocabularies and relation types (see three tabs). In order to see it take effect, refresh the metadata (bottom button in the metadata legend) and click on use View -> Update Display.
For example, a reaction is of the same colour as three other concept classes (gene, protein complex and compound). In order to make it more distinct, we are going to change its colour by clicking on the coloured rectangle (first column). We get the following window popping up:
Clicking on the coloured rectangle in the new window will pop open a “Pick a colour” window:
Select a colour, click on OK, Apply, Refresh Metadata and View -> Update Display to finally get the following:
The metadata legend has three tabs: Concept Classes, Controlled Vocabulary and RelationType Sets. “ca_by” was originally in yellow and therefore not very visible. In the following screenshot its colour was changed to purple:
To move on to the main network, you can click on “Main Network” in the “Metagraph View” window or maximize the visualization window that has been sitting in the bottom left-hand side corner all along.
Opening the visualization window can take a while when there are a lot of concepts and relations to draw. Nevertheless, you will obtain something like this:
Rather than looking at this overwhelming graph, let us select a single pathway to analyse: GDP-L-fucose biosynthesis II
This leaves us with the GDP-L-fucose biosynthesis II (from L-fucose) pathway only:
You can use Layouts → Hierarchical for OVTK to organise the concepts hierarchically:
You can then select a concept (for example the pathway itself) and click on the “i” icon (a tooltip shows when you mouse over the icon) to see information about it:
The information gives a link to a website:
Clicking on this link opens a web browser at the following page:
Study the reaction shown in the website and try to understand how it is represented in OVTK.
Help: The first thing you might want to do is use the View menu to add concept and relation labels on the network so you know what is what without needing the “Item Information” window. Use
Help: To move concepts around,
Following this process, you obtain:
Note: the “Metagraph View” window has been minimized.
In the above screenshot, the reactions previously shown on the webpage have been reconstituted. Here, the compound that is a product of the first reaction and a substrate of the second has been selected and is automatically shown in the “Item Information” window. Note: SMILES stand for “simplified molecular input line entry specification”.
Note: If you have closed the Metadata Legend (previously launched from the metagraph view window), you may launch it again using Config -> Metadata Appearance.
OVTK provides two mechanisms for zooming:
i) Using the mouse –move the mouse scroll forward and backwards
ii) Using the two zoom buttons on the toolbar to zoom in and out of the network
Selecting an area to look at from the overview
When studying large graphs, using the Network -> Satellite View to move and navigate across the network can be very useful.
Manually rearrange a network
Select the concepts of interest and move them around. See exercise in Section 1.2.
Automatically rearrange a network
There are a variety of layout algorithms which can be found in the Layouts menu. Select Layout -> OVTK Layouts (the GEM Algorithm is probably the most popular layout).
Rotating and Sheering your network
Finally you can rotate and scale your network within OVTK. In the transformer mode (See Section 1.4 for icon), shift + mouse will rotate, control + mouse will sheer.
The toolbar is composed of several icons followed by a search bar which is explained in further details in Section 3.1. The icons are as follows:
Add a Concept or Relation
Information on selected Concept or Relation
Edit selected Concept or Relation
Delete selected Concept or Relation
Copy whole Network as new
Set mouse to “transforming” mode (to move within the network)
Set mouse to “picking” mode (to select concepts, drag and drop them)
We will briefly run through all the menus available in OVTK.
The File menu contains basic file functionality:
creates a new network
opens an OVTK session file
saves a session file
prints the visualization window
exports the visualization window
imports files in Network WorkBench format
creates workflows for data integration (see Section 4.1)
exits the OVTK
The Config menu contains:
for login user and password
to modify colours/shapes of conepts/relations
to change concept label font
to change relation label font
to save the coordinates, colours and shapes of the concepts/relations
The Network menu contains:
for a list of all the concepts
for a list of all the relations
for a movable overview of the network
Displays a metagraph containing all the types of concepts/relations
for a permanent window displaying information on selected items of the network
displays statistics about the network (see Section 5.3)
centers the network within the visualization window
removes invisible items from the network
The View menu contains:
shows names of concepts on the main network
shows names of relations on the main network
colours concepts by controlled vocabularies
colours concepts by type of concept
updates the network with information previously saved about concept colour
updates the network with information previously saved about concept shape
updates the network with information previously saved about relation colour
to update the visualization window with recent modifications
smoothes relations in the network
The Layouts menu offers a collection of algorithms for visually organising the network:
for a circular arrangement of concept classes
for a hierarchy of concept classes
for a flip around of selected concepts (mirror effect)
for the Kamada-Kawai layout algorithm
for a force-directed algorithm based on attributes values
for the GEM layout algorithm
for a layout determined by the selection of a focus concept
for a force-directed algorithm based on relation types
for the latest saved layout
for the Sugiyama layout algorithm (“Visualization of structural information: Automatic drawing of compound digraphs”, 1991, by Sugiyama, K. and Misue, K.)
for a directed rooted tree
for options on the parameters of some algorithms (some algorithms do not have options, others do but are not yet supported)
tick for automatic relayouting of the network when the visualization window’s size is modified
The Annotators menu offers a collection of algorithms for visually organising the network:
to resize concepts based on their centrality within the network (see Section 3.3)
to colour concepts based on one of their attribute’s value (see Section 5.4)
to resize concepts based on one of their attribute’s value (see Section 5.1)
to resize edges width based on one of their attribute’s
to shape concepts based on one of their attribute’s value (see Section 5.4)
to remove one concept at a time virtually and calculating the structure changes in the network
The Filters menu offers a collection of algorithms for visually organising the network:
computes the shortest paths between all possible pairs of concepts and removes all relations that are not part of these shortest paths (see Section 5.3 for an example)
filters out some particular class of concepts
filters out concepts from a specified data source
to see a particular number of neigbours only to a particular concept (see Section 5.2 for an example)
filters out some particular type of relations
computes the Dijkstra shortest path algorithm from a particular concept (see Section 3.3 for an example)
selects a threshold for one of the concepts attributes’ value (see Section 5.2 for an example)
filters out all unconnected concepts
The Help menu offers a collection of algorithms for visually organising the network:
a brief message about the project
a documentation of OVTK with screenshots
this tutorial is available in html format through this menu
Filters and Annotators will be further explored later on in this tutorial. For now, take some time to familiarise yourself with the toolbar and menus that have been introduced in the last two subsections. Play around with the currently opened network and please ask any questions you may have.
In this section we will show how to load in your own networks and associated data into OVTK. It is possible to open biological networks into OVTK by:
OVTK can read files written in the following formats:
For each concept/relation, right-click and view/edit properties.
Note: In the “Item Information” window, you can only view information and cannot modify anything.
It is also possible to create new, empty networks that concepts and relations can be manually added to. To create an empty network, go to File → New empty network. Click on the icon for “Add a concept or relation”.
If you wish to add a concept, click on “Add a new Concept”. You will need to fill in all the fields highlighted in orange as they are compulsory. This will require creating Controlled Vocabularies, etc.
If you wish to add a relation, click on “Add a new Relation”. You will need to select concepts beforehand (shift + left mouse button to create a select area) so the drop-down lists offer these concepts as origin/target of the relation.
You can save your OVTK sessions and export visualizations.
Searching and filtering networks is another powerful feature of OVTK. In this section we will cover different ways of navigating and analysing networks.
OVTK includes a Search feature, which enables you to quickly find concepts and relations.
You may specify in which field(s) you wish to search: parser identifier (PID), annotation, description, name and accession.
If you select (single click) one item in the search results, you will need to single click on the icon “Zoom In” in order for OVTK to zoom in on this particular concept. You can then use the mouse scroll to get to the concept.
If you select several items (using the Control key) in the search results, OVTK will automatically zoom in to show all those items as close as possible.
Configuring an OVTK Search
At the end of the toolbar, there are two options that can be configured for each search.
As shown in Section 1.2, the simplest way of filtering is through using the context drop-down list within the visualization window itself when available. It is sometimes also possible to filter the context list by selecting a context class in the second drop-down list.
This section is about the Filters menu. Filters allow you to quickly select multiple concepts or relations of interest by comparing concept and relation attributes loaded onto OVTK networks to properties you specify.
For this section and the next, we will work with a social network rather than a biological network. We will take advantage of the fact that Ondex is domain-independent to show some examples which are easy to understand. Other filters and annotators will be explored in Section 5 using biological networks.
Close the Aracyc network and load a new one in: foaf.xml.gz, friend of a friend network. The metagraph shows only one type of concept and one type of relation. This is normal as this friend of a friend network is simply composed of individuals who know one another.
How many people are there in the network?
How many relations are there in the network?
Have a look at some of the people. Here, for example, Marjorie McShane is observed:
In this section, we are going to study the Shortest Path Filter (Filters → Shortest Path). This filter is based on the Dijkstra algorithm.
The algorithm will then compute the shortest paths from the selected concept(s) to all other concepts. Selecting an attribute name in step 2 will result in a minimum weight tree of the network.
We can apply the GEM layout (Layouts → GEM Algorithm) to have a better overview of the network. In the following screenshot, we selected “Randy Schauer” in the network.
We then applied the shortest path filter as shown below:
The filter deleted all the relations which were not part of a shortest path from Randy Schauer to another person.
Note: Selecting more than one concept will result in the network showing all the shortest paths from these root concepts to all other concepts on the network (it will not result in the shortest path between the selected concepts).
In this section, we are going to study the Betweenness Centrality Annotator in the friend of a friend network loaded in the previous section. To do this, close the shortest path filter without saving its results and click on Annotators → Betweenness Centrality. You get:
By clicking on a column’s name, the table will get sorted according to the values contained in that column. Here we can sort by “score” which represents the betweenness centrality measure of each concept within the network. A concept with a betweenness centrality of 1 is “central” to the network. The table shows Tim Finin is the most popular person in this network of friends:
Clicking on the visualization window (after clicking on “Annotate Graph”) will scale all concepts based on their score:
It is now easy to know what concepts are important in the network simply by observing the size of a concept.
In the above screenshot, the item information window gives details on Tim Finin. Deleting this person would change the network the most.
The OVTK stands for the ONDEX Visualization ToolKit and is used to visualize networks produced by running an ONDEX workflow. Traditionally, the data integration is done by writing a workflow in XML format and by running it in the back-end. We will cover this in Section 4.2. First of all, we are going to introduce the Launcher which allows users to prepare and run an ONDEX workflow within the OVTK. In both sections, we are going to work with an example which will later be analysed in Section 5.4.
For this example, techniques of comparative genomics were combined with data integration methods to predict pathogenicity genes in pathogenic organisms. The InParanoid algorithm for the prediction of orthologous groups of proteins was implemented as part of the ONDEX data warehouse.
The PHI-base database (http://www.phi-base.org/) is a reference database of virulence and pathogenicity genes validated by gene disruption experiments. It contains information on experimentally verified pathogenicity, virulence and effector genes from bacterial, fungal and Oomycete pathogens identified from the scientific literature. Data integration methods for both the contents of PHI-base and the genome sequence of Botrytis Cinerea (http://www.broad.mit.edu/annotation/genome/botrytis_cinerea/) were developed for ONDEX.
The Launcher is available from File → Launcher in the menus and looks like this when first opened up:
In the top-left drop-down list, “All” is selected by default which means the column below it gives all the possible steps that can make up a workflow. If you select a type of step in the drop-down list, the list of steps in the column below will be filtered accordingly to only show those particular kind of steps.
All of the available steps are documented under "Developer\Document". A simple example for a workflow is to first select parsers (to import data) then do select mapping methods and, finally, to select some filtering so as to improve the quality of the graph. Any field that is not grayed out is required.
Note: Running this workflow with the whole genome sequence from Botrytis will take about an hour. If you wish to see quick results during this hands-on tutorial, we suggest you use a fasta file which has been prepared for this purpose: short.fasta.
This file contains a tenth of the entire genome and will take about 3 minutes. The results show a smaller graph which can be analysed nevertheless. In Section 5.4, you will be able to load up the results for the whole genome which are saved under the folder for application case 4 as botrytis_results.xml.gz.
Note: Before running a workflow in the launcher, make sure you save your current workflow so you can re-open it as you might not get all the parameters right the first time around.
For this example, the steps needed in the workflow are as follows:
Parsing the file containing the PHI-base database (Parser – Oxl import). Note: A “Create new graph” pre-step is automatically
Parsing the file downloaded from the Broad institute website for the
sequence of Botrytis Cinerea (Parser –
Note: The taxonomy ID for Botryotinia fuckeliana was found on http://www.ncbi.nlm.nih.gov/Taxonomy/ (40559).
CC stands for Concept Class and is therefore “Protein”.
SeqType stands for Sequence Type and is in this case “AA”, amino acid.
After parsing the two data sources, they are mapped to each other using the
“inparanoid” mapping which uses pair-wise sequence alignment results generated
by BLAST http://www.ncbi.nlm.nih.gov/blast/download.shtml
(Mapping – InParanoid).
The Evalue parameter is passed on to BLAST.
“cutoff” and “overlap” are post filter parameters which specify a bit score cutoff and the minimum length of the match compared to the longest sequence.
To reduce the complexity of the network serveral concept classes are
filtered (Filter – ConceptClass)
In PHI-base a protein is always linked to an “Interaction” which carries
the specific phenotype. To merge both information from protein and Interaction
the relation type set “phi1” is collapsed and the connected concepts are fused
into a single concept of concept class “Interaction:Protein”. (Transformer – Relation Collapser)
The unconnected filter is used to remove any isolated concepts (i.e.
without any relations to the rest of the network). (Filter – Unconnected)
The next step will only keep clusters of concepts in the network which
contain at least one concept of concept class Protein from Botryotinia fuckeliana. (Filter
The resulting network is exported as OXL (Export – Oxl export) to a Gzip compressed XML file called short_botrytis_results.xml
Finally, a tab separated text file called short_botrytis_clusters.txt is created which lists all associated
phenotypes for the members of each cluster in the network (Export – Clusters).
Note: GDS stands for General Data Store. This is where the phenotypic annotation is stored under the attribute name “Pheno”.
Should you wish to see the results of your oxl network loading automatically when the workflow is finished running, copy and paste the graph number after “Load result in OVTK when complete” (in this case, “@graph7”). Loading short_botrytis_results.xml in OVTK will look similar to this screenshot:
The exact same workflow could have been written in an XML file as follows (here using the whole Botrytis genome sequence):
<Ondex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="ONDEXParameters.xsd">
<DefaultGraph name="phibase" type="memory">
<Parser name="oxl" datadir="/importdata/oxl">
<Parser name="fasta" datadir="/importdata/fasta">
<Parameter name="PathToBlast">D:/Program Files/Eclipse/Ondex-data/blast/bin</Parameter>
<Export name="oxl" datafile="/importdata/oxl/botrytis_results.xml">
<Export name="clusters" datafile="/importdata/oxl/botrytis_clusters.txt">
Customising the way you visualise and manipulate networks is a key function of OVTK. This is achieved through using a combination of annotators, filters, searches and layouts.
The best way to learn about OVTK is to try to use it on some real examples. In this tutorial we will look at four application cases:
Author: Jan Taubert
We integrated AraCyc (http://www.arabidopsis.org/biocyc/index.jsp), a database containing pathway information for the plant Arabidopsis thaliana, with data from the former DRASTIC-INSIGHT database for information on plant gene expression. Additionally we loaded microarray expression data onto the concepts of the network using the General Data Store. The expression data has been analysed for statistical significance and normalized.
We like to show, that an integrative approach to the exploration of microarray expression data can leverage the understanding in the context of pathways and might yield new biological insights.
After loading the OXL data file pathways.xml.gz, a meta-graph is displayed:
11) 10) 9) 8) 7) 6) 5) 4) 3) 2) 1)
The different concept classes are:
All eleven concept classes can be identified in the main visualisation, which uses a circular layout. Each circle corresponds to one concept class, e.g. the circle with green round concepts on the left corresponds to Treatments from DRASTIC, whereas the circle with orange star shaped concepts corresponds to Pathways from AraCyc.
The whole network contains 24541 concepts and 45153 relations. Using context information we can display only the relevant sub network instead of having to work with the whole network.
From the Context drop-down list please select gamma-glutamyl-cycle. The resulting network has still the circular layout from the whole network.
The GEM layout (Layouts -> GEM Algorithm) can easily be applied to this smaller network, which produces a more pleasant representation of the network.
Now additional features like displaying concept and relation labels (View -> Show Concept Labels, View -> Show Relation Labels) and anti-aliased painting (View -> Anti-aliased) can be turned on the enrich the current visualisation. Using the Mouse wheel you can zoom into the network view. Try to locate a group of membrane alanyl aminopeptidases.
Here only one Gene (blue triangle) has information associated from the DRASTIC databases (green circle). On all of these three genes additional information can be displayed by right mouse click on the concept.
By clicking on Edit Concept Properties you are able to inspect all the properties assigned to this concept. Selecting View/Edit Concept General Data Store displays a tab based representation of all values in the General Data Store.
In this case the values in the General Data Store are the microarray expression data listed according to the treatment. Now we can use the Scale by Value annotator (Annotators -> Scale by Value) to actually map the values of the General Data Store to the visualisation of the corresponding concepts.
The annotator supports multiple value selection. Here all treatments of the form xN60C are selected. Annotate Graph will perform the changes to the visualisation.
The visualisation for the concepts now changed to pie charts. The pie chart is divided into the number of treatments that were selected in the annotator. The order of the pies is in mathematical positive direction (anti-clockwise), e.g. the 3.3N60C treatment expression value is the upper right pie. Visual inspection of this network now reveals, that Gene in the middle shows an irregular expression pattern, whereas the other two Genes are oppositely regulated (red for up-regulation and green for down-regulation).
Further inspection of the value in the General Data Store supports the hypothesis that this measurement might be wrong and the Gene is behaving in the same way as the other down-regulated Gene.
In addition we can annotate one of these three Genes with information from the DRASTIC database showing that this particular Gene has a significant change in expression under the associated conditions.
In this example we showed how visualisation of expression patterns can leverage the understanding in terms of comparing conditions and influenced pathways. We identified one particular Gene with an irregular expression pattern. Furthermore data integration helped to enrich pathways from AraCyc with information about influential treatments from DRASTIC.
Author: Keywan Hassani-Pak
In this application case we are going to show the
integration of major public databases to provide a comprehensive gene
annotation network. We have parsed and integrated GOA EBI,
and Gene Ontology
step by step in the ONDEX backend to build a core database. In this example we use
the GO annotations of Arabidopsis
thaliana, however this core database can be created for any species. From
The functional annotation of genes is still a major challenge in the post genomic era. Traditional manual annotations by literature curation are reliable and of high quality. However, as both the volume of literature and of genes requiring characterization increases, the manual processing capabilities are becoming overloaded. To efficiently annotate genes with controlled vocabularies such as Gene Ontology (GO), computational methods to automate the process of functional annotation are required. We are working towards a novel annotation system for ONDEX that includes data integration and literature analysis methods to predict the function of previously unannotated genes.
Our first aim is to create a comprehensive integrated network in which genes are enriched with manual (if available) and automatic annotations such as text mining based annotations. The second aim is to show the information content of the core database and how to navigate through the network gene by gene.
The data integration was done in the ONDEX backend. Some intermediate steps were exported into the OXL format. Here we are going to load these steps and show some examples. As the first step we load the GOAEBI_MANUAL.xml.gz file into OVTK. The Metagraph View is displayed. We can see that Proteins are connected to the three GO classes (Molecular Function, Biological Process and Cellular Component). In this case every relation has a publication assigned to it (ternary relation). The publication contains the evidence for the relation between Protein and GO. Clicking on the Metadata Legend button shows some information about the number of the concepts in this first graph. For example 1157 unique publications are referenced in the GOA Arabidopsis file.
Let’s search for “at2g” to get a list of genes located on the second chromosome of Arabidopsis. Select several of the results and apply the Neighbourhood Filter with a depth of 1 or 2. Now we open the main graph and apply the GEM Algorithm Layout. We can see a graph with several clusters, which shows GO concepts surrounded by associated proteins.
The proteins have no sequence or further details right now. However they contain UNIPROT accessions, which we will use for the next integration step. Now let us load GOAEBI_MANUAL_UNIPROT.xml.gz and have a look at the Metadata Legend. We see that the number of Publications has increased to 2446 and 199 new EC concepts are in the graph now. This is because UNIPROT proteins have their own publications and ECs.
We perform the same accession based integration approach with MEDLINE and GO accessions. The results can be loaded from GOAEBI_MANUAL_UNIPROT_MEDLINE_GO.xml.gz. This file is an integrated core database which contains sequences, abstracts and the GO hierarchy. It can be used for many automatic GO annotation approaches as a manual reference set.
Now we are going to show on some example proteins the information content of the integrated gene annotation network. First we search for “bre1” and apply the Shortest Path Filter. This shows the protein in the centre with its publications and GO annotations. Additionally the GO hierarchy is visible.
Next we search for “bik1” and apply the Neighbourhood Filter with depth 1. Then we open the Item Information window and click on one of the publications. The title, abstract, year and journal etc of the publication are shown in this window.
As we can see, not every publication was used as a source for functional GO annotation. We are now interested to extract automatically from these publications some GO terms. This text mining based mapping between publications and GO concepts was carried out in the ONDEX backend. We load the results into OVTK:
Search for “mgp” gene and apply the Shortest Path Filter. The blue relations (is_r) are created automatically by the text mining mapping method. A score and evidence sentences are assigned to each of those relations (can be seen in the Info Window). Applying the “Scale relation width by value” annotator will highlight strong relationships.
We perform the same search with “FH8” and apply the Neighbourhood Filter with depth 2. One can see on the right side that one publication (PMID:11130712) has lots of links to proteins, it seems to be a large scale genomic DNA paper. On the left side manual and automatically extracted GO annotations can be compared.
As a last step of the analysis we have blasted two target sequences against our integrated database. So let us load GOAEBI_MANUAL_UNIPROT_MEDLINE_GO_TM_BLAST_noHirarchy.xml.gz into OVTK. We search for “target1” and apply the Shortest Path filter again. Then we annotate the graph relations by E-Value in Scale Relation Width by Value. We can see our target protein in the centre connected to several ARR proteins based on sequence similarity.
We have integrated major public databases into one core database. OVTK enables a sufficient way for the graph based navigation of these gene annotations networks. This application case showed that integrated gene-annotation networks can provide substantial support for semi-automated genome annotation projects. Using text mining as an automatic GO annotation method opens up the wealth of indispensable knowledge in the scientific literature.
Author: Jochen Weile
This application case uses data from an ongoing project that examines topological comparison on protein-protein interaction networks (PPIs) with the help of graph hierarchies that serve as scaffolds for the comparison.
· We will have a look at the single interim steps of the workflow to understand the whole process as well as the structure of the graph.
· We will learn about the OVTK’s abilities to serve as a tool for quality assessment and data exploration.
Integrated databases: We integrate several databases during the time of the workflow: PPI data from a high-throughput YTH-array experiment on Campylobacter jejuni, the respective PPI datasets for E. coli and Helicobacter pylori from the IntAct database, the Pfam-A HMM library and the Gene Ontology. See fig. 5.3.1.
Step 1 (Integration of PPIs): Let us have a look at the workflow that was used to create the data for this tutorial. Open the workflow launcher (File→Launcher) and load the file DemoWorkflow.xml (File→Open). Do not execute it, though. It requires special hardware and takes several hours to run. Anyway, we can now have a look at the single integration and processing steps in it.
· PSI-MI parser: The PSI-MI parser imports the PPI data from IntAct and from the C. jejuni experiment flatfiles. It creates concepts for each participant of an interaction and links them with relations. The relation types match the description of the interaction. It may be a general “interacts with”, or more specifically “physical interaction”, “colocalization” or “is part of”.
· Uniprot2GOA transformer: This transformer uses the Uniprot IDs found in concepts with a given Taxonomy ID (in this case 83333, that is E.coli) to find the corresponding GO annotations in the GOA database. It then attaches these annotations to the concepts.
So we now have our interaction networks with annotated E.coli entries. Let’s have a look at it! It has been exported to Step1.xml.gz. Switch back to main OVTK window and open the file (File→Open).
After a little wait (~60,000 relations are being loaded) the metagraph window will appear. The metagraph gives an overview of the main graph representing every concept class as a node and every relation type that connects part of the concepts in two concept classes as an edge. The real graph visualization is minimized at the lower left corner of the main window. I would not recommend opening it though, since it would take some time to render all those 60,000 relations. For starters it will be enough to examine the metagraph.
Figure 5.3.2: Metagraph at step 1 of the workflow.
If you right-click on nodes or edges you will see a context menu popping up. It explains the represented type and the number of associated elements in the main graph. In the example in fig.5.3.2. this would be the type RNA that represents 2 concepts. The menu also offers to change the visibility of these elements in the main graph.
You will probably be wondering now, why there are other concept classes than “Protein” in the graph. That is a legitimate question. There is a quite simple answer: The E.coli dataset from IntAct did not only contain entries about protein-protein interaction, but also about other kinds of molecular interactions. Since we do not need them here, however, we might just as well get rid of those entries.
Step 2 (Filtering the data): To do so in the OVTK we would make those data points first invisible and then remove all invisible elements from the graph. Right-click on all the metaconcepts (nodes in the metagraph) except “Protein” and then click their visibility checkboxes in the appearing context menus. You will see that this does not only affect the metaconcepts themselves, but also their connected metarelations. This is because relations cannot exist without a source or target concept.
Now all unnecessary elements are invisible we can remove them from the graph. This can be done by the graph synchronization function (Network→Synchronize Network). However, since it would take quite a long time on such a big graph (~30 min.), because the consistency with the visualization would have to be maintained all the time. Luckily there is a quicker way: The ONDEX workflow.
Bring the Launcher window to the front again and have a look at the next elements in the workflow:
So all we need to do to see the results of this cleanup is to switch back to OVTK and load the file Step2.xml. We want it to look nice when we open it later, so we apply a layout to it: The GEM layout (Layouts→GEM Algorithm). This will take a while so let’s move the progress monitor some place where it doesn’t obstruct us and do something else in the meantime.
Start the statistics module (Network→Statistics). You see a window that is divided into three main areas. Top left is the listing panel. It contains three lists with different graph elements that can be used as variables or filters for the module. Top right is the selection panel. It contains the variable field with two buttons that load it with the currently selected element or unload it, respectively. Below the variable field you can see the filter list with the same kind of load/unload buttons beside it. The filter field can be loaded with an arbitrary number if elements, that will be handled as conditions linked with an “or” operation. If attribute names are loaded into the filter list they may be assigned a value.
Finally, the panel on the bottom is the display panel. It shows the number of concepts that were found to meet the selected conditions, the respective number of relations. In case the variable can contain numbers, it also features the mean and the standard deviation values, and a histogram.
Figure 5.3.3: The statistics module.
This probably all sounds quite complicated so let’s just try it out and see an example. Select the attribute “CONF” from the listing panel and load it into the variable field by clicking the respective button. You should now see a histogram of the confidence values that occur in the graph. Most of the interactions seem quite unlikely, with the highest peak at 0.1 and the mean value at 0.3354. This seems to confirm that YTH is a very hypersensitive methodology.
You can also see the number of relations that hold confidence values: 11,900. This is a surprisingly small number, considering there are about 60,000 relations in the whole graph.
So let’s find out where they come from. Unload the variable again and select the “TAXID” attribute. Load it into the filter field and assign the value “197” (i.e. C.jejuni). Now load the confidence attribute into the variable field again.
The histogram seems unchanged but we see that the number of relations is a bit lower: 11,885 Does that mean there are only 11,900 – 11,885 = 15 relations with confidence values in the whole rest of the dataset? That would explain why the histogram looks almost the same as before. Let’s find out.
Unload the variable, change the TAXID value to 83333 (i.e. E.coli) and reload the confidence attribute into the variable field.
Indeed, there are only 15 relations: Six with value 0.4, five with value 0.6 and four with value 0.8.
That would mean that the H.pylori data does not contain any confidence values. Find out; you know what to do (H.pylori-TAXID=210).
The commonality between the E.coli and the H.pylori dataset is that they both originate from the IntAct database while the C.jejuni data comes from the experimental flatfiles. It would seem that the IntAct dataset is of quite poor quality in these cases.
Anyway, the layout process should be finished by now. If it is, we can have a look at the graph now. However, in order to not wait for too much rendering we should hide the relations first. Open the metagraph window again and right-click on the “physical interaction” metarelation. Unselect the visibility checkbox in the context menu. If you think your machine is fast enough you can also omit this step, but be prepared for some 30 seconds of GUI freezing. Now de-iconify the main visualization window (lower left corner) and click the “center network” button in the toolbar. There they are: two black spots. These are the two main components of the C.jejuni and the E.coli PPIs. Scattered around them you can see some fragments, among them the small set of H.pylori data.
To see the differences between the species we could apply different colors to them. Open the Color Annotator (Annotators→Color by Value). Select the “TAXID” attribute from the list and click on “Annotate graph”. A legend will appear, explaining the color code. (You will probably not see any change from the current zoom level unless you have activated anti-aliasing, which I would not recommend at the moment).
Figure 5.3.4: Two interaction networks in GEM layout colored by Taxonomy ID.
Step 2.1 (Examining a subgraph): Zoom in on the C.jejuni main component (the less dense one): Hold down the shift-key and draw a rectangle over the component to select it. Then click the “Zoom in” button on the toolbar. Alternatively you can use the mouse wheel.
We can see the effects of the graphs scale-free property: Many nodes with low degree attached to few nodes with high degree. But to see this a bit clearer we could have a look at a smaller subset of it. To this end we pick a node that gives a nice example: The “CRISPR-associated protein”. The easiest way to do so is to use the search function. Enter “crispr” in the search field (top right) press enter (or click search). A list appears that shows the result. Select it and you will see the node being highlighted in the graph.
Now open the neighborhood filter (Filters→Neighborhood), specify search depth as ‘3’ and click “Filter Network”. You can now apply the GEM layout again; it will be much quicker on a smaller graph like this. You may also active anti-aliasing now if you wish.
In order to examine this subgraph a bit more we would have to synchronize it first. But to spare you the waiting time I prepared the result for you. You can close all internal windows and open Step2_mini1.xml.gz. Let’s find out more about the internal structure of this subgraph.
Choose the ‘Betweenness centrality’ annotator from the menu (Annotators→Betweenness Centrality). Wait a few seconds for it to compile the required data; then a window will open that shows a list of the nodes in the graphs with their respective BC scores. Click on the score category to sort them and then click on the entry with the highest value. It will then be highlighted in the graph. Click on “Annotate Graph” and minimize the window.
Figure 5.3.5: Betweenness centrality filter applied to a smaller subset of the graph.
You will see that a few nodes are scaled larger then most of the others. Those are the ones that feature the highest BC score. To learn more about them click on the info button in the tool bar and select one of them.
If you like, you can also save the results of the analysis in an Excel spreadsheet. De-iconify the annotator window and click on “Export results”.
Step 2.2 (Examining a small subgraph): By applying the Neighborhood filter again with a lower depth (2) we can get an even smaller subset. This time we can call the synchronization for real. It will only take about a minute. Afterwards we can apply the GEM layout anew. Now we can see clearly the scale-free property of the graph. Many low degree nodes associated to very few high-degree nodes.
Our graph is now small enough to run a more complex analysis: The “Virtual Knock-out” annotator (Annotators→Virtual Knock-Out). After a little wait ( O(n3) ) another internal window will appear showing a table with the results for each node. It contains the following columns:
Figure 5.3.6: Knockout filter being applied to a small subset of the PPI.
Just like the Betweenness-centrality annotator you can sort the entries by their score, select the highest entry and click on “Annotate Graph”. Then minimize the window to see the effect on the graph; or you might export the results to an Excel spreadsheet again.
You can use the item information window again to explore the graph a bit.
To find out more about the significance, we can apply some more filters. Let’s try the shortest path filter, select our central CRISPR protein again and open the shortest path filter (Filters→Shortest Path) choose the confidence attribute as edge weight. Since the confidence is a probability value we have to inverse its value (so lower probabilities result in a greater distance value). Now click “Filter graph”.
We can see that a lot of paths that seemed visually longer were actually considered improbable.
There is a nice layout trick that we could apply now: click “Re-layout” on the toolbar, close the shortest path filter. You will be asked if you want to keep the changes of the filter; click “No”. You will no see the complete subgraph laid out according to the shortest path tree structure. (See fig. 5.3.7)
Figure 5.3.7: Subgraph around the CRISPR protein laid out according to its shortest path structure.
If you want to you can try the same with the all-pairs shortest path filter. (Filters→All pairs shortest path) But it might take some time, since it is another O(n3) algorithm. It does the same as the shortest path filter, except it does not only consider one source node but computes all shortest path between all possible pairs of nodes.
Let’s try some more effective filter for significance analysis. The significance filter (Filters→Significance)
Select the confidence attribute and choose a threshold value in the chart. Then click Filter graph.
Figure 5.3.8: Subgraph before application of the significance filter.
Figure 5.3.9: Subgraph after application of the significance filter
We can see that a whole lot of interaction relations being removed. This is actually not surprising because the histogram already suggests that most of the interactions have a confidence value below 0.3. To see how this affects the complete graph. You can load Step2.xml.gz again and apply the significance filter to it.
Figure 5.3.10: Complete PPI of C.jejuni after application of the significance filter.
Since we found out earlier that neither the E.coli nor the H.pylori data features confidence values we already knew that this filter would not affect those two datasets. However it might still be a good idea to apply it in the workflow as well. Switch back to the launcher and you will see that the next step is indeed:
Step 3 (Seed mapping): If we look further on in the workflow we see the next steps that are part of the Pfam-based seed mapping, which works as follows:
In the next step we try to get rid of those false positives by trying to confirm each orthology relation with common protein family hits. All orthologies that fail to present at least one shared Pfam module are removed. Also, to proceed as conservatively as possible, we have a look at the protein’s GO annotations (where existent). If they differ too much between the two ends of an orthology it is deleted as well. To this end we need a graph configuration as it can be seen in fig. 5.3.11.
Figure 5.3.11: Metagraph of the configuration used for the Pfam-based seed mapping.
Hence the next steps in the workflow are:
Now, let’s have a look at the graph after the application of the seed mapping. Bring the OVTK to front again, close all internal windows and open Step4_layout.xml.gz. In the metagraph you will see that there is a new relation present in the graph:
To spare you waiting for the layout process, I already saved one in the general data stores of the concepts for you. Just choose Layouts→Static to load it. Then hide the “physical interaction” relation using the metagraph context menu. You may then de-iconify the main graph window.
Figure 5.3.12: The Two PPIs of E.coli and C.jejuni with orthology mapping.
On the first glance it looks quite good. Let’s have a look at the statistics. Open the Statistic module once more. The scores for the mappings are also stored in an attribute called confidence. Hence we have to make sure, that we only include orthologies in our statistic. Load the relation type “ortho” into the filter panel. Then load the “CONF” attribute into the variable field. You will see something like the contents of fig. 5.3.13.
There are no orthologies with a score lower than 50. This is, because we chose 50 as our cutoff for the Inparanoid algorithm. The histogram rises between scores 50 and 100. This seems reasonable, since we can expect that the probability to share a Pfam hit rises with increasing sequence similarity. From score 100 on, we see the number of occurrences falling again; obeying the typical blast hit score distribution.
In total we see a number of 948 orthology relations. We must not forget that this is based on bi-directional BLAST hits, so we have two relations for each orthology. So we have 948/2 = 474 actual mappings. How much is that compared to the size of our C.jejuni dataset? To find out, empty the fields again and load the “TAXID” into the filter field. Assign value 197 for C.jejuni. Then load the concept class “Protein” into the variable field. And, the answer is: 1321. So we have a share of 474/1321 = 0.359 = 35.9% of our C.jejuni proteins mapped to E.coli and H.pylori proteins.
Figure 5.3.13: Statistics module showing a histogram of the BLAST scores in the orthologies.
Step 4 (Hierarchy construction): In the next step we build up a clustering hierarchy on our PPIs. This is supposed to work as a scaffold for the comparison that is to be made between the PPIs. Have a look at the workflow again.
Let us have a look at the result. Load the file Step5.xml.gz. We see that the metagraph has changed again. Now we have hierarchy nodes that are connected to themselves and to the proteins with “is part of” relations. The Hierarchy is basically a tree, so we could use a tree layout to display that graph. But first we have to make sure that the interaction and mapping relations don’t interfere with that tree structure. So we just hide them with the metagraph.
Now we can apply our tree layout (Layouts→Tree). However our tree structure is based on “is part of” relations instead of “includes” relations. So we have to reverse the order. Click Layouts→Layout Options and check the box for “reversed edge direction” in the appearing window, then click “Refresh Layout”. Now you may de-iconify the visualization window.
What may appear like an obscure line are the two tree structures. You may see if you zoom in on it.
Now we will try to visualize the mapping between the two proteomes again. Zoom in on the gap between the two trees and start marking the smaller one from there. The “Color by value” annotator might help you find it.
When you have marked the tree use the flip layout (Layouts→Flip) to reverse it and then drag it below the other tree. No we can set the orthology relations to visible. Use the metagraph to do so.
You may explore the graph a bit using the item information panel.
Figure 5.3.14: The two PPIs with their hierarchies and their seed mappings.
You might also have a look at a smaller example. Close all internal frames and load the file Step5_mini.xml.gz. Again I have prepared a layout for you. You can load it again using the static layout.
Figure 5.3.15: Small subgraphs of two PPIs with their hierarchies and seed mappings.
You may explore the graph a bit using the item information panel or any annotators.
The last step in this procession workflow would of course be the actual topological comparison. It is, however, still under development. Anyway I hope you have gained some feeling for the idea behind this project during this tutorial.
In this tutorial we have learned about different visualization and data exploration possibilities in the OVTK. We have also begun to understand the basics of the graph data structure of ONDEX and the way the workflow engine can extend and alter it.
Finally, we have understood the different steps that are made in ONDEX’s PPI comparison project.
We integrated PHI-base (http://www.phi-base.org/), a database containing expertly curated molecular and biological information on genes proven to affect the outcome of pathogen-host interactions. Additionally we loaded the genome sequence of Botrytis Cinerea (http://www.broad.mit.edu/annotation/genome/botrytis_cinerea/).
Running our implementation of the InParanoid algorithm based on BLAST mappings of the genomic data gives us new biological insights. Indeed PHI-base contains a lot of annotation which can be displayed on the network using the “Annotators” menu. This type of visualization in OVTK allows us to gain new hypotheses on the Botrytis genome.
After loading the OXL data file botrytis_results.xml.gz, a metagraph is displayed. The metagraph shows red circles are genes from Botrytis and blue triangles are genes from PHI-base (see explanation for “Interaction:Protein” Section 4.1).
They are linked by ortholog and paralog relations indicating whether genes were separated by the event of speciation (ortholog) or genetic duplication (paralog). Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one.
Concepts imported from PHI-base contain a lot of annotation. In order to make this annotation visible, we are going to use the Annotators menu. First of all, let us use the Colour by Value annotator.
As a lot of phenotypic information is encoded in PHI-base, we select the first attribute: “Phenotype”. Then click on “Annotate Graph”. We get a colour legend and colour annotation on the triangles (PHI-base concepts) in the graph:
We use Layouts → GEM Algorithm in order to visualize clusters:
We may now zoom in on a particular cluster:
Another Annotator we may use is the Shape by Value Annotator as PHI-base also contains information on the pathogen species. We select the attribute “TAXID” (Taxonomy Identifier) in the list:
We can look up the concepts’ TAXID by editing their concept properties:
By clicking on “View/Edit Concept General Data Store”, we get:
The taxonomy ID is under the TAXID tab. When there are a lot of tabs, you may have to click on a right hand side arrow to get to it.
Once you know what TAXID a concept holds, you may enter it in the box below and choose a shape you would like to associate to it:
The results are displayed once you click on the graph:
Let us change the shapes of all the PHI-base concepts in this cluster by using the same annotator several times. (Note: Entering more than one TAXID in the blank box would make the same shape be associated to all of those TAXIDs.)
In this example we have shown how data integration combined with visualization of annotation from a manually curated database can yield new hypotheses for a genome.
OVTK is open source and the latest nightly build can be obtained from: http://www4.rothamsted.bbsrc.ac.uk/ondex/latest_snapshots/ovtk2_packaged-distro.jar
OVTK requires a JAVA 6 runtime environment (http://java.sun.com/). It can be used under both Windows and Linux. You can execute the downloaded file using the following command:
java –Xmx512M –jar ovtk2_packaged-distro.jar
The parameter –Xmx512M is used to specify the maximum amount of memory JAVA can use. The minimum amount for OVTK is 256 MByte. The more memory is available, the more OVTK can cope with bigger networks.
During the first run several resources are created. These include two directories, “data” and “net”. The “data” directory contains examples, help files, configuration files and images used by OVTK. The “net” directory is part of the scripting engine of OVTK, which also created the corresponding help file “Scripting_ref.htm”. Last but not least a log file called “ondex.log” is created.
 See Section 6.2 for more information or README file on CD.
 Kamada T. and Kawai S. (1989). An algorithm for drawing general undirected graphs. Information Processing Letters, 31, 7--15.
 Frick A., Ludwig A. and Mehldau H. (1994). A Fast Adaptive Layout Algorithm for Undirected Graphs (Extended Abstract and System Demonstration). Lecture Notes In Computer Science; Vol. 894. Proceedings of the DIMACS International Workshop on Graph Drawing, 388--403.
 Brandes U. (2001). A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, Vol. 25, 163—177.
 Remm M., Storm C.E., Sonnhammer E.L. (2001). Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal of molecular biology, 314(5):1041-52, PubMed ID: 11743721.
 J. R. Parrish, J. Yu, G. Liu, J. A. Hines, J. E. Chan, B. A. Mangiola, H. Zhang, S. Pacifico, F. Fotouhi, V. J. DiRita, T. Ideker, P. Andrews, and R. L. Finley. A proteomewide protein interaction map for campylobacter jejuni. Genome Biology, 8(7):R130, 2007.
 M. Remm, C.E. Storm, E.L. Sonnhammer; Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314(5) 1041-1052; 2001
 A. Clauset, C. Moore, and M. Newman. Structural inference of hierarchies in networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, 2006.