Query Primer

Queries are a major feature of DiscoverySpace and are the most complex and powerful new functionality of this fourth release. A Query is a question asked of the database, much like a web search on Google or Yahoo. However, Queries are more structured then web searches and allow the user to search particular properties of things. For example, "Get me all things with a certain name", or "Get me all things less then a certain age", etc. The Query allows the user to define such a subset of the data available in the database using a graphical query builder. It is imperative that new users learn how to use Queries in order to make full use of the functionality that DiscoverySpace has to offer. The skills used in constructing queries are also of vital importance when using the DiscoverySpace Explorer, which follows a similar methodology.

Table of Contents

  1. Introduction
  2. Definition of Terms
  3. Defining the path of the Query
  4. Applying Constraints
  5. The "In" Clause

1. Introduction

This primer will introduce the user to the core concepts of the Query with some example use cases. Available datatypes differ for each user, so you might not have access to the specific datatypes given in the examples. None-the-less, the general usage remains the same regardless of the specific datatypes available. Also be wary that there may be many different ways to reach the same result and no one method may be 'correct'.

2. Definition of Terms

DiscoverySpace uses the term resource to describe an individual, unique record, object or entity. Each resource belongs to a class of resource. For example, the record "Refseq NM_001004214" is a resource which belongs to the "Refseq Genes" class. DiscoverySpace offers a large, and ever expanding, selection of classes including "GSC SAGE Libraries", "Refseq Genes", "Swissprot Proteins", "GO Terms", etc. Each class of resource has its own properties (for example, accession, name, description, Taxon, etc) which describe the attributes of an individual resource and its relationships to other resources.

In order to utilize the features of DiscoverySpace one has to define sets of resources. Such a set can be the result of a Query. For example, in the case of the SAGE experiments one has to be able to define sets of Tag Sequence resources. Once these sets of resources have been defined, one can explore, compare and analyze the sets using DiscoverySpace's inbuilt tools. The central method in DiscoverySpace for defining sets of resources is via the Query. Additionally one may define sets by using using Data definitions. Presently DiscoverySpace supports sets of one class only and does not allow sets of resources of mixed classes (e.g. A set of GO Term resources and Tag Sequence resources).

The Query builder represents a query as a graphical 'query tree' of properties which describe the qualities of a set of resources. The user can navigate these properties to explore the data model, retrieve associated sets of data and constrain the resulting set by applying constraints to the tree. There are two categories of property; literal properties and resource properties.

Literal properties are the 'attributes' (or 'fields') of a resource and have a atomic, simple datatype as an object (an integer, a decimal, a date or text). For instance, the 'age' of a Person resource would be described using a literal property with an integer object. Literal properties are denoted by the 'document' icon in the query tree and, crucially, may have constraints applied to them. DiscoverySpace does not allow the user to collect sets of literal objects, only of resource objects. For example, one cannot collect the set of all names of a given set of Refseq resources.

Resource properties are 'links' (or 'foreign keys') to related resources and are denoted in the query tree by the 'folder' icon. For instance, the siblings of a Person resource would be described using a resource property with a Person object. Constraints may not be attached directly to a resource property, but they may be attached to the literal properties that resource. In addition, by following resource properties the user can collect a set of related resources; one may wish to collect the set of 'siblings' resources, or the set of the 'parents' of the 'parents' of a given set.

3. Defining the path of the Query

The Query panel of the Query Properties presents all the classes available on the database in a drop-down (for example "GSC SAGE Libraries" or "GO Terms"). The class selected is the starting set of resources which anchors the query. In Figure 1 (below) the starting class is "Kegg Genes". This means that the user is starting her query with the set of all Kegg Genes available from the database. This starting set should be either the class that the user is interested in or a set from which the user wishes to get an associated set. An example of such an associated set would be if I wanted to get the set of all Refseq genes mapped to a my given set of Tag Sequences. In such a case I should start with the class of Tag Sequences and go from there to the associated Refseqs. Once the start set has been selected, the user is presented with the query tree display which describes the properties of the start set, its attributes and its links to other resources.

An image of the Query

Figure 1) A detail from the top section of the query panel. The user has used the drop-down to select the set of Kegg Genes as the start set. In the tree below the drop-down one can see the properties of the Kegg Genes class. At the top of the tree (marked by document icons) are the literal properties of a Kegg Gene; accession, names, description, etc. Below these are the resource properties (marked by folder icons) which link to related resources; the Locuslinks related to the set of Kegg Genes, the pathways related to the set of Kegg Genes, etc. These resource properties can themselves be expanded to show their own properties, and so on.

With a starting set, the user then needs to consider the resulting, end dataset that she requires. By following the links from the start set the user can navigate to associated sets of resources. To collect one of these associated sets, the user needs to mark the set as the end of the query. This is done by selecting the required resource property and clicking the "End" button. As noted before literal property nodes cannot be set as the destination of a query. Figures 2a-2c illustrate how changing the end marker affects the results of a query.

An image of the Query

Figure 2a) A detail from the top section of the query panel. The query defines the set of all 'GSC SAGE Libraries'. Here the user has chosen 'GSC SAGE Libraries' as her start set. The user has not moved the end marker, so the start and end of the query are the same. Therefore the user is starting with the set of all 'GSC SAGE Libraries' and is collecting that same set. The user can expect a result set containing the 300+ GSC SAGE Libraries available on the database.

An image of the Query

Figure 2b) A detail from the top section of the query panel. The query defines the set of all Taxons of all 'GSC SAGE Libraries'. This query is identical to the query in 1a) except that the end marker has been moved to the Taxon property of the libraries. Therefore the user is starting with the set of all 'GSC SAGE Libraries' and is collecting the Taxons of those libraries. The user can expect a result set containing the 10 or so Taxons from which SAGE libraries have been generated.

An image of the Query

Figure 2c) A detail from the top section of the query panel. The query defines the set of all Tag Sequences of all 'GSC SAGE Libraries'. This query is identical to the query in 1b) except that the destination marker has been moved further down the tree to collect the 'Sequences' of the 'Experimental SAGE Tags' of all 'GSC SAGE Libraries'. WARNING: As shown this query will return a vast result set so do not run it without attaching additional constraints (see below).

4. Applying Constraints

Constraints are logical rules which enable to user to reduce the result set to just those resources which she requires. Unconstrainted sets return all of the resources of a given class in the database. For example, all 300+ GSC SAGE Libraries, or all 4000 GO Terms, or all 14 million expermental SAGE tags. However, it is very rare that we wish to deal with all the resources of a certain class and one has to be careful because there is a limit to the number of resources that DiscoverySpace can fit into memory.

Constraints can be added to any literal property in the query tree and many constraints may be applied to one property. Constraints restrict a property with a condition, chosen from a drop-down, and a value entered by the user. Available conditions include "equals", "less than", "in", etc, as can be seen from Figure 3 (below). Each condition can also be negated by checking the "Not" box. For example, using the "starts with" condition I can reduce the result set to only those "GO Terms" which do not have a term type of "function".

On a more technical point it should be specified that constraints are combined together using an "And" operation. Currently there is no support for the "Or" operation except for the limited capabilities of the "In" clause (see below). In future releases of DiscoverySpace we hope to extend the Query to provide "Or" support.

Note that selecting a resource property node from the query tree displays all the constraints of the descendents of that node. Therefore selecting the start node will display all the constraints that have been applied to the whole query tree.

An image of the Query

Figure 3) An image of the Query panel. In this example the user has attached a constraint to the "Name" property of the class "GEO SAGE Library". The location of the constraint is represented in the query tree by the green triangle. The detail of the constraint itself is displayed in the bottom half of the panel. Here one can see that the user has the choice of a number of available conditions such as equal ("="), like and more than or equal to (">="). After choosing the condition to use the user then needs to enter a value for the constraint.

An image of the Query

Figure 4) An image of the Query panel. This query defines a set of all GSC Human SAGE Libraries. To do this the user first selected "GSC Human SAGE Libraries" as the starting datatype of the query (leaving the destination marker in place), then placed a constraint on the accession of the Taxonomy of the libraries. In this example the user has selected all Libraries with a Taxon with accession '9606'; the NCBI Taxon for Homo Sapiens.

An image of the Query

Figure 5) An image of the Query panel. This query defines the set of all Tag Sequences from GEO Library GSM1. The user has first selected the start class 'GEO SAGE Library' and has set the end marker to collect the 'Tag Sequences' of the 'Experimental Tags' of the 'GEO SAGE Libraries'. She has then attached a constraint to select only the 'GEO SAGE Libraries' with name 'GSM1'.

An image of the Query

Figure 6) An image of the Query panel. This query defines the set of Tag Sequences from GSC SAGE Library SHS05. The user has first selected the start class 'GSC SAGE Libraries' and has set the end marker to collect the 'Tag Sequences' of the 'Experimental SAGE Tags' of the 'GSC SAGE Libraries'. She has then attached constraints to select only the libraries with name 'GSM1', and only the Experimental SAGE Tags which are not duplicate ditags and which have a Quality of over 0.99. All of these constraints are combined together to constrain the resulting set.

5. The "In" Clause

Of all the Conditions available in from Query constraints the "In" clause deserves special attention. The "In" clause allows the user to constrain a given property to be one of many in a set of values. It is best regarded as an "equals" operation for multiple values. Technically the "In" clause is a series of "equals" operations which have been combined with an "Or". The true power of the "In" clause is that it can handle thousands of values, and allows the user to quickly construct sets from long lists of values. For example, a user might have a long list of gene names. In order to get all the Refseq records for those gene names she merely needs to attach an "In" clause constraint on the appropriate property and paste those names into it. The screenshots below intend to make the usage of the "In" clause clear.

An image of the Query

Figure 7) An image of the Query panel. Here the user has selected the starting set of GO Terms and is restricting that set with a constraint upon the Accession property. In the lower half of the window one can see that the "In" clause has been selected for this constraint and that the value column contains an empty list ("[]") of values. The user needs to click this field to edit the list.

An image of the Query

Figure 8) An image of the Query panel. Here the user is editing the contents of the value list of the "In" clause. In this image the user has pasted 28 GO Accessions into the list. The user can also import values from file or can enter values directly into the table. Clicking OK adds these values to the list.

An image of the Query

Figure 9) An image of the Query panel and the result of the action in Figure 8. One can now see that the value list of the "In" clause now contains the values from Figure 8. This Query will now return all GO Terms that have an Accession which is in the given list of values.

Page last modified Jun 08, 2010