Query Primer
Table of Contents
- Introduction
- Definition of Terms
- Defining the path of the Query
- Applying Constraints
- The "In" Clause
1. Introduction
This primer will introduce the user to the core concepts of the Query with some example use cases. Available datatypes differ for each user, so you might not have access to the specific datatypes given in the examples. None-the-less, the general usage remains the same regardless of the specific datatypes available. Also be wary that there may be many different ways to reach the same result and no one method may be 'correct'.
2. Definition of Terms
DiscoverySpace uses the term resource to describe an individual, unique record, object or entity. Each resource belongs to a class of resource. For example, the record "Refseq NM_001004214" is a resource which belongs to the "Refseq Genes" class. DiscoverySpace offers a large, and ever expanding, selection of classes including "GSC SAGE Libraries", "Refseq Genes", "Swissprot Proteins", "GO Terms", etc. Each class of resource has its own properties (for example, accession, name, description, Taxon, etc) which describe the attributes of an individual resource and its relationships to other resources.
In order to utilize the features of DiscoverySpace one has to define sets of resources. Such a set can be the result of a Query. For example, in the case of the SAGE experiments one has to be able to define sets of Tag Sequence resources. Once these sets of resources have been defined, one can explore, compare and analyze the sets using DiscoverySpace's inbuilt tools. The central method in DiscoverySpace for defining sets of resources is via the Query. Additionally one may define sets by using using Data definitions. Presently DiscoverySpace supports sets of one class only and does not allow sets of resources of mixed classes (e.g. A set of GO Term resources and Tag Sequence resources).
The Query builder represents a query as a graphical 'query tree' of properties which describe the qualities of a set of resources. The user can navigate these properties to explore the data model, retrieve associated sets of data and constrain the resulting set by applying constraints to the tree. There are two categories of property; literal properties and resource properties.
Literal properties are the 'attributes' (or 'fields') of a resource and have a atomic, simple datatype as an object (an integer, a decimal, a date or text). For instance, the 'age' of a Person resource would be described using a literal property with an integer object. Literal properties are denoted by the 'document' icon in the query tree and, crucially, may have constraints applied to them. DiscoverySpace does not allow the user to collect sets of literal objects, only of resource objects. For example, one cannot collect the set of all names of a given set of Refseq resources.
Resource properties are 'links' (or 'foreign keys') to related resources and are denoted in the query tree by the 'folder' icon. For instance, the siblings of a Person resource would be described using a resource property with a Person object. Constraints may not be attached directly to a resource property, but they may be attached to the literal properties that resource. In addition, by following resource properties the user can collect a set of related resources; one may wish to collect the set of 'siblings' resources, or the set of the 'parents' of the 'parents' of a given set.
3. Defining the path of the Query
The Query panel of the Query Properties presents all the classes available on the database in a drop-down (for example "GSC SAGE Libraries" or "GO Terms"). The class selected is the starting set of resources which anchors the query. In Figure 1 (below) the starting class is "Kegg Genes". This means that the user is starting her query with the set of all Kegg Genes available from the database. This starting set should be either the class that the user is interested in or a set from which the user wishes to get an associated set. An example of such an associated set would be if I wanted to get the set of all Refseq genes mapped to a my given set of Tag Sequences. In such a case I should start with the class of Tag Sequences and go from there to the associated Refseqs. Once the start set has been selected, the user is presented with the query tree display which describes the properties of the start set, its attributes and its links to other resources.
With a starting set, the user then needs to consider the resulting, end dataset that she requires. By following the links from the start set the user can navigate to associated sets of resources. To collect one of these associated sets, the user needs to mark the set as the end of the query. This is done by selecting the required resource property and clicking the "End" button. As noted before literal property nodes cannot be set as the destination of a query. Figures 2a-2c illustrate how changing the end marker affects the results of a query.
4. Applying Constraints
Constraints are logical rules which enable to user to reduce the result set to just those resources which she requires. Unconstrainted sets return all of the resources of a given class in the database. For example, all 300+ GSC SAGE Libraries, or all 4000 GO Terms, or all 14 million expermental SAGE tags. However, it is very rare that we wish to deal with all the resources of a certain class and one has to be careful because there is a limit to the number of resources that DiscoverySpace can fit into memory.
Constraints can be added to any literal property in the query tree and many constraints may be applied to one property. Constraints restrict a property with a condition, chosen from a drop-down, and a value entered by the user. Available conditions include "equals", "less than", "in", etc, as can be seen from Figure 3 (below). Each condition can also be negated by checking the "Not" box. For example, using the "starts with" condition I can reduce the result set to only those "GO Terms" which do not have a term type of "function".
On a more technical point it should be specified that constraints are combined together using an "And" operation. Currently there is no support for the "Or" operation except for the limited capabilities of the "In" clause (see below). In future releases of DiscoverySpace we hope to extend the Query to provide "Or" support.
Note that selecting a resource property node from the query tree displays all the constraints of the descendents of that node. Therefore selecting the start node will display all the constraints that have been applied to the whole query tree.
5. The "In" Clause
Of all the Conditions available in from Query constraints the "In" clause deserves special attention. The "In" clause allows the user to constrain a given property to be one of many in a set of values. It is best regarded as an "equals" operation for multiple values. Technically the "In" clause is a series of "equals" operations which have been combined with an "Or". The true power of the "In" clause is that it can handle thousands of values, and allows the user to quickly construct sets from long lists of values. For example, a user might have a long list of gene names. In order to get all the Refseq records for those gene names she merely needs to attach an "In" clause constraint on the appropriate property and paste those names into it. The screenshots below intend to make the usage of the "In" clause clear.