Query engine
This technical document describes the spoc query engine, a set of classes that implements spoc's interface for querying multi-dimensional genomic data.
Principles
Composable pieces
Spoc's query engine consists of composable pieces that can be combined to produce an expressive query language. These pieces represent basic operations on genomic data that are easily implemented and understood on their own. This allows a great degree of flexibility, while also allowing predefined recipes that less experienced users can get started with.
Lazy evaluation
The spoc query engine is designed with lazy evaluation as a guiding principle. This means that data queries are only executed when they are needed to minimize loading data into memory and computational overhead. To enable this, spoc queries have a construction phase, which specifies the operations to be executed and an execution phase, that actually executes the query.
Query plans and query steps
The most important ingredient in this query language is a class that implements the QueryStep
protocol. This protocol serves two purposes:
- It exposes a way to validate the data schema during query building
- It implements adding itself to a query
This way, query steps can be combined into a query plan that specifies the analysis to be executed. In generaly, spoc supports the following types of query steps:
- Overlap: Overlaps genomic data with a set of genomic intervals
- Aggregation: Aggregates the data passed into the query using an aggregation function
- Transform: Adds columns to query based on a fixed or custom computation
Specific examples of query steps are:
- DistanceTransformation: Example of a transfomration. Adds distance of genomic positions to regions added by Overlap
- DistanceAggregation: Exaple of an aggregation. Aggregates the distances to genomic regions using an aggregation function.
Input and output of query steps
A query step takes as input a class that implements the GenomicData
protocol. This protocol allows retrievel of the data schema (a thin wrapper over a pandera dataframe schema) as well as the data itself. The output of a query step is again a class that ipmlements the GenomicData
protocol to allow composition. Specific examples of possible inputs are:
- Pixels: Represents input pixels
- Contacts: Represents input contacts
- QueryPlan: The result of a query step
Composition of query steps
To allow specifying complex queries, query steps need to be combined. This is done using the Query
class. It takes a query plan (a list of QueryStep
instances) as input, exposes the build
method, which takes input data, validates all query steps and adds them to the resulting QueryPlan
instance that is returned.
Manifestation of results
So far, we have only talked about specifying the query to be executed, but not how to actually execute it. A QueryPlan
has a compute()
method that returns the manifested dataframe as a pd.DataFrame
instance. This is the step that actually executes the specified query.
Examples
Selecting a subset of contacts at a single genomic position
In this example, we want to select a subset of genomic contacts at a single location. For this, we first load the required input data:
from spoc.query_engine import Overlap, Anchor, Query
from spoc.contacts import Contacts
import pandas as pd
contacts = Contacts.from_uri("../tests/test_files/contacts_unlabelled_2d_v2.parquet::2")
Then we specify a target region
First, we want to select all contacts where any of the fragments constituting the contact overlaps the target region. To perform this action, we use the Overlap class and pass the target region as well as an instance of the Anchor
class. The Anchor
dataclass allows us to specify how we want to filter contacts for region overlap. It has two attributes mode
and anchors
. Anchors
indicates the positions we want to filter on (default is all positions) and mode
specifies whether we require all positions to overlap or any position to overlap. So for example, if we want all of our two-way contacts for which any of the positions overlap, we would use Anchor(mode='ANY', anchors=[1,2])
.
A query plan is a list of qury steps that can be used in the basic query class
The .build
method executes the query plan and retuns a QueryPlan
object
<spoc.query_engine.QueryPlan at 0x1804b711a00>
The .load_result
method of the QueryResult
object can be executed using .load_result
, which returns a pd.DataFrame
. The resulting dataframe has additional columns that represent the regions, with which the input contacts overlapped.
<class 'pandas.core.frame.DataFrame'>
chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | region_chrom | region_start | region_end | region_id | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 100 | 200 | chr1 | 1000 | 2000 | chr1 | 100 | 400 | 0 |
1 | chr1 | 2000 | 3000 | chr1 | 200 | 300 | chr1 | 100 | 400 | 0 |
2 | chr1 | 3000 | 4000 | chr1 | 300 | 400 | chr1 | 100 | 400 | 0 |
We can also restrict the positions to filter on, by passing different anchor parameters. For example, we can filter for contacts, where the first position overlaps with our target:
query_steps = [
Overlap(target_region, anchor_mode=Anchor(mode="ANY", anchors=[1]))
]
Query(query_steps=query_steps)\
.build(contacts)\
.compute()\
.filter(regex=r"chrom|start|end|id")
chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | region_chrom | region_start | region_end | region_id | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 100 | 200 | chr1 | 1000 | 2000 | chr1 | 100 | 400 | 0 |
This time, only the first contact overlaps.
The same functionality is implemented also for the Pixels class
Selecting a subset of contacts at multiple genomic regions
The Overlap class is also capable of selecting contacts at multiple genomic regions. Here, the behavior of Overlap
deviates from a simple filter, because if a given contact overlaps with multiple regions, it will be returned multiple times.
Specify target regions
target_regions = pd.DataFrame({
"chrom": ['chr1', 'chr1'],
"start": [100, 150],
"end": [400, 200],
})
query_steps = [
Overlap(target_regions, anchor_mode=Anchor(mode="ANY", anchors=[1]))
]
Query(query_steps=query_steps)\
.build(contacts)\
.compute()\
.filter(regex=r"chrom|start|end|id")
chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | region_chrom | region_start | region_end | region_id | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 100 | 200 | chr1 | 1000 | 2000 | chr1 | 100 | 400 | 0 |
1 | chr1 | 100 | 200 | chr1 | 1000 | 2000 | chr1 | 150 | 200 | 1 |
In this example, the contact overlapping both regions is duplicated.
The same functionality is implemented also for the pixels class.
Calculating the distance to a target region and aggregating the result
In this example, we calculate the distance of pixels to target regions and aggregate based on the distances. This is a very common use case in so-called pileup analyses, where we want to investigate the average behavior around regions of interest.
from spoc.pixels import Pixels
from spoc.query_engine import DistanceTransformation, DistanceMode
import pandas as pd
import numpy as np
from itertools import product
First we define a set of target pixels
def complete_synthetic_pixels():
"""Pixels that span two regions densely"""
np.random.seed(42)
# genomic region_1
pixels_1 = [
{
"chrom": tup[0],
"start_1": tup[1],
"start_2": tup[2],
"start_3": tup[3],
"count": np.random.randint(0, 10),
}
for tup in product(
["chr1"],
np.arange(900_000, 1_150_000, 50_000),
np.arange(900_000, 1_150_000, 50_000),
np.arange(900_000, 1_150_000, 50_000),
)
]
# genomic region_2
pixels_2 = [
{
"chrom": tup[0],
"start_1": tup[1],
"start_2": tup[2],
"start_3": tup[3],
"count": np.random.randint(0, 10),
}
for tup in product(
["chr2"],
np.arange(900_000, 1_150_000, 50_000),
np.arange(900_000, 1_150_000, 50_000),
np.arange(900_000, 1_150_000, 50_000),
)
]
return pd.concat((pd.DataFrame(pixels_1), pd.DataFrame(pixels_2)))
Then we define the target regions we are interested in.
target_regions = pd.DataFrame(
{
"chrom": ["chr1", "chr2"],
"start": [900_000, 900_000],
"end": [1_100_000, 1_100_000],
}
)
We are then interested in selecting all contacts that are contained within these pixels and then calculate the distance to them. The selection step can be done with the Overlap
class that we described above. The distance transformation can be done with the DistanceTransformation
query step. This query step takes an instance of genomic data that contains regions (as defined by it's schema) and calculates the distance to all position columns. All distances are calculated with regards to the center of each assigned region. Since genomic positions are defined by a start and end,the DistanceTransformation
query step has a DistanceMode
parameter that defines whether we would like to calculate the distance with regard to the start of a genomic position, the end or it's center.
query_steps = [
Overlap(target_regions, anchor_mode=Anchor(mode="ANY")),
DistanceTransformation(
distance_mode=DistanceMode.LEFT,
),
]
We can then execute this query plan using the Query class. This well add an distance column to the genomic dataset returned.
chrom_1 | chrom_2 | chrom_3 | region_chrom | distance_1 | distance_2 | distance_3 | |
---|---|---|---|---|---|---|---|
0 | chr1 | chr1 | chr1 | chr1 | -100000.0 | -100000.0 | -100000.0 |
1 | chr1 | chr1 | chr1 | chr1 | -100000.0 | -100000.0 | -50000.0 |
2 | chr1 | chr1 | chr1 | chr1 | -100000.0 | -100000.0 | 0.0 |
3 | chr1 | chr1 | chr1 | chr1 | -100000.0 | -100000.0 | 50000.0 |
4 | chr1 | chr1 | chr1 | chr1 | -100000.0 | -100000.0 | 100000.0 |
... | ... | ... | ... | ... | ... | ... | ... |
245 | chr2 | chr2 | chr2 | chr2 | 100000.0 | 100000.0 | -100000.0 |
246 | chr2 | chr2 | chr2 | chr2 | 100000.0 | 100000.0 | -50000.0 |
247 | chr2 | chr2 | chr2 | chr2 | 100000.0 | 100000.0 | 0.0 |
248 | chr2 | chr2 | chr2 | chr2 | 100000.0 | 100000.0 | 50000.0 |
249 | chr2 | chr2 | chr2 | chr2 | 100000.0 | 100000.0 | 100000.0 |
250 rows × 7 columns
Aggregating genomic data based on it's distance to a target region
In this example, we extend the above use-case to aggregate the results based on the distance columns added. This is a common use-case to calculate aggregate statistics for different distance levels. To achieve this, we employ the same query plan as above and extend it using the DistanceAggregation
query step.
The DistanceAggregation
class requires the following parameters:
- value_columns
: Thie specifies the value to aggregate
- function
: The aggregation function to use. This is the enumerated type AggregationFunction
- densify_output
: Whether missing distance values should be filled with empty values (specific empty value depends on the aggregation function)
Note that there are two different average functions available, AVG
and AVG_WITH_EMPTY
. AVG
performs and average over all available columns, where as AVG_WITH_EMPTY
counts missing distances per regions as 0.
query_steps = [
Overlap(target_regions, anchor_mode=Anchor(mode="ALL")),
DistanceTransformation(),
DistanceAggregation(
value_column='count',
function=AggregationFunction.AVG,
),
]
distance_1 | distance_2 | distance_3 | count | |
---|---|---|---|---|
0 | -100000.0 | -100000.0 | -100000.0 | 4.5 |
1 | -100000.0 | -100000.0 | -50000.0 | 3.0 |
2 | -100000.0 | -100000.0 | 0.0 | 5.5 |
3 | -100000.0 | -100000.0 | 50000.0 | 5.0 |
4 | -100000.0 | -100000.0 | 100000.0 | 6.0 |
... | ... | ... | ... | ... |
120 | 100000.0 | 100000.0 | -100000.0 | 8.0 |
121 | 100000.0 | 100000.0 | -50000.0 | 4.5 |
122 | 100000.0 | 100000.0 | 0.0 | 4.5 |
123 | 100000.0 | 100000.0 | 50000.0 | 4.5 |
124 | 100000.0 | 100000.0 | 100000.0 | 0.0 |
125 rows × 4 columns
In addition, we can also aggregate on a subset of distance positions, using the position_list
parameter:
query_steps = [
Overlap(target_regions, anchor_mode=Anchor(mode="ALL")),
DistanceTransformation(),
DistanceAggregation(
value_column='count',
function=AggregationFunction.AVG,
position_list=[1,2]
),
]
distance_1 | distance_2 | count | |
---|---|---|---|
0 | -100000.0 | -100000.0 | 4.8 |
1 | -100000.0 | -50000.0 | 4.5 |
2 | -100000.0 | 0.0 | 5.3 |
3 | -100000.0 | 50000.0 | 4.7 |
4 | -100000.0 | 100000.0 | 5.3 |
5 | -50000.0 | -100000.0 | 4.8 |
6 | -50000.0 | -50000.0 | 4.4 |
7 | -50000.0 | 0.0 | 4.5 |
8 | -50000.0 | 50000.0 | 5.4 |
9 | -50000.0 | 100000.0 | 3.4 |
10 | 0.0 | -100000.0 | 2.0 |
11 | 0.0 | -50000.0 | 3.5 |
12 | 0.0 | 0.0 | 4.4 |
13 | 0.0 | 50000.0 | 5.4 |
14 | 0.0 | 100000.0 | 4.3 |
15 | 50000.0 | -100000.0 | 5.3 |
16 | 50000.0 | -50000.0 | 4.7 |
17 | 50000.0 | 0.0 | 4.0 |
18 | 50000.0 | 50000.0 | 4.2 |
19 | 50000.0 | 100000.0 | 6.1 |
20 | 100000.0 | -100000.0 | 5.4 |
21 | 100000.0 | -50000.0 | 2.8 |
22 | 100000.0 | 0.0 | 3.6 |
23 | 100000.0 | 50000.0 | 5.2 |
24 | 100000.0 | 100000.0 | 4.3 |