Skip to content

Query engine

This technical document describes the spoc query engine, a set of classes that implements spoc's interface for querying multi-dimensional genomic data.

Principles

Composable pieces

Spoc's query engine consists of composable pieces that can be combined to produce an expressive query language. These pieces represent basic operations on genomic data that are easily implemented and understood on their own. This allows a great degree of flexibility, while also allowing predefined recipes that less experienced users can get started with.

Lazy evaluation

The spoc query engine is designed with lazy evaluation as a guiding principle. This means that data queries are only executed when they are needed to minimize loading data into memory and computational overhead. To enable this, spoc queries have a construction phase, which specifies the operations to be executed and an execution phase, that actually executes the query.

Query plans and query steps

The most important ingredient in this query language is a class that implements the QueryStep protocol. This protocol serves two purposes:

  • It exposes a way to validate the data schema during query building
  • It implements adding itself to a query

This way, query steps can be combined into a query plan that specifies the analysis to be executed. In generaly, spoc supports the following types of query steps:

  • Overlap: Overlaps genomic data with a set of genomic intervals
  • Aggregation: Aggregates the data passed into the query using an aggregation function
  • Transform: Adds columns to query based on a fixed or custom computation

Specific examples of query steps are:

  • DistanceTransformation: Example of a transfomration. Adds distance of genomic positions to regions added by Overlap
  • DistanceAggregation: Exaple of an aggregation. Aggregates the distances to genomic regions using an aggregation function.

Input and output of query steps

A query step takes as input a class that implements the GenomicData protocol. This protocol allows retrievel of the data schema (a thin wrapper over a pandera dataframe schema) as well as the data itself. The output of a query step is again a class that ipmlements the GenomicData protocol to allow composition. Specific examples of possible inputs are:

  • Pixels: Represents input pixels
  • Contacts: Represents input contacts
  • QueryPlan: The result of a query step

Composition of query steps

To allow specifying complex queries, query steps need to be combined. This is done using the Query class. It takes a query plan (a list of QueryStep instances) as input, exposes the build method, which takes input data, validates all query steps and adds them to the resulting QueryPlan instance that is returned.

Manifestation of results

So far, we have only talked about specifying the query to be executed, but not how to actually execute it. A QueryPlan has a compute() method that returns the manifested dataframe as a pd.DataFrame instance. This is the step that actually executes the specified query.

Examples

Selecting a subset of contacts at a single genomic position

In this example, we want to select a subset of genomic contacts at a single location. For this, we first load the required input data:

from spoc.query_engine import Overlap, Anchor, Query
from spoc.contacts import Contacts
import pandas as pd

contacts = Contacts.from_uri("../tests/test_files/contacts_unlabelled_2d_v2.parquet::2")

Then we specify a target region

target_region = pd.DataFrame({
    "chrom": ['chr1'],
    "start": [100],
    "end": [400],
})

First, we want to select all contacts where any of the fragments constituting the contact overlaps the target region. To perform this action, we use the Overlap class and pass the target region as well as an instance of the Anchor class. The Anchor dataclass allows us to specify how we want to filter contacts for region overlap. It has two attributes mode and anchors. Anchors indicates the positions we want to filter on (default is all positions) and mode specifies whether we require all positions to overlap or any position to overlap. So for example, if we want all of our two-way contacts for which any of the positions overlap, we would use Anchor(mode='ANY', anchors=[1,2]).

query_steps = [
    Overlap(target_region, anchor_mode=Anchor(mode="ANY", anchors=[1,2]))
]

A query plan is a list of qury steps that can be used in the basic query class

query = Query(query_steps=query_steps)

The .build method executes the query plan and retuns a QueryPlan object

result = query.build(contacts)
result
<spoc.query_engine.QueryPlan at 0x1804b711a00>

The .load_result method of the QueryResult object can be executed using .load_result, which returns a pd.DataFrame. The resulting dataframe has additional columns that represent the regions, with which the input contacts overlapped.

df = result.compute()
print(type(df))
df.filter(regex=r"chrom|start|end|id")
<class 'pandas.core.frame.DataFrame'>
chrom_1 start_1 end_1 chrom_2 start_2 end_2 region_chrom region_start region_end region_id
0 chr1 100 200 chr1 1000 2000 chr1 100 400 0
1 chr1 2000 3000 chr1 200 300 chr1 100 400 0
2 chr1 3000 4000 chr1 300 400 chr1 100 400 0

We can also restrict the positions to filter on, by passing different anchor parameters. For example, we can filter for contacts, where the first position overlaps with our target:

query_steps = [
    Overlap(target_region, anchor_mode=Anchor(mode="ANY", anchors=[1]))
]
Query(query_steps=query_steps)\
    .build(contacts)\
    .compute()\
    .filter(regex=r"chrom|start|end|id")
chrom_1 start_1 end_1 chrom_2 start_2 end_2 region_chrom region_start region_end region_id
0 chr1 100 200 chr1 1000 2000 chr1 100 400 0

This time, only the first contact overlaps.

The same functionality is implemented also for the Pixels class

Selecting a subset of contacts at multiple genomic regions

The Overlap class is also capable of selecting contacts at multiple genomic regions. Here, the behavior of Overlap deviates from a simple filter, because if a given contact overlaps with multiple regions, it will be returned multiple times.

Specify target regions

target_regions = pd.DataFrame({
    "chrom": ['chr1', 'chr1'],
    "start": [100, 150],
    "end": [400, 200],
})
query_steps = [
    Overlap(target_regions, anchor_mode=Anchor(mode="ANY", anchors=[1]))
]
Query(query_steps=query_steps)\
    .build(contacts)\
    .compute()\
    .filter(regex=r"chrom|start|end|id")
chrom_1 start_1 end_1 chrom_2 start_2 end_2 region_chrom region_start region_end region_id
0 chr1 100 200 chr1 1000 2000 chr1 100 400 0
1 chr1 100 200 chr1 1000 2000 chr1 150 200 1

In this example, the contact overlapping both regions is duplicated.

The same functionality is implemented also for the pixels class.

Calculating the distance to a target region and aggregating the result

In this example, we calculate the distance of pixels to target regions and aggregate based on the distances. This is a very common use case in so-called pileup analyses, where we want to investigate the average behavior around regions of interest.

from spoc.pixels import Pixels
from spoc.query_engine import DistanceTransformation, DistanceMode
import pandas as pd
import numpy as np
from itertools import product

First we define a set of target pixels

def complete_synthetic_pixels():
    """Pixels that span two regions densely"""
    np.random.seed(42)
    # genomic region_1
    pixels_1 = [
        {
            "chrom": tup[0],
            "start_1": tup[1],
            "start_2": tup[2],
            "start_3": tup[3],
            "count": np.random.randint(0, 10),
        }
        for tup in product(
            ["chr1"],
            np.arange(900_000, 1_150_000, 50_000),
            np.arange(900_000, 1_150_000, 50_000),
            np.arange(900_000, 1_150_000, 50_000),
        )
    ]
    # genomic region_2
    pixels_2 = [
        {
            "chrom": tup[0],
            "start_1": tup[1],
            "start_2": tup[2],
            "start_3": tup[3],
            "count": np.random.randint(0, 10),
        }
        for tup in product(
            ["chr2"],
            np.arange(900_000, 1_150_000, 50_000),
            np.arange(900_000, 1_150_000, 50_000),
            np.arange(900_000, 1_150_000, 50_000),
        )
    ]
    return pd.concat((pd.DataFrame(pixels_1), pd.DataFrame(pixels_2)))
pixels = Pixels(complete_synthetic_pixels(), number_fragments=3, binsize=50_000)

Then we define the target regions we are interested in.

target_regions = pd.DataFrame(
        {
            "chrom": ["chr1", "chr2"],
            "start": [900_000, 900_000],
            "end": [1_100_000, 1_100_000],
        }
    )

We are then interested in selecting all contacts that are contained within these pixels and then calculate the distance to them. The selection step can be done with the Overlap class that we described above. The distance transformation can be done with the DistanceTransformation query step. This query step takes an instance of genomic data that contains regions (as defined by it's schema) and calculates the distance to all position columns. All distances are calculated with regards to the center of each assigned region. Since genomic positions are defined by a start and end,the DistanceTransformation query step has a DistanceMode parameter that defines whether we would like to calculate the distance with regard to the start of a genomic position, the end or it's center.

query_steps = [
    Overlap(target_regions, anchor_mode=Anchor(mode="ANY")),
    DistanceTransformation(
        distance_mode=DistanceMode.LEFT,
    ),
]

We can then execute this query plan using the Query class. This well add an distance column to the genomic dataset returned.

Query(query_steps=query_steps)\
    .build(pixels)\
    .compute()\
    .filter(regex=r"chrom|distance")
chrom_1 chrom_2 chrom_3 region_chrom distance_1 distance_2 distance_3
0 chr1 chr1 chr1 chr1 -100000.0 -100000.0 -100000.0
1 chr1 chr1 chr1 chr1 -100000.0 -100000.0 -50000.0
2 chr1 chr1 chr1 chr1 -100000.0 -100000.0 0.0
3 chr1 chr1 chr1 chr1 -100000.0 -100000.0 50000.0
4 chr1 chr1 chr1 chr1 -100000.0 -100000.0 100000.0
... ... ... ... ... ... ... ...
245 chr2 chr2 chr2 chr2 100000.0 100000.0 -100000.0
246 chr2 chr2 chr2 chr2 100000.0 100000.0 -50000.0
247 chr2 chr2 chr2 chr2 100000.0 100000.0 0.0
248 chr2 chr2 chr2 chr2 100000.0 100000.0 50000.0
249 chr2 chr2 chr2 chr2 100000.0 100000.0 100000.0

250 rows × 7 columns

Aggregating genomic data based on it's distance to a target region

In this example, we extend the above use-case to aggregate the results based on the distance columns added. This is a common use-case to calculate aggregate statistics for different distance levels. To achieve this, we employ the same query plan as above and extend it using the DistanceAggregation query step.

from spoc.query_engine import DistanceAggregation, AggregationFunction

The DistanceAggregation class requires the following parameters: - value_columns: Thie specifies the value to aggregate - function: The aggregation function to use. This is the enumerated type AggregationFunction - densify_output: Whether missing distance values should be filled with empty values (specific empty value depends on the aggregation function)

Note that there are two different average functions available, AVG and AVG_WITH_EMPTY. AVG performs and average over all available columns, where as AVG_WITH_EMPTY counts missing distances per regions as 0.

query_steps = [
    Overlap(target_regions, anchor_mode=Anchor(mode="ALL")),
    DistanceTransformation(),
    DistanceAggregation(
        value_column='count',
        function=AggregationFunction.AVG,
    ),
]
Query(query_steps=query_steps)\
    .build(pixels)\
    .compute()
distance_1 distance_2 distance_3 count
0 -100000.0 -100000.0 -100000.0 4.5
1 -100000.0 -100000.0 -50000.0 3.0
2 -100000.0 -100000.0 0.0 5.5
3 -100000.0 -100000.0 50000.0 5.0
4 -100000.0 -100000.0 100000.0 6.0
... ... ... ... ...
120 100000.0 100000.0 -100000.0 8.0
121 100000.0 100000.0 -50000.0 4.5
122 100000.0 100000.0 0.0 4.5
123 100000.0 100000.0 50000.0 4.5
124 100000.0 100000.0 100000.0 0.0

125 rows × 4 columns

In addition, we can also aggregate on a subset of distance positions, using the position_list parameter:

query_steps = [
    Overlap(target_regions, anchor_mode=Anchor(mode="ALL")),
    DistanceTransformation(),
    DistanceAggregation(
        value_column='count',
        function=AggregationFunction.AVG,
        position_list=[1,2]
    ),
]
Query(query_steps=query_steps)\
    .build(pixels)\
    .compute()
distance_1 distance_2 count
0 -100000.0 -100000.0 4.8
1 -100000.0 -50000.0 4.5
2 -100000.0 0.0 5.3
3 -100000.0 50000.0 4.7
4 -100000.0 100000.0 5.3
5 -50000.0 -100000.0 4.8
6 -50000.0 -50000.0 4.4
7 -50000.0 0.0 4.5
8 -50000.0 50000.0 5.4
9 -50000.0 100000.0 3.4
10 0.0 -100000.0 2.0
11 0.0 -50000.0 3.5
12 0.0 0.0 4.4
13 0.0 50000.0 5.4
14 0.0 100000.0 4.3
15 50000.0 -100000.0 5.3
16 50000.0 -50000.0 4.7
17 50000.0 0.0 4.0
18 50000.0 50000.0 4.2
19 50000.0 100000.0 6.1
20 100000.0 -100000.0 5.4
21 100000.0 -50000.0 2.8
22 100000.0 0.0 3.6
23 100000.0 50000.0 5.2
24 100000.0 100000.0 4.3