Skip to content

Spoc data structures

This notebook explains the data structures that are availabel within spoc and shows how they relate to each other. On a high level, spoc provides data structures for all parts of the transformation pipline from raw reads to aggregated pixels.

Often, these data structures (with the exception of the pixels class) will not be used within every day analysis tasks, but rather within analysis pipelines.

Data frame schemas

Spoc data structures are wrappers around tabular data containers such as panda.DataFrame or dask.dataframe.DataFrame. To ensure that the underlying data complies with the format that spoc expects, spoc implements dataframe validation using pandera. The underlying schemas reside in the spoc.dataframe_models file.

I/O

Reading and writing of spoc data structures is managed by the spoc.io package, specifically by the FileManager class. Examples of using the FileManager can be found with the specific data structure.

Fragments

Fragments encapsulate a data structure that can hold a dynamic number of aligned fragments per sequencing unit. In a Pore-C experiment, a sequencing unit is the sequencing read that holds multiple fragments per read. In theory, this structure can also be used for other experiment types that generate aligned fragments that are grouped together by an id, for exapmle SPRITE

Reading fragments using FileManager

from spoc.io import FileManager
fragments = FileManager().load_fragments("../tests/test_files/good_porec.parquet")

Fragments class has data accessor for fragments

fragments.data.head()
chrom start end strand read_name read_start read_end read_length mapping_quality align_score align_base_qscore pass_filter
0 chr1 1 4 True dummy 1 4 1 1 1 1 True
1 chr1 2 5 True dummy 2 5 1 2 2 2 True
2 chr1 3 6 True dummy 3 6 1 3 3 3 True

The fragments class constructor validates the underlying data structure using pandera and the dataframe schemas in spoc.dataframe_models

from pandera.errors import SchemaError
try:
    FileManager().load_fragments("../tests/test_files/bad_porec.parquet")
except SchemaError as e:
    print(str(e).split("\n")[0])

Fragments class also supports reading as dask dataframe

from spoc.io import DataMode
fragments = FileManager(DataMode.DASK).load_fragments("../tests/test_files/good_porec.parquet")
fragments.data
Dask DataFrame Structure:
chrom start end strand read_name read_start read_end read_length mapping_quality align_score align_base_qscore pass_filter
npartitions=1
object int64 int64 bool object int64 int64 int64 int64 int64 int64 bool
... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: validate, 2 graph layers

Note that if reading from dask dataframes, schema evaluation is deferred until the dask taskgraph is evaluated

fragments = FileManager(DataMode.DASK).load_fragments("../tests/test_files/bad_porec.parquet")
try:
    fragments.data.compute()
except SchemaError as e:
    print(str(e).split("\n")[0])
column 'chrom' not in dataframe

Annotating fragments

Fragments can carry metadata that add additional information, which can be propagated in the analysis pipeline. FragmentAnnotator uses a dictionary called label library that contains compound fragment ids and metainformation to annotate fragments. These ids are concatenations of the read_id, chromosome, start and end of the mapping.

fragments = FileManager().load_fragments("../tests/test_files/good_porec.parquet")
label_library = FileManager().load_label_library("../tests/test_files/ll1.pickle")
label_library
{'dummy_chr1_1_4': True, 'dummy_chr1_2_5': False}
from spoc.fragments import FragmentAnnotator
annotated_fragments = FragmentAnnotator(label_library).annotate_fragments(fragments)
annotated_fragments.data.head()
chrom start end strand read_name read_start read_end read_length mapping_quality align_score align_base_qscore pass_filter metadata
0 chr1 1 4 True dummy 1 4 1 1 1 1 True SisterB
1 chr1 2 5 True dummy 2 5 1 2 2 2 True SisterA

Contacts

While the fragment representation retains flexibility, it is often not practical to have contacts of multiple orders and types in different rows of the same file. To this end, we employ the contact representation, where each row contains one contact of a defined order, e.g. a duplet, or a triplet. The Contact class is a wrapper around the data structure that holds this representation. The Contacts class is a generic interface that can represent different orders. The class that creates contacts from fragments is called FragmentExpander, which can be used to generate contacts of arbitrary order.

import pandas as pd
from spoc.fragments import FragmentExpander
fragments = FileManager().load_fragments("../tests/test_files/fragments_unlabelled.parquet")
fragments.data.head()
chrom start end strand read_name read_start read_end read_length mapping_quality align_score align_base_qscore pass_filter
0 chr1 1 4 True dummy 1 4 1 1 1 1 True
1 chr1 2 5 True dummy 2 5 1 2 2 2 True
2 chr1 3 6 True dummy 3 6 1 3 3 3 True
3 chr1 4 7 True dummy 4 7 1 4 4 4 True
4 chr1 5 8 True dummy2 5 8 1 5 5 5 True
contacts = FragmentExpander(number_fragments=2).expand(fragments)
contacts.data.head()
read_name read_length chrom_1 start_1 end_1 mapping_quality_1 align_score_1 align_base_qscore_1 chrom_2 start_2 end_2 mapping_quality_2 align_score_2 align_base_qscore_2
0 dummy 1 chr1 1 4 1 1 1 chr1 2 5 2 2 2
1 dummy 1 chr1 1 4 1 1 1 chr1 3 6 3 3 3
2 dummy 1 chr1 1 4 1 1 1 chr1 4 7 4 4 4
3 dummy 1 chr1 2 5 2 2 2 chr1 3 6 3 3 3
4 dummy 1 chr1 2 5 2 2 2 chr1 4 7 4 4 4

Fragment expander also allows us to deal with metadata that is associated with fragments

fragments_labelled = FileManager().load_fragments("../tests/test_files/fragments_labelled.parquet")
contacts_labelled = FragmentExpander(number_fragments=2).expand(fragments_labelled)
contacts_labelled.data.head()
read_name read_length chrom_1 start_1 end_1 mapping_quality_1 align_score_1 align_base_qscore_1 metadata_1 chrom_2 start_2 end_2 mapping_quality_2 align_score_2 align_base_qscore_2 metadata_2
0 dummy 1 chr1 1 4 1 1 1 SisterA chr1 2 5 2 2 2 SisterB
1 dummy 1 chr1 1 4 1 1 1 SisterA chr1 3 6 3 3 3 SisterA
2 dummy 1 chr1 1 4 1 1 1 SisterA chr1 4 7 4 4 4 SisterB
3 dummy 1 chr1 2 5 2 2 2 SisterB chr1 3 6 3 3 3 SisterA
4 dummy 1 chr1 2 5 2 2 2 SisterB chr1 4 7 4 4 4 SisterB

The contact class retains the information as to whether the expanded contacts contain metadata

contacts_labelled.contains_metadata
True

Persisting contacts

Contacts can be persisted using the file manager, which writes the data as well as the gobal parameters to a parquet file. Loading the file restores the global parameters.

contacts_labelled.get_global_parameters()
ContactsParameters(number_fragments=2, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False)
FileManager().write_contacts("test.parquet", contacts_labelled)

The contacts fileformat is a directory containing different specific contact instances such as contacts with different contact order etc. We can specify which of those to load by bassing the contacts parameters

contacts = FileManager().load_contacts("test.parquet", contacts_labelled.get_global_parameters())
contacts.get_global_parameters()
ContactsParameters(number_fragments=2, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False)

Additionally, contacts can be loaded using a URI that specifies the parameters seperated by ::. This method of loading requires the parameters path and number_fragments but will match the rest to the available pixels.

from spoc.contacts import Contacts
loaded_contacts = Contacts.from_uri('test.parquet::2')
loaded_contacts.get_global_parameters()
ContactsParameters(number_fragments=2, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False)

Symmetry

Unlabelled contacts

The quantification of genomic interactions in conventional (2-way) Hi-C assumes that there is no difference in the order of interactions. This means that whether a genomic location is measured in the first read or second read of a paired-end sequencing experiment carries the same information. This means that during preprocessing, conventional Hi-C data is flipped based on some convention (often that the first read has a smaller genomic location based on some sort order) and then only the upper triangular interaction matrix is stored.

When we talk about higher genomic order, a similar reasoning can apply (except for special use-cases) and we thus can flip genomic contacts such that genomic coordinates are monotonically increasing from lower to higher order (we mean this order if we refer to flipping below). This produces a symmetric, high-dimensional tensor, meaning that every permutation of dimensions does not change the associated value.

In spoc, this logic is implemented in the ContactManipulator class, in the .flip_symmetric_contacts method.

from spoc.contacts import Contacts
contacts = Contacts.from_uri("../tests/test_files/contacts_unlabelled_2d_v2.parquet::2")
contacts.data.head().filter(regex="(read_name|start|end)")
read_name start_1 end_1 start_2 end_2
0 read1 100 200 1000 2000
1 read2 2000 3000 200 300
2 read3 3000 4000 300 400

This particular contacts dataframe has one contact that conforms with the convention that the first contact should be smaller than the second (read1), whereas the other two contacts don't conform with that convention. Using the .flip_symmetric_contacts method we can fix this:

from spoc.contacts import ContactManipulator
flipped_contacts = ContactManipulator().flip_symmetric_contacts(contacts)
flipped_contacts.data.head().filter(regex="(read_name|start|end)")
read_name start_1 end_1 start_2 end_2
0 read1 100 200 1000 2000
1 read2 200 300 2000 3000
2 read3 300 400 3000 4000

Symmetry flipped contacts have the flag symmetry_flipped set to true

flipped_contacts.symmetry_flipped
True

These operations are available for arbitrary contact cardinalities:

contacts = Contacts.from_uri("../tests/test_files/contacts_unlabelled_3d_v2.parquet::3")
contacts.data.head().filter(regex="(read_name|start|end)")
read_name start_1 end_1 start_2 end_2 start_3 end_3
0 read1 100 200 1000 2000 250 300
1 read2 2000 3000 200 300 400 500
2 read3 3000 4000 300 400 100 200
flipped_contacts = ContactManipulator().flip_symmetric_contacts(contacts)
flipped_contacts.data.head().filter(regex="(read_name|start|end)")
read_name start_1 end_1 start_2 end_2 start_3 end_3
0 read1 100 200 250 300 1000 2000
1 read2 200 300 400 500 2000 3000
2 read3 100 200 300 400 3000 4000

Labelled contacts

For labelled contacts, the situation is more complex as we have to deal with different flavours of symmetry. If we take the example of triplets, that is genomic contacts of order 3, and binary labels (denoted as A or B), there are 8 possible contact orders:

 (AAA, BBB, BAA, ABA, AAB, ABB, BAB, BBA)

If we extend the argument that order of interactions is unimportant, this reduces to 4 possible arrangements of labels:

 (AAA, BBB, ABB, BAA)

It is often the case that for binary contacts, the specific label type is unimportant, the only important information is whether the labels where different (e.g. for sister specific labels or homologous chromosome labels). In such a situation, the possible arrangements reduce further to 2 possible label types:

 (AAA, AAB)

For those contact types, two different “rules” for symmetry apply, for the situation of all similar labels (AAA or BBB), the same rules apply as for unlabelled contacts as there is no difference in order. For the other situation (ABB or BBA, which we denote as ABB from now on), only permutations within one label type produce the same value, meaning that if we have the label state ABB, and denote permutations as tuples of length three with (0,1,2) being the identity permutation then only the permutations (0,1,2) and (0,2,1) are identical. Practically, this means that we can flip contacts that are related by these permutations such that their genomic coordinates are monotonically increasing, but we cannot do this for contacts that are not related through a symmetry relation.

This reasoning can be generalized to higher dimensions, where contacts can be flipped if they can be related via a symmetry relation. Practically speaking, this means that we have a higher-order contact with two possible label states of order n with k labels of type A and (n-k) labels of type B, we can flip the labels within type A and within type B. For example, if we have a contact of order 4 with the configuration AABB, we can flip within A and within B based on genomic coordinates, but not within them. This reasoning also applies to the situation where we have more than one possible label state. Also here, we can flip within one label state, but not between them.

This logic is implemented in spoc in the ContactManipulator class, with to methods:

  • The .equate_binary_labels method can be used to specify whether in a binary label situation, the labels shold be different or not (e.g. whether AA is equivaltent to BB) or not. This method is optional and can be used prior to the flipping procedure.
  • The .flip_symmetric_contacts method flips symmetric contacts base don the rules specified above

Note that all symmetry operations require the contact metadata to be alphabetically sorted, this can be either done explicitely via the .sort_labels method, or is performed automatically within the other methods. For example, the metadat order of ABA will be converted to AAB by the .sort_labels operation.

Let's look at an example! Here, we have 3d contacts that have binary labels

contacts = Contacts.from_uri("../tests/test_files/contacts_labelled_3d_v2.parquet::3")
contacts.data.head().filter(regex="(read_name|start|end|metadata)")
read_name start_1 end_1 metadata_1 start_2 end_2 metadata_2 start_3 end_3 metadata_3
0 read1 100 200 A 200 300 B 1000 2000 B
1 read2 5000 5500 A 2000 3000 A 200 300 B
2 read3 800 900 B 3000 3200 B 3000 4000 B
contacts.number_fragments, contacts.get_label_values()
(3, {'A', 'B'})

This dataframe contains three contacts, with the following label state:

(ABB)
(AAB)
(BBB)
In this analysis use case, we want to equate the binary labels since we don't have a biological reason to belive that there is any difference between the labels. The only information that is important for us is whether the contacts happened between different label states or the same label state. Therefore, we use the .equate_binary_labels method to replace all occurences of the same contact combination with their alphabetically first example. For example, the label state (ABB) is a contact, where two parts come from one label state and one part comes from another. In our logic, this equivalent to (AAB), which is the alphabetically first label, which we therefore use to replace it. Following the same logic, (BBB) will be replaced by (AAA).

equated_contacts = ContactManipulator().equate_binary_labels(contacts)
/users/michael.mitter/.conda/envs/spoc-dev/lib/python3.8/site-packages/pandas/core/frame.py:3607: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)
equated_contacts.data.head().filter(regex="(read_name|start|end|metadata)")
read_name start_1 end_1 metadata_1 start_2 end_2 metadata_2 start_3 end_3 metadata_3
0 read1 100 200 A 200 300 A 1000 2000 B
1 read2 5000 5500 A 2000 3000 A 200 300 B
2 read3 800 900 A 3000 3200 A 3000 4000 A

As you can see, the occurence of (ABB) has been replaced by (AAB) and the occurence of (BBB) has been replaced by (AAA). We can now reduce the symmetry of these contacts based on the logic explained above using the .flip_symmetric_contacts method.

flipped_labelled_contacts = ContactManipulator().flip_symmetric_contacts(equated_contacts)
flipped_labelled_contacts.data.head().filter(regex="(read_name|start|end|metadata)")
read_name start_1 end_1 metadata_1 start_2 end_2 metadata_2 start_3 end_3 metadata_3
0 read1 100 200 A 200 300 A 1000 2000 B
1 read2 2000 3000 A 5000 5500 A 200 300 B
2 read3 800 900 A 3000 3200 A 3000 4000 A

Pixels

Concept

Pixels represent aggregated information that count the number of genomic contacts per genomic bin. In this context, a genomic bin is a genomic interval with a fixed size (for example 10 kb) and a genomic contact is defined as above as an interaction between the corresponding bins. Pixels represent a single contact order, binsize and labelling state (if metadata is provided) and are thought to be the central datastructure that an analyst will use to interact with multiway genomic data to answer questions about genomic structure.

Implementation

Within spoc, pixel instances can be generated from genomic contacts through the GenomicBinner class.

from spoc.pixels import GenomicBinner
contacts = Contacts.from_uri("../tests/test_files/contacts_for_pixels_3d_v2.parquet::3")
contacts.data.head().filter(regex="(read_name|chrom|start|end|metadata)")
read_name chrom_1 start_1 end_1 metadata_1 chrom_2 start_2 end_2 metadata_2 chrom_3 start_3 end_3 metadata_3
0 a chr1 100010 100015 SisterA chr1 500010 500050 SisterB chr1 600100 600200 SisterB
1 b chr1 5000010 5000050 SisterB chr1 7000050 7000070 SisterA chr4 2000300 2000400 SisterA
2 c chr1 10000010 10000050 SisterA chr1 25000800 25000900 SisterB chr1 6000050 6000600 SisterA
3 d chr1 10000010 10000050 SisterA chr1 25001000 25002000 SisterA chr1 6000010 6000700 SisterB

GenomicBinner takes a binsize and aggregates contacts by counting the number of contacts that fall within a genomic bin. As with all the other preprocessing functionalities, GenomicBinner can take arguments as either pandas dataframes or dask dataframes. The output is consistent, meaning that when passing a pandas dataframe, a pandas dataframe is returned and if passing a dask dataframe, a dask dataframe is returned.

pixels = GenomicBinner(bin_size=100_000).bin_contacts(contacts)

The default behavior of genomic binner is to filter for contacts that are on the same chromosome and produce an intrachromosomal pixels instance. This behavior allows us to create a compact representation of pixels, which only store one chromosome field per pixel.

pixels.data.head()
chrom start_1 start_2 start_3 count
0 chr1 100000 500000 600000 1
1 chr1 10000000 25000000 6000000 2

The pixels class additionally contains all information associated with the data stored. This data is taken from the contacts object passed to genomic binner.

pixels.number_fragments, pixels.binsize, pixels.binary_labels_equal, pixels.symmetry_flipped, pixels.metadata_combi
(3, 100000, False, False, None)

If interchromosomal contacts are needed, this can be passed to the .bin_contacts method of genomic binner and will cause the pixel schema to incorporate chromosome columns for each of the corresponding pixel dimensions.

pixels_w_inter = GenomicBinner(bin_size=100_000).bin_contacts(contacts, same_chromosome=False)
pixels_w_inter.data.head()
chrom_1 start_1 chrom_2 start_2 chrom_3 start_3 count
0 chr1 100000 chr1 500000 chr1 600000 1
1 chr1 10000000 chr1 25000000 chr1 6000000 2
2 chr1 5000000 chr1 7000000 chr4 2000000 1

Lablled contacts

The concept of pixels representing a single labeling state allows simplification of the interfaces to downstream processing functionality, but offloads responsibiltiy to the analyst to ensure that the biological question at hand is adequately adressed by the chosen labelling state. The pixels class does not perform any labelling state checks and forwards the combination(s) of labelling states present in the contacts class used for their construction. This allows for greater flexibility with regards to the questions that can be answered.

The ContactManipulator class contains functionality to filter contacts for a given labelling state to perform downstream aggregation of pixels.

contacts.data.head().filter(regex="(read_name|chrom|start|end|metadata)")
read_name chrom_1 start_1 end_1 metadata_1 chrom_2 start_2 end_2 metadata_2 chrom_3 start_3 end_3 metadata_3
0 a chr1 100010 100015 SisterA chr1 500010 500050 SisterB chr1 600100 600200 SisterB
1 b chr1 5000010 5000050 SisterB chr1 7000050 7000070 SisterA chr4 2000300 2000400 SisterA
2 c chr1 10000010 10000050 SisterA chr1 25000800 25000900 SisterB chr1 6000050 6000600 SisterA
3 d chr1 10000010 10000050 SisterA chr1 25001000 25002000 SisterA chr1 6000010 6000700 SisterB
filtered_contacts = ContactManipulator().subset_on_metadata(contacts, metadata_combi=['SisterA', 'SisterB', 'SisterB'])
filtered_contacts.data.head().filter(regex="(read_name|chrom|start|end|metadata)")
read_name chrom_1 start_1 end_1 metadata_1 chrom_2 start_2 end_2 metadata_2 chrom_3 start_3 end_3 metadata_3
0 a chr1 100010 100015 SisterA chr1 500010 500050 SisterB chr1 600100 600200 SisterB

The resulting contacts carry the filtered metadata combination as an attribute:

filtered_contacts.metadata_combi
['SisterA', 'SisterB', 'SisterB']

GenomicBinner carries this information forward and adds it to the pixels object

pixels_filtered = GenomicBinner(bin_size=100_000).bin_contacts(filtered_contacts)
pixels_filtered.data.head()
chrom start_1 start_2 start_3 count
0 chr1 100000 500000 600000 1
pixels_filtered.metadata_combi
['SisterA', 'SisterB', 'SisterB']

Persisting pixels

Pixels come in a multitude of different flavors and it would thus be cumbsersome to have to keep track of pixels of different binsize, labelling state or contact order that belong to the same source fragments. We thus developed a pixel file format that exposes a unified name to pixels that belong together as well as functoinality to load the exact pixels that we want.

The pixel file format

The pixel file format is a directory that contains a number of parquet files as well as a config file that holds information about what specific pixels are available. As with the other io-operations, the FileManager object is responsible for saving pixels.

import os
FileManager().write_pixels("test_pixels.parquet", pixels)
os.listdir('test_pixels.parquet')
['5f0e4d6a6c0e5afb82c3a6ec4bce1635.parquet', 'metadata.json']
  • The pixel foler at test.parquet contains a single specific pixel file as well as a metadata file

The FileManager class contains functionality to read the metadata file and list available pixels

FileManager().list_pixels("test_pixels.parquet")
[PixelParameters(number_fragments=3, binsize=100000, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True)]
  • The list_pixels method lists all available pixels in the form of PixelParameters, a pydantic dataclass that holds the parameters that describe a specific pixel file

Writing additional pixels updates the metadata file

pixels_filtered._metadata_combi = ['A', 'B', 'B'] # make metadata shorter
FileManager().write_pixels("test_pixels.parquet", pixels_filtered)
FileManager().list_pixels("test_pixels.parquet")
[PixelParameters(number_fragments=3, binsize=100000, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True),
 PixelParameters(number_fragments=3, binsize=100000, metadata_combi=['A', 'B', 'B'], label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True)]

Loading pixels

Pixel files can be loaded using the parameter pydantic class

from spoc.file_parameter_models import PixelParameters
loaded_pixels = FileManager().load_pixels("test_pixels.parquet", PixelParameters(number_fragments=3, binsize=100_000, same_chromosome=True))

Additionally, pixels can be loaded using a URI that specifies the parameters seperated by ::. This method of loading requires the parameters path, number_fragments and binsize, but will match the rest to the available pixels.

from spoc.pixels import Pixels
loaded_pixels = Pixels.from_uri('test_pixels.parquet::3::100000::ABB')
loaded_pixels.get_global_parameters()
PixelParameters(number_fragments=3, binsize=100000, metadata_combi=['A', 'B', 'B'], label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True)

If the specified uri is not specific enough and multiple pixels match, an error is raised