Spoc data structures
This notebook explains the data structures that are availabel within spoc and shows how they relate to each other. On a high level, spoc provides data structures for all parts of the transformation pipline from raw reads to aggregated pixels.
Often, these data structures (with the exception of the pixels class) will not be used within every day analysis tasks, but rather within analysis pipelines.
Data frame schemas
Spoc data structures are wrappers around tabular data containers such as panda.DataFrame
or dask.dataframe.DataFrame
. To ensure that the underlying data complies with the format that spoc expects, spoc implements dataframe validation using pandera
. The underlying schemas reside in the spoc.dataframe_models
file.
I/O
Reading and writing of spoc data structures is managed by the spoc.io
package, specifically by the FileManager
class. Examples of using the FileManager can be found with the specific data structure.
Fragments
Fragments encapsulate a data structure that can hold a dynamic number of aligned fragments per sequencing unit. In a Pore-C experiment, a sequencing unit is the sequencing read that holds multiple fragments per read. In theory, this structure can also be used for other experiment types that generate aligned fragments that are grouped together by an id, for exapmle SPRITE
Reading fragments using FileManager
Fragments class has data accessor for fragments
chrom | start | end | strand | read_name | read_start | read_end | read_length | mapping_quality | align_score | align_base_qscore | pass_filter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 1 | 4 | True | dummy | 1 | 4 | 1 | 1 | 1 | 1 | True |
1 | chr1 | 2 | 5 | True | dummy | 2 | 5 | 1 | 2 | 2 | 2 | True |
2 | chr1 | 3 | 6 | True | dummy | 3 | 6 | 1 | 3 | 3 | 3 | True |
The fragments class constructor validates the underlying data structure using pandera and the dataframe schemas in spoc.dataframe_models
try:
FileManager().load_fragments("../tests/test_files/bad_porec.parquet")
except SchemaError as e:
print(str(e).split("\n")[0])
Fragments class also supports reading as dask dataframe
from spoc.io import DataMode
fragments = FileManager(DataMode.DASK).load_fragments("../tests/test_files/good_porec.parquet")
chrom | start | end | strand | read_name | read_start | read_end | read_length | mapping_quality | align_score | align_base_qscore | pass_filter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
npartitions=1 | ||||||||||||
object | int64 | int64 | bool | object | int64 | int64 | int64 | int64 | int64 | int64 | bool | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Note that if reading from dask dataframes, schema evaluation is deferred until the dask taskgraph is evaluated
column 'chrom' not in dataframe
Annotating fragments
Fragments can carry metadata that add additional information, which can be propagated in the analysis pipeline. FragmentAnnotator
uses a dictionary called label library that contains compound fragment ids and metainformation to annotate fragments. These ids are concatenations of the read_id, chromosome, start and end of the mapping.
{'dummy_chr1_1_4': True, 'dummy_chr1_2_5': False}
chrom | start | end | strand | read_name | read_start | read_end | read_length | mapping_quality | align_score | align_base_qscore | pass_filter | metadata | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 1 | 4 | True | dummy | 1 | 4 | 1 | 1 | 1 | 1 | True | SisterB |
1 | chr1 | 2 | 5 | True | dummy | 2 | 5 | 1 | 2 | 2 | 2 | True | SisterA |
Contacts
While the fragment representation retains flexibility, it is often not practical to have contacts of multiple orders and types in different rows of the same file. To this end, we employ the contact representation, where each row contains one contact of a defined order, e.g. a duplet, or a triplet. The Contact
class is a wrapper around the data structure that holds this representation.
The Contacts
class is a generic interface that can represent different orders.
The class that creates contacts from fragments is called FragmentExpander
, which can be used to generate contacts of arbitrary order.
chrom | start | end | strand | read_name | read_start | read_end | read_length | mapping_quality | align_score | align_base_qscore | pass_filter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 1 | 4 | True | dummy | 1 | 4 | 1 | 1 | 1 | 1 | True |
1 | chr1 | 2 | 5 | True | dummy | 2 | 5 | 1 | 2 | 2 | 2 | True |
2 | chr1 | 3 | 6 | True | dummy | 3 | 6 | 1 | 3 | 3 | 3 | True |
3 | chr1 | 4 | 7 | True | dummy | 4 | 7 | 1 | 4 | 4 | 4 | True |
4 | chr1 | 5 | 8 | True | dummy2 | 5 | 8 | 1 | 5 | 5 | 5 | True |
read_name | read_length | chrom_1 | start_1 | end_1 | mapping_quality_1 | align_score_1 | align_base_qscore_1 | chrom_2 | start_2 | end_2 | mapping_quality_2 | align_score_2 | align_base_qscore_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | dummy | 1 | chr1 | 1 | 4 | 1 | 1 | 1 | chr1 | 2 | 5 | 2 | 2 | 2 |
1 | dummy | 1 | chr1 | 1 | 4 | 1 | 1 | 1 | chr1 | 3 | 6 | 3 | 3 | 3 |
2 | dummy | 1 | chr1 | 1 | 4 | 1 | 1 | 1 | chr1 | 4 | 7 | 4 | 4 | 4 |
3 | dummy | 1 | chr1 | 2 | 5 | 2 | 2 | 2 | chr1 | 3 | 6 | 3 | 3 | 3 |
4 | dummy | 1 | chr1 | 2 | 5 | 2 | 2 | 2 | chr1 | 4 | 7 | 4 | 4 | 4 |
Fragment expander also allows us to deal with metadata that is associated with fragments
read_name | read_length | chrom_1 | start_1 | end_1 | mapping_quality_1 | align_score_1 | align_base_qscore_1 | metadata_1 | chrom_2 | start_2 | end_2 | mapping_quality_2 | align_score_2 | align_base_qscore_2 | metadata_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | dummy | 1 | chr1 | 1 | 4 | 1 | 1 | 1 | SisterA | chr1 | 2 | 5 | 2 | 2 | 2 | SisterB |
1 | dummy | 1 | chr1 | 1 | 4 | 1 | 1 | 1 | SisterA | chr1 | 3 | 6 | 3 | 3 | 3 | SisterA |
2 | dummy | 1 | chr1 | 1 | 4 | 1 | 1 | 1 | SisterA | chr1 | 4 | 7 | 4 | 4 | 4 | SisterB |
3 | dummy | 1 | chr1 | 2 | 5 | 2 | 2 | 2 | SisterB | chr1 | 3 | 6 | 3 | 3 | 3 | SisterA |
4 | dummy | 1 | chr1 | 2 | 5 | 2 | 2 | 2 | SisterB | chr1 | 4 | 7 | 4 | 4 | 4 | SisterB |
The contact class retains the information as to whether the expanded contacts contain metadata
True
Persisting contacts
Contacts can be persisted using the file manager, which writes the data as well as the gobal parameters to a parquet file. Loading the file restores the global parameters.
ContactsParameters(number_fragments=2, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False)
The contacts fileformat is a directory containing different specific contact instances such as contacts with different contact order etc. We can specify which of those to load by bassing the contacts parameters
ContactsParameters(number_fragments=2, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False)
Additionally, contacts can be loaded using a URI that specifies the parameters seperated by ::
. This method of loading requires the parameters path and number_fragments but will match the rest to the available pixels.
ContactsParameters(number_fragments=2, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False)
Symmetry
Unlabelled contacts
The quantification of genomic interactions in conventional (2-way) Hi-C assumes that there is no difference in the order of interactions. This means that whether a genomic location is measured in the first read or second read of a paired-end sequencing experiment carries the same information. This means that during preprocessing, conventional Hi-C data is flipped based on some convention (often that the first read has a smaller genomic location based on some sort order) and then only the upper triangular interaction matrix is stored.
When we talk about higher genomic order, a similar reasoning can apply (except for special use-cases) and we thus can flip genomic contacts such that genomic coordinates are monotonically increasing from lower to higher order (we mean this order if we refer to flipping below). This produces a symmetric, high-dimensional tensor, meaning that every permutation of dimensions does not change the associated value.
In spoc
, this logic is implemented in the ContactManipulator
class, in the .flip_symmetric_contacts
method.
read_name | start_1 | end_1 | start_2 | end_2 | |
---|---|---|---|---|---|
0 | read1 | 100 | 200 | 1000 | 2000 |
1 | read2 | 2000 | 3000 | 200 | 300 |
2 | read3 | 3000 | 4000 | 300 | 400 |
This particular contacts dataframe has one contact that conforms with the convention that the first contact should be smaller than the second (read1
), whereas the other two contacts don't conform with that convention. Using the .flip_symmetric_contacts
method we can fix this:
read_name | start_1 | end_1 | start_2 | end_2 | |
---|---|---|---|---|---|
0 | read1 | 100 | 200 | 1000 | 2000 |
1 | read2 | 200 | 300 | 2000 | 3000 |
2 | read3 | 300 | 400 | 3000 | 4000 |
Symmetry flipped contacts have the flag symmetry_flipped
set to true
True
These operations are available for arbitrary contact cardinalities:
read_name | start_1 | end_1 | start_2 | end_2 | start_3 | end_3 | |
---|---|---|---|---|---|---|---|
0 | read1 | 100 | 200 | 1000 | 2000 | 250 | 300 |
1 | read2 | 2000 | 3000 | 200 | 300 | 400 | 500 |
2 | read3 | 3000 | 4000 | 300 | 400 | 100 | 200 |
flipped_contacts = ContactManipulator().flip_symmetric_contacts(contacts)
flipped_contacts.data.head().filter(regex="(read_name|start|end)")
read_name | start_1 | end_1 | start_2 | end_2 | start_3 | end_3 | |
---|---|---|---|---|---|---|---|
0 | read1 | 100 | 200 | 250 | 300 | 1000 | 2000 |
1 | read2 | 200 | 300 | 400 | 500 | 2000 | 3000 |
2 | read3 | 100 | 200 | 300 | 400 | 3000 | 4000 |
Labelled contacts
For labelled contacts, the situation is more complex as we have to deal with different flavours of symmetry. If we take the example of triplets, that is genomic contacts of order 3, and binary labels (denoted as A or B), there are 8 possible contact orders:
If we extend the argument that order of interactions is unimportant, this reduces to 4 possible arrangements of labels:
It is often the case that for binary contacts, the specific label type is unimportant, the only important information is whether the labels where different (e.g. for sister specific labels or homologous chromosome labels). In such a situation, the possible arrangements reduce further to 2 possible label types:
For those contact types, two different “rules” for symmetry apply, for the situation of all similar labels (AAA or BBB), the same rules apply as for unlabelled contacts as there is no difference in order. For the other situation (ABB or BBA, which we denote as ABB from now on), only permutations within one label type produce the same value, meaning that if we have the label state ABB, and denote permutations as tuples of length three with (0,1,2) being the identity permutation then only the permutations (0,1,2) and (0,2,1) are identical. Practically, this means that we can flip contacts that are related by these permutations such that their genomic coordinates are monotonically increasing, but we cannot do this for contacts that are not related through a symmetry relation.
This reasoning can be generalized to higher dimensions, where contacts can be flipped if they can be related via a symmetry relation. Practically speaking, this means that we have a higher-order contact with two possible label states of order n with k labels of type A and (n-k) labels of type B, we can flip the labels within type A and within type B. For example, if we have a contact of order 4 with the configuration AABB, we can flip within A and within B based on genomic coordinates, but not within them. This reasoning also applies to the situation where we have more than one possible label state. Also here, we can flip within one label state, but not between them.
This logic is implemented in spoc
in the ContactManipulator
class, with to methods:
- The
.equate_binary_labels
method can be used to specify whether in a binary label situation, the labels shold be different or not (e.g. whether AA is equivaltent to BB) or not. This method is optional and can be used prior to the flipping procedure. - The
.flip_symmetric_contacts
method flips symmetric contacts base don the rules specified above
Note that all symmetry operations require the contact metadata to be alphabetically sorted, this can be either done explicitely via the .sort_labels
method, or is performed automatically within the other methods. For example, the metadat order of ABA
will be converted to AAB
by the .sort_labels
operation.
Let's look at an example! Here, we have 3d contacts that have binary labels
read_name | start_1 | end_1 | metadata_1 | start_2 | end_2 | metadata_2 | start_3 | end_3 | metadata_3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | read1 | 100 | 200 | A | 200 | 300 | B | 1000 | 2000 | B |
1 | read2 | 5000 | 5500 | A | 2000 | 3000 | A | 200 | 300 | B |
2 | read3 | 800 | 900 | B | 3000 | 3200 | B | 3000 | 4000 | B |
(3, {'A', 'B'})
This dataframe contains three contacts, with the following label state:
In this analysis use case, we want to equate the binary labels since we don't have a biological reason to belive that there is any difference between the labels. The only information that is important for us is whether the contacts happened between different label states or the same label state. Therefore, we use the.equate_binary_labels
method to replace all occurences of the same contact combination with their alphabetically first example. For example, the label state (ABB)
is a contact, where two parts come from one label state and one part comes from another. In our logic, this equivalent to (AAB)
, which is the alphabetically first label, which we therefore use to replace it. Following the same logic, (BBB)
will be replaced by (AAA)
.
/users/michael.mitter/.conda/envs/spoc-dev/lib/python3.8/site-packages/pandas/core/frame.py:3607: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._set_item(key, value)
read_name | start_1 | end_1 | metadata_1 | start_2 | end_2 | metadata_2 | start_3 | end_3 | metadata_3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | read1 | 100 | 200 | A | 200 | 300 | A | 1000 | 2000 | B |
1 | read2 | 5000 | 5500 | A | 2000 | 3000 | A | 200 | 300 | B |
2 | read3 | 800 | 900 | A | 3000 | 3200 | A | 3000 | 4000 | A |
As you can see, the occurence of (ABB)
has been replaced by (AAB)
and the occurence of (BBB)
has been replaced by (AAA)
. We can now reduce the symmetry of these contacts based on the logic explained above using the .flip_symmetric_contacts
method.
read_name | start_1 | end_1 | metadata_1 | start_2 | end_2 | metadata_2 | start_3 | end_3 | metadata_3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | read1 | 100 | 200 | A | 200 | 300 | A | 1000 | 2000 | B |
1 | read2 | 2000 | 3000 | A | 5000 | 5500 | A | 200 | 300 | B |
2 | read3 | 800 | 900 | A | 3000 | 3200 | A | 3000 | 4000 | A |
Pixels
Concept
Pixels represent aggregated information that count the number of genomic contacts per genomic bin. In this context, a genomic bin is a genomic interval with a fixed size (for example 10 kb) and a genomic contact is defined as above as an interaction between the corresponding bins. Pixels represent a single contact order, binsize and labelling state (if metadata is provided) and are thought to be the central datastructure that an analyst will use to interact with multiway genomic data to answer questions about genomic structure.
Implementation
Within spoc
, pixel instances can be generated from genomic contacts through the GenomicBinner
class.
read_name | chrom_1 | start_1 | end_1 | metadata_1 | chrom_2 | start_2 | end_2 | metadata_2 | chrom_3 | start_3 | end_3 | metadata_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | a | chr1 | 100010 | 100015 | SisterA | chr1 | 500010 | 500050 | SisterB | chr1 | 600100 | 600200 | SisterB |
1 | b | chr1 | 5000010 | 5000050 | SisterB | chr1 | 7000050 | 7000070 | SisterA | chr4 | 2000300 | 2000400 | SisterA |
2 | c | chr1 | 10000010 | 10000050 | SisterA | chr1 | 25000800 | 25000900 | SisterB | chr1 | 6000050 | 6000600 | SisterA |
3 | d | chr1 | 10000010 | 10000050 | SisterA | chr1 | 25001000 | 25002000 | SisterA | chr1 | 6000010 | 6000700 | SisterB |
GenomicBinner
takes a binsize and aggregates contacts by counting the number of contacts that fall within a genomic bin. As with all the other preprocessing functionalities, GenomicBinner
can take arguments as either pandas dataframes or dask dataframes. The output is consistent, meaning that when passing a pandas dataframe, a pandas dataframe is returned and if passing a dask dataframe, a dask dataframe is returned.
The default behavior of genomic binner is to filter for contacts that are on the same chromosome and produce an intrachromosomal pixels instance. This behavior allows us to create a compact representation of pixels, which only store one chromosome field per pixel.
chrom | start_1 | start_2 | start_3 | count | |
---|---|---|---|---|---|
0 | chr1 | 100000 | 500000 | 600000 | 1 |
1 | chr1 | 10000000 | 25000000 | 6000000 | 2 |
The pixels class additionally contains all information associated with the data stored. This data is taken from the contacts object passed to genomic binner.
pixels.number_fragments, pixels.binsize, pixels.binary_labels_equal, pixels.symmetry_flipped, pixels.metadata_combi
(3, 100000, False, False, None)
If interchromosomal contacts are needed, this can be passed to the .bin_contacts
method of genomic binner and will cause the pixel schema to incorporate chromosome columns for each of the corresponding pixel dimensions.
chrom_1 | start_1 | chrom_2 | start_2 | chrom_3 | start_3 | count | |
---|---|---|---|---|---|---|---|
0 | chr1 | 100000 | chr1 | 500000 | chr1 | 600000 | 1 |
1 | chr1 | 10000000 | chr1 | 25000000 | chr1 | 6000000 | 2 |
2 | chr1 | 5000000 | chr1 | 7000000 | chr4 | 2000000 | 1 |
Lablled contacts
The concept of pixels representing a single labeling state allows simplification of the interfaces to downstream processing functionality, but offloads responsibiltiy to the analyst to ensure that the biological question at hand is adequately adressed by the chosen labelling state. The pixels class does not perform any labelling state checks and forwards the combination(s) of labelling states present in the contacts class used for their construction. This allows for greater flexibility with regards to the questions that can be answered.
The ContactManipulator
class contains functionality to filter contacts for a given labelling state to perform downstream aggregation of pixels.
read_name | chrom_1 | start_1 | end_1 | metadata_1 | chrom_2 | start_2 | end_2 | metadata_2 | chrom_3 | start_3 | end_3 | metadata_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | a | chr1 | 100010 | 100015 | SisterA | chr1 | 500010 | 500050 | SisterB | chr1 | 600100 | 600200 | SisterB |
1 | b | chr1 | 5000010 | 5000050 | SisterB | chr1 | 7000050 | 7000070 | SisterA | chr4 | 2000300 | 2000400 | SisterA |
2 | c | chr1 | 10000010 | 10000050 | SisterA | chr1 | 25000800 | 25000900 | SisterB | chr1 | 6000050 | 6000600 | SisterA |
3 | d | chr1 | 10000010 | 10000050 | SisterA | chr1 | 25001000 | 25002000 | SisterA | chr1 | 6000010 | 6000700 | SisterB |
filtered_contacts = ContactManipulator().subset_on_metadata(contacts, metadata_combi=['SisterA', 'SisterB', 'SisterB'])
read_name | chrom_1 | start_1 | end_1 | metadata_1 | chrom_2 | start_2 | end_2 | metadata_2 | chrom_3 | start_3 | end_3 | metadata_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | a | chr1 | 100010 | 100015 | SisterA | chr1 | 500010 | 500050 | SisterB | chr1 | 600100 | 600200 | SisterB |
The resulting contacts carry the filtered metadata combination as an attribute:
['SisterA', 'SisterB', 'SisterB']
GenomicBinner
carries this information forward and adds it to the pixels object
chrom | start_1 | start_2 | start_3 | count | |
---|---|---|---|---|---|
0 | chr1 | 100000 | 500000 | 600000 | 1 |
['SisterA', 'SisterB', 'SisterB']
Persisting pixels
Pixels come in a multitude of different flavors and it would thus be cumbsersome to have to keep track of pixels of different binsize, labelling state or contact order that belong to the same source fragments. We thus developed a pixel file format that exposes a unified name to pixels that belong together as well as functoinality to load the exact pixels that we want.
The pixel file format
The pixel file format is a directory that contains a number of parquet files as well as a config file that holds information about what specific pixels are available. As with the other io-operations, the FileManager
object is responsible for saving pixels.
['5f0e4d6a6c0e5afb82c3a6ec4bce1635.parquet', 'metadata.json']
- The pixel foler at
test.parquet
contains a single specific pixel file as well as a metadata file
The FileManager
class contains functionality to read the metadata file and list available pixels
[PixelParameters(number_fragments=3, binsize=100000, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True)]
- The
list_pixels
method lists all available pixels in the form ofPixelParameters
, a pydantic dataclass that holds the parameters that describe a specific pixel file
Writing additional pixels updates the metadata file
[PixelParameters(number_fragments=3, binsize=100000, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True),
PixelParameters(number_fragments=3, binsize=100000, metadata_combi=['A', 'B', 'B'], label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True)]
Loading pixels
Pixel files can be loaded using the parameter pydantic class
loaded_pixels = FileManager().load_pixels("test_pixels.parquet", PixelParameters(number_fragments=3, binsize=100_000, same_chromosome=True))
Additionally, pixels can be loaded using a URI that specifies the parameters seperated by ::
. This method of loading requires the parameters path, number_fragments and binsize, but will match the rest to the available pixels.
PixelParameters(number_fragments=3, binsize=100000, metadata_combi=['A', 'B', 'B'], label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True)
If the specified uri is not specific enough and multiple pixels match, an error is raised