Spoc data structures

This notebook explains the data structures that are availabel within spoc and shows how they relate to each other. On a high level, spoc provides data structures for all parts of the transformation pipline from raw reads to aggregated pixels.

Often, these data structures (with the exception of the pixels class) will not be used within every day analysis tasks, but rather within analysis pipelines.

Data frame schemas

Spoc data structures are wrappers around tabular data containers such as panda.DataFrame or dask.dataframe.DataFrame. To ensure that the underlying data complies with the format that spoc expects, spoc implements dataframe validation using pandera. The underlying schemas reside in the spoc.dataframe_models file.

I/O

Reading and writing of spoc data structures is managed by the spoc.io package, specifically by the FileManager class. Examples of using the FileManager can be found with the specific data structure.

Fragments

Fragments encapsulate a data structure that can hold a dynamic number of aligned fragments per sequencing unit. In a Pore-C experiment, a sequencing unit is the sequencing read that holds multiple fragments per read. In theory, this structure can also be used for other experiment types that generate aligned fragments that are grouped together by an id, for exapmle SPRITE

Reading fragments using FileManager

from spoc.io import FileManager

fragments = FileManager().load_fragments("../tests/test_files/good_porec.parquet")

Fragments class has data accessor for fragments

fragments.data.head()

	chrom	start	end	strand	read_name	read_start	read_end	read_length	mapping_quality	align_score	align_base_qscore	pass_filter
0	chr1	1	4	True	dummy	1	4	1	1	1	1	True
1	chr1	2	5	True	dummy	2	5	1	2	2	2	True
2	chr1	3	6	True	dummy	3	6	1	3	3	3	True

The fragments class constructor validates the underlying data structure using pandera and the dataframe schemas in spoc.dataframe_models

from pandera.errors import SchemaError

try:
    FileManager().load_fragments("../tests/test_files/bad_porec.parquet")
except SchemaError as e:
    print(str(e).split("\n")[0])

Fragments class also supports reading as dask dataframe

from spoc.io import DataMode
fragments = FileManager(DataMode.DASK).load_fragments("../tests/test_files/good_porec.parquet")

fragments.data

Dask DataFrame Structure:

	chrom	start	end	strand	read_name	read_start	read_end	read_length	mapping_quality	align_score	align_base_qscore	pass_filter
npartitions=1
	object	int64	int64	bool	object	int64	int64	int64	int64	int64	int64	bool
	...	...	...	...	...	...	...	...	...	...	...	...

Dask Name: validate, 2 graph layers

Note that if reading from dask dataframes, schema evaluation is deferred until the dask taskgraph is evaluated

fragments = FileManager(DataMode.DASK).load_fragments("../tests/test_files/bad_porec.parquet")

try:
    fragments.data.compute()
except SchemaError as e:
    print(str(e).split("\n")[0])

column 'chrom' not in dataframe

Annotating fragments

Fragments can carry metadata that add additional information, which can be propagated in the analysis pipeline. FragmentAnnotator uses a dictionary called label library that contains compound fragment ids and metainformation to annotate fragments. These ids are concatenations of the read_id, chromosome, start and end of the mapping.

fragments = FileManager().load_fragments("../tests/test_files/good_porec.parquet")

label_library = FileManager().load_label_library("../tests/test_files/ll1.pickle")

label_library

{'dummy_chr1_1_4': True, 'dummy_chr1_2_5': False}

from spoc.fragments import FragmentAnnotator

annotated_fragments = FragmentAnnotator(label_library).annotate_fragments(fragments)

annotated_fragments.data.head()

	chrom	start	end	strand	read_name	read_start	read_end	read_length	mapping_quality	align_score	align_base_qscore	pass_filter	metadata
0	chr1	1	4	True	dummy	1	4	1	1	1	1	True	SisterB
1	chr1	2	5	True	dummy	2	5	1	2	2	2	True	SisterA

Contacts

While the fragment representation retains flexibility, it is often not practical to have contacts of multiple orders and types in different rows of the same file. To this end, we employ the contact representation, where each row contains one contact of a defined order, e.g. a duplet, or a triplet. The Contact class is a wrapper around the data structure that holds this representation. The Contacts class is a generic interface that can represent different orders. The class that creates contacts from fragments is called FragmentExpander, which can be used to generate contacts of arbitrary order.

import pandas as pd
from spoc.fragments import FragmentExpander

fragments = FileManager().load_fragments("../tests/test_files/fragments_unlabelled.parquet")

fragments.data.head()

	chrom	start	end	strand	read_name	read_start	read_end	read_length	mapping_quality	align_score	align_base_qscore	pass_filter
0	chr1	1	4	True	dummy	1	4	1	1	1	1	True
1	chr1	2	5	True	dummy	2	5	1	2	2	2	True
2	chr1	3	6	True	dummy	3	6	1	3	3	3	True
3	chr1	4	7	True	dummy	4	7	1	4	4	4	True
4	chr1	5	8	True	dummy2	5	8	1	5	5	5	True

contacts = FragmentExpander(number_fragments=2).expand(fragments)

contacts.data.head()

	read_name	read_length	chrom_1	start_1	end_1	mapping_quality_1	align_score_1	align_base_qscore_1	chrom_2	start_2	end_2	mapping_quality_2	align_score_2	align_base_qscore_2
0	dummy	1	chr1	1	4	1	1	1	chr1	2	5	2	2	2
1	dummy	1	chr1	1	4	1	1	1	chr1	3	6	3	3	3
2	dummy	1	chr1	1	4	1	1	1	chr1	4	7	4	4	4
3	dummy	1	chr1	2	5	2	2	2	chr1	3	6	3	3	3
4	dummy	1	chr1	2	5	2	2	2	chr1	4	7	4	4	4

Fragment expander also allows us to deal with metadata that is associated with fragments

fragments_labelled = FileManager().load_fragments("../tests/test_files/fragments_labelled.parquet")

contacts_labelled = FragmentExpander(number_fragments=2).expand(fragments_labelled)

contacts_labelled.data.head()

	read_name	read_length	chrom_1	start_1	end_1	mapping_quality_1	align_score_1	align_base_qscore_1	metadata_1	chrom_2	start_2	end_2	mapping_quality_2	align_score_2	align_base_qscore_2	metadata_2
0	dummy	1	chr1	1	4	1	1	1	SisterA	chr1	2	5	2	2	2	SisterB
1	dummy	1	chr1	1	4	1	1	1	SisterA	chr1	3	6	3	3	3	SisterA
2	dummy	1	chr1	1	4	1	1	1	SisterA	chr1	4	7	4	4	4	SisterB
3	dummy	1	chr1	2	5	2	2	2	SisterB	chr1	3	6	3	3	3	SisterA
4	dummy	1	chr1	2	5	2	2	2	SisterB	chr1	4	7	4	4	4	SisterB

The contact class retains the information as to whether the expanded contacts contain metadata

contacts_labelled.contains_metadata

True

Persisting contacts

Contacts can be persisted using the file manager, which writes the data as well as the gobal parameters to a parquet file. Loading the file restores the global parameters.

contacts_labelled.get_global_parameters()

ContactsParameters(number_fragments=2, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False)

FileManager().write_contacts("test.parquet", contacts_labelled)

The contacts fileformat is a directory containing different specific contact instances such as contacts with different contact order etc. We can specify which of those to load by bassing the contacts parameters

contacts = FileManager().load_contacts("test.parquet", contacts_labelled.get_global_parameters())

contacts.get_global_parameters()

ContactsParameters(number_fragments=2, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False)

Additionally, contacts can be loaded using a URI that specifies the parameters seperated by ::. This method of loading requires the parameters path and number_fragments but will match the rest to the available pixels.

from spoc.contacts import Contacts

loaded_contacts = Contacts.from_uri('test.parquet::2')

loaded_contacts.get_global_parameters()

ContactsParameters(number_fragments=2, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False)

Symmetry

Unlabelled contacts

The quantification of genomic interactions in conventional (2-way) Hi-C assumes that there is no difference in the order of interactions. This means that whether a genomic location is measured in the first read or second read of a paired-end sequencing experiment carries the same information. This means that during preprocessing, conventional Hi-C data is flipped based on some convention (often that the first read has a smaller genomic location based on some sort order) and then only the upper triangular interaction matrix is stored.

When we talk about higher genomic order, a similar reasoning can apply (except for special use-cases) and we thus can flip genomic contacts such that genomic coordinates are monotonically increasing from lower to higher order (we mean this order if we refer to flipping below). This produces a symmetric, high-dimensional tensor, meaning that every permutation of dimensions does not change the associated value.

In spoc, this logic is implemented in the ContactManipulator class, in the .flip_symmetric_contacts method.

from spoc.contacts import Contacts

contacts = Contacts.from_uri("../tests/test_files/contacts_unlabelled_2d_v2.parquet::2")

contacts.data.head().filter(regex="(read_name|start|end)")

	read_name	start_1	end_1	start_2	end_2
0	read1	100	200	1000	2000
1	read2	2000	3000	200	300
2	read3	3000	4000	300	400

This particular contacts dataframe has one contact that conforms with the convention that the first contact should be smaller than the second (read1), whereas the other two contacts don't conform with that convention. Using the .flip_symmetric_contacts method we can fix this:

from spoc.contacts import ContactManipulator

flipped_contacts = ContactManipulator().flip_symmetric_contacts(contacts)

flipped_contacts.data.head().filter(regex="(read_name|start|end)")

	read_name	start_1	end_1	start_2	end_2
0	read1	100	200	1000	2000
1	read2	200	300	2000	3000
2	read3	300	400	3000	4000

Symmetry flipped contacts have the flag symmetry_flipped set to true

flipped_contacts.symmetry_flipped

True

These operations are available for arbitrary contact cardinalities:

contacts = Contacts.from_uri("../tests/test_files/contacts_unlabelled_3d_v2.parquet::3")

contacts.data.head().filter(regex="(read_name|start|end)")

	read_name	start_1	end_1	start_2	end_2	start_3	end_3
0	read1	100	200	1000	2000	250	300
1	read2	2000	3000	200	300	400	500
2	read3	3000	4000	300	400	100	200

flipped_contacts = ContactManipulator().flip_symmetric_contacts(contacts)
flipped_contacts.data.head().filter(regex="(read_name|start|end)")

	read_name	start_1	end_1	start_2	end_2	start_3	end_3
0	read1	100	200	250	300	1000	2000
1	read2	200	300	400	500	2000	3000
2	read3	100	200	300	400	3000	4000

Labelled contacts

For labelled contacts, the situation is more complex as we have to deal with different flavours of symmetry. If we take the example of triplets, that is genomic contacts of order 3, and binary labels (denoted as A or B), there are 8 possible contact orders:

 (AAA, BBB, BAA, ABA, AAB, ABB, BAB, BBA)

If we extend the argument that order of interactions is unimportant, this reduces to 4 possible arrangements of labels:

 (AAA, BBB, ABB, BAA)

It is often the case that for binary contacts, the specific label type is unimportant, the only important information is whether the labels where different (e.g. for sister specific labels or homologous chromosome labels). In such a situation, the possible arrangements reduce further to 2 possible label types:

 (AAA, AAB)

For those contact types, two different “rules” for symmetry apply, for the situation of all similar labels (AAA or BBB), the same rules apply as for unlabelled contacts as there is no difference in order. For the other situation (ABB or BBA, which we denote as ABB from now on), only permutations within one label type produce the same value, meaning that if we have the label state ABB, and denote permutations as tuples of length three with (0,1,2) being the identity permutation then only the permutations (0,1,2) and (0,2,1) are identical. Practically, this means that we can flip contacts that are related by these permutations such that their genomic coordinates are monotonically increasing, but we cannot do this for contacts that are not related through a symmetry relation.

This reasoning can be generalized to higher dimensions, where contacts can be flipped if they can be related via a symmetry relation. Practically speaking, this means that we have a higher-order contact with two possible label states of order n with k labels of type A and (n-k) labels of type B, we can flip the labels within type A and within type B. For example, if we have a contact of order 4 with the configuration AABB, we can flip within A and within B based on genomic coordinates, but not within them. This reasoning also applies to the situation where we have more than one possible label state. Also here, we can flip within one label state, but not between them.

This logic is implemented in spoc in the ContactManipulator class, with to methods:

The .equate_binary_labels method can be used to specify whether in a binary label situation, the labels shold be different or not (e.g. whether AA is equivaltent to BB) or not. This method is optional and can be used prior to the flipping procedure.
The .flip_symmetric_contacts method flips symmetric contacts base don the rules specified above

Note that all symmetry operations require the contact metadata to be alphabetically sorted, this can be either done explicitely via the .sort_labels method, or is performed automatically within the other methods. For example, the metadat order of ABA will be converted to AAB by the .sort_labels operation.

Let's look at an example! Here, we have 3d contacts that have binary labels

contacts = Contacts.from_uri("../tests/test_files/contacts_labelled_3d_v2.parquet::3")

contacts.data.head().filter(regex="(read_name|start|end|metadata)")

	read_name	start_1	end_1	metadata_1	start_2	end_2	metadata_2	start_3	end_3	metadata_3
0	read1	100	200	A	200	300	B	1000	2000	B
1	read2	5000	5500	A	2000	3000	A	200	300	B
2	read3	800	900	B	3000	3200	B	3000	4000	B

contacts.number_fragments, contacts.get_label_values()

(3, {'A', 'B'})

This dataframe contains three contacts, with the following label state:

(ABB)
(AAB)
(BBB)

In this analysis use case, we want to equate the binary labels since we don't have a biological reason to belive that there is any difference between the labels. The only information that is important for us is whether the contacts happened between different label states or the same label state. Therefore, we use the .equate_binary_labels method to replace all occurences of the same contact combination with their alphabetically first example. For example, the label state (ABB) is a contact, where two parts come from one label state and one part comes from another. In our logic, this equivalent to (AAB), which is the alphabetically first label, which we therefore use to replace it. Following the same logic, (BBB) will be replaced by (AAA).

equated_contacts = ContactManipulator().equate_binary_labels(contacts)

/users/michael.mitter/.conda/envs/spoc-dev/lib/python3.8/site-packages/pandas/core/frame.py:3607: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)

equated_contacts.data.head().filter(regex="(read_name|start|end|metadata)")

	read_name	start_1	end_1	metadata_1	start_2	end_2	metadata_2	start_3	end_3	metadata_3
0	read1	100	200	A	200	300	A	1000	2000	B
1	read2	5000	5500	A	2000	3000	A	200	300	B
2	read3	800	900	A	3000	3200	A	3000	4000	A

As you can see, the occurence of (ABB) has been replaced by (AAB) and the occurence of (BBB) has been replaced by (AAA). We can now reduce the symmetry of these contacts based on the logic explained above using the .flip_symmetric_contacts method.

flipped_labelled_contacts = ContactManipulator().flip_symmetric_contacts(equated_contacts)

flipped_labelled_contacts.data.head().filter(regex="(read_name|start|end|metadata)")

	read_name	start_1	end_1	metadata_1	start_2	end_2	metadata_2	start_3	end_3	metadata_3
0	read1	100	200	A	200	300	A	1000	2000	B
1	read2	2000	3000	A	5000	5500	A	200	300	B
2	read3	800	900	A	3000	3200	A	3000	4000	A

Pixels

Concept

Pixels represent aggregated information that count the number of genomic contacts per genomic bin. In this context, a genomic bin is a genomic interval with a fixed size (for example 10 kb) and a genomic contact is defined as above as an interaction between the corresponding bins. Pixels represent a single contact order, binsize and labelling state (if metadata is provided) and are thought to be the central datastructure that an analyst will use to interact with multiway genomic data to answer questions about genomic structure.

Implementation

Within spoc, pixel instances can be generated from genomic contacts through the GenomicBinner class.

from spoc.pixels import GenomicBinner

contacts = Contacts.from_uri("../tests/test_files/contacts_for_pixels_3d_v2.parquet::3")

contacts.data.head().filter(regex="(read_name|chrom|start|end|metadata)")

	read_name	chrom_1	start_1	end_1	metadata_1	chrom_2	start_2	end_2	metadata_2	chrom_3	start_3	end_3	metadata_3
0	a	chr1	100010	100015	SisterA	chr1	500010	500050	SisterB	chr1	600100	600200	SisterB
1	b	chr1	5000010	5000050	SisterB	chr1	7000050	7000070	SisterA	chr4	2000300	2000400	SisterA
2	c	chr1	10000010	10000050	SisterA	chr1	25000800	25000900	SisterB	chr1	6000050	6000600	SisterA
3	d	chr1	10000010	10000050	SisterA	chr1	25001000	25002000	SisterA	chr1	6000010	6000700	SisterB

GenomicBinner takes a binsize and aggregates contacts by counting the number of contacts that fall within a genomic bin. As with all the other preprocessing functionalities, GenomicBinner can take arguments as either pandas dataframes or dask dataframes. The output is consistent, meaning that when passing a pandas dataframe, a pandas dataframe is returned and if passing a dask dataframe, a dask dataframe is returned.

pixels = GenomicBinner(bin_size=100_000).bin_contacts(contacts)

The default behavior of genomic binner is to filter for contacts that are on the same chromosome and produce an intrachromosomal pixels instance. This behavior allows us to create a compact representation of pixels, which only store one chromosome field per pixel.

pixels.data.head()

	chrom	start_1	start_2	start_3	count
0	chr1	100000	500000	600000	1
1	chr1	10000000	25000000	6000000	2

The pixels class additionally contains all information associated with the data stored. This data is taken from the contacts object passed to genomic binner.

pixels.number_fragments, pixels.binsize, pixels.binary_labels_equal, pixels.symmetry_flipped, pixels.metadata_combi

(3, 100000, False, False, None)

If interchromosomal contacts are needed, this can be passed to the .bin_contacts method of genomic binner and will cause the pixel schema to incorporate chromosome columns for each of the corresponding pixel dimensions.

pixels_w_inter = GenomicBinner(bin_size=100_000).bin_contacts(contacts, same_chromosome=False)

pixels_w_inter.data.head()

	chrom_1	start_1	chrom_2	start_2	chrom_3	start_3	count
0	chr1	100000	chr1	500000	chr1	600000	1
1	chr1	10000000	chr1	25000000	chr1	6000000	2
2	chr1	5000000	chr1	7000000	chr4	2000000	1

Lablled contacts

The concept of pixels representing a single labeling state allows simplification of the interfaces to downstream processing functionality, but offloads responsibiltiy to the analyst to ensure that the biological question at hand is adequately adressed by the chosen labelling state. The pixels class does not perform any labelling state checks and forwards the combination(s) of labelling states present in the contacts class used for their construction. This allows for greater flexibility with regards to the questions that can be answered.

The ContactManipulator class contains functionality to filter contacts for a given labelling state to perform downstream aggregation of pixels.

contacts.data.head().filter(regex="(read_name|chrom|start|end|metadata)")

	read_name	chrom_1	start_1	end_1	metadata_1	chrom_2	start_2	end_2	metadata_2	chrom_3	start_3	end_3	metadata_3
0	a	chr1	100010	100015	SisterA	chr1	500010	500050	SisterB	chr1	600100	600200	SisterB
1	b	chr1	5000010	5000050	SisterB	chr1	7000050	7000070	SisterA	chr4	2000300	2000400	SisterA
2	c	chr1	10000010	10000050	SisterA	chr1	25000800	25000900	SisterB	chr1	6000050	6000600	SisterA
3	d	chr1	10000010	10000050	SisterA	chr1	25001000	25002000	SisterA	chr1	6000010	6000700	SisterB

filtered_contacts = ContactManipulator().subset_on_metadata(contacts, metadata_combi=['SisterA', 'SisterB', 'SisterB'])

filtered_contacts.data.head().filter(regex="(read_name|chrom|start|end|metadata)")

	read_name	chrom_1	start_1	end_1	metadata_1	chrom_2	start_2	end_2	metadata_2	chrom_3	start_3	end_3	metadata_3
0	a	chr1	100010	100015	SisterA	chr1	500010	500050	SisterB	chr1	600100	600200	SisterB

The resulting contacts carry the filtered metadata combination as an attribute:

filtered_contacts.metadata_combi

['SisterA', 'SisterB', 'SisterB']

GenomicBinner carries this information forward and adds it to the pixels object

pixels_filtered = GenomicBinner(bin_size=100_000).bin_contacts(filtered_contacts)

pixels_filtered.data.head()

	chrom	start_1	start_2	start_3	count
0	chr1	100000	500000	600000	1

pixels_filtered.metadata_combi

['SisterA', 'SisterB', 'SisterB']

Persisting pixels

Pixels come in a multitude of different flavors and it would thus be cumbsersome to have to keep track of pixels of different binsize, labelling state or contact order that belong to the same source fragments. We thus developed a pixel file format that exposes a unified name to pixels that belong together as well as functoinality to load the exact pixels that we want.

The pixel file format

The pixel file format is a directory that contains a number of parquet files as well as a config file that holds information about what specific pixels are available. As with the other io-operations, the FileManager object is responsible for saving pixels.

import os

FileManager().write_pixels("test_pixels.parquet", pixels)

os.listdir('test_pixels.parquet')

['5f0e4d6a6c0e5afb82c3a6ec4bce1635.parquet', 'metadata.json']

The pixel foler at test.parquet contains a single specific pixel file as well as a metadata file

The FileManager class contains functionality to read the metadata file and list available pixels

FileManager().list_pixels("test_pixels.parquet")

[PixelParameters(number_fragments=3, binsize=100000, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True)]

The list_pixels method lists all available pixels in the form of PixelParameters, a pydantic dataclass that holds the parameters that describe a specific pixel file

Writing additional pixels updates the metadata file

pixels_filtered._metadata_combi = ['A', 'B', 'B'] # make metadata shorter

FileManager().write_pixels("test_pixels.parquet", pixels_filtered)

FileManager().list_pixels("test_pixels.parquet")

[PixelParameters(number_fragments=3, binsize=100000, metadata_combi=None, label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True),
 PixelParameters(number_fragments=3, binsize=100000, metadata_combi=['A', 'B', 'B'], label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True)]

Loading pixels

Pixel files can be loaded using the parameter pydantic class

from spoc.file_parameter_models import PixelParameters

loaded_pixels = FileManager().load_pixels("test_pixels.parquet", PixelParameters(number_fragments=3, binsize=100_000, same_chromosome=True))

Additionally, pixels can be loaded using a URI that specifies the parameters seperated by ::. This method of loading requires the parameters path, number_fragments and binsize, but will match the rest to the available pixels.

from spoc.pixels import Pixels

loaded_pixels = Pixels.from_uri('test_pixels.parquet::3::100000::ABB')

loaded_pixels.get_global_parameters()

PixelParameters(number_fragments=3, binsize=100000, metadata_combi=['A', 'B', 'B'], label_sorted=False, binary_labels_equal=False, symmetry_flipped=False, same_chromosome=True)

If the specified uri is not specific enough and multiple pixels match, an error is raised