Preprocessing tasks

HiCognition provides a number of concrete preprocessing tasks that are related to the abstract tasks defined in the concepts section. These can be structured as tasks that aggregate genomic features for a given region set or tasks that aggregate collections of features. The preprocessing tasks are related to the types of widgets available for exploration, although some tasks produce data for multiple widgets.

Tasks for single genomic features

These tasks aggregate a single genomic feature on a genomic region set and thus make this feature available for exploration.

           is aggregated
feature -------------------> region-set

Aggregate a 1D-feature at a genomic region set

This task produces data for the 1D-average widget and the Stacked lineprofile widget and amounts to extracting the signal of a biwig file at a genomic region set (see the widgets section for a detailed description of the algorithm).

Aggregate a 2D-feature at a genomic region set

This task produces data for the 2D-average widget and the 2D-feature embedding widget and amounts to extracting the signal of a multiresolution cooler file at a genomic region set (see the widgets section for a detailed description of the algorithm).

Tasks for collections of genomic features

These tasks aggregate a collection of genomic features on a genomic region set and make this collection available for exploration.

feature1 ---
            |
feature2 ---|                       is aggregated
            |------>  Collection -------------------> region-set
    .       |
            |   
featureN ---

Aggregate a 1D-feature collection at a genomic region set

This task produces data for the 1D-feature embedding widget and amounts to embedding the genomic region set using the 1D-features into a 2D-space (see the widgets section for a detailed description of the algorithm).

Aggregate a Region collection at a genomic region set

This task produces data for the Association widget and amounts to running LOLA on the region set of interest with the region collection (see the widgets section for a detailed description of the algorithm).

Duration of tasks

A visual exploration tool is only useful if the tasks it fulfills can be completed in a reasonable time. Therefore, we have worked hard on minimizing the required time for each preprocessing step. If you use the default configurations and a machine that complies with our hardware requirements, none of the jobs should take longer than ~ 3 minutes. The following table gives a rough estimate of how long different preprocessing steps are expected to run for common input sizes.

Region size Preprocessing task Duration [min]
1000 Aggregate a 1D-feature at a genomic region set 0.1
1000 Aggregate a 2D-feature at a genomic region set 0.6 *
1000 Aggregate a 1D-feature collection at a genomic region set 0.5
1000 Aggregate a Region collection at a genomic region set 0.5
50000 Aggregate a 1D-feature at a genomic region se 0.5
50000 Aggregate a 2D-feature at a genomic region set 3 *
50000 Aggregate a 1D-feature collection at a genomic region set 2.5
50000 Aggregate a Region collection at a genomic region set 2.5

* For the Aggregate a 2D-feature at a genomic region set task, the first time you preprocess that particular region, you should expect that it runs ~2x as long as we calculate the Observed/expected values the first time and then cache them for future runs.

If you change the configuration for windowsizes and binsizes, the jobs may take much longer and require more memory.

That being said, on-demand preprocessing is just one of the potential user flows. We think many preprocessing steps can be submitted in bulk as a large part of exploration tasks involve common “dataset ingredients”.