Help

Using DeepArk

DeepArk is a resource for studying regulatory genomics in model organisms. Specifically, DeepArk is a set of four convolutional neural networks that are capable of predicting biochemical regulatory activity (e.g. transcription factors, histone marks, chromatin accessibility, RNA polymerase) directly from genomic sequences. For a given input sequence of 4095 base pairs, DeepArk outputs a set of probabilities indicating whether different regulatory features are predicted to be present at the center of said sequence. DeepArk can be used to predict the regulatory activity of input sequences, to predict the regulatory effects of genomic variants, and to profile an input sequence to identify regions of said sequence that appear to be especially predictive of regulatory activity. In addition to providing in silico predictions that may be used in computational analyses, we expect that DeepArk's predictive capabilities will be also useful for a number of experimental applications as well (e.g. predictive genome editing of regulatory regions). Thus, DeepArk is intended to be a tool for both computational and experimental researchers studying regulatory genomics. Further information about DeepArk's applications can be found in the sections below, and additional information about the regulatory features modeled by DeepArk is available in the next section. To learn more about the DeepArk models and read about some example applications of DeepArk to both computational and experimental regulatory genomics, please see our publication on DeepArk.

Understanding DeepArk's regulatory feature information

The output files produced by DeepArk use the accessions as names for the predicted features. These accessions are generally written as SRX followed by an assortment of numbers, and do not directly tell you much about the regulatory feature being predicted. To make sense of them, you will need to match them with the corresponding metadata. Metatadata for the features predicted by DeepArk may be found in the DeepArk GitHub repository. A brief explanation of the metadata file format is provided here. Note that these metadata files also include the test set performance of DeepArk, which may be relevant to some performance-critical applications where users need to retain only the highest-performing regulatory feature predictions. A very limited number of features may have NA or NaN for their test set performance. These are features that had fewer than 50 positive examples in the test set. Additional information about our testing procedures is available in our publication on DeepArk. For users interested in binarizing DeepArk's outputs with using a probability threshold, we provide a set of thresholds, along with their recall, precision, and false positive rates, in the DeepArk GitHub repository. We recommend binarizing DeepArk's outputs in large-scale performance-critical applications where users need their predictions to have a specific recall, precision, or false positive rate.

Supported file formats and examples

DeepArk supports both FASTA and BED files for prediction and sequence profiling, as well as VCF files for variant effect prediction. Additional information about these formats, as well as other formats commonly used for genomics, may be found here. Note that DeepArk requires that entry names for BED files and FASTA files be unique. We include examples of each file format below:

FASTA
BED (use with the ce11 reference genome)
VCF (use with the mm10 reference genome)

Regulatory activity prediction

The regulatory activity prediction tool predicts the probability of a regulatory feature (e.g. transcription factor binding) for an input DNA sequence. This input sequence can be specified as either a set of DNA sequences in a FASTA file or a BED file that lists the coordinates for the sequences. If using a BED file, you must also specify which of the genomes you would like to use. If you want to use a BED file with a genome version that we do not support, consider using the UCSC LiftOver tool to convert your coordinates into valid coordinates for one of the versions that we do support. Note that all input sequences must be 4095 base pairs in length. If an input sequence is longer than 4095 base pairs, then only the center 4095 base pairs of the sequence will be used. Furthermore, the start and end coordinates in a BED file should not fall outside the limits of its respective chromosome. If you intend to make predictions for over 20,000 examples, we recommend that you use the standalone source code and models instead of this server. Below is a screenshot of the output from the regulatory activity prediction tool.

Variant effect prediction

We anticipate that the variant effect prediction tool will be particularly useful for those interested in prioritizing a large set of regulatory variants by putative effects on different regulatory features, as well as those interested in precision editing of the regulatory genome. The variant effect prediction tool predicts the change in probability of regulatory feature due to a specific genomic variant. This difference is calculated as the difference and log₂ fold change of the probability of the feature occurring in the alternative sequence versus the reference sequence. Variants should be uploaded in a VCF file, and you must specify which of the genomes you would like to use as a reference. The reference alleles in the VCF must match the reference genome. Indels should be stored so that the alternative and reference alleles both include at least one base pair. If a variant has multiple alternate alleles associated with it, each alternate allele should be listed on a separate line. Variants that are not represented as nucleotides in the VCF are not supported. Variants that are larger than 100 base pairs are not supported either. The 4095 base pair window around the variant should not fall outside the limits of the specified chromosome. If you intend to make predictions for over 20,000 examples, we recommend that you use the standalone source code and models instead of this server. Below is a screenshot of the output from the variant effect prediction tool.

Genomic sequence profiling

The genomic sequence profiler can be particularly useful for uncovering portions of input sequences that appear especially predictive of a given regulatory feature (e.g. binding sites that are critical to a transcription factor binding to the input sequence). The genomic sequence profiling tool predicts the probability of a regulatory feature for an input DNA sequence, and predicts how every possible mutation in that sequence might change the probability of that regulatory feature. This difference is calculated as the difference and log fold change of the probability of the feature occurring in the alternative sequence versus the reference sequence. This input sequence can be specified as either a set of DNA sequences in a FASTA file or a BED file that lists the coordinates for the sequences. If using a BED file, you must also specify which of the genomes you would like to use. If you want to use a BED file with a genome assembly version that we do not support, consider using the UCSC LiftOver tool to convert your coordinates into valid coordinates for one of the versions that we do support. Note that all input sequences must be 4095 base pairs in length. If an input sequence is longer than 4095 base pairs, then only the center 4095 base pairs of the sequence will be used. Furthermore, the start and end coordinates in a BED file should not fall outside the limits of its respective chromosome. If you intend to make predictions for more than one example, we recommend that you use the standalone source code and models instead of this server. If you are unsure which feature to investigate, we recommend first identifying which features are active in your sequence of interest by predicting its regulatory activity. Below is a screenshot of the sequence profiler's output.