Model Based CpG Islands

Lists of CpG islands

Below are lists of model-based CpG islands for a number of different species created using the method described in Wu et al. (2010) Biostatistics. All lists are generated using 0.99 as posterior probability threshold except for D. melanogaster (fruit fly), which used 0.975. All files are tab delimited text. Model based CpG islands are now at UCSC genome browser as custom tracks (link).

R software package

makeCGI is an R software package to obtain CGI from a genome (version 1.1 is here) It fits two HMMs on GC content and observed to expected CpG ratio iteratively and obtain posterior probabilities for genomic regions being CpG islands. The CpG islands are then defined by thresholding the posterior probabilities.

The software package can be downloaded from here. It depends on BSgenome and Biostrings BioConductor packages. The input DNA sequence can be either a BSgenome package or text file in fa format. Follow below steps to use the package:

  1. Load in the library:
      library(makeCGI)
  2. Set up default parameters:
      .CGIoptions=CGIoptions()

  3. Start running:
      makeCGI(.CGIoptions)
Three folders "counts" "rawdata" and "result" will be created under the current working directories to save intermediate result files. The final results will be saved as an R data file under "result" directory as "CGI-[species name].rda", e.g., for the above examples, "CGI-Hsapiens.rda".

This program requires a lot of memory, depending on the size of the genome being analyzed. The computational time could be substantial.

References

  1. Wu H, Caffo B, Jaffe HA, Feinberg AP, Irizarry RA (2010) Redefining CpG Islands Using a Hierarchical Hidden Markov Model. Biostatistics 11(3): 499-514.
  2. Irizarry RA, Wu H, Feinberg AP (2009) A Species-Generalized Probabilistic Model-Based Definition of CpG Islands. Mammalian Genome Volume 20, Numbers 9-10, 674-680.