Examples of using the ENCODE web sites

Some general points:

The Nature site (http://www.nature.com/encode/) provides direct access to the core articles of the ENCODE project published in the September 6, 2012 issue of Nature, and to 29 "companion papers" in Nature, Genome Research and Genome Biology. Some papers in other journals are also listed as "additional research papers."

The main innovative feature of this site is "the Nature ENCODE explorer" which allows one to "engage with the material by following 'Threads', each one dedicated to a theme discussed in more than one paper." There are 13 threads altogether. "Threads ...complement the papers by highlighting and bringing together topics that are otherwise covered only in subsections of individual papers. Each Thread consists of relevant paragraphs, figures and tables from across the papers, united around a specific theme." Although this organization is useful for a quick prediction of which of the 30 papers might be most relevant to a specific topic, eventually you're probably going to want to look at the whole paper anyway, not just someone else's selection of excerpts.

The search utility on the Nature site (http://www.nature.com/encode/) appears to be based on an index of the titles of the core papers, and the text that appears in the 13 threads. Searching for a word appearing in a paper, but not included in any of the threads, will generate a message "Sorry, no results were found."

Many of the papers include supplementary data, most often as Excel files, but these aren't indexed on the site. The only way to get them is to download the files from the journal.

For raw data, and especially for browsing the genome, you will want to go to the UCSC ENCODE site: http://genome.ucsc.edu/ENCODE/


Example 1: looking for information on a specific gene

If you are interested in a particular gene and want to find out what is currently known about its 5' and 3' flanking regions, potential regulatory elements, whether alternatively-spliced mRNAs have been found, and much more, the UCSC ENCODE site has it all, overwhelmingly so.

On the UCSC site, from the home page, select Genomes in the top menu bar. This defaults to the human genome, but you can choose any of a long list of other animals, or yeast. The default human genome version is the February 2009 assembly (designated GRCH37), and this is the version to which all the ENCODE data have been added.

Enter your gene name, or some other identifier into the search box. If there is more than one gene matching that name, you'll get a list from which to choose one. You will then be taken to the genome browser, which is a display of a region of a chromosome, showing the region occupied by your chosen gene and its exons and introns, and a selection of tracks displaying other information.

Any gene will work for this exercise, although the amount of information retrieved will obviously vary depending on how thoroughly it has been studied. I chose POC1A, which encodes a conserved protein required for centriole stability. If you want to start with a typical browser screen to explore, here is the link to the POC1A gene

The tracks are identified on the left side of the browser window. Right-clicking with the mouse will bring up a menu that lets you hide or expand the various tracks.

At the top of the screen are arrows to scroll to the left or right, and to zoom in or out. Use these features to examine the regions upstream and downstream of your gene. The button marked "base" will display 148 bases of DNA sequence centered on the point of your cursor.

Clicking within the browser window will usually bring up a page of additional information; what you get depends on what track, and what item, you clicked on. Usually there will be at least a paragraph explaining what you're looking at, and references to the sources for the data.

Clicking on the gene sequence itself in the browser will give you a page with the annotation for the gene - what it codes for, what the protein does, etc. - plus links to the raw sequence, the mRNA sequence and protein sequence, gene expression data, references and more.

Below the browser window there are drop-down menus for hiding or displaying many additional features.

Specific to ENCODE, the default display includes clusters of DNaseI hypersensitivity regions and transcription factors from ChIP-seq analysis. Other tracks of ENCODE data can also be selected.


Example 2: Attempt to find new SNPs that have been discovered in the ENCODE project

Single nucleotide polymorphisms (SNPs) are used as markers for many inherited diseases and other traits. For example, they are the basis of prenatal blood tests to determine if a woman is a carrier for cystic fibrosis, sickle-cell anemia, Tay-Sachs disease, and some other conditions. I wanted to see if the ENCODE project had turned up new medically useful SNPs, or new information on diseases that are believed to have some genetic basis but for which the cause is still unknown.

I started with the Nature ENCODE site Using their search box, I first tried searching for "disease":

This retrieved

    one research paper from the core set
    two additional research papers
    three threads
All three of these papers contained the word "disease" in their titles.

Searching for "disease AND SNP" reduced the output to the three threads, apparently because none of the papers had the word SNP in its title, although all three were indeed highly relevant to my search.

The threads had links to some additional papers, some of which were also relevant.

Hardison (Genome-wide Epigenetic Data Facilitate Understanding of Disease Susceptibility Association Studies) focuses mostly on susceptibility to disease determined by multiple gene loci, as opposed to single-gene mutations (like cystic fibrosis) that have been known for a long time from studies of inheritance in families, and in some cases animal models. Hardison has a nice diagram and explanation of how searches for disease-susceptibility SNPs are done:

"Mapping experiments examine SNPs to ascertain the genotypes that are significantly more prevalent in the affected group than in the non-affected group; these genotypes are associated with the trait of interest."

For example, a group of people with coronary artery disease might be compared to a healthy group of similar age and ethnicity. SNPs are identified that have a significantly higher frequency of one allele than the other, and are mapped to the genome. The chromosomal region surrounding the SNP position is then examined to determine if it contains cis-regulatory elements ("CRMs") like promoters or enhancers, which are binding sites for transcription factors, or regions of methylation or histone modifications. This paper contains good explanations of how such features can be identified using ChIP-seq and other techniques.

Hardison's paper focuses mainly on SNPs already known from population studies ("genome-wide association studies" or GWAS), but uses the ENCODE data on regulatory elements to try to determine how the SNP allele might influence expression of its associated gene.

Murano et al. (Systematic Localization of Common Disease-Associated Variation in Regulatory DNA) primarily used identification of DNaseI-hypersensitive sites (DHS), which are markers for cis-regulatory elements "within which the co- operative binding of transcription factors creates focal alterations in chromatin structure". These sites appear to be particularly useful for identifying enhancers that may not be close to the gene that they regulate. They also compared such sites in fetal vs. adult tissues, and among populations with various diseases with multigene involvement (as described by Hardison, above).

Schaub et al. (Linking Disease Associations with Regulatory Information in the Human Genome) mined the ENCODE data for what they call "functional SNPs", definied as a "SNP that appears in a region identified as associated with a biochemical event in at least one ENCODE cell line. Functional SNPs can be further subdivided into SNPs that overlap coding or noncoding transcripts, and SNPs that appear in regions identified as potentially regulatory, such as ChIP-seq peaks and DNase I–hypersensitive sites." As in the other studies, these SNPs are typically in promoter or enhancer regions, and are often identified by binding of specific transcription factors. Schaub et al. took this a step further, however, in looking for additional functional SNPs nearby (and genetically linked) in each region of interest. Taken together, these results confirmed many previously identified associations as highly likely to be relevant to a disease or trait. They have some interesting tables of examples.

Another very relevant paper that turned up in the threads cited in the original search:

Ni et al. (Simultaneous SNP Identification and Assessment of Allele-Specific Bias from ChIP-seq D) used the ENCODE data for several transcription factors and DHS sites to identify a huge number of new SNPs. These were then analyzed for overlap with previously-identified disease-associated SNPs, and for the frequency of the possible alleles of the new SNPs in the worldwide population data.

Finally, the lead article in the Nature ENCODE issue, An integrated encyclopedia of DNA elements in the human genome, includes a supplementary Excel file with a long list of diseases and traits identified by SNPs, and correlated with specific transcription factors or DNase I hypersensitivity sites. If you were looking for all the transcription factors that seem to have an association with a particular disease, then this would be a place to start.

Unfortunately, however, I found it very difficult to move from these papers to looking for SNPs associated with any particular disease, if it wasn't specifically discussed in the paper. I think the best way to go about this is probably to download raw data from the UCSC site:

There are many choices here. One approach is to use the pull-down menus in the second section to select "lab producing data" is among " "___________"

Click in the box that says "ANY" and you'll get a list of the contributing laboratories. Select the one that corresponds to the paper that interests you.

Another repository for the ENCODE data is at NCBI Gene Expression Omnibus. This site allows searching data sets from individual labs or projects.

Two other resources for genetics of human disease are the Genetic Association Database and OMIM (Online Mendelian Inheritance in Man), but these aren't based on the ENCODE data.


Example 3. Finding information on individual transcription factors and their binding sites

The ENCODE papers are packed with discussion of various transcription factors, but they presume a familiarity that many readers won't have. Fortunately, there's a separate site that organizes this information nicely:

Factorbook presents a matrix of transcription factors and the cell lines in which they have been mapped. Clicking on the name of each transcription factor brings up a very informative page describing it in detail, showing its protein structure and binding sites, and providing links to various other databases.




back to index page