How-To: Genomic Assembly
Modern genomic projects are a three-phase undertaking, whereby a given organism’s genome is sequenced, assembled, and annotated. A very general, but possibly helpful, overview of this process is provided below.
- Sequencing: Shotgun (WGS) sequencing is a fairly straight-forward process. High molecular weight DNA is first sheared into random fragments. These fragments are then size-selected into (2, 10, 50, 150 kb) libraries, cloned into proper sequencing vectors, and transformed into bacterial cell lines for colony generation and selection. Insert-containing vectors undergo classical chain-termination (Sanger) sequencing in a bi-directional fashion. “Mate pair” (read and end-read) sequences are generated for each insert. These sequences, and also the physical distance information linking reads within each pair, help structure assembly efforts.
- Assembly: The idea now is to link these sequences into larger continuous regions, allowing for eventual genomic “closure”. Ideally, this would mean that the only sequencing gaps remaining in an assembly are those representing the physical separation of the individual chromosomes of an organism. Realistically, this means a working draft of the whole genome with a myriad of gap-closure issues. An effective assembly algorithm must optimally identify overlapping sequence reads within the bounds of mate-pair distance data, while dealing with A/T richness and the highly-repetitive nature of certain genomes. It’s a difficult, but not insurmountable, computational problem. A whole slew of refined assembly algorithms see current use, including the TIGR Assembler, Celera Assembler, Arachne, and PCAP.
- Annotation: This is the phase in which information of biological relevance is attached to the raw genomic construct. It’s a dynamic effort to decipher both the structural nature of genes and regulatory elements, and perhaps more importantly, to ascribe functional purpose to the molecular actors spawned by these structures. Approaches may involve homology-based algorithms (BLAST), complex stochastic modeling (hidden Markov models), structural motif recognition, comparative genomics (whole-genome alignment), and the overlay of EST data (transcriptome alignment).
Each step presents a unique set of challenges. Now, a miniature exercise to mimic how an organism’s genome is assembled.
Sequences and Chromatographs:
We begin with 1648 shotgun sequences output from an ABI sequencing machine. These sequences come in the form of chromatographs, and reads from the same vector are paired by filename (forward and reverse, arbitrarily).

Here, I’ve set a low-quality region next to a high-quality region from a single chromatograph representing a single sequencing read:

Base calling:
There’s an established method to gauge the probability that each individual base is correctly “called” (labeled) according to parameters drawn from the chromatograph. We’ll use Phred to assign numerical values to each base position, representing base quality. Phred uses Fourier methods to examine base traces surrounding each potential position in our dataset, determining where a peak should be under the assumption that there’s no compression. Next, Phred examines the individual traces to determine the precise center of true peaks, and their areas relative to neighboring peaks. A dynamic programming algorithm is used to map observed peaks to predicted peaks. Phred then evaluates the trace surrounding the final positional centers using five quality parameter values to quantify trace quality.
This quality value (QV) is related to the probability of a base-call error (P_e):
QV = – 10 * log_10(P_e)
A single command gets the job done:
phred -id “directory of sequences” -sa …/basecalls -qa …/basequals
This outputs two files,
basecalls: contains the actual bases called for each position
basequals: contains the quality values (QV) for each base call
Base quals and calls for a corresponding reverse read.

Trimming, Vector contamination:
Armed with this information, we can “trim” our sequence data to only include high probability bases. We should also identify and remove any vector contaminants. We’ll use Lucy to simultaneously accomplish both. The sequencing vector used to generate our library of reads is the PUC19 vector. Lucy intakes both the full-length sequence of the cloning vector and its end “splice sites” in both orientations, as shown below:

Again, a single command does it all:
lucy -vector ‘PUC19′ ‘PUC19splice.lucy’ -e 0.05 0.045 -debug ‘lucy.debug’ -output ‘lucy.seq’ ‘lucy.qul’ ‘basecalls’ ‘basequals’
Where ‘basecalls’ and ‘basequals’ refer to files generated by Phred.
Assembly:
We’ll use the TIGR suite of assembly tools for this step.
“run_TA” is a csh shell script for running the TIGR Assembler without requiring an explicit listing of all input parameters.
Assembly command:
run_TA ‘-q …/project.qul’ …project.seq
This command will create the files ‘project.align’ ‘project.asm’ ‘project.error’ ‘project.fasta’. Only ‘project.asm’ is of interest at this juncture. We’ll use it to “group” continuously assembled pieces of DNA, revealing overall structure and any remaining gaps.
First, we parse the “asm” file for “Grouper” processing:
perl asm_to_group.pl -i “project.asm”
output: ‘group.inp’
Second, we “group” related reads in a segmented assembly:
grouper ‘group.inp’
output: ‘contained.out’ ‘display.out’ ‘group.inp.noc’ ‘group.out’
‘display.out’- the graphical (ascii) display, containing summarized information about group content and structuring.
In the case of these reads, it looks like this:

This allows is to measure the effectiveness of our assembly campaign. Note 5 “gaps” listed as either “phy” (physical) or “seq” (sequence). Gaps 1 and 5 are physical gaps representing the variable ends of a potential full assembly, and therefore, can be ignored. This leaves us with 3 sequencing gaps (2,3,4) and 1 physical gap (3). The physical gap is problematic, signaling a lack of read-pair linking information between two sets of assembled groups. The sequencing gaps can be dealt with more easily, but will require more benchwork. Next up, filling gaps using simulated PCR.
Gap-closure:
A common gap-closure technique is to use PCR to bridge two ordered, linked continuous stretches of DNA. Performing PCR for each assembly gap is a time-consuming and expensive process. To account for this, a genomics consortium may decide to hide a large set of chromatographs, generated by random PCR, from the programmers and bioinformaticians involved in assembly. To gain access to a given read, an assembler will need to provide a “bait” PCR primer sequence that falls within the appropriate window. If you’re given a limited number of PCR bait attempts, it would be wise to input 20-mers at the end of a given block, or a reverse-complemented 20-mer at the beginning of a block.
Once you’ve acquired enough PCR reads to help smooth over sequencing gaps, it’s time to re-assemble.
Re-assembly:
This time, we’ll use phredPhrap and consed to re-assemble and view our updated assembly.
mkdir chromat_dir edit_dir phd_dir
symbolic link for all *.gz chromatographs (original and pcr-generated) to chromat_dir
ln -s …/*.gz …/chromat_dir
running phredPhrap assembler (script) from edit_dir folder
nice phredPhrap
outputs set of files, view “project.fasta.screen.ace.1″ with consed
consed &
consed re-assembly results

! single contig (contig1) 104,461 bp in length. Success.
We can also view a coverage map that shows the original overlapping reads under the full-length assembled contig.

Colorful and stuff.
About this entry
You’re currently reading “How-To: Genomic Assembly,” an entry on Ramblings & Musings
- Published:
- October 23, 2007 / 11:57 pm
- Category:
- Science.
- Tags:
No comments yet
Jump to comment form | comments rss [?] | trackback uri [?]