| 1 | DAWG VERSION 1.2-RELEASE |
|---|
| 2 | |
|---|
| 3 | Copyright (c) (2004-2009) Reed A. Cartwright - All rights reserved. |
|---|
| 4 | |
|---|
| 5 | DESCRIPTION |
|---|
| 6 | |
|---|
| 7 | Dawg is an application that will simulate nucleotide evolution with |
|---|
| 8 | gaps. |
|---|
| 9 | |
|---|
| 10 | ABSTRACT |
|---|
| 11 | |
|---|
| 12 | DNA Assembly with Gaps (Dawg) is an application designed to simulate the |
|---|
| 13 | evolution of recombinant DNA sequences in continuous time based on the |
|---|
| 14 | robust general time reversible model with gamma and invariant rate |
|---|
| 15 | heterogeneity and a novel length-dependent model of gap formation. The |
|---|
| 16 | application accepts phylogenies in Newick format and can return the |
|---|
| 17 | sequence of any node, allowing for the exact evolutionary history to be |
|---|
| 18 | recorded at the discretion of users. Dawg records the gap history of |
|---|
| 19 | every lineage to produce the true alignment in the output. Many options |
|---|
| 20 | are available to allow users to customize their simulations and results. |
|---|
| 21 | |
|---|
| 22 | Many tools and procedures exist for reconstructing alignments and |
|---|
| 23 | phylogenies and estimating evolutionary parameters from extant data. |
|---|
| 24 | True phylogenies and alignments are known in very rare instances. In the |
|---|
| 25 | absence of known data with true phylogenies, we are left with using |
|---|
| 26 | simulations to test the accuracy of such procedures. Proper simulation |
|---|
| 27 | of sequence evolution should involve both nucleotide substitution and |
|---|
| 28 | indel formation. However, existing tools for simulating sequence |
|---|
| 29 | evolution either do not include indels, like Seq-gen or evolver, or |
|---|
| 30 | include a rather inexact model of indel formation, like Rose. I |
|---|
| 31 | developed Dawg to fill in these gaps. |
|---|
| 32 | |
|---|
| 33 | CONTACT |
|---|
| 34 | |
|---|
| 35 | racartwr@ncsu.edu or reed@scit.us |
|---|
| 36 | |
|---|
| 37 | Reed A. Cartwright, PhD |
|---|
| 38 | Postdoctoral Research Associate |
|---|
| 39 | Department of Genetics |
|---|
| 40 | Bioinformatics Research Center |
|---|
| 41 | North Carolina State University |
|---|
| 42 | Campus Box 7566 |
|---|
| 43 | Raleigh, NC 27695-7566 |
|---|
| 44 | |
|---|
| 45 | Most work was done while I was a PhD student: |
|---|
| 46 | |
|---|
| 47 | Department of Genetics |
|---|
| 48 | University of Georgia |
|---|
| 49 | Athens, GA |
|---|
| 50 | |
|---|
| 51 | REFERENCE |
|---|
| 52 | |
|---|
| 53 | Cartwright, R.A. (2005) DNA Assembly With Gaps (Dawg): Simulating Sequence |
|---|
| 54 | Evolution. Bioinformatics 21 (Suppl. 3): iii31-iii38 |
|---|
| 55 | |
|---|
| 56 | LICENSE |
|---|
| 57 | |
|---|
| 58 | See COPYING for license information. |
|---|
| 59 | |
|---|
| 60 | DOWNLOAD |
|---|
| 61 | |
|---|
| 62 | Dawg can be downloaded from <http://scit.us/projects/dawg/>. |
|---|
| 63 | |
|---|
| 64 | INSTALLATION |
|---|
| 65 | |
|---|
| 66 | See Dawg's website for binary packages for Windows, Mac OSX, and other |
|---|
| 67 | systems. Alternatively, you can compile Dawg from the source. Dawg |
|---|
| 68 | requires CMake 2.6 (http://www.cmake.org/) to build it from sources. Many |
|---|
| 69 | Unix-like operating systems can install CMake through their package |
|---|
| 70 | systems. Extract the Dawg source code and issue the following commands in |
|---|
| 71 | the extracted directory: |
|---|
| 72 | |
|---|
| 73 | cmake . |
|---|
| 74 | make |
|---|
| 75 | make install |
|---|
| 76 | |
|---|
| 77 | The '-G' option to cmake is used to specify different build systems, e.g. Unix |
|---|
| 78 | Makefiles versus KDevelop3 project. The '-D' option to cmake can be used to |
|---|
| 79 | set different cmake variables from the command line: |
|---|
| 80 | |
|---|
| 81 | cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr . |
|---|
| 82 | make |
|---|
| 83 | make install |
|---|
| 84 | |
|---|
| 85 | This will build an optimized version of Dawg and install it to '/usr/bin'. |
|---|
| 86 | To specify your own build flags you need to set the environment variables |
|---|
| 87 | CFLAGS and LDFLAGS as necessary. Then specify |
|---|
| 88 | |
|---|
| 89 | cmake -DCMAKE_BUILD_TYPE= . |
|---|
| 90 | |
|---|
| 91 | See CMake's manual for additional information. |
|---|
| 92 | |
|---|
| 93 | If you want to build the source code on Windows you will need to install Flex |
|---|
| 94 | and Bison from the Gnuwin32 project <http://gnuwin32.sourceforge.net/>, and make |
|---|
| 95 | sure that they are in your path. You can then run the CMake GUI interface. If |
|---|
| 96 | you would prefer to run the command line version, then open up a command console |
|---|
| 97 | through the Visual Studio tools shortcut (or similar shortcut). This will add |
|---|
| 98 | the required compiler programs to your command console environment. After |
|---|
| 99 | changing to the source code directory issue the following commands: |
|---|
| 100 | |
|---|
| 101 | cmake -G "NMake Makefiles" . |
|---|
| 102 | nmake |
|---|
| 103 | |
|---|
| 104 | If successful, you should find dawg.exe in the "src" directory. |
|---|
| 105 | |
|---|
| 106 | If you are trying to compile Dawg on a UNIX machine that does not have CMake |
|---|
| 107 | installed, and you can't install it from a package, then you may need to install |
|---|
| 108 | it locally. After downloading and extracting CMake in your home directory, |
|---|
| 109 | change to its directory and issue the following commands. |
|---|
| 110 | |
|---|
| 111 | ./configure --prefix=$HOME |
|---|
| 112 | make |
|---|
| 113 | make install |
|---|
| 114 | |
|---|
| 115 | If "make" fails, try using "gmake" instead. |
|---|
| 116 | |
|---|
| 117 | EXAMPLES |
|---|
| 118 | |
|---|
| 119 | example0.dawg - minimal |
|---|
| 120 | example1.dawg - typical usage |
|---|
| 121 | example2.dawg - simple indel formation |
|---|
| 122 | example3.dawg - robust indel formation |
|---|
| 123 | example4.dawg - recombination |
|---|
| 124 | |
|---|
| 125 | COMMAND LINE USAGE |
|---|
| 126 | |
|---|
| 127 | dawg -[scubvhqew?] [-o outputfile] file1 [file2...] |
|---|
| 128 | -s: process files serially [default] |
|---|
| 129 | -c: process files combined together |
|---|
| 130 | -u: unbuffered output |
|---|
| 131 | -b: buffered output [default] |
|---|
| 132 | -q: disable error and warning reports (quiet) |
|---|
| 133 | -e: enable error reports [default] |
|---|
| 134 | -w: enable warning reports [default] |
|---|
| 135 | -v: display version information |
|---|
| 136 | -h: display help information |
|---|
| 137 | -?: same as -h |
|---|
| 138 | -o outputfile: override ouput filename in the configuration file |
|---|
| 139 | |
|---|
| 140 | Dawg will read stdin if filename is "-". |
|---|
| 141 | |
|---|
| 142 | FILE FORMAT |
|---|
| 143 | |
|---|
| 144 | The file format takes a series of statements in the form of "[name]" or |
|---|
| 145 | "name = value", where "name" is alphanumeric and value can be a string, |
|---|
| 146 | number, Boolean, tree, or vector of values. The former specifies a heading, |
|---|
| 147 | which can simplify variable assignment. A single variable is equivalent to |
|---|
| 148 | a vector of a single entry. |
|---|
| 149 | |
|---|
| 150 | When using headings, the following statements are equivalent: |
|---|
| 151 | |
|---|
| 152 | Out.Block.Head = "A Comment" |
|---|
| 153 | Out.Block.Tail = "B Comment" |
|---|
| 154 | |
|---|
| 155 | [Out.Block] |
|---|
| 156 | Head = "A Comment" |
|---|
| 157 | Tail = "B Comment" |
|---|
| 158 | |
|---|
| 159 | Out.Block.Head = "A Comment" |
|---|
| 160 | [Out.Block] |
|---|
| 161 | Tail = "B Comment" |
|---|
| 162 | |
|---|
| 163 | [Out.Block] |
|---|
| 164 | Head = "A Comment" |
|---|
| 165 | [] |
|---|
| 166 | Out.Block.Tail = "B Comment" |
|---|
| 167 | |
|---|
| 168 | [Out] |
|---|
| 169 | Block.Head = "A Comment" |
|---|
| 170 | [Out.Block] |
|---|
| 171 | Tail = "B Comment" |
|---|
| 172 | |
|---|
| 173 | [Out] |
|---|
| 174 | Block.Head = "A Comment" |
|---|
| 175 | [.Block] |
|---|
| 176 | Tail = "B Comment" |
|---|
| 177 | |
|---|
| 178 | Values can be specified via the following syntaxes. |
|---|
| 179 | |
|---|
| 180 | string: "[char-sequence]" |
|---|
| 181 | '[char-sequence]' |
|---|
| 182 | """[multi-line char-sequence]""" (removes initial and final newlines) |
|---|
| 183 | '''[multi-line char-sequence]''' (preserves initial and final newlines) |
|---|
| 184 | number: [sign]digits[.digits][(e|E)[sign]digits] |
|---|
| 185 | boolean: true|false |
|---|
| 186 | tree: Newick Format |
|---|
| 187 | vector: { value, value, ...} |
|---|
| 188 | |
|---|
| 189 | OPTIONS |
|---|
| 190 | |
|---|
| 191 | Name Type Description |
|---|
| 192 | -------------------------------------------------------------------------- |
|---|
| 193 | Tree VT phylogeny |
|---|
| 194 | TreeScale N coefficient to scale branch lengths by |
|---|
| 195 | Sequence VS root sequences |
|---|
| 196 | Length VN length of generated root sequences |
|---|
| 197 | Rates VVN rate of evolution of each root nucleotide |
|---|
| 198 | Model S model of evolution: GTR|JC|K2P|K3P|HKY|F81|F84|TN |
|---|
| 199 | Freqs VN nucleotide (ACGT) frequencies |
|---|
| 200 | Params VN parameters for the model of evolution |
|---|
| 201 | Width N block width for indels and recombination |
|---|
| 202 | Scale VN block position scales |
|---|
| 203 | Gamma VN coefficients of variance for rate heterogenity |
|---|
| 204 | Alpha VN shape parameters |
|---|
| 205 | Iota VN proportions of invariant sites |
|---|
| 206 | GapModel VS models of indel formation: NB|PL|US |
|---|
| 207 | Lambda VN rates of indel formation |
|---|
| 208 | GapParams VVN parameter for the indel model |
|---|
| 209 | Reps N number of data sets to output |
|---|
| 210 | File S output file |
|---|
| 211 | Format S output format: Fasta|Nexus|Phylip|Clustal |
|---|
| 212 | GapSingleChar B output gaps as a single character |
|---|
| 213 | GapPlus B distinguish insertions from deletions in alignment |
|---|
| 214 | KeepFlank N undeletable flanking regions N nucs from sequence |
|---|
| 215 | KeepEmpty B preserve empty columns in final alignment |
|---|
| 216 | LowerCase B output sequences in lowercase |
|---|
| 217 | Translate B translate outputted sequences to amino acids |
|---|
| 218 | Seed VN pseudo-random-number-generator seed (integers) |
|---|
| 219 | Out.Block.Head S string to insert at the start of the output |
|---|
| 220 | Out.Block.Tail S string to insert at the end of the output |
|---|
| 221 | Out.Block.Before S string to insert before a sequence set in the output |
|---|
| 222 | Out.Block.After S string to insert after a sequence set in the output |
|---|
| 223 | Out.Subst B do variable substitution in Out.Block.* |
|---|
| 224 | |
|---|
| 225 | DEFAULTS |
|---|
| 226 | |
|---|
| 227 | TreeScale = 1.0 |
|---|
| 228 | Length = 100 |
|---|
| 229 | Model = "JC" |
|---|
| 230 | Freqs = {0.25,0.25,0.25,0.25} |
|---|
| 231 | Params = {1.0,1.0,1.0,1.0,1.0,1.0} |
|---|
| 232 | Width = 1 |
|---|
| 233 | Scale = 1.0 |
|---|
| 234 | Gamma = 0.0 |
|---|
| 235 | Iota = 0.0 |
|---|
| 236 | GapModel = "US" |
|---|
| 237 | GapParams = 1.0 |
|---|
| 238 | Reps = 1 |
|---|
| 239 | Format = "Fasta" |
|---|
| 240 | GapSingleChar = false |
|---|
| 241 | GapPlus = false |
|---|
| 242 | LowerCase = false |
|---|
| 243 | Translate = false |
|---|
| 244 | Out.Subst = true |
|---|
| 245 | |
|---|
| 246 | VARIABLE SUBSTITUTION |
|---|
| 247 | |
|---|
| 248 | If Out.Subst is true (the default), then Dawg will preform variable substitution |
|---|
| 249 | in any Out.Block that it outputs. Currently three variables are supported. |
|---|
| 250 | %r is replaced by the current dataset number |
|---|
| 251 | %R is replaced by the total dataset number |
|---|
| 252 | %% is replaced by a percent sign. |
|---|
| 253 | |
|---|
| 254 | OUTPUT FILE |
|---|
| 255 | |
|---|
| 256 | Dawg can automatically detect the format of the output file based on its extension. |
|---|
| 257 | Supported extensions and their formats are: |
|---|
| 258 | |
|---|
| 259 | Clustal: aln, poo, txt, out, Clustal |
|---|
| 260 | Fasta: fas, Fasta |
|---|
| 261 | Nexus: nex, Nexus |
|---|
| 262 | Phylip: phy, Phylip |
|---|
| 263 | |
|---|
| 264 | Dawg also supports the filename format of "ext:file" to output to "file" with |
|---|
| 265 | the format specified by extension "ext". That way one can use "nex:-" to output |
|---|
| 266 | to stdout in Nexus format. |
|---|
| 267 | |
|---|
| 268 | NOTES |
|---|
| 269 | |
|---|
| 270 | The meaning of the "Params" vector is different for each substitution model. |
|---|
| 271 | GTR: Substitution rates A-C, A-G, A-T, C-G, C-T, G-T |
|---|
| 272 | JC: Ignored |
|---|
| 273 | K2P: Transition rate, Transversion rate |
|---|
| 274 | K3P: Alpha (Transitions), Beta (A-T & G-C), Gamma (A-C & G-T) |
|---|
| 275 | HKY: Transition rate, Transversion rate |
|---|
| 276 | F81: Ignored |
|---|
| 277 | F84: Kappa |
|---|
| 278 | TN: Alpha1 (A-G), Alpha2 (C-T), Beta (Transversions) |
|---|
| 279 | |
|---|
| 280 | Parameter "Freqs" is ignored by the models "JC", "K2P", and "K3P". |
|---|
| 281 | |
|---|
| 282 | If "Lambda" is a single value, then it specifies the rate of indel formation, |
|---|
| 283 | e.g. "Lambda = 0.1" is the same as "Lambda = {0.05, 0.05}". The first |
|---|
| 284 | parameter is the insertion rate and the second parameter is the deletion rate. |
|---|
| 285 | |
|---|
| 286 | The first parameter of "GapModel" specifies the distribution model of |
|---|
| 287 | insertion sizes. The second parameter specifies the distribution model of |
|---|
| 288 | deletion sizes. If only one parameter is given it is the model for both |
|---|
| 289 | insertions and deletions. |
|---|
| 290 | |
|---|
| 291 | The first parameter of "GapParams" is a vector specifying the parameters for the |
|---|
| 292 | gap model of insertions. Likewise the second parameter is a vector specifying |
|---|
| 293 | the parameters for the gap model of deletions. If "GapParams" is not a vector |
|---|
| 294 | of vectors, then it specifies the vector of parameters for both insertions and |
|---|
| 295 | deletions. |
|---|
| 296 | |
|---|
| 297 | The meaning of the GapParams vector is different for each gap model. |
|---|
| 298 | US: The distribution of gap sizes. |
|---|
| 299 | NB: The number of failures (r), the probability of success (q). |
|---|
| 300 | PL: The rate parameter (a), the maximum gap size. |
|---|
| 301 | |
|---|
| 302 | To create a recombinant tree, you may need to specifically describe and label |
|---|
| 303 | the inner nodes at which the recombination events occur. See example4.dawg. |
|---|
| 304 | |
|---|
| 305 | Gamma takes precedence over Alpha. |
|---|
| 306 | |
|---|
| 307 | Sequence takes precedence over Length. |
|---|
| 308 | |
|---|
| 309 | If Out.Block.* is the name of a file, the code is read from that file. |
|---|
| 310 | |
|---|
| 311 | The following vector parameters have a size of "Width": "Scale", "Alpha", |
|---|
| 312 | "Gamma", and "Iota". If their size is less than width then the first value in |
|---|
| 313 | the vector will be used to fill in the rest of the values, e.g. "Scale = 1.0" |
|---|
| 314 | is the same as "Scale = {1.0,1.0,1.0}" when "Width = 3". |
|---|