All input files supplied by the user are expected to be ’formatted’ (in a Fortran sense),
i.e. should be plain text of ASCII file.
Non-standard characters may cause problems !
HINT: Take great care when switching from DOS/Windows to Linux or
vice versa: Remember that these operating systems use different
end-of-line coding - this means that you may have to ‘translate’ files using
a utility like dos2unix (unix2dos) or fromdos (todos).
6.2 Data File
The data file is mandatory. It gives the traits to be analysed, and all information on
effects in the model of analysis. It is expected to have the following features:
-
1.
- There is no ’default’ name for the data file. File names up to 30 characters
long are accommodated.
-
2.
- Variables in the data file should be in fixed width columns, separated by
spaces.
-
3.
- Each column, up to the maximum number of columns to be considered (=
number of variables specified in the parameter file), must have a numerical
value – even if this column is not used in the analysis, i.e. no ‘blank’ values
!
-
4.
- All codes of effects to be considered (fixed, random or ‘extra’ effects) must
be positive integer variables, i.e. consist of a string of digits only.
The maximum value allowed for a code is 2147483647, i.e. just over 2
billion.
-
5.
- All traits and covariables (including control variables) are read as
real values, i.e. may contain digits, plus or minus signs, and
Fortran type formatting directives only.
N.B.: Calculations in WOMBAT use an operational zero
(default value: 10−8), treating all smaller values as zero.
To avoid numerical problems, please ensure your traits are
scaled so that their variances are in a moderate range
(something like 10−5 to 105).
-
6.
- Any alphanumeric strings in the part of the data file to be read by
WOMBAT are likely to produce errors !
-
7.
- For multi-trait analyses, there should be one record for each trait recorded for an
individual .
The trait number for the record should be given in the first column.
No special codes for ’missing values’ are available – missing traits are simply
absent records in the data file.
-
8.
- The data file must be sorted in ascending order, according to :
-
i)
- the individual (or ’subject’) for which traits are recorded, and
-
ii)
- according to the trait number within individual.
-
iii)
- For RR analyses, records are expected to be sorted according to the value
of the control variable (within individual and trait number) in
addition.
N.B.: WOMBAT does not allow ‘repeated’ records for
individual points on the trajectory in RR analyses, i.e.
you can not have multiple observations for an individual
with the same value of the control variable.
-
9.
- For multivariate analyses combining traits with repeated and single records, the
traits with repeated records need to have a lower trait number than those with
single records only.
To facilitate annotation of the data file (e.g. column headers, date of creation,
source), WOMBAT will skip lines with a ’#’ (hash sign) in column 1 at
the beginning of the file - there is no limit on the number, n, of such lines,
but they must represent the first n lines (any ’#’ elsewhere will cause an
error).
6.3 Pedigree File
If the model of analysis contains random effect(s) which are assumed to be
distributed proportional to the numerator relationship matrix, a pedigree file is
required. It is expected to have the following features:
-
1.
- There is no ’default’ name for the pedigree file. File names up to 30
characters long are accommodated.
-
2.
- The pedigree file must contain one line for each animal in the data.
Additional lines with pedigree information for parents without records
themselves can be included.
-
3.
- Each line is expected to contain (at least) three integer variables
:
-
(a)
- the animal code,
-
(b)
- the code for the animal’s sire,
-
(c)
- and the code for the animal’s dam.
All codes must be valid integer in the range of 0 to 2147483647.
For analyses distinguishing between genotyped and non-genotyped individuals,
a fourth column needs to be supplied containing a “1” for non-genotyped and
“2” or “3” for genotyped individuals.
If genetic groups are to be fitted, relevant information is to be given in from
column 5 onwards; see subsection 4.17.10 for details.
Additional, optional variables in the fourth or fifth column can be:
-
(d)
- the animal’s inbreeding coefficient (real variable between 0 and 1),
-
(e)
- a code of 1 (males) or 2 (females) (integer), defining the number of
X chromosomes, if a corresponding relationship matrix is to be set up.
Note that these relate to older, seldom used model options which are
incompatible with genotype codes or genetic group information.
-
4.
- All animals must have a numerically higher code than either of their
parents.
Unknown parents are to be coded as “0”.
-
5.
- If maternal genetic effects are to be fitted in the model of analysis, all dams of
animals in the data must be ‘known’, i.e. have codes > 0.
-
6.
- The pedigree file does not need to be sorted. However, sorting according to
animal code (column 1) (in ascending order) is desirable and highly
recommended as it will reduce processing time, especially for large numbers of
individuals.
As for the data file, any lines at the beginning of the pedigree file with a ’#’ (hash
sign) in column 1 are ignored.
6.4 Marker counts file
With the growing prevalence of genomic information and need for mixed model
analyses of such data, capabilities of WOMBAT have recently been extended to
perform some of the tasks required (see run options These require a file of allele
counts for markers (or SNPs).
-
1.
- The ’default’ name for this input file is MarkerCounts.dat. Alternative
names (up to 30 characters long) can be specified in the parameter file
(see section 4.8).
-
2.
- The marker counts file file must contain at least one line for each genotyped
animal.
-
3.
- Information required for each animal, to be read consequentially, is comprised
of :
-
(a)
- the animal code (as in data or pedigree file) at the beginning of a
NEW line.
-
(b)
- the marker counts, typically 0, 1 and 2 (through real variables can
be accommodated for some forms of input; see below) for exactly m
markers, with m as specified in the parameter file. These may be on
the same line or extend over continuation lines.
-
4.
- There are different form options for MarkerCounts.dat or equivalent
-
(a)
- The default is a formatted file with space-separated variables. This is
expected to extension .dat in the filename. For this option marker
counts are read as 4 Byte real variables.
-
(b)
- Alternatively, a file name with extension .BIN or .BI1 is read as a
binary file. In this case, marker allele counts are expected to have been
written out as integer*1 variables (1 Byte long), while the animal
codes are read as integer variables of standard length (4 or 8 Bytes).
-
(c)
- An extension of .BR4 specifies a binary input file where allele counts
have written out as 4 Byte real variables.
-
5.
- NB: WOMBAT does absolutely NO checking of the contents of this file -
missing counts or mono-morphic markers may create problems!
6.5 Parameter File
WOMBAT acquires all information on the model of analysis from a parameter
file.
Rules to set up the parameter file are complex, and are described in detail in a
separate chapter (chapter 4).
6.6 Other Files
Depending on the model of analysis chosen, additional input files may be
required.
6.6.1 General inverse file
For each random effect fitted for which the covariance option GIN (see
subsection 4.10.4) has been specified, WOMBAT expects a file set up by the user
which contains the inverse of the matrix K (such as relationship or correlation
matrix) which determines the ‘structure’ of the covariance matrix for the random
effect. The following rules apply :
-
1.
- The file name should be equal to the name of the random effect, with
the extension .gin. For example, mother.gin for a random effect called
mother.
For random effect names containing additional information in round
brackets, for instance in RR analysis, only the part preceding the ‘(’ should
be used. In this case, be careful to name the effects in the model so that
no ambiguities arise!
-
2.
- The first line of the file should contain a real variable with value equal
to the log determinant of the covariance/general relationship matrix (NB:
This is the log determinant of the matrix K, not of the inverse K−1; this
can generally be calculated as a ‘by-product’ during inversion).
This comprises a constant term in the (log) likelihood, i.e. any value can
be given (e.g. zero) if no comparisons between models are required.
Optionally, this can be followed (separated by space(s)) by the keyword
“DENSE”. If given, WOMBAT will store the elements of the general
relationship matrix in core, assuming it is dense, i.e. for n levels, an array
of size n(n+1)∕2 is used. This can require substantial additional memory,
but reduces the overhead incurred by re-reading this matrix from disk for
every iteration, and may be advantageous if the matrix is (almost) dense,
such as the inverse of a genomic relationship matrix.
-
3.
- The file should then contain one line for each non-zero element in the
inverse. Each line is expected to contain three space-separated variables
:
-
(a)
- An integer code for the ‘column’ number
-
(b)
- An integer code for the ‘row’ number
-
(c)
- A real variable specifying the element of the inverse
Here ‘row’ and ‘column’ numbers should range from 1 to N, where N is the
number of levels for the random effect.
Only the elements of the lower triangle of the inverse should be given and given
‘row-wise’, i.e. WOMBAT expects a ’column’ number which is less than or
equal to the ‘row’ number.
HINT: Calculations involved are more efficient if elements are given
in order (of the lower triangle)!
6.6.1.1 Codes for GIN levels
By default, WOMBAT determines the number of levels for a random effect with
covariance option GIN from the data, renumbering them in ascending numerical
order. In some cases, however, we might want to fit additional levels, not represented
in the data. A typical example is am additional genetic effect, which can have levels
not in the data linked to those in the data through covariances arising from
co-ancestry.
If WOMBAT encounters row or column numbers greater than the number of random
effect levels found in the data, it will take the following action:
-
1.
- It is checked that this number does not exceed the maximum number
of random effects levels as specified in the parameter file. If it does,
WOMBAT stops (change parameter file if necessary).
-
2.
- WOMBAT looks for a file with the same name as the .gin file but
extension .codes; e.g. mother.codes for the random effect mother.
This file is expected to supply the codes for all levels of the random
effect: There has to be one line for each level with two space separated
integer variables, the running number (1st) and the code for the level
(2nd).
For an analysis using the run option --s1step where the user supplied
matrix represents the inverse of the joint relationship matrix between
genotyped and non-genotyped animals, the .codes file is required to have
a third column with the code 1 for non-genotyped and 2 or 3 for genotyped
individuals.
If the random effect represents an additive genetic effect and the model
of analysis fits respective, explicit group effects, pertaining information is
expected to be supplied from column 4 onwards; see subsection 4.17.10
for details.
-
3.
- If such file is not found, WOMBAT will look for a genetic effect (i.e. a
random effect with covariance option NRM) which has the same number of
levels as the current random effect. If found, it will simply copy the vector
of identities for that effect and proceed. (Hint: you may have to use run
time --noprune to utilise this feature).
-
4.
- Finally, if neither of these scenarios apply, WOMBAT will assume the
random levels are coded from 1 to N and try to proceed without any
further checking – this may cause problems!
6.6.1.2 Diagonal elements of GIN matrix
For runs which produce random effects solutions WITH standard errors, a file
containing the diagonal elements of the general relationship matrix K (NOT of the
inverse!) is recognised. This is expected to have the same name as the .gin file but
extension .hdiags or .gdiags.( If found, these diagonal elements are used to
attempt computation of the accuracies corresponding to the prediction errors
computed. The file is expected to be formatted, with one line for each level of the
corresponding random effect. Each line should contain three space separated
variables:
-
1.
- The running number (integer)
-
2.
- The original (animal) code (integer)
-
3.
- The diagonal element (real). Missing values can be replaced by a value
of -9.0.
HINT: Output of such file — with extension .hdiags – can be requested
when using WOMBAT with run option --hinv to build the joint inverse
relationship matrix for a single-step analysis, H−1.
6.6.1.3 Elements of the general relationship matrix
Multivariate analyses estimating covariance matrices due to random effects with
covariance option GIN at reduced rank using the average information algorithm
require the product of the ‘original’ matrix, K and a vector in each iterate. For such
analyses, WOMBAT allows for the following alternatives:
-
A)
- If a file with the same name as the .gin file but extension .matrix is
found, WOMBAT expects to read the elements of the lower triangle of the
symmetric matrix from this file.
-
B)
- If the original matrix is not available, it may be more convenient to supply
the Cholesky factor of the inverse K−1 instead, in a file with the same name
as the .gin file but extension .chlsky. If this is given (and no .matrix file
is found), the required product is evaluated using two triangular solves in
each iterate.
-
C)
- If neither of these files is available, WOMBAT will attempt to carry out the
Cholesky factorisation if the GIN matrix is more than 75% dense. This is
done storing the matrix in full and using Lapack routines for symmetric,
positive definite matrices – if the matrix is large or not safely positive definite
this can cause problems. Otherwise, the program will stop.
To avoid setting up a .matrix or .chlsky file altogether, please disable
use of the average information algorithm by explicitly specifying one of the
other maximisation algorithms.
As for the .gin file, the .matrix or .chlsky file should be formatted, with one line
for each non-zero element containing three space-separated variables (but no line
corresponding to the determinant)
-
1.
- An integer code for the ‘column’ number
-
2.
- An integer code for the ‘row’ number
-
3.
- A real variable specifying the element of the matrix of Cholesky factor
HINT: Run option --hchol is provided to carry out a sparse Cholesky
factorisation as a separate step; see subsection 5.2.10
6.6.2 Basis function file
If a regression on a user- defined set of basis functions has been chosen in the model
of analysis by specifying the code USR for a covariable (or ‘control’ variable in a RR
analysis), file(s) specifying the functions need to be supplied.
The form required for these files is:
-
1.
- The name of the file should be the name of the covariable (or ‘control’
variable), as given in the parameter file (model of analysis part), followed
by _USR, the number of coefficients, and the extension .baf.
EXAMPLE: If the model of analysis includes the effect
age and the maximum number of regression coefficients
for age is 7, the corresponding input file expected is
age_USR7.baf
N.B.: The file name does not include a trait number.
This implies, that for multivariate analyses the same basis
function is assumed to be used for a particular covariable
across all traits. The only differentiation allowed is that the
number of regression coefficients may be different (i.e. that
a subset of coefficients may be fitted for some traits); in
this case, the file supplied must correspond to the largest
number of coefficients specified.
-
2.
- There should be one row for each value of the covariable.
-
3.
- Rows should correspond to values of the covariable in ascending order.
-
4.
- The number of columns in the file must be equal to (or larger than) the number
of regression coefficients to be fitted (i.e. the order of fit) for the covariable.
-
5.
- The elements of the i−th row should be the user-defined functions evaluated for
the i−th value of the covariable.
EXAMPLE: Assume the covariable has possible values of 1, 3, 5, 7
and 9, and that we want to fit a cubic regression on ’ordinary’
polynomials, including the intercept. In this case, WOMBAT would
expect to find a file with 5 rows (corresponding to the 5 values of
the covariable) and 4 columns (corresponding to the 4
regression coefficients, i.e. intercept, linear, quadratic and
cubic):
1 1 1 1
1 3 9 27
1 5 25 125
1 7 49 343
1 9 81 729
Note that there is no leading column with the value of the covariable
(you can add it as the last column which is ignored by WOMBAT, if
you wish) – the association between value of covariable
and user defined function is made through the order of
records.
6.6.3 File with allele counts
For an analysis using the run option --snap, an additional input file is required
which supplies the counts for the reference allele for each QTL or SNP to be
considered. This has the default name SNPCounts.dat or SNPCountsR.dat,
depending whether integer or or real input is chosen. If both exist in the working
directory, WOMBAT will utilize the former and ignore the latter.
- SNPCounts.dat must be a formatted file with one row per QTL. Each
row should contain a single digit (usually 0, 1, 2) for all individuals
in the data, without any spaces between them! At present, there is no
provision for missing genotypes (these are readily imputed). In contrast
to most other input files used by WOMBAT, information is obtained in a
Fortran formatted read and a blank line is treated as a line of zeros. For
example, if there are 1000 individuals, each line should be 1000 characters
long. The number of SNPs processed is given by the number of records
(rows) in the file.
- SNPCountsR.dat accommodates the situation where – for some reason or
other – a format with space separated values is preferred. This removes
the restriction of a single digit. ‘Counts’ are read as real values, i.e. can
contain decimals. Values for a SNP can be spread over any number of
rows, but counts for each new SNP must begin on a new row.
6.6.4 Files with results from part analyses
6.6.4.1 List of partial results
For a run with option --itsum or --pool, WOMBAT expects a number of files with
results from part analyses as input. Typically, these have been generated by
WOMBAT when carrying out these analyses; see subsection 7.2.6 for further
details.
For run option --pool, results can be given in a single file instead. For each part
analysis, this should contain the following information:
-
1.
- A line giving (space separated):
-
a)
- The number of traits in the part analysis
-
b)
- The (running) numbers of these traits in the full covariance matrix.
-
c)
- The relative weight to be given to this part; this can be omitted and,
if not given, is set to 1.
-
2.
- The elements of the upper triangle of the residual covariance matrix, given
row-wise.
-
3.
- For each random effect fitted, the elements of the upper triangle, given
row-wise. Each matrix must begin on a new line and the matrices must given in
the same order as the corresponding VAR statements in the parameter
file.
6.6.5 ‘Utility’ files
WOMBAT will check for existence of other files with default names in the working
directory and, if they exist, acquire information from them.
6.6.5.1 File RunOptions
This file can be used as an alternative to the command line to specify run options
(see chapter 5).
It must have one line for each run option specified, e.g.
-v
--emalg
to specify a run with verbose output using the EM-algorithm.
6.6.5.2 File FileSynonyms
In some cases, WOMBAT expects input files with specific names. If files with
different default names have the same content, duplication can be avoided by setting
up a file FileSynonyms to ‘map’ specific files to a single input file. This file should
contain one line for each input file to be ‘mapped’ to another file. Each line should
give two file names (space separated) :
-
(a)
- The default name expected by WOMBAT.
-
(b)
- The name of the replacement file
EXAMPLE:
age.baf mybasefn.dat
damage.baf mybasefn.dat
[Not yet implemented !]
6.6.5.3 File RandomSeeds
To simulate data, WOMBAT requires two integer values to initialise the random
number generator. If the file RandomSeeds exists, it will attempt to read these
values from it. Both numbers can be specified on the same or different lines.
If the file does not exist in the working directory, or if an error reading is
encountered, initial numbers are instead derived from the date and time of
day.
WOMBAT writes out such file in each simulation run, i.e. if RandomSeeds exists, it is
overwritten with a new pair of numbers !
6.6.6 File SubSetsList
For a run with option --itsum, WOMBAT expects to read a list of names
of files with results from subset analyses in a file with the standard name
SubSetsList. This has generated by WOMBAT (see subsection 7.3.9) if the part
analyses have been carried out using WOMBAT, but may need editing. In
particular, if a weighted summation is required, the default weights of ‘1.000’,
need to be replaced ‘manually’ by appropriate values, selected by the user
!
6.6.7 File(s) Pen*(.dat)
6.6.7.1 File PenTargetMatrix
For penalty options COVARM and CORREL a file with this name must be supplied
which gives the shrinkage target. This must be a positive definite matrix. The file
should be a plain text file and contain the elements of the upper triangle of the
matrix. It is read in ‘free’ format, i.e. variable numbers of elements per line are
allowed.
6.6.7.2 File PenBestPoints.dat
A run with the option --valid expects to read sets of estimates from a file with this
name. This is generated by WOMBAT when penalized estimation is specified, but
can be edited to suit or generated by other means. For each tuning factor, it should
contain:
-
(a)
- A line with the tuning factor (realvariable) at the beginning
-
(b)
- The elements of the upper triangle of estimate the residual covariance matrix
(or equivalent) for this tuning factor. This is read in ‘free’ format, i.e. can
be given over as many lines suitable.
-
(c)
- Starting on a new line: The elements of the upper triangle of estimate the
genetic covariance matrix (or equivalent) for this tuning factor. Again, this
is read in ‘free’ format.