Chapter 6
Input files for WOMBAT



6.1 Format

All input files supplied by the user are expected to be ’formatted’ (in a Fortran sense), i.e. should be plain text of ASCII file.

Non-standard characters may cause problems !

HINT: Take great care when switching from DOS/Windows to Linux or vice versa: Remember that these operating systems use different end-of-line coding - this means that you may have to ‘translate’ files using a utility like dos2unix (unix2dos) or fromdos (todos).

6.2 Data File

The data file is mandatory. It gives the traits to be analysed, and all information on effects in the model of analysis. It is expected to have the following features:

1.
There is no ’default’ name for the data file. File names up to 30 characters long are accommodated.
2.
Variables in the data file should be in fixed width columns, separated by spaces.
3.
Each column, up to the maximum number of columns to be considered (= number of variables specified in the parameter file), must have a numerical value – even if this column is not used in the analysis, i.e. no ‘blank’ values !
4.
All codes of effects to be considered (fixed, random or ‘extra’ effects) must be positive integer variables, i.e. consist of a string of digits only.
The maximum value allowed for a code is 2147483647, i.e. just over 2 billion.
5.
All traits and covariables (including control variables) are read as real values, i.e. may contain digits, plus or minus signs, and Fortran type formatting directives only.

N.B.: Calculations in WOMBAT use an operational zero (default value: 108), treating all smaller values as zero. To avoid numerical problems, please ensure your traits are scaled so that their variances are in a moderate range (something like 105 to 105).

6.
Any alphanumeric strings in the part of the data file to be read by WOMBAT are likely to produce errors !
7.
For multi-trait analyses, there should be one record for each trait recorded for an individual1 . The trait number for the record should be given in the first column.

No special codes for ’missing values’ are available – missing traits are simply absent records in the data file.

8.
The data file must be sorted in ascending order, according to :
i)
the individual (or ’subject’) for which traits are recorded, and
ii)
according to the trait number within individual.
iii)
For RR analyses, records are expected to be sorted according to the value of the control variable (within individual and trait number) in addition.

N.B.: WOMBAT does not allow ‘repeated’ records for individual points on the trajectory in RR analyses, i.e. you can not have multiple observations for an individual with the same value of the control variable.

9.
For multivariate analyses combining traits with repeated and single records, the traits with repeated records need to have a lower trait number than those with single records only.

To facilitate annotation of the data file (e.g. column headers, date of creation, source), WOMBAT will skip lines with a ’#’ (hash sign) in column 1 at the beginning of the file - there is no limit on the number, n, of such lines, but they must represent the first n lines (any ’#’ elsewhere will cause an error).

6.3 Pedigree File

If the model of analysis contains random effect(s) which are assumed to be distributed proportional to the numerator relationship matrix, a pedigree file is required. It is expected to have the following features:

1.
There is no ’default’ name for the pedigree file. File names up to 30 characters long are accommodated.
2.
The pedigree file must contain one line for each animal in the data. Additional lines with pedigree information for parents without records themselves can be included.
3.
Each line is expected to contain (at least) three integer variables :
(a)
the animal code,
(b)
the code for the animal’s sire,
(c)
and the code for the animal’s dam.

All codes must be valid integer in the range of 0 to 2147483647.

For analyses distinguishing between genotyped and non-genotyped individuals, a fourth column needs to be supplied containing a “1” for non-genotyped and “2” or “3” for genotyped individuals.

If genetic groups are to be fitted, relevant information is to be given in from column 5 onwards; see subsection 4.17.10 for details.

Additional, optional variables in the fourth or fifth column can be:

(d)
the animal’s inbreeding coefficient (real variable between 0 and 1),
(e)
a code of 1 (males) or 2 (females) (integer), defining the number of X chromosomes, if a corresponding relationship matrix is to be set up.

Note that these relate to older, seldom used model options which are incompatible with genotype codes or genetic group information.

4.
All animals must have a numerically higher code than either of their parents.
Unknown parents are to be coded as “0”.
5.
If maternal genetic effects are to be fitted in the model of analysis, all dams of animals in the data must be ‘known’, i.e. have codes > 0.
6.
The pedigree file does not need to be sorted. However, sorting according to animal code (column 1) (in ascending order) is desirable and highly recommended as it will reduce processing time, especially for large numbers of individuals.

As for the data file, any lines at the beginning of the pedigree file with a ’#’ (hash sign) in column 1 are ignored.

6.4 Marker counts file

With the growing prevalence of genomic information and need for mixed model analyses of such data, capabilities of WOMBAT have recently been extended to perform some of the tasks required (see run options These require a file of allele counts for markers (or SNPs).

1.
The ’default’ name for this input file is MarkerCounts.dat. Alternative names (up to 30 characters long) can be specified in the parameter file (see section 4.8).
2.
The marker counts file file must contain at least one line for each genotyped animal.
3.
Information required for each animal, to be read consequentially, is comprised of :
(a)
the animal code (as in data or pedigree file) at the beginning of a NEW line.
(b)
the marker counts, typically 0, 1 and 2 (through real variables can be accommodated for some forms of input; see below) for exactly m markers, with m as specified in the parameter file. These may be on the same line or extend over continuation lines.
4.
There are different form options for MarkerCounts.dat or equivalent
(a)
The default is a formatted file with space-separated variables. This is expected to extension .dat in the filename. For this option marker counts are read as 4 Byte real variables.
(b)
Alternatively, a file name with extension .BIN or .BI1 is read as a binary file. In this case, marker allele counts are expected to have been written out as integer*1 variables (1 Byte long), while the animal codes are read as integer variables of standard length (4 or 8 Bytes).
(c)
An extension of .BR4 specifies a binary input file where allele counts have written out as 4 Byte real variables.
5.
NB: WOMBAT does absolutely NO checking of the contents of this file - missing counts or mono-morphic markers may create problems!

6.5 Parameter File

WOMBAT acquires all information on the model of analysis from a parameter file.

Rules to set up the parameter file are complex, and are described in detail in a separate chapter (chapter 4).

6.6 Other Files

Depending on the model of analysis chosen, additional input files may be required.

6.6.1 General inverse file

For each random effect fitted for which the covariance option GIN (see subsection 4.10.4) has been specified, WOMBAT expects a file set up by the user which contains the inverse of the matrix K (such as relationship or correlation matrix) which determines the ‘structure’ of the covariance matrix for the random effect. The following rules apply :

1.
The file name should be equal to the name of the random effect, with the extension .gin. For example, mother.gin for a random effect called mother.
For random effect names containing additional information in round brackets, for instance in RR analysis, only the part preceding the ‘(’ should be used. In this case, be careful to name the effects in the model so that no ambiguities arise!
2.
The first line of the file should contain a real variable with value equal to the log determinant of the covariance/general relationship matrix (NB: This is the log determinant of the matrix K, not of the inverse K1; this can generally be calculated as a ‘by-product’ during inversion).
This comprises a constant term in the (log) likelihood, i.e. any value can be given (e.g. zero) if no comparisons between models are required.
Optionally,  this can be followed (separated by space(s)) by the keyword “DENSE”. If given, WOMBAT will store the elements of the general relationship matrix in core, assuming it is dense, i.e. for n levels, an array of size n(n+1)2 is used. This can require substantial additional memory, but reduces the overhead incurred by re-reading this matrix from disk for every iteration, and may be advantageous if the matrix is (almost) dense, such as the inverse of a genomic relationship matrix.
3.
The file should then contain one line for each non-zero element in the inverse. Each line is expected to contain three space-separated variables :
(a)
An integer code for the ‘column’ number
(b)
An integer code for the ‘row’ number
(c)
A real variable specifying the element of the inverse

Here ‘row’ and ‘column’ numbers should range from 1 to N, where N is the number of levels for the random effect.
Only the elements of the lower triangle of the inverse should be given and given ‘row-wise’, i.e. WOMBAT expects a ’column’ number which is less than or equal to the ‘row’ number.

HINT: Calculations involved are more efficient if elements are given in order (of the lower triangle)!

6.6.1.1 Codes for GIN levels

By default, WOMBAT determines the number of levels for a random effect with covariance option GIN from the data, renumbering them in ascending numerical order. In some cases, however, we might want to fit additional levels, not represented in the data. A typical example is am additional genetic effect, which can have levels not in the data linked to those in the data through covariances arising from co-ancestry.

If WOMBAT encounters row or column numbers greater than the number of random effect levels found in the data, it will take the following action:

1.
It is checked that this number does not exceed the maximum number of random effects levels as specified in the parameter file. If it does, WOMBAT stops (change parameter file if necessary).
2.
WOMBAT looks for a file with the same name as the .gin file but extension .codes; e.g. mother.codes  for the random effect mother. This file is expected to supply the codes for all levels of the random effect: There has to be one line for each level with two space separated integer variables, the running number (1st) and the code for the level (2nd).

For an analysis using the run option --s1step where the user supplied matrix represents the inverse of the joint relationship matrix between genotyped and non-genotyped animals, the .codes file is required to have a third column with the code 1 for non-genotyped and 2 or 3 for genotyped individuals.

If the random effect represents an additive genetic effect and the model of analysis fits respective, explicit group effects, pertaining information is expected to be supplied from column 4 onwards; see subsection 4.17.10 for details.

3.
If such file is not found, WOMBAT will look for a genetic effect (i.e. a random effect with covariance option NRM) which has the same number of levels as the current random effect. If found, it will simply copy the vector of identities for that effect and proceed. (Hint: you may have to use run time --noprune to utilise this feature).
4.
Finally, if neither of these scenarios apply, WOMBAT will assume the random levels are coded from 1 to N and try to proceed without any further checking – this may cause problems!

6.6.1.2 Diagonal elements of GIN matrix

For runs which produce random effects solutions WITH standard errors, a file containing the diagonal elements of the general relationship matrix K (NOT of the inverse!) is recognised. This is expected to have the same name as the .gin file but extension .hdiags or .gdiags.( If found, these diagonal elements are used to attempt computation of the accuracies corresponding to the prediction errors computed. The file is expected to be formatted, with one line for each level of the corresponding random effect. Each line should contain three space separated variables:

1.
The running number (integer)
2.
The original (animal) code (integer)
3.
The diagonal element (real). Missing values can be replaced by a value of -9.0.

HINT: Output of such file — with extension .hdiags – can be requested when using WOMBAT with run option --hinv to build the joint inverse relationship matrix for a single-step analysis, H1.

6.6.1.3 Elements of the general relationship matrix

Multivariate analyses estimating covariance matrices due to random effects with covariance option GIN at reduced rank using the average information algorithm require the product of the ‘original’ matrix, K and a vector in each iterate. For such analyses, WOMBAT allows for the following alternatives:

A)
If a file with the same name as the .gin file but extension .matrix is found, WOMBAT expects to read the elements of the lower triangle of the symmetric matrix from this file.
B)
If the original matrix is not available, it may be more convenient to supply the Cholesky factor of the inverse K1 instead, in a file with the same name as the .gin file but extension .chlsky. If this is given (and no .matrix file is found), the required product is evaluated using two triangular solves in each iterate.
C)
If neither of these files is available, WOMBAT will attempt to carry out the Cholesky factorisation if the GIN matrix is more than 75% dense. This is done storing the matrix in full and using Lapack routines for symmetric, positive definite matrices – if the matrix is large or not safely positive definite this can cause problems. Otherwise, the program will stop.
To avoid setting up a .matrix or .chlsky file altogether, please disable use of the average information algorithm by explicitly specifying one of the other maximisation algorithms.

As for the .gin file, the .matrix or .chlsky file should be formatted, with one line for each non-zero element containing three space-separated variables (but no line corresponding to the determinant)

1.
An integer code for the ‘column’ number
2.
An integer code for the ‘row’ number
3.
A real variable specifying the element of the matrix of Cholesky factor

HINT: Run option --hchol is provided to carry out a sparse Cholesky factorisation as a separate step; see subsection 5.2.10

6.6.2 Basis function file

If a regression on a user- defined set of basis functions has been chosen in the model of analysis by specifying the code USR for a covariable (or ‘control’ variable in a RR analysis), file(s) specifying the functions need to be supplied.

The form required for these files is:

1.
The name of the file should be the name of the covariable (or ‘control’ variable), as given in the parameter file (model of analysis part), followed by _USR, the number of coefficients, and the extension .baf.

EXAMPLE: If the model of analysis includes the effect age and the maximum number of regression coefficients for age is 7, the corresponding input file expected is age_USR7.baf

N.B.: The file name does not include a trait number.

This implies, that for multivariate analyses the same basis function is assumed to be used for a particular covariable across all traits. The only differentiation allowed is that the number of regression coefficients may be different (i.e. that a subset of coefficients may be fitted for some traits); in this case, the file supplied must correspond to the largest number of coefficients specified.

2.
There should be one row for each value of the covariable.
3.
Rows should correspond to values of the covariable in ascending order.
4.
The number of columns in the file must be equal to (or larger than) the number of regression coefficients to be fitted (i.e. the order of fit) for the covariable.
5.
The elements of the ith row should be the user-defined functions evaluated for the ith value of the covariable.

EXAMPLE: Assume the covariable has possible values of 1, 3, 5, 7 and 9, and that we want to fit a cubic regression on ’ordinary’ polynomials, including the intercept. In this case, WOMBAT would expect to find a file with 5 rows (corresponding to the 5 values of the covariable) and 4 columns (corresponding to the 4 regression coefficients, i.e. intercept, linear, quadratic and cubic):

  1   1   1    1 
  1   3   9   27 
  1   5  25  125 
  1   7  49  343 
  1   9  81  729

Note that there is no leading column with the value of the covariable (you can add it as the last column which is ignored by WOMBAT, if you wish) – the association between value of covariable and user defined function is made through the order of records.

6.6.3 File with allele counts

For an analysis using the run option --snap, an additional input file is required which supplies the counts for the reference allele for each QTL or SNP to be considered. This has the default name SNPCounts.dat or SNPCountsR.dat, depending whether integer or or real input is chosen. If both exist in the working directory, WOMBAT will utilize the former and ignore the latter.

6.6.4 Files with results from part analyses

6.6.4.1 List of partial results

For a run with option --itsum or --pool, WOMBAT expects a number of files with results from part analyses as input. Typically, these have been generated by WOMBAT when carrying out these analyses; see subsection 7.2.6 for further details.

6.6.4.2 Single, user generated input file

For run option --pool, results can be given in a single file instead. For each part analysis, this should contain the following information:

1.
A line giving (space separated):
a)
The number of traits in the part analysis
b)
The (running) numbers of these traits in the full covariance matrix.
c)
The relative weight to be given to this part; this can be omitted and, if not given, is set to 1.
2.
The elements of the upper triangle of the residual covariance matrix, given row-wise.
3.
For each random effect fitted, the elements of the upper triangle, given row-wise. Each matrix must begin on a new line and the matrices must given in the same order as the corresponding VAR statements in the parameter file.

6.6.5 ‘Utility’ files

WOMBAT will check for existence of other files with default names in the working directory and, if they exist, acquire information from them.

6.6.5.1 File RunOptions

This file can be used as an alternative to the command line to specify run options (see chapter 5).
It must have one line for each run option specified, e.g.
   -v
   --emalg
to specify a run with verbose output using the EM-algorithm.

6.6.5.2 File FileSynonyms

In some cases, WOMBAT expects input files with specific names. If files with different default names have the same content, duplication can be avoided by setting up a file FileSynonyms to ‘map’ specific files to a single input file. This file should contain one line for each input file to be ‘mapped’ to another file. Each line should give two file names (space separated) :

(a)
The default name expected by WOMBAT.
(b)
The name of the replacement file

EXAMPLE:

age.baf      mybasefn.dat 
damage.baf   mybasefn.dat

[Not yet implemented !]

6.6.5.3 File RandomSeeds

To simulate data, WOMBAT requires two integer values to initialise the random number generator. If the file RandomSeeds exists, it will attempt to read these values from it. Both numbers can be specified on the same or different lines. If the file does not exist in the working directory, or if an error reading is encountered, initial numbers are instead derived from the date and time of day.

WOMBAT writes out such file in each simulation run, i.e. if RandomSeeds exists, it is overwritten with a new pair of numbers !

6.6.6 File SubSetsList

For a run with option --itsum, WOMBAT expects to read a list of names of files with results from subset analyses in a file with the standard name SubSetsList. This has generated by WOMBAT (see subsection 7.3.9) if the part analyses have been carried out using WOMBAT, but may need editing. In particular, if a weighted summation is required, the default weights of ‘1.000’, need to be replaced ‘manually’ by appropriate values, selected by the user !

6.6.7 File(s) Pen*(.dat)

6.6.7.1 File PenTargetMatrix

For penalty options COVARM and CORREL a file with this name must be supplied which gives the shrinkage target. This must be a positive definite matrix. The file should be a plain text file and contain the elements of the upper triangle of the matrix. It is read in ‘free’ format, i.e. variable numbers of elements per line are allowed.

6.6.7.2 File PenBestPoints.dat

A run with the option --valid expects to read sets of estimates from a file with this name. This is generated by WOMBAT when penalized estimation is specified, but can be edited to suit or generated by other means. For each tuning factor, it should contain:

(a)
A line with the tuning factor (realvariable) at the beginning
(b)
The elements of the upper triangle of estimate the residual covariance matrix (or equivalent) for this tuning factor. This is read in ‘free’ format, i.e. can be given over as many lines suitable.
(c)
Starting on a new line: The elements of the upper triangle of estimate the genetic covariance matrix (or equivalent) for this tuning factor. Again, this is read in ‘free’ format.