6 Input ﬁles for WOMBAT

Chapter 6
Input ﬁles for WOMBAT

6.1 Format

All input ﬁles supplied by the user are expected to be ’formatted’ (in a Fortran sense), i.e. should be plain text of ASCII ﬁle.

Non-standard characters may cause problems !

HINT: Take great care when switching from DOS/Windows to Linux or vice versa: Remember that these operating systems use diﬀerent end-of-line coding - this means that you may have to ‘translate’ ﬁles using a utility like dos2unix (unix2dos) or fromdos (todos).

6.2 Data File

The data ﬁle is mandatory. It gives the traits to be analysed, and all information on eﬀects in the model of analysis. It is expected to have the following features:

1.

There is no ’default’ name for the data ﬁle. File names up to 30 characters long are accommodated.

2.

Variables in the data ﬁle should be in ﬁxed width columns, separated by spaces.

3.

Each column, up to the maximum number of columns to be considered (= number of variables speciﬁed in the parameter ﬁle), must have a numerical value – even if this column is not used in the analysis, i.e. no ‘blank’ values !

4.

All codes of eﬀects to be considered (ﬁxed, random or ‘extra’ eﬀects) must be positive integer variables, i.e. consist of a string of digits only.
The maximum value allowed for a code is 2147483647, i.e. just over 2 billion.

5.

All traits and covariables (including control variables) are read as real values, i.e. may contain digits, plus or minus signs, and Fortran type formatting directives only.

N.B.: Calculations in WOMBAT use an operational zero (default value: 10⁻⁸), treating all smaller values as zero. To avoid numerical problems, please ensure your traits are scaled so that their variances are in a moderate range (something like 10⁻⁵ to 10⁵).

6.

Any alphanumeric strings in the part of the data ﬁle to be read by WOMBAT are likely to produce errors !

7.

For multi-trait analyses, there should be one record for each trait recorded for an individual¹ . The trait number for the record should be given in the ﬁrst column.

No special codes for ’missing values’ are available – missing traits are simply absent records in the data ﬁle.

8.

The data ﬁle must be sorted in ascending order, according to :

i): the individual (or ’subject’) for which traits are recorded, and
ii): according to the trait number within individual.
iii): For RR analyses, records are expected to be sorted according to the value of the control variable (within individual and trait number) in addition.

N.B.: WOMBAT does not allow ‘repeated’ records for individual points on the trajectory in RR analyses, i.e. you can not have multiple observations for an individual with the same value of the control variable.

9.

For multivariate analyses combining traits with repeated and single records, the traits with repeated records need to have a lower trait number than those with single records only.

To facilitate annotation of the data ﬁle (e.g. column headers, date of creation, source), WOMBAT will skip lines with a ’#’ (hash sign) in column 1 at the beginning of the ﬁle - there is no limit on the number, n, of such lines, but they must represent the ﬁrst n lines (any ’#’ elsewhere will cause an error).

6.3 Pedigree File

If the model of analysis contains random eﬀect(s) which are assumed to be distributed proportional to the numerator relationship matrix, a pedigree ﬁle is required. It is expected to have the following features:

1.

There is no ’default’ name for the pedigree ﬁle. File names up to 30 characters long are accommodated.

2.

The pedigree ﬁle must contain one line for each animal in the data. Additional lines with pedigree information for parents without records themselves can be included.

3.

Each line is expected to contain (at least) three integer variables :

(a): the animal code,
(b): the code for the animal’s sire,
(c): and the code for the animal’s dam.

All codes must be valid integer in the range of 0 to 2147483647.

For analyses distinguishing between genotyped and non-genotyped individuals, a fourth column needs to be supplied containing a “1” for non-genotyped and “2” or “3” for genotyped individuals.

If genetic groups are to be ﬁtted, relevant information is to be given in from column 5 onwards; see subsection 4.17.10 for details.

Additional, optional variables in the fourth or ﬁfth column can be:

(d): the animal’s inbreeding coeﬃcient (real variable between 0 and 1),
(e): a code of 1 (males) or 2 (females) (integer), deﬁning the number of X chromosomes, if a corresponding relationship matrix is to be set up.

Note that these relate to older, seldom used model options which are incompatible with genotype codes or genetic group information.

4.

All animals must have a numerically higher code than either of their parents.
Unknown parents are to be coded as “0”.

5.

If maternal genetic eﬀects are to be ﬁtted in the model of analysis, all dams of animals in the data must be ‘known’, i.e. have codes > 0.

6.

The pedigree ﬁle does not need to be sorted. However, sorting according to animal code (column 1) (in ascending order) is desirable and highly recommended as it will reduce processing time, especially for large numbers of individuals.

As for the data ﬁle, any lines at the beginning of the pedigree ﬁle with a ’#’ (hash sign) in column 1 are ignored.

6.4 Marker counts ﬁle

With the growing prevalence of genomic information and need for mixed model analyses of such data, capabilities of WOMBAT have recently been extended to perform some of the tasks required (see run options These require a ﬁle of allele counts for markers (or SNPs).

1.

The ’default’ name for this input ﬁle is MarkerCounts.dat. Alternative names (up to 30 characters long) can be speciﬁed in the parameter ﬁle (see section 4.8).

2.

The marker counts ﬁle ﬁle must contain at least one line for each genotyped animal.

3.

Information required for each animal, to be read consequentially, is comprised of :

(a): the animal code (as in data or pedigree ﬁle) at the beginning of a NEW line.
(b): the marker counts, typically 0, 1 and 2 (through real variables can be accommodated for some forms of input; see below) for exactly m markers, with m as speciﬁed in the parameter ﬁle. These may be on the same line or extend over continuation lines.

4.

There are diﬀerent form options for MarkerCounts.dat or equivalent

(a): The default is a formatted ﬁle with space-separated variables. This is expected to extension .dat in the ﬁlename. For this option marker counts are read as 4 Byte real variables.
(b): Alternatively, a ﬁle name with extension .BIN or .BI1 is read as a binary ﬁle. In this case, marker allele counts are expected to have been written out as integer*1 variables (1 Byte long), while the animal codes are read as integer variables of standard length (4 or 8 Bytes).
(c): An extension of .BR4 speciﬁes a binary input ﬁle where allele counts have written out as 4 Byte real variables.

5.

NB: WOMBAT does absolutely NO checking of the contents of this ﬁle - missing counts or mono-morphic markers may create problems!

6.5 Parameter File

WOMBAT acquires all information on the model of analysis from a parameter ﬁle.

Rules to set up the parameter ﬁle are complex, and are described in detail in a separate chapter (chapter 4).

6.6 Other Files

Depending on the model of analysis chosen, additional input ﬁles may be required.

6.6.1 General inverse ﬁle

For each random eﬀect ﬁtted for which the covariance option GIN (see subsection 4.10.4) has been speciﬁed, WOMBAT expects a ﬁle set up by the user which contains the inverse of the matrix K (such as relationship or correlation matrix) which determines the ‘structure’ of the covariance matrix for the random eﬀect. The following rules apply :

1.

The ﬁle name should be equal to the name of the random eﬀect, with the extension .gin. For example, mother.gin for a random eﬀect called mother.
For random eﬀect names containing additional information in round brackets, for instance in RR analysis, only the part preceding the ‘(’ should be used. In this case, be careful to name the eﬀects in the model so that no ambiguities arise!

2.

The ﬁrst line of the ﬁle should contain a real variable with value equal to the log determinant of the covariance/general relationship matrix (NB: This is the log determinant of the matrix K, not of the inverse K⁻¹; this can generally be calculated as a ‘by-product’ during inversion).
This comprises a constant term in the (log) likelihood, i.e. any value can be given (e.g. zero) if no comparisons between models are required.
Optionally, this can be followed (separated by space(s)) by the keyword “DENSE”. If given, WOMBAT will store the elements of the general relationship matrix in core, assuming it is dense, i.e. for n levels, an array of size n(n+1)∕2 is used. This can require substantial additional memory, but reduces the overhead incurred by re-reading this matrix from disk for every iteration, and may be advantageous if the matrix is (almost) dense, such as the inverse of a genomic relationship matrix.

3.

The ﬁle should then contain one line for each non-zero element in the inverse. Each line is expected to contain three space-separated variables :

(a): An integer code for the ‘column’ number
(b): An integer code for the ‘row’ number
(c): A real variable specifying the element of the inverse

Here ‘row’ and ‘column’ numbers should range from 1 to N, where N is the number of levels for the random eﬀect.
Only the elements of the lower triangle of the inverse should be given and given ‘row-wise’, i.e. WOMBAT expects a ’column’ number which is less than or equal to the ‘row’ number.

HINT: Calculations involved are more eﬃcient if elements are given in order (of the lower triangle)!

6.6.1.1 Codes for GIN levels

By default, WOMBAT determines the number of levels for a random eﬀect with covariance option GIN from the data, renumbering them in ascending numerical order. In some cases, however, we might want to ﬁt additional levels, not represented in the data. A typical example is am additional genetic eﬀect, which can have levels not in the data linked to those in the data through covariances arising from co-ancestry.

If WOMBAT encounters row or column numbers greater than the number of random eﬀect levels found in the data, it will take the following action:

1.

It is checked that this number does not exceed the maximum number of random eﬀects levels as speciﬁed in the parameter ﬁle. If it does, WOMBAT stops (change parameter ﬁle if necessary).

2.

WOMBAT looks for a ﬁle with the same name as the .gin ﬁle but extension .codes; e.g. mother.codes for the random eﬀect mother. This ﬁle is expected to supply the codes for all levels of the random eﬀect: There has to be one line for each level with two space separated integer variables, the running number (1st) and the code for the level (2nd).

For an analysis using the run option --s1step where the user supplied matrix represents the inverse of the joint relationship matrix between genotyped and non-genotyped animals, the .codes ﬁle is required to have a third column with the code 1 for non-genotyped and 2 or 3 for genotyped individuals.

If the random eﬀect represents an additive genetic eﬀect and the model of analysis ﬁts respective, explicit group eﬀects, pertaining information is expected to be supplied from column 4 onwards; see subsection 4.17.10 for details.

3.

If such ﬁle is not found, WOMBAT will look for a genetic eﬀect (i.e. a random eﬀect with covariance option NRM) which has the same number of levels as the current random eﬀect. If found, it will simply copy the vector of identities for that eﬀect and proceed. (Hint: you may have to use run time --noprune to utilise this feature).

4.

Finally, if neither of these scenarios apply, WOMBAT will assume the random levels are coded from 1 to N and try to proceed without any further checking – this may cause problems!

6.6.1.2 Diagonal elements of GIN matrix

For runs which produce random eﬀects solutions WITH standard errors, a ﬁle containing the diagonal elements of the general relationship matrix K (NOT of the inverse!) is recognised. This is expected to have the same name as the .gin ﬁle but extension .hdiags or .gdiags.( If found, these diagonal elements are used to attempt computation of the accuracies corresponding to the prediction errors computed. The ﬁle is expected to be formatted, with one line for each level of the corresponding random eﬀect. Each line should contain three space separated variables:

1.: The running number (integer)
2.: The original (animal) code (integer)
3.: The diagonal element (real). Missing values can be replaced by a value of -9.0.

HINT: Output of such ﬁle — with extension .hdiags – can be requested when using WOMBAT with run option --hinv to build the joint inverse relationship matrix for a single-step analysis, H⁻¹.

6.6.1.3 Elements of the general relationship matrix

Multivariate analyses estimating covariance matrices due to random eﬀects with covariance option GIN at reduced rank using the average information algorithm require the product of the ‘original’ matrix, K and a vector in each iterate. For such analyses, WOMBAT allows for the following alternatives:

A): If a ﬁle with the same name as the .gin ﬁle but extension .matrix is found, WOMBAT expects to read the elements of the lower triangle of the symmetric matrix from this ﬁle.
B): If the original matrix is not available, it may be more convenient to supply the Cholesky factor of the inverse K⁻¹ instead, in a ﬁle with the same name as the .gin ﬁle but extension .chlsky. If this is given (and no .matrix ﬁle is found), the required product is evaluated using two triangular solves in each iterate.
C): If neither of these ﬁles is available, WOMBAT will attempt to carry out the Cholesky factorisation if the GIN matrix is more than 75% dense. This is done storing the matrix in full and using Lapack routines for symmetric, positive deﬁnite matrices – if the matrix is large or not safely positive deﬁnite this can cause problems. Otherwise, the program will stop.
To avoid setting up a .matrix or .chlsky ﬁle altogether, please disable use of the average information algorithm by explicitly specifying one of the other maximisation algorithms.

As for the .gin ﬁle, the .matrix or .chlsky ﬁle should be formatted, with one line for each non-zero element containing three space-separated variables (but no line corresponding to the determinant)

1.: An integer code for the ‘column’ number
2.: An integer code for the ‘row’ number
3.: A real variable specifying the element of the matrix of Cholesky factor

HINT: Run option --hchol is provided to carry out a sparse Cholesky factorisation as a separate step; see subsection 5.2.10

6.6.2 Basis function ﬁle

If a regression on a user- deﬁned set of basis functions has been chosen in the model of analysis by specifying the code USR for a covariable (or ‘control’ variable in a RR analysis), ﬁle(s) specifying the functions need to be supplied.

The form required for these ﬁles is:

1.

The name of the ﬁle should be the name of the covariable (or ‘control’ variable), as given in the parameter ﬁle (model of analysis part), followed by _USR, the number of coeﬃcients, and the extension .baf.

EXAMPLE: If the model of analysis includes the eﬀect age and the maximum number of regression coeﬃcients for age is 7, the corresponding input ﬁle expected is age_USR7.baf

N.B.: The ﬁle name does not include a trait number.
This implies, that for multivariate analyses the same basis function is assumed to be used for a particular covariable across all traits. The only diﬀerentiation allowed is that the number of regression coeﬃcients may be diﬀerent (i.e. that a subset of coeﬃcients may be ﬁtted for some traits); in this case, the ﬁle supplied must correspond to the largest number of coeﬃcients speciﬁed.

2.

There should be one row for each value of the covariable.

3.

Rows should correspond to values of the covariable in ascending order.

4.

The number of columns in the ﬁle must be equal to (or larger than) the number of regression coeﬃcients to be ﬁtted (i.e. the order of ﬁt) for the covariable.

5.

The elements of the i−th row should be the user-deﬁned functions evaluated for the i−th value of the covariable.

EXAMPLE: Assume the covariable has possible values of 1, 3, 5, 7 and 9, and that we want to ﬁt a cubic regression on ’ordinary’ polynomials, including the intercept. In this case, WOMBAT would expect to ﬁnd a ﬁle with 5 rows (corresponding to the 5 values of the covariable) and 4 columns (corresponding to the 4 regression coeﬃcients, i.e. intercept, linear, quadratic and cubic):
  1   1   1    1 
  1   3   9   27 
  1   5  25  125 
  1   7  49  343 
  1   9  81  729
Note that there is no leading column with the value of the covariable (you can add it as the last column which is ignored by WOMBAT, if you wish) – the association between value of covariable and user deﬁned function is made through the order of records.

6.6.3 File with allele counts

For an analysis using the run option --snap, an additional input ﬁle is required which supplies the counts for the reference allele for each QTL or SNP to be considered. This has the default name SNPCounts.dat or SNPCountsR.dat, depending whether integer or or real input is chosen. If both exist in the working directory, WOMBAT will utilize the former and ignore the latter.

SNPCounts.dat must be a formatted ﬁle with one row per QTL. Each row should contain a single digit (usually 0, 1, 2) for all individuals in the data, without any spaces between them! At present, there is no provision for missing genotypes (these are readily imputed). In contrast to most other input ﬁles used by WOMBAT, information is obtained in a Fortran formatted read and a blank line is treated as a line of zeros. For example, if there are 1000 individuals, each line should be 1000 characters long. The number of SNPs processed is given by the number of records (rows) in the ﬁle.
SNPCountsR.dat accommodates the situation where – for some reason or other – a format with space separated values is preferred. This removes the restriction of a single digit. ‘Counts’ are read as real values, i.e. can contain decimals. Values for a SNP can be spread over any number of rows, but counts for each new SNP must begin on a new row.

6.6.4 Files with results from part analyses

6.6.4.1 List of partial results

For a run with option --itsum or --pool, WOMBAT expects a number of ﬁles with results from part analyses as input. Typically, these have been generated by WOMBAT when carrying out these analyses; see subsection 7.2.6 for further details.

6.6.4.2 Single, user generated input ﬁle

For run option --pool, results can be given in a single ﬁle instead. For each part analysis, this should contain the following information:

1.

A line giving (space separated):

a): The number of traits in the part analysis
b): The (running) numbers of these traits in the full covariance matrix.
c): The relative weight to be given to this part; this can be omitted and, if not given, is set to 1.

2.

The elements of the upper triangle of the residual covariance matrix, given row-wise.

3.

For each random eﬀect ﬁtted, the elements of the upper triangle, given row-wise. Each matrix must begin on a new line and the matrices must given in the same order as the corresponding VAR statements in the parameter ﬁle.

6.6.5 ‘Utility’ ﬁles

WOMBAT will check for existence of other ﬁles with default names in the working directory and, if they exist, acquire information from them.

6.6.5.1 File RunOptions

This ﬁle can be used as an alternative to the command line to specify run options (see chapter 5).
It must have one line for each run option speciﬁed, e.g.
-v
--emalg
to specify a run with verbose output using the EM-algorithm.

6.6.5.2 File FileSynonyms

In some cases, WOMBAT expects input ﬁles with speciﬁc names. If ﬁles with diﬀerent default names have the same content, duplication can be avoided by setting up a ﬁle FileSynonyms to ‘map’ speciﬁc ﬁles to a single input ﬁle. This ﬁle should contain one line for each input ﬁle to be ‘mapped’ to another ﬁle. Each line should give two ﬁle names (space separated) :

(a): The default name expected by WOMBAT.
(b): The name of the replacement ﬁle

EXAMPLE:

age.baf      mybasefn.dat 
damage.baf   mybasefn.dat

[Not yet implemented !]

6.6.5.3 File RandomSeeds

To simulate data, WOMBAT requires two integer values to initialise the random number generator. If the ﬁle RandomSeeds exists, it will attempt to read these values from it. Both numbers can be speciﬁed on the same or diﬀerent lines. If the ﬁle does not exist in the working directory, or if an error reading is encountered, initial numbers are instead derived from the date and time of day.

WOMBAT writes out such ﬁle in each simulation run, i.e. if RandomSeeds exists, it is overwritten with a new pair of numbers !

6.6.6 File SubSetsList

For a run with option --itsum, WOMBAT expects to read a list of names of ﬁles with results from subset analyses in a ﬁle with the standard name SubSetsList. This has generated by WOMBAT (see subsection 7.3.9) if the part analyses have been carried out using WOMBAT, but may need editing. In particular, if a weighted summation is required, the default weights of ‘1.000’, need to be replaced ‘manually’ by appropriate values, selected by the user !

6.6.7 File(s) Pen*(.dat)

6.6.7.1 File PenTargetMatrix

For penalty options COVARM and CORREL a ﬁle with this name must be supplied which gives the shrinkage target. This must be a positive deﬁnite matrix. The ﬁle should be a plain text ﬁle and contain the elements of the upper triangle of the matrix. It is read in ‘free’ format, i.e. variable numbers of elements per line are allowed.

6.6.7.2 File PenBestPoints.dat

A run with the option --valid expects to read sets of estimates from a ﬁle with this name. This is generated by WOMBAT when penalized estimation is speciﬁed, but can be edited to suit or generated by other means. For each tuning factor, it should contain:

(a): A line with the tuning factor (realvariable) at the beginning
(b): The elements of the upper triangle of estimate the residual covariance matrix (or equivalent) for this tuning factor. This is read in ‘free’ format, i.e. can be given over as many lines suitable.
(c): Starting on a new line: The elements of the upper triangle of estimate the genetic covariance matrix (or equivalent) for this tuning factor. Again, this is read in ‘free’ format.

[next] [prev] [prev-tail] [front] [up]

Chapter 6Input ﬁles for WOMBAT

6.1 Format

6.2 Data File

6.3 Pedigree File

6.4 Marker counts ﬁle

6.5 Parameter File

6.6 Other Files

6.6.1 General inverse ﬁle

6.6.1.1 Codes for GIN levels

6.6.1.2 Diagonal elements of GIN matrix

6.6.1.3 Elements of the general relationship matrix

6.6.2 Basis function ﬁle

6.6.3 File with allele counts

6.6.4 Files with results from part analyses

6.6.4.1 List of partial results

6.6.4.2 Single, user generated input ﬁle

6.6.5 ‘Utility’ ﬁles

6.6.5.1 File RunOptions

6.6.5.2 File FileSynonyms

6.6.5.3 File RandomSeeds

6.6.6 File SubSetsList

6.6.7 File(s) Pen*(.dat)

6.6.7.1 File PenTargetMatrix

6.6.7.2 File PenBestPoints.dat

Chapter 6
Input ﬁles for WOMBAT