Sparse variable and covariance selection for high-dimensional seemingly unrelated Bayesian regression

High-throughput technology for molecular biomarkers is increasingly producing multivariate phenotype data exhibiting strong correlation structures. Existing approaches for combining such data with genetic variants for multivariate Quantitative Trait Loci analysis generally either ignore correlation structure or make other restrictive assumptions about the associations between phenotypes and genetic loci. We present a Bayesian Variable Selection (BVS) model with sparse variable and covariance selection for high-dimensional seemingly unrelated regressions. The model includes a matrix of binary variable selection indicators for multivariate regression, thus allowing different phenotype responses to be associated with different genetic predictors (a seemingly unrelated regressions framework). A general covariance structure is allowed for the residuals relating to the conditional dependencies between phenotype variables. The covariance structure may be dense (unrestricted) or sparse, with a graphical modelling prior. The graphical structure amongst the multivariate responses can be estimated as part of the model. To achieve feasible computation of the large and complex model space, we exploit a factorisation of the covariance matrix parameter to enable faster computation using Markov Chain Monte Carlo (MCMC) methods. We are able to infer associations with thousands of candidate predictors multivariately on hundreds of responses. We illustrate the model using a dataset of 158 NMR spectroscopy measured metabolites and over 9000 Single Nucleotide Polymorphisms on chromosome 16, measured in a cohort of more than 5000 people.


Introduction
Integrating multiple high-dimensional molecular biomarker data sets is a fundamental problem in genetic epidemiology and bioinformatics, in the search for molecular mediating mechanisms of the effects of genetic variants on ultimate clinical phenotypes. An important part of the analysis problem is to find associations between a set of genetic variants and downstream molecular phenotypes such as gene expression, proteomics, metabolomics or epigenetic data.
Since molecular phenotypes are assumed to be downstream of genetic variants in the causal pathway, the usual analysis framework is a regression model of phenotype on genotype. In the simplest analysis, multiple univariate regressions are performed for each phenotype-genotype pair, needing post-hoc adjustment for multiple comparisons and ignoring any correlations between genotypes and between phenotypes. Univariate approaches are unlikely to be the best strategy, considering that data where latent structures induce high level of correlation are becoming increasingly available to analyse simultaneously, for example serum metabolomic profiles ( There is an extensive literature in both classical and Bayesian frameworks on univariate multiple regression, where each phenotype, e.g. each metabolite, is regressed separately on all genetic variants. This approach enables the analyst to select a small set of predictive variants, and to find patterns of genetic variants associated with each individual molecular phenotype. Bayesian models may use shrinkage priors (Park and Casella, 2008;Griffin and Brown, 2012) to encourage the majority of regression coefficients to be shrunk to very small values. This approach requires a threshold to be set on the coefficients to select associations. Global-local shrinkage priors have been developed to overcome problems with too heavy shrinkage ). An alternative approach (Bayesian variable selection or BVS) is to explicitly select a small set of associations, using auxiliary latent variables (George and McCulloch, 1993;Kuo and Mallick, 1998;Clyde, 1999); this approach produces posterior probabilities of variable inclusion, enabling probabilistic decisions about which sets of variables are associated with the outcome of interest.
Brown, Vannucci, and Fearn (1998) and Brown, Vannucci, and Fearn (2002) developed a general framework for BVS models for multivariate outcomes. For this multivariate model with variable selection on multiple predictors, the posterior space is large and complex. In applications of this model, two alternative simplifying assumptions have been made in order to exploit conjugacy with respect to regression coefficients and residual covariance. One such simplification is used by Petretto et al. (2010), who present a model allowing for residual covariance between response variables, in the context of eQTL (expression Quantitative Trait Loci) analysis. In order to ensure conjugacy in the model and enable feasible computation times in the increasingly high-dimensional -omics setting, they assume the same set of predictors is selected in the regression equations for every response variable. Bhadra and Mallick (2013) make this same assumption on the predictors, but go a step further and include sparse residual covariance selection between regressions, using graphical modelling based on decomposable graphs. In both these models, the regression coefficients and residual covariances are integrated out, resulting in a model search over the space of the vector of variable selection indicator variables and hyperparameters.
The alternative simplifying assumption is to assume conditional independence between regressions. In this scenario different predictors can be permitted to be associated with different response variables, thus this model contains a matrix of variable selection indicator variables relating responses and predictors. Jia  The direction of this work is, accordingly, Quantitative Trait Locus mapping of correlated phenotypes.
In particular, we focus on metabolomics Quantitative Trait Locus (mQTL), a powerful approach typically used to identify genes associated with metabolic markers of diseases, where the multivariate response is generally of moderate size (in the order of hundreds of metabolites) and where both the assumption of common predictors and the assumption of independent residuals are likely to be unreasonable. We develop a fully Bayesian variable selection model which allows both for different predictors to be associated with different outcomes and for the outcomes to be residually correlated. We present two versions of the model, one with a dense covariance matrix and one with sparse covariance selection.
Each instance of the multivariate regression model has the form of a Seemingly Unrelated Regressions (SUR) model. Since the model loses the conjugacy of the earlier models, we exploit a factorisation of the covariance matrix parameter to enable faster computation using Markov Chain Monte Carlo (MCMC) methods. We are able to infer associations with thousands of candidate predictors multivariately on all the responses. In the case of sparse covariance, we also simultaneously infer an underlying graph detailing the conditional independence relationships between hundreds of response variables.
To illustrate our approach, we analyse a dataset from the Northern Finland Birth Cohort 1966 (NFBC66) 31-year follow-up, where the model developed here is used to uncover associations between 158 NMR spectroscopy measured metabolites and genome data; we will focus on directly genotyped Single Nucleotide Polymorphisms (SNPs) on chromosome 16, which results in over 9000 covariates and a sample size of more than 5000 people. The metabolites set comprises lipoprotein particle concentrations, low molecular weight metabolites such as amino acids, 3-hydroxybutyrate and creatinine, and different serum lipids, including free and esterified cholesterol, sphingomyelin and fatty acid saturation.
These data have been observed to exhibit strong residual correlation (Marttinen et al., 2014;Kettunen, Tukiainen, et al., 2012), even after accounting for the variance explained by all reported SNPs. Section 2 starts by introducing the general matrix Normal regression model and briefly reviews the variable selection prior used in this work. The crucial reparametrisation that allows us to factorise the likelihood is then presented alongside the relative priors in the transformed space. Section 3 provides the posterior computations for the parameters and details the MCMC used to estimate them. Finally in Section 5 the method is first validated against simulated data and then applied to the NFBC66 dataset.

Model
The model can be seen as a set of regressions for multivariate response Y = (y 1 , . . . , y s ), y k = (y 1k , . . . , y nk ) t , for k = 1, . . . , s, and corresponding covariate matrices X k with dimensions n × p k . The same set of covariates may be used for all regressions, but this is not necessary. Variable selection is performed on the covariates using binary indicators γ k = (γ k1 , . . . , γ kp k ) t where γ kj is 1 if covariate j is included in the regression for response k, and 0 if not. We use the shorthand notation X γ k for the columns of X k selected by the vector γ k , and similar for β γ k . Thus we can write the set of linked regressions as but most importantly the residuals will be correlated: We can also write the likelihood for this model as where Y = vec(Y ), vec() being the vectorisation operator, β γ = vec(β γ 1 , . . . , β γs ) and X γ is a block diagonal matrix with X γ k as the k th diagonal element.

Variable Selection Priors
As mentioned above, we want to select only a subset of the covariates for each outcome; the state-of-theart prior for this purpose is a Normal spike-and-slab prior ( introduced by Brown, Vannucci, and Fearn (1998)) for the regression coefficients where a point-mass on zero is associated with each γ kj = 0, while β kj follows a diffuse Normal distribution otherwise; calling β γ = {β kj |k ∈ 1, . . . s, j ∈ 1, . . . p k , γ kj = 1} Here w can be thought as a shrinking coefficient and V γ is a general prior covariance matrix with dimension dictated by the number of selected coefficients in γ.
Since we assume one common w scaling for all the selected regression coefficients, this prior makes sense only if all the Xs have also a single common scaling, i.e. if they represent comparable quantities and/or if they've been standardised. A simple generalisation would be to allow for a vector w of shrinkage coefficients as in Bae and Mallick (2004).
Several possibilities exist for V γ , the simplest being the one where V γ = I γ ; in this case the values of β γ are independent a priori and only w controls their variance. Another common choice is the g-prior, first proposed in Zellner (1986); in the case of correlated residuals this would imply V γ = We will in general centre and rescale the covariates X, the automatic scaling feature of the g-prior (Chipman et al., 2001) is of little interest and so are the computational advantages for the marginal likelihood since this is not analytically available in the Seemingly Unrelated Regression framework. In the following we thus use a priori independent regression coefficients, but any of the more complex prior only requires minor modifications to the expressions given.
To retain some flexibility we further specify an Inverse Gamma hyper-prior for w rather than fixing it to some arbitrary value.
The values of a w and b w can be chosen depending on the application.
Shifting the focus on the prior for γ, a number of sparsity inducing priors have been used in the literature, the most commons being an independent Bernoulli prior for each binary γ kj coefficient or a Binomial distribution on the model size; the desired prior sparisty level can thus be adjusted by simply tuning one marginal prior inclusion probability parameter.
Another successful choice in eQTL analysis is the hotspot detection prior used in Bottolo, Petretto, et al. (2011) where each inclusion probability is decomposed into a marginal effect for each response that regulate the sparsity level and a marginal predisposition for each predictor to be associated with a relevant number of outcomes. Because this matches with our motivating example, we will use the hotspot prior, details of which are given in Section 5.

Factorisation of the likelihood
As mentioned in the Introduction, we can only integrate out analytically both β and C by either assuming a diagonal C or the same γ k for all k, and conjugate priors on C and β, giving clear To overcome this issue we factorise the covariance matrix C iteratively as for all k = 2, · · · , s, with C (s) = C and C (1) = c 1 , and c 1 is null. Thus each C (k) is the marginal covariance matrix for responses 1, · · · , k, c k is the variance of response k, and c k is the vector of covariances between response k and responses {1, . . . , k − 1}.
With this decomposition the likelihood can be factorised as where µ i is the vector of means with µ ik = x i t γ k β γ k , the vector y i(k) stands for the vector of responses 1, · · · , k and similar for µ i(k) , and ϕ(y | µ, Σ) is the pdf for the Normal distribution with mean µ and covariance matrix Σ. Note that the joint distribution p(Y | X , β, γ, C) is the same regardless of the order used for the decomposition since we are simply factorising it by chain-conditioning.

Thus defining new parameters
the likelihood becomes a product of conditionally independent regressions: defining for convenience the matrix consisting of the first k − 1 residuals from the original linked regressions.
From (11) is straightforward to see that σ 2 k is the residual variance of response k conditioned on U (k−1) and ρ k is a real valued (k − 1)-vector of regression coefficients on the same U (k−1) residuals. The likelihood is thus decomposed into a product of independent (conditionally on the new parameters) factors over the outcomes and this opens up the possibility of parallelisation, which greatly increases the computational feasibility of this model. This type of factorisation was first proposed in the graphical modelling literature by Wermuth (1980) for multivariate Normal data with zero means and has been used in these so called covariance selection models for computational convenience and for interpretability of the now unconstrained transformed off-diagonal elements of the covariance matrix, see for example Pourahmadi (1999)  i.e. not resulting from variable selection, specifying a Jeffrey's prior on β, C.
We instead estimate our model completely in the reparametrised space, defining priors on the new parameters {σ 2 k , ρ k } starting from a usual Normal-Inverse Wishart distribution.
Note that this factorisation is related to the Cholesky decomposition of C, and not of C −1 as for example in Roverato (2000), although as we will see there is a straightforward relation between elements in C −1 and the reparametrised ρ k . Although for factorising the likelihood any ordering of the variables can be used for the decomposition, in terms of prior distribution some ordering are better suited to specific sparsity patterns in the precision matrix (Vecchia, 1988; Guinness, 2018).

Prior for a Dense Covariance Matrix
To allow a general covariance matrix with correlated residuals but no assumed structure, we start from a standard Inverse Wishart prior C ∼ IW(ν, M ). The priors in the transformed space are straightforward to calculate using the block decomposition of C. Since C (k) is a submatrix of C it also has an Inverse Wishart distribution. The new parameters are simply related to the block structure of the inverse of this matrix, with σ 2 k being the Schur complements of c k in C (k) , and ρ k = C −1 (k−1) c k , hence their priors can be calculated using standard matrix properties of the Inverse Wishart (Dempster, 1969;Dawid, 1981;Eaton, 1983;Roverato, 2000).

Prior for a Sparse Covariance Matrix
To model sparsity structure in the covariances we introduce the graph structure of the precision matrix as a parameter in the model, such that variables are conditionally independent if there is no direct link between them in the graph. In this section we use decomposable graphs to model sparsity, and show that the new parameters ρ k are simply related to the graph structure and hence retain the useful computational properties for MCMC estimation of the SUR models.

Perfect Elimination Order and Notation
A separator of a graph G is a completely-connected subgraph of G which separates the graph into two components such that any path between the two components must pass through the separator. The two components and the separator form a decomposition of G. The set of subgraphs which cannot be further decomposed form the prime components of the graph. If all the prime components are complete then the graph is said to be decomposable. In this case the prime components P q can be ordered in such a way that for every q > 1 there exists m < q such that where H q = ∪ q−1 l=1 P l for q = 2, · · · , Q. This is called the running intersection property Lauritzen (1996). With this notation, we also define the separators S q = P q ∩ H q and residuals R q = P q S q for q = 2, · · · , Q. The separators {S 2 , · · · , S Q } are not necessarily distinct from each other.
This (non-unique) ordering of the prime components is called a perfect ordering and has the property that nodes in residual R q are independent of nodes in H q S q , conditionally on the rest of the graph, and therefore the elements of the precision matrix relating pairs of nodes in those sets are structurally equal to zero.
The so-called perfect elimination ordering (also non-unique, even for a given perfect ordering of prime components) will be denoted ξ. This is defined by taking the nodes in P 1 in any order, followed by the nodes in R 2 in any order, and so on for the rest of the residuals up to R Q .  With the variables ordered accordingly to the perfect elimination ordering we have that Λ kl = 0 =⇒ ρ kl = 0, where Λ is the precision matrix C −1 ; this was proved by Paulsen, Power, and Smith (1989) (a short proof in our notation is also given in the Appendix). This means that the reparametrisation of the covariance matrix is a convenient way to encode the conditional independencies between variables.
In addition we see that conditionally on a given sparse graph, we will not need to estimate values of ρ kl for many pairs of variables, allowing for computational savings. More specifically, given a particular decomposable graph structure, ρ kl are only estimated for pairs of variables k, l contained in the same prime component of the graph.

Reparametrisation definition
Here we show further that the new parameters σ 2 k and ρ k are defined straightforwardly in terms of the covariance matrices of the prime components of the graph, and hence we easily calculate priors in the reparametrised space. Assuming the perfect elimination ordering, take any prime component P q , and define variable k to be the final variable in the residual R q . With this ordering of variables, ρ k relates variable k to the set of all variables in prime components 1, · · · , q, which is the set of nodes in the subgraph H q+1 . Thus from the definition of the reparametrisation (shown in the Appendix), where ω k is the final column of C −1 H q+1 without the last element, and ω k is the last diagonal entry of The marginal precision matrix for the subgraph H q+1 consisting of all prime components from 1 to q can be written as shown in the Appendix, which directly show that the elements of ρ k corresponding to variables in H q S q are zero. A further block matrix calculation shows that where we highlight that the off-diagonal blocks in C −1 Pq appear unmodified also in C −1 H q+1 , thanks to the zeros in the latter. Now define ρ k,Pq for the non-zero elements of ρ k . From the above equations we can see that this can be written in terms of the final column of C −1 Pq without the last element, and the last diagonal entry of C −1 Pq . Hence again by block matrix manipulation we see that we can write where C is the sub-matrix of C Pq with variable k removed, and c k,Pq is the final column of C Pq without the last element. This is the same as the original definition of ρ k , except that due to the structural zeros in the precision matrix we now only need to use the sub-covariance matrix C Pq corresponding to the prime component P q containing R q . Then consider This shows that σ 2 k also has the same definition in terms of the sub-covariance matrix C Pq .
Thus we see that with a perfect elimination ordering of variables, the new parameters ρ k , σ 2 k are completely specified by the marginal covariance matrices of the prime components of the decomposable graph. For decomposable graphs a convenient prior distribution for the covariance matrix is the Hyper Inverse Wishart (HIW), which is the conjugate prior for each prime component in a Gaussian graphical model. The Hyper Inverse Wishart C ∼ HIW G (ν, M ) is defined as the distribution such that the covariance matrix for each prime component is marginally Inverse Wishart: where M Pq is the sub-matrix of M corresponding to P q , and |P q | is the number of variables in the prime component.

Prior in the reparametrised space
For clarity, we relabel variables conditional on a given decomposable graph structure. To obtain the perfect elimination ordering of variables discussed above, we label variables with index t = (|S q |+1), · · · , |P q | within each residual R q for components q = 2, · · · , Q. For the first prime component all variables are labelled t = 1, · · · , |P 1 |.
With this labelling, the density of the HIW prior on the covariance matrix is transformed to where the densities are for all q and t except σ 2 11 ∼ IG((ν − s + 1)/2, m 11 /2).
A nice feature of this transformation is that we do not need to do the completion operation to fill in the covariances between the separated parts of prime components, so sparsity in the graph can substantially reduce the computational time for this part of the model.

Posterior Computations
Recall we started from the model and we reparametrised it into where ξ is the perfect elimination ordering derived from graph G.
By allowing γ to be completely unstructured we are not constrained to find the same set of predictors on the right hand side of Equation (27) for every outcome y k . This leads to a class of linear models called Seemingly Unrelated Regressions (SUR). Unfortunately it is well known, see for example Holmes, Denison, and Mallick (2002), that for the SUR model there is no joint conjugate prior on β, C.
The major consequence of this is that the strategy followed by Richardson, Bottolo, and Rosenthal The reparametrisation (8) and the introduction of a sparse precision matrix via the graph G allows us much more economical Gibbs perturbation of the parameters, allowing us to scale the MCMC procedure to high dimensional data-sets.
In particular it will be possible to show that most of the dependency on the previous parameter values will be hidden in the value of the estimated residuals U = Y − Xβ at the previous iteration.
Because of this Zellner and Ando (2010) name this procedure Direct Monte Carlo (DMC) and show how it is possible to derive a tractable expression to update all the parameters of the reparametrised residual covariance matrix C under the Jeffrey's prior; we follow their work but show however that is possible to recover the correct posterior full conditionals, avoiding unnecessary and computationally prohibitive resampling steps, that enables a Gibbs sampler which factorises over the regression outcomes as does the transformed likelihood; we provide below the expressions given our more general priors from previous Sections: and by defining the following index sets with respect to the perfect elimination order ξ we can write These will be used not only to update β γ , σ 2 and ρ in a Gibbs sampler fashion inside the MCMC, but also as proposal distributions when we will need to update their value while changing some of their conditioning parameters (namely γ and G).

Sampler details
Here we will give a brief description of the MCMC sampler used, full details can be found in the Appendix (Section 11). We

Structure learning
So far we have discussed reparametrisations of C given a specific graph G but we are instead interested in estimating its structure so that we can infer all the conditional (in)dependency relationships between the outcomes. Even for moderate number of outcomes s, the graph space is exceedingly big and thus posterior sampling can be challenging; to help the Markov Chain explore the whole space we take advantage of the junction tree representation of a decomposable graph as a state variable.
A spanning tree J of a graph G where each node is a subset of the vertices in G has the junction property if it is such that for any two nodes N q and N l every node on the path between them in J For any decomposable graph it is known that such representation exists ( see again for example Cowell et al. (2006) ). Note for example that is trivial to construct one such tree from the perfect ordering of the cliques, where the junction tree property is guaranteed by the running intersection property. As with the perfect ordering, a decomposable graph may have many junction trees representation; for a given junction tree, the implied graph is however uniquely determined. Green and Thomas (2013) introduce an efficient sampler that parametrise the model in terms of the junction tree J which allows local perturbations to multiple edges of G simultaneously, contrarily to the more common single-edge modification Giudici and Green (1999). The junction tree sampler ensures decomposability of the graph at all times and providing as a by-product a perfect elimination ordering ξ; this helps dramatically the mixing of the Markov Chain, especially in high dimensions, and avoids the need for maximal cardinality search type of algorithms to check for chordality and to build ξ.
With this in mind, Green and Thomas (2013) define a prior directly on the junction tree itself, proportional to a selected prior π(G) on G and uniform over the number of possible junction trees µ(G) for a specific decomposable G π(J) = π(G(J)) µ(G(J)) .
This allows to correct for the otherwise oversampling bias toward graphs that have more junction tree representations, but at the rather steep cost of computing µ(G); see Thomas and Green (2009).
Given that graphs with fewer edges will have more junction tree representations, and since we are specifically interested in sparse models, the oversampling bias is not necessarily a problem in our case, as long as is correctly accounted for. We thus specify the prior on G to be proportional to the number of possible junction tree representations; additionally we specify independent Bernoulli(η) prior on each edge of the graph ( Binomial on the cardinality edge-set |E| ), with a Beta prior on η to help the sampler push toward sparsity when needed.
Note that, even though each sample is from a decomposable model, the sampler allows us to discover non-decomposable graph structures via Bayesian Model Averaging of the marginal edge inclusion probabilities.
Finally it is worth mentioning that with changes in the graph, the perfect elimination ordering ξ will change as well and with it the conditioning order in Eq (24); this means that the ρ coefficients lose their interpretation as partial correlations from one iteration to the next. It is important to remark though that since the transformation between C and σ 2 , ρ is bijective, the MCMC will retain its validity.
As we are interested mainly in structure learning, both the variable and the covariance selections, we consider the reparametrised covariance as nuisance parameters.

MCMC sampling of the graphical structure
To sample the graph G we use the Junction Tree Sampler of Green and Thomas (2013), allowing us to connect and disconnect multiple nodes of the graph at the same time.
We have also implemented a randomisation move inspired by Thomas and Green (2009) that uniformly samples an equivalent junction tree given the current graph. This move rebases the junction tree at different root component and exchanges some of the links in the junction tree between nodes with identical separators. This does not affect the relation between J, the current junction tree, and the current graph but is purely a way to increase mixing as described in Green and Thomas (2013).
The covariance parameters are then jointly proposed with the new graph structure; the reparametrisation from C to σ 2 , ρ allows us to update only some of the coefficients, connected with the graph update. The easiest way to do this would be to compare with the current ξ and only update elements starting from the first modified index in the perfect elimination order corresponding to the proposed junction tree J . Their update use again the (possibly tempered) posterior full conditionals described in Section 3.1. The proposed triplet J , σ 2 qt , ρ qt is accepted or rejected it using Metropolis-Hastings.
Finally, η is updated via a Gibbs perturbation, as its full conditional is analytically available.

Results
We want to evaluate the performances of the reparametrised multivariate sparse SUR model and of our efficient sampler in both a simulated settings and on real mQTL data. The aim of the study detailed in under the constraint that o k × π j ≤ 1 ∀j, k where o k defines the sparsity level and π j describes the prior predisposition for each predictor to be an hotspot. In the MCMC we use MH to sample new points for o k and π j (see  for details).
The four hyperparameters are chosen so that the average model size for each outcome is small, as we want to enforce a strong sparsity in the model, while a π and b π are chosen so that E [π j ] = 2 and Var [π j ] = 2.
The prior for β kj |γ kj is N (0, w) and since we standardise and centre all responses and covariates, the hyperparameters for τ and w are set that these variances are centred on small values but at the same time allowing the respective prior to be diffuse.
All initialisation are random, except for the graph which is initialised empty (i.e. on the independent model).
The sampler with all the above specifications is implemented fully in C++ and freely available at https://github.com/mbant/Bayesian_SSUR . Similarly to Richardson, Bottolo, and Rosenthal (2010) and Bhadra and Mallick (2013) we set up our simulation study by randomly subsampling p = 300 SNPs from our real -omics dataset (see Section 5.2). This will form our covariate set X and allows us to mimic real correlation effects and linkage disequilibrium between genes that would be otherwise difficult to simulate artificially. The observed correlations between covariates ranges from very small to over 0.8 in absolute value.

Simulation Study
We then set n = 200 and s = 30 and proceed by selecting the correlation structure for the outcomes in form of a graph. We explore either a block diagonal structure with 4 components, a decomposable graph with 4 components or a non decomposable (cyclic) model with 5 components.
In order to present a range of possible association patterns between outcomes and predictors, we fix (conditionally on the selected graph structure) the γ matrix as shown in Figure 2. This is built so that a small group of different covariates is connected to each of • all outcomes, i.e. a true hotspot; • all outcomes within each Prime component; • all outcomes within each Residual component, so that each set of predictors is associated with different outcomes; • a set of selected outcomes that spans multiple components, so that the covariates are linked to both correlated and (conditionally) independent outcomes. Now that the structure is fixed, we sample the non-zero β γ coefficients as random normals β γ ∼ N (5, 1) (so that most of the real βs are distinct from zero) and the residuals' precision matrix C −1 from a W G (s + 2, M ) (using the BDgraph package of Mohammadi and Wit (2015)) with M = f C, f ∈ R + and C a correlation matrix with 0.85 on the off-diagonal elements; C is obtained by inverting the precision matrix and the residuals are finally sampled from u ∼ MN (0, I n , C).
We choose f at each repetition so that we can obtain a data-set with the desired empirical signal to noise ratio (within 10% of the desired value), measured as where σ 2 , ρ are the reparametrised values of the covariance matrix C and L is as defined in Eq (34).
We repeat the procedure 50 times for each combination of correlation structure and f /SN R to produce different datasets and average the result obtained from each run of 250000 iterations, for both our Sparse SUR model and the HESS model. We expect HESS to perform well even in the presence of correlated When looking at Table 1, that summarise the results for the graph reconstructions averaged over the simulations conditions, is important to note that the graph structure has much more implicit noise than the variable selection. While the regression coefficients β are fixed to a significantly greater than zero value, we are not only sampling from the G-Wishart to obtain the precision matrix but also, because   Table 1: Averaged true positive rates and false positive rates relative to graph reconstruction when the resulting posterior marginal edge inclusion probabilities are thresholded at the 0.5 level. The results correspond to our expectations; HESS is known to perform well even in cases where residuals are correlated (Bottolo, Petretto, et al., 2011), as long as SN R is not too low and we indeed observe that the simpler model detect with high probability the true effects. In most cases though, especially at low levels of 'signal', HESS estimates are more noisy and more false positives are picked up due to the confounding effect of the unaccounted for correlations. SSUR has a more marked separations between positive (very probable) and negative (very improbable) signals and overall returns less noisy estimates, especially at lower level of signal. SSUR is also able at the same time to recover appropriately the conditional (in)dependence structure of the residuals especially when the decomposability assumption is not violated; in the non-decomposable cases graph reconstruction seems to converge to a close (in the graph space), more dense chordal graph alternative (see Figure 5), which is what one would expect based on Fitch, Jones, and Massam (2014), see also the discussion in Section 6.1.

High dimensional -omics data analysis
Finally, we performed an analysis of human e-QTL data; the data come from the 31-year followup study of the Northern Finland Birth Cohort NFBC66 collection, previously described in details in Rantakallio (1969)  The posterior expectations of π uncover the presence of some hotspots ( i.e. genetic variants that are associated with multiple outcomes ), in particular we report rs4985124, rs931406, and most importantly rs12102766 and rs3764261 as shown in Figure 6 and Table 2.
Analysing the whole γ gives us a lot more information though, as shown in Figure 7, where we plotted posterior marginal inclusion probabilities for each predictor against their position in chromosome 16 with different outcomes being superimposed; we can see how some SNPs are associated with only one or a few outcomes and would thus be missed by only looking at hotspot detections; this results are presented in Table 3 as well.
By thresholding the posterior inclusion probabilities at 0.5 we discover a total of 38 associations, and the average Bayes FDR (bFDR, see e.g. Lewin et al. (2016)) is ∼ 0.0578. By increasing the threshold to 0.9 we would keep 32 coefficients, with a bFDR < 0.01.

SNPs Associated Metabolite rs4985124
FAw3.FA FAw6.FA   Our novel method that allows the residuals to be correlated was also able to uncover other potential XL.HDL.CE rs4843914 Cit rs2544630 Gln  important associations that warrant further investigations, in particular rs7191766, associated with multiple cholesterol-related phenotypes.

The decomposability assumption
One might think that the restriction to decomposable graphs would be too stringent for real applications; various attempt have in fact been made to relax such an assumption, using the G-Wishart Another disadvantage of course is that for a non-decomposable graph the correspondence between the reparametrised ρ and the zeros in the precision matrix C −1 is harder to derive, which makes unrestricted graphs less interesting in the context of this work. Lastly, as we will see later (see Section 4), assuming decomposability allows for the Junction Tree representation of G (Lauritzen, 1996), which gives us the ability to sample multi-edge modification of the graph.
The work of Fitch, Jones, and Massam (2014) concludes that, under model assumptions similar to ours, inference on G will asymptotically converge toward minimal triangulations of the true graph, i.e.
the graph with the least amount of extra edges so that chordality is obtained, and that inference on the covariance matrix is competitive in terms of prediction errors against penalised likelihood methods that estimate unrestricted graphs like the graphical lasso. In practice, assuming decomposability seems to be sensible and inference on the covariance matrix under such an assumption sound; the only obvious word of caution is against over-interpretation of the estimated graph.

Conclusion
This work presents a novel method to perform Bayesian variable selection in a matrix-variate regression settings jointly for multivariate outcomes that takes into account residual correlation while maintaining a flexible association pattern between responses and predictors. Although conjugacy is lost for this very general model, by virtue of a crucial reparametrisation of the covariance matrix the computations can be greatly sped up and thus our method is able to analyse a large number of both outcomes and covariates, thanks as well to the efficient C++ implementation, as demonstrated in the real data example. It is moreover possible to introduce further computational gains by assuming that the conditional independence structure between the residuals is sparse and inference on the resulting graph is straightforward to obtain; we also show that is possible to obtain simple expressions for the prior distribution of the reparametrised parameters starting from the usual Normal-(Hyper)Inverse Wishart prior for both the dense and the sparse cases.
We have shown in the simulation study how, in the presence of non-negligible residual correlations, our method exhibit better performances in selecting relevant predictors and is able at the same time to perform covariance selection. Nevertheless, this method is inherently slower than models that assume independent outcomes or other simplifying assumptions which allow to retain conjugacy, that can comparatively analyse a greater number of outcomes. Thanks to the very general formulation of the SUR models we expect the present work to find appli-cations beyond the mQTL-type data presented here, for example in finance, econometrics and other biological settings where linked regression models are already widespread.
Recall the original decomposition Using standard block matrix inverse identities, we see that the decompositions in Equations 46 and 47 are related by Hence we can write down the new parameters in the transformed space as with ω s = λ s and ω s = λ s .
So we notice that the new parameters are directly related to the marginal precision matrices of the subsets of response variables, and hence the ρ directly relate to the partial correlations amongst the subsets of variables, with the remaining variables marginalised out.
We can use the Cholesky decomposition of C to derive a relation between the new parameters and the complete precision matrix Λ. Let P be a lower triangular matrix with P kl = ρ kl (k > l) and zeros on the diagonal. The reparametrisation of the covariance matrix converts into where S is a diagonal matrix such that S kk = σ 2 k . Thus we see that Given that C is symmetric and positive definite, there exists a unique modified Cholesky decomposition such that C = LDL T . We thus see that the reparametrisation is related to the Cholesky decomposition with L = (I − P ) −1 and D = S.
From this we can easily write the precision matrix in terms of the new parameters: from which we can see that for k = 2, · · · , s.
8 Appendix: Proof that zeros in ρ k correspond to zeros in Λ The decomposable graph structure means that the precision matrix Λ for the entire graph can be written as where Q is the final prime component in the ordering above. The blocks of zeros are due to the conditional independence defined by the graph structure, since by definition R Q is conditionally independent of H Q S Q given the separator S Q .
With the perfect elimination ordering, variable s is the final variable in the residual of the final prime component R Q , corresponding to the final row/column in Λ as ordered here. For the variable s, we know from standard matrix operations that so we can see here that all elements of the vector ρ s corresponding to the partial correlations between variable s and variables in H Q S Q are zero. Next consider the remaining variables in R Q , labelling these s − 1, s − 2, . . . . For any variable k, we have the recursion formula relating k to variables m > k: From this it is easy to see that for any variable k in the final residual R Q , the zeros in ρ k are directly inherited from the ρ k , ρ k+1 , . . . , ρ s . All these variables have the same zeros in the λ k , λ k+1 , . . . , λ s sub-vectors of the precision matrix, so for all variables in R Q , the vector ρ k has zeros where λ k has zeros, corresponding to partial correlations with variables in H Q S Q .
Now consider the residual of the next to final prime component, R Q−1 . By definition of the decomposable graph structure, the variables in R Q−1 are separated from variables in H Q−1 S Q−1 , and so Λ H Q−1 S Q−1 ,R Q−1 = 0. The running intersection property of a decomposable graph implies that R Q must be separated from at least one of R Q−1 and H Q−1 S Q−1 . If R Q is separated from R Q−1 then ρ mk = 0 for all m ∈ R Q and k ∈ R Q−1 . If R Q is separated from H Q−1 S Q−1 then λ kl = 0 =⇒ λ ml = 0 =⇒ ρ ml = 0 for all m ∈ R Q and k ∈ R Q−1 . In either case this leads to the conclusion that λ kl = 0 =⇒ ρ kl = 0. Hence again, ρ k has zeros where λ k has zeros, corresponding to partial correlations between variables in R Q−1 and variables in H Q−1 S Q−1 .
The same argument applies to all successive residuals of the graph, so we see that for all variables, ρ kl = 0 where Λ kl = 0.
9 Appendix: Marginal precision matrix for subgraph Here we consider decomposable graphs. We show that the marginal precision matrix C −1 Hq for each subgraph H q has the same zero structure as the corresponding block of the precision matrix for the entire graph Λ Hq . First see that we can partition the precision matrix into the final residual and the rest of the graph: conformably with the covariance matrix: and from these we have Next we partition Λ H Q into the next residual R Q−1 , its separator S Q−1 and the remainder: where the zero blocks appear by definition of the decomposability of the graph.

Now calculate the 2nd term in Equation 62
: The running intersection property of a decomposable graph implies that either (a) R Q is separated In either case this leads to the block decomposition for the marginal precision matrix as that is the zero-structure of the overall graph is maintained for a subgraph where the nodes in R Q have been removed.
The same argument can be applied when removing successive residuals R Q−1 , R Q−2 etc, since each time the removal results in another decomposable graph.
10 Appendix : Computational Advantages of the reparametrised sparse model The reparametrisation from (6) to (8) is computationally beneficial in multiple ways. The most obvious one is that the likelihood is now computed as a product of independent factors, and thus its computation can be trivially parallelised once we know Xβ and U ρ and σ 2 .
Sparsity in both the variable selection coefficients, contrarily to shrinkage priors like the horseshoe Another, perhaps less obvious, advantage is the fact that the availability of all the full conditionals for a given index j (in Eq (68)) allows us to update only a subset of the coefficients when we update both Γ or G, proposing thus less daring move in the parameter space that have a higher probability of being accepted.

Appendix: Detailed sampler specification
Here we will details how the sampling of each parameter is performed. We start by re-stating the model and all the priors and then describe each Gibbs or Metropolis-within-Gibbs update for each parameter conditioned on all the others.
where C is an s × s general covariance matrix, which we reparametrise into with ξ being a (non-unique) global perfect elimination order, determined by G.
In the reparametrised space, this HIW prior implies where Q is the number of Prime components P q in the decomposable graph G, S q is the separator associated with component P q , with the convention that S 1 = ∅, and is R q the corresponding residual component. It is also assumed that inside each component the nodes are ordered in accordance with ξ.
under the constraint that o k × π j ≤ 1 ∀j, k In the following we will use p(·) to indicate the prior distributions and (·) for the likelihood function; we will avoid including explicitly the Jacobian of the transformation (68) as it will usually be included in the posterior full conditional for σ 2 , ρ . q(·) will usually be reserved for proposal probability distributions.
Following Bottolo and Richardson (2010) we will also define multiple chains c = 1, . . . N c , but as our likelihood (6) is not factorisable, we will use a slightly less complex tempering scheme. In particular each chain will only have one temperature t c associated ( rather than one for each outcome ) and we will only temper the likelihood and not the already diffused priors.

τ sampling
We use a simple random walk Metropolis update for τ ; given that the parameter is constrained to be positive, we propose the new point τ using a symmetric Normal jump on the log-space, centred in zero and with variance σ 2 τ : The MH ratio is then simply where the factorisation over k is a direct consequence of the diagonal scale matrix for the prior over C.

η sampling
Since the prior over the graph G is a product of independent Bernoulli on the edge-set E, the Beta prior on η is conjugate and we can analytically derive its posterior full-conditional distribution.
The update follows then where k,k 1 [E k,k ∈E] essentially counts how many edges are present in the current graph.

G sampling
Once a new junction tree jt has been proposed from the aforementioned Junction Tree sampler, we compute deterministically its set of prime components P , separators S and residual components R as well as its modified perfect elimination ordering ξ . Since we are not able to analytically integrate out the elements of the residual covariance matrix we also need to sample a proposed point σ 2 qt , ρ qt for them (see section 11.8 for details) that is in accordance with the newly proposed graph / junction tree.
Note that now that the likelihood is involved we also need to take into account t c , the temperature associated with the current chain c.
We need to consider as well that given different perfect elimination order, we are effectively changing the conditioning order in our reparametrisation. This makes it hard to track the actual values for σ 2 , ρ but, as the transformation between these and the original covariance matrix C is bijective, structure learning doesn't pose any problem.
By considering that our proposal for ρ and σ 2 (when we sample all the coefficients in particular) is actually equivalent to the standard Gibbs proposal for C given in Equation (??), essentially its posterior full conditional, in the MH ratio r will simplify with the relevant terms in the posterior distributions ratio, thus eliminating the dependence of the acceptance probability on this coefficients.

π sampling
To update each π j we perturb the current value using a Normal random walk in the log-space, similarly to Sec 11.1.
We repeat this procedure for a random subset of all j at each iteration.

o sampling
Once again we use a random walk Metropolis-Hastings perturbation of o k that proceeds in a similarly way as the update for π j ; the only difference being that the proposal is from a truncated Normal N (0, σ 2 o , −∞, − log (o k )) in order to ensure that o k ∈ (0, 1).
The update is therefore The acceptance probability is 11.7 w sampling Finally, the update for the prior variance of the slab component in the β prior is again a Gibbs sampling update. w has an Inverse-Gamma prior distribution and thus is conjugate to the prior over the non-zero β elements.
The update is thus where we update the prior parameters using the number of non-zero regression coefficients and their second moment.
See Sec 3.1 for additional details.

Proposal variances adaptation
For all the Normal Metropolis-Hastings updates ( τ , o k and π j ) we use a simple adaptive scheme ( see Haario, Saksman, and Tamminen (2001), Roberts and Rosenthal (2009) ) that targets the asymptotically optimal proposal distribution. We start by setting a reasonably small value σ 2 * = s (0) * and then every once in a while we update it using where * = (τ, o, π) and s (i) * is the posterior sample variance computed using the MCMC chain up to the current iteration i. To avoid wasting computations we keep track of the first two moments of each parameters as the Markov chain progresses. Finally for o and π a vector of these is obtained for each index k and j and we average them to get the actual proposal variance.
12 Appendix: Naïve Gibbs sampler for the SUR model As mentioned in the main text, a simple expression for the posterior full-conditionals on β, C is available but potentially computationally expensive is little care is given in optimising the matrix operations within.
Moreover the likelihood in the original parameters' space is O ns 3 and not trivially parallelisable.
We give nonetheless the full conditionals below: