<TeXmacs|1.99.7>

<style|<tuple|article|cite-author-year|std-latex>>

<\body>
  <\hide-preamble>
    <new-theorem|defi|Definition>

    <new-theorem|theo|Theorem>

    <new-theorem|prop|Proposition>
  </hide-preamble>

  <abstract-data|<\abstract>
    In computational biology, gene expression datasets are characterized by
    very few individual samples compared to a large number of measurements
    per sample. Thus, it is appealing to merge these datasets in order to
    increase the number of observations and diversify the data, allowing a
    more reliable selection of genes relevant to the biological problem.
    Besides, the increased size of a merged dataset facilitates its
    re-splitting into training and validation sets. This necessitates the
    introduction of the dataset as a random effect. In this context,
    extending a work of <cite-textual|LeeSha>, a method is proposed to select
    relevant variables among tens of thousands in a probit mixed regression
    model, considered as part of a larger hierarchical Bayesian model. Latent
    variables are used to identify subsets of selected variables and the
    grouping (or blocking) technique of <cite-textual|Liu> is combined with a
    Metropolis-within-Gibbs algorithm <cite-parenthesized|MonteCarloStatMethods>.
    The method is applied to a merged dataset made of three individual gene
    expression datasets, in which tens of thousands of measurements are
    available for each of several hundred human breast cancer samples. Even
    for this large dataset comprised of around 20000 predictors, the method
    is shown to be efficient and feasible. As an illustration, it is used to
    select the most important genes that characterize the estrogen receptor
    status of patients with breast cancer.
  </abstract>>

  <\center>
    <with|font-size|1.41|font-shape|small-caps|Bayesian Variable Selection
    for Probit Mixed Models Applied to Gene Selection ><vspace|2fn>

    Meli Baragatti<rsup|<math|1,2,\<ast\>>>

    <vspace|1fn><with|font-shape|italic|<rsup|<math|1>> Ipsogen SA, Luminy
    Biotech Entreprises, Case 923, Campus de Luminy, 13288 Marseille Cedex 9,
    France.<next-line><rsup|<math|2>> Institut de Mathmatiques de Luminy
    (IML), CNRS Marseille, case 907, Campus de Luminy, 13288 Marseille Cedex
    9, France.<next-line><rsup|<math|\<ast\>>> baragatt@iml.univ-mrs.fr,
    baragattimeili@hotmail.com. >
  </center>

  <\center>
    PREPRINT
  </center>

  <vspace|2fn>

  <no-indent>

  <with|font-shape|italic|Keywords>: Bayesian variable selection, random
  effects, probit mixed regression model, grouping technique (or blocking
  technique), Metropolis-within-Gibbs algorithm.

  <section|Introduction>

  Selection of variables is a common problem in many scientific fields, and
  particularly in bioinformatics. Gene expression profiling analyses are
  notorious for generating a very large number of predictors compared to the
  number of observations. Microarray or high throughput sequencing
  technologies are important for finding genes that are implicated in
  biological processes including development, disease, and response to
  treatment, and it plays an important role in the current tendency towards
  personalized medicine. Identified genes or sequences can be used to
  classify future observations, influencing the treatment of patients.
  However, these experiments are expensive, and datasets have often no more
  than 100 specimens. The goal, therefore, is to advance a method allowing
  variable selection from merged microarray datasets, each of them presenting
  its own individual experimental bias.<next-line>

  Several model-based approaches have been developed to select variables. A
  well-known example is SVM (Support Vector Machine) with a recursive feature
  elimination of the genes (<cite-textual|GuyonWeston>).
  <cite-textual|GeorgeMcCulloch> and <cite-textual|ChipmanGeorge> developed
  Bayesian variable selection with the use of Gibbs sampling for linear
  models; a review of this type of selection is provided by
  <cite-textual|OHara>. <cite-textual|TadesseSha2005> proposed a Bayesian
  variable selection in a model-based clustering approach, using a
  multivariate Gaussian mixture model. Recently
  <cite-textual|BottoloRichardson> proposed an algorithm based upon
  Evolutionary Monte Carlo. Binary responses are often encountered in
  biostatistics studies, therefore probit or logistic models are implied.
  Bayesian variable selection methods have been proposed by
  <cite-textual|LeeSha>, <cite-textual|ShaVannucci>,
  <cite-textual|ZhouWang1>, <cite-textual|ZhouWang2> and
  <cite-textual|YangSong> for probit regression, and by
  <cite-textual|ZhouLiu>, <cite-textual|ChenDey> and <cite-textual|Tuchler>
  for logistic regression. Extension to multi-category data has been done for
  the probit model in <cite-textual|AlbertChib>.<next-line>

  The motivation behind the variable selection method developed in this paper
  is to take the design of the study into account by using random effects in
  a mixed model. It is particularly suited to a merged microarray dataset
  design, and many such datasets are freely available from the NCBI GEO
  website <cite-parenthesized|GEO>. The increased size of a merged dataset
  may provide improved power, and facilitates its re-splitting into training
  and validation sets. In addition a merged set comprises more data diversity
  than an individual set, hence we can avoid bias due to a particular dataset
  as explained by various authors, see <cite-textual|MetaAnalysis> and
  references therein. Among all the methods previously proposed for variable
  selection, that of <cite-textual|Tuchler> considered mixed models. However,
  her approach was specific for logistic models, and the method was applied
  to datasets with only few dozens predictors, whereas the aim of this paper
  is to select a few predictors among tens of thousands in a Bayesian
  framework. Recently <cite-textual|FruhwirthWagner> considered variable
  selection for random effects, but in this paper we are more interested by
  variable selection for the fixed effects, assuming that random effects are
  present.<next-line>

  The approach developed in this paper extends the approach of
  <cite-textual|GeorgeMcCulloch> and <cite-textual|LeeSha>.
  <cite-textual|GeorgeMcCulloch> introduced latent variables to identify
  subsets of selected variables in a linear model. Then <cite-textual|LeeSha>
  used these latent variables in a probit regression model, which is
  considered as part of a larger hierarchical Bayesian model. Our method
  extends the model used by <cite-textual|LeeSha> by adding random effects.
  We are then confronted with several difficulties. One concerns the
  simulation of conditional distributions, since full conditional
  distributions cannot be directly simulated. A solution is to use the
  grouping (or blocking) technique of <cite-textual|Liu>, and to combine
  Gibbs sampler and Metropolis-Hastings algorithms. Therefore the algorithm
  developed is a combination of the grouping method of Liu and the
  Metropolis-within-Gibbs algorithm <cite-parenthesized|MonteCarloStatMethods>.
  A computational difficulty due to the large number of genes had also been
  overcome by imposing a fixed number of selected genes at each iteration of
  the algorithm. As a consequence the influence of the value chosen for
  <with|font-shape|italic|the variable selection coefficient> of our model is
  reduced. That represents an advantage, since the value of this coefficient
  can impact the results of other methods, see for instance
  <cite-textual|BottoloRichardson> who proposed to put a hyperprior
  distribution on this coefficient.<next-line>

  In this paper, Affymetrix microarray data are used, so predictors (genes)
  will be referred to as \Pprobesets\Q, according to that technology. An
  Affymetrix U133plus2 microarray profiles all of the genes in the human
  genome, many of them more than once, using over 54000 gene-specific
  \Pprobesets\Q. Our Bayesian variable selection method for probit mixed
  models is developed to select a few important probesets, among tens of
  thousands, which are indicative of the activity of the estrogen receptor
  gene in breast cancer. The severity of this common and deadly disease is
  directly related to estrogen receptor (ER) status, which is traditionally
  measured biochemically.<next-line>Three different breast cancer datasets
  were used, all with clinically defined ER status. One microarray experiment
  was done per patient, and ten of thousands of probesets were measured per
  experiment. The dataset is introduced as a random effect in the model, thus
  accounting for the different experimental conditions implicit in each set.
  The three merged datasets were split into training and validation sets, and
  the relevance of the selected probesets was checked by fitting a probit
  mixed model on the training set and predicting the ER status for the
  patients from the validation set and other independent sets available from
  the NCBI GEO website. The stability and the sensitivity of the algorithm
  were also checked by using the relative weighted consistency measure of
  <cite-textual|Somol2008>.<next-line>

  The remainder of the paper is organized as follows. Section 2 describes the
  probit mixed model with latent variables. Section 3 gives the full
  conditional distributions necessary for the Gibbs sampling algorithm,
  outlines the algorithm and proposes a way to construct a classification
  rule using the selected probesets. Section 4 provides some experimental
  results on real datasets, on the relevance of selected probesets, and on
  the sensitivity and the stability of the method. Finally Section 5
  discusses the method.

  <section|Probit mixed model for gene selection>

  <subsection|The hierarchical model>

  Suppose that <math|n> binary events are observed, denoted by the
  <math|Y<rsub|i>>, <math|i=1,\<ldots\>,n>. The set of potential regressors
  is of size <math|p>, with <math|p\<gg\>n>. The goal is to select a subset
  of regressors related to the events <math|Y<rsub|1>,\<ldots\>,Y<rsub|n>>.
  The following probit mixed model is considered,

  <\equation*>
    P*<around|(|Y<rsub|i>=1\<mid\>U,\<beta\>|)>=p<rsub|i>=\<Phi\>*<around|(|X<rsub|i><rprime|'>*\<beta\>+Z<rsub|i><rprime|'>*U|)>,
  </equation*>

  where <math|\<Phi\>> stands for the standard Gaussian cumulative
  distribution function, and <math|X<rsub|i>> and <math|Z<rsub|i>> for the
  fixed and random effect regressors associated with the <math|i<rsup|t*h>>
  observation. The parameter <math|\<beta\>> corresponds to the fixed-effect
  coefficients and the parameter <math|U> to the random-effect coefficients.
  <math|X> and <math|Z> are design matrices associated with the fixed and
  random effects.<next-line>Assuming that we have <math|K> random effects,
  <math|U=<around|(|U<rsub|1><rprime|'>,\<ldots\>,U<rsub|K><rprime|'>|)><rprime|'>>.
  Each <math|U<rsub|l>> is of size <math|q<rsub|l>>, and
  <math|<big|sum><rsub|l=1><rsup|K>q<rsub|l>=q>. The size of <math|\<beta\>>
  is <math|p>. <vspace|0.4cm>

  Following <cite-textual|AlbertChib> and <cite-textual|LeeSha>, a vector of
  latent variables <math|L> is introduced. We write
  <math|L=<around|(|L<rsub|1>,\<ldots\>,L<rsub|n>|)>> and we assume that
  <math|L\<mid\>U,\<beta\>\<sim\>\<cal-N\><rsub|n>*<around|(|X*\<beta\>+Z*U,I<rsub|n>|)>>
  with <math|I<rsub|n>> the identity matrix. We then have

  <\equation>
    <label|LatentVar>Y<rsub|i>=<around*|{|<tabular*|<tformat|<cwith|1|-1|1|1|cell-halign|r>|<cwith|1|-1|1|1|cell-lborder|0ln>|<cwith|1|-1|2|2|cell-halign|l>|<cwith|1|-1|2|2|cell-rborder|0ln>|<table|<row|<cell|1>|<cell|<text|if
    >L<rsub|i>\<gtr\>0>>|<row|<cell|0>|<cell|<text|if
    >L<rsub|i>\<less\>0,>>>>>|\<nobracket\>>
  </equation>

  To perform variable selection, a vector <math|\<gamma\>> of <math|p>
  indicator variables is introduced:

  <\equation*>
    \<gamma\><rsub|j>=<around*|{|<tabular*|<tformat|<cwith|1|-1|1|1|cell-halign|r>|<cwith|1|-1|1|1|cell-lborder|0ln>|<cwith|1|-1|2|2|cell-halign|l>|<cwith|1|-1|2|2|cell-rborder|0ln>|<table|<row|<cell|1>|<cell|<text|if
    >\<beta\><rsub|j>\<neq\>0,<space|1em><text|variable
    >j<text|selected>>>|<row|<cell|0>|<cell|<text|if
    >\<beta\><rsub|j>=0,<space|1em><text|variable >j<text|not
    selected>.>>>>>|\<nobracket\>>
  </equation*>

  Given <math|\<gamma\>>, <math|\<beta\><rsub|\<gamma\>>> is the vector of
  all nonzero elements of <math|\<beta\>>, and <math|X<rsub|\<gamma\>>> is
  the matrix <math|X> with only the columns corresponding to the elements of
  <math|\<gamma\>> that are equal to 1.

  <subsection|Prior distributions>

  To complete the hierarchical model, some prior assumptions have to be made
  on <math|U\<mid\>D>, <math|\<beta\><rsub|\<gamma\>>\<mid\>\<gamma\>>,
  <math|\<gamma\>> and <math|D>, where <math|D> is a covariance matrix of
  dimension <math|q>.

  <\itemize>
    <item>If the data supports <math|\<gamma\><rsub|j>=0> over
    <math|\<gamma\><rsub|j>=1>, then the <math|j<rsup|t*h>> variable will not
    be needed in the model and we can let <math|\<beta\><rsub|j>=0>. We then
    focus on the prior distribution of the non null vector
    <math|\<beta\><rsub|\<gamma\>>>. Like <cite-textual|LeeSha>, we take the
    following conventional prior:

    <\equation>
      <label|priorbetagamma>\<beta\><rsub|\<gamma\>>\<mid\>\<gamma\>\<sim\>\<cal-N\><rsub|d>*<around|(|0,c*<around|(|X<rsub|\<gamma\>><rprime|'>*X<rsub|\<gamma\>>|)><rsup|-1>|)>,<space|2em><with|font-family|rm|w*i*t*h>*<space|2em>d=<big|sum><rsub|j=1><rsup|p>\<gamma\><rsub|j>,
    </equation>

    This prior corresponds to the g-prior of <cite-textual|Zellner86>, and
    <math|c> is a positive scale factor specified by the user.
    <cite-textual|BottoloRichardson> called it the variable selection
    coefficient. Several authors discussed the choice of its value, see
    <cite-textual|ChipmanGeorge>, <cite-textual|GeorgeFoster>,
    <cite-textual|ClydeGeorge> and <cite-textual|SmithKohn> among others.
    <cite-textual|Raftery97> used a similar form of prior. In our algorithm
    the value of <math|c> will be fixed, but will not be too influent (see
    the discussion).

    <item>The <math|\<gamma\><rsub|j>> are assumed to be independent
    Bernoulli variables, with

    <\equation*>
      P*<around|(|\<gamma\><rsub|j>=1|)>=\<pi\><rsub|j>,<space|2em>0\<leq\>\<pi\><rsub|j>\<leq\>1.
    </equation*>

    We do not want to use prior knowledge to favor any probesets, so we put
    <math|<space|1em>\<pi\><rsub|j>=\<pi\>>,
    <math|\<forall\>j=1,\<ldots\>*p>.

    <item>The vector of coefficients associated with the random effects is
    assumed to be Gaussian and centered:

    <\equation*>
      U\<mid\>D\<sim\>\<cal-N\><rsub|q><around|(|0,D|)>.
    </equation*>

    This definition allows three cases to be distinguished:<next-line>

    <em|General case:> No structure is assumed for the variance-covariance
    matrix <math|D>, its prior distribution is an Inverse-Wishart
    <math|\<cal-W\><rsup|-1><around|(|\<Psi\>,m|)>>.<next-line>

    <em|Case of a block-diagonal matrix <math|D>:> The different random
    effects are assumed independent. The vectors of coefficients associated
    with each random effect have Gaussian prior distributions:

    <\equation*>
      U<rsub|l>\<mid\>A<rsub|l>\<sim\>\<cal-N\><rsub|q<rsub|l>><around|(|0,A<rsub|l>|)>,<space|1em>l=1,\<ldots\>,K,
    </equation*>

    where the <math|A<rsub|l>> are symmetric design matrices of dimension
    <math|q<rsub|l>>. <math|D> is a block-diagonal matrix denoted by
    <math|d*i*a*g<around|(|A<rsub|1>,\<ldots\>,A<rsub|K>|)>>. The prior
    distributions for each <math|A<rsub|l>> are Inverse-Wishart
    <math|\<cal-W\><rsup|-1><around|(|\<Psi\>,m|)>>.<next-line>

    <em|Case of a diagonal matrix <math|D>:>
    <math|D=d*i*a*g<around|(|A<rsub|1>,\<ldots\>,A<rsub|K>|)>> where
    <math|A<rsub|l>=\<sigma\><rsub|l><rsup|2>*I<rsub|q<rsub|l>>>,
    <math|l=1,\<ldots\>,K> and <math|I<rsub|q<rsub|l>>> the identity matrix.
    The prior distributions for the <math|\<sigma\><rsub|l><rsup|2>> are then
    Inverse Gamma <math|<with|math-font|cal|I*G><around|(|a,b|)>> (<math|b>
    denotes the scale).
  </itemize>

  <section|Bayesian sampler for variable selection>

  <subsection|The conditional distributions>

  The posterior distribution of <math|\<gamma\>> is of particular interest
  since it encapsulates the effectiveness of the different explanatory
  variables in explaining the variation in the responses <math|Y>. The number
  of possible explanatory variables is on the order of tens of thousands, so
  the number of possible <math|\<gamma\>>-vectors is extremely large. The
  idea is to use a Gibbs sampling algorithm to explore this posterior
  distribution and search for high probability <math|\<gamma\>>
  values.<next-line>

  In order to use the classical Gibbs sampler, we must be able to simulate
  from all of the full conditional distributions (simplified by the
  hierarchical structure): <math|f<around|(|L\<mid\>Y,\<beta\>,U|)>>,
  <math|f<around|(|\<beta\>\<mid\>L,U,\<gamma\>|)>>,
  <math|f<around|(|U\<mid\>L,\<beta\>,D|)>>,
  <math|f<around|(|\<gamma\>\<mid\>L,U,\<beta\>|)>> and
  <math|f<around|(|D\<mid\>U|)>>.

  <\itemize>
    <item>Full conditional distribution of <math|L>.

    <eqnarray|<tformat|<table|<row|<cell|<label|fullL>L<rsub|i>\<mid\>\<beta\>,U,Y<rsub|i>=1>|<cell|\<sim\>>|<cell|\<cal-N\>*<around|(|X<rsub|i><rprime|'>*\<beta\>+Z<rsub|i><rprime|'>*U,1|)>*<space|1em><with|font-family|rm|l*e*f*t*t*r*u*n*c*a*t*e*d*a*t*0>>>|<row|<cell|L<rsub|i>\<mid\>\<beta\>,U,Y<rsub|i>=0>|<cell|\<sim\>>|<cell|\<cal-N\>*<around|(|X<rsub|i><rprime|'>*\<beta\>+Z<rsub|i><rprime|'>*U,1|)>*<space|1em><with|font-family|rm|r*i*g*h*t*t*r*u*n*c*a*t*e*d*a*t*0>.<eq-number>>>>>>

    <item>Full conditional distribution of <math|\<beta\>>.<next-line>Given
    <math|\<gamma\>>, we know which elements of <math|\<beta\>> are not null.
    So we focus on the generation of the non null elements of
    <math|\<beta\><rsub|\<gamma\>>>. Letting
    <math|V<rsub|\<gamma\>>=<frac|c|1+c>*<around|(|X<rsub|\<gamma\>><rprime|'>*X<rsub|\<gamma\>>|)><rsup|-1>>,
    we have

    <eqnarray|<tformat|<table|<row|<cell|\<beta\><rsub|\<gamma\>>\<mid\>L,U,\<gamma\>>|<cell|\<sim\>>|<cell|\<cal-N\><rsub|d>*<around|(|V<rsub|\<gamma\>>*X<rsub|\<gamma\>><rprime|'>*<around|(|L-Z*U|)>,V<rsub|\<gamma\>>|)><space|2em><with|font-family|rm|w*i*t*h>*<space|2em>d=<big|sum><rsub|i=1><rsup|p>\<gamma\><rsub|i>.<eq-number><label|fullbeta>>>>>>

    <item>Full conditional distribution of <math|U>.<next-line>Defining
    <math|W=<around|(|Z<rprime|'>*Z+D<rsup|-1>|)><rsup|-1>>, we have

    <eqnarray|<tformat|<table|<row|<cell|U\<mid\>L,\<beta\>,D\<sim\>\<cal-N\><rsub|q>*<around|(|W*Z<rprime|'>*<around|(|L-X*\<beta\>|)>,W|)>.<eq-number><label|fullU>>>>>>

    <item>Full conditional distributions of <math|\<gamma\>>.

    <eqnarray|<tformat|<table|<row|<cell|<label|fullgamma>f<around|(|\<gamma\>\<mid\>\<beta\><rsub|\<gamma\>>,L,U|)>>|<cell|\<propto\>>|<cell|<around|(|2*\<pi\>|)><rsup|-<frac|d|2>>*exp
    <around*|[|-<frac|1|2>*<around*|(|-L<rprime|'>*X<rsub|\<gamma\>>*\<beta\><rsub|\<gamma\>>-\<beta\><rsub|\<gamma\>><rprime|'>*X<rsub|\<gamma\>><rprime|'>*L+\<beta\><rsub|\<gamma\>><rprime|'>*X<rsub|\<gamma\>><rprime|'>*Z*U+U<rprime|'>*Z<rprime|'>*X<rsub|\<gamma\>>*\<beta\><rsub|\<gamma\>>+\<beta\><rsub|\<gamma\>><rprime|'>*V<rsub|\<gamma\>><rsup|-1>*\<beta\><rsub|\<gamma\>>|)>|]>>>|<row|<cell|>|<cell|>|<cell|\<times\>\<mid\>c*<around|(|X<rsub|\<gamma\>><rprime|'>*X<rsub|\<gamma\>>|)><rsup|-1>\<mid\><rsup|-<frac|1|2>><big|prod><rsub|j=1><rsup|p>\<pi\><rsub|j><rsup|\<gamma\><rsub|j>>*<around|(|1-\<pi\><rsub|j>|)><rsup|1-\<gamma\><rsub|j>>.<eq-number>>>>>>

    <item>Full conditional distribution of <math|D>.<next-line><em|General
    case:> The full conditional distribution of <math|D> is an
    Inverse-Wishart:

    <eqnarray|<tformat|<table|<row|<cell|D\<mid\>U>|<cell|\<sim\>>|<cell|\<cal-W\><rsup|-1>*<around|(|U*U<rprime|'>+\<Psi\>,m+1|)>.<eq-number><label|fullD1>>>>>>

    <em|Case of a block-diagonal matrix <math|D>:>
    <math|D=d*i*a*g<around|(|A<rsub|1>,\<ldots\>,A<rsub|K>|)>>. The full
    conditional distribution of <math|A<rsub|l>>
    (<math|\<forall\>l=1,\<ldots\>,K>) is an Inverse-Wishart:

    <eqnarray|<tformat|<table|<row|<cell|A<rsub|l>\<mid\>U<rsub|l>>|<cell|\<sim\>>|<cell|\<cal-W\><rsup|-1>*<around|(|U<rsub|l>*U<rsub|l><rprime|'>+\<Psi\>,m+1|)>.<eq-number><label|fullD2>>>>>>

    <em|Case of a diagonal matrix <math|D>:>
    <math|D=d*i*a*g<around|(|A<rsub|1>,\<ldots\>,A<rsub|K>|)>>, and
    <math|\<forall\>l=1,\<ldots\>,K>, <math|A<rsub|l>=\<sigma\><rsub|l><rsup|2>*I<rsub|q<rsub|l>>>.
    The full conditional distribution of <math|\<sigma\><rsub|l><rsup|2>> is
    an Inverse-Gamma:

    <eqnarray|<tformat|<table|<row|<cell|\<sigma\><rsub|l><rsup|2>\<mid\>U<rsub|l>>|<cell|\<sim\>>|<cell|<with|math-font|cal|I*G>*<around*|(|<frac|q<rsub|l>|2>+a,<around*|(|<frac|1|2>*U<rsub|l><rprime|'>*U<rsub|l>+b|)>|)>.<eq-number><label|fullsigma>>>>>>
  </itemize>

  <subsection|Use of the grouping technique>

  The classical Gibbs sampler cannot be used because the full conditional
  distribution of <math|\<gamma\>> cannot be directly simulated (see
  (<reference|fullgamma>)). However, this full conditional distribution can
  be simulated with a Metropolis-Hastings algorithm, and the complete
  algorithm would be a Metropolis-within-Gibbs algorithm.
  <cite-textual|RobertsRosenthal2006> have shown the Harris-recurrence of
  this algorithm, therefore its convergence is guaranteed. But even with a
  Metropolis-Hastings algorithm, the full conditional distribution of
  <math|\<gamma\>> is difficult to obtain, since it depends on the actual
  value of <math|\<beta\><rsub|\<gamma\>>>. Thus the acceptance rate for a
  candidate <math|\<gamma\><rsup|\<ast\>>> in the Metropolis-Hastings
  algorithm will depend both on the actual
  <math|\<gamma\><rsup|<around|(|t|)>>> and
  <math|\<beta\><rsub|\<gamma\><rsup|<around|(|t|)>>>>, and on the proposed
  <math|\<gamma\><rsup|\<ast\>>> and <math|\<beta\><rsub|\<gamma\><rsup|\<ast\>>>>.
  The problem is that <math|\<beta\><rsub|\<gamma\><rsup|\<ast\>>>> is
  unknown.<next-line>To get around this problem, we combine the
  Metropolis-within-Gibbs algorithm with the grouping (or blocking) technique
  of <cite-textual|Liu>. The idea is to group the parameters
  <math|\<beta\><rsub|\<gamma\>>> and <math|\<gamma\>>, so we will be
  interested in the full conditional distribution of
  <math|<around|(|\<beta\><rsub|\<gamma\>>,\<gamma\>|)>\<mid\>L,U>. This
  technique improves the algorithm and facilitates the convergence of the
  Markov chain, see <cite-textual|Liu> and <cite-textual|vanDyk>. We note
  that the sampler obtained is then a special case of a Partial Collapsed
  Gibbs Sampler, see <cite-textual|vanDyk>.<next-line>As we have

  <\equation*>
    f<around|(|\<beta\><rsub|\<gamma\>>,\<gamma\>\<mid\>L,U|)>\<propto\>f<around|(|\<gamma\>\<mid\>L,U|)>*f<around|(|\<beta\><rsub|\<gamma\>>\<mid\>\<gamma\>,L,U|)>,
  </equation*>

  we remark that simulating from the full conditional distribution
  <math|<around|(|\<beta\><rsub|\<gamma\>>,\<gamma\>|)>\<mid\>L,U> is
  equivalent to simulating <math|\<gamma\>> from its full conditional
  distribution integrated on <math|\<beta\><rsub|\<gamma\>>>, then simulating
  <math|\<beta\><rsub|\<gamma\>>> from its full conditional distribution. The
  \Pintegrated distribution\Q for <math|\<gamma\>> will not depend anymore on
  the nuisance parameter <math|\<beta\><rsub|\<gamma\>>> and will be easily
  simulated by a Metropolis-Hastings algorithm.<next-line>In each iteration
  of the algorithm, we will take care to simulate <math|\<gamma\>> before
  <math|\<beta\>*\<gamma\>>, to keep the dependence between
  <math|\<beta\>*\<gamma\>> and <math|\<gamma\>>, as noted by
  <cite-textual|vanDyk>.<next-line>

  We use <math|f<around|(|L\<mid\>\<gamma\>,U|)>> and the Bayes Theorem to
  get the integrated distribution of <math|\<gamma\>\<mid\>L,U> (the target
  distribution):

  <eqnarray|<tformat|<table|<row|<cell|f<around|(|\<gamma\>\<mid\>L,U|)>>|<cell|\<propto\>>|<cell|<around|(|1+c|)><rsup|-<frac|<big|sum>\<gamma\><rsub|i>|2>>*exp
  <around*|[|-<frac|1|2>*<around*|{|<around|(|L-Z*U|)><rprime|'>*<around|(|L-Z*U|)>|\<nobracket\>>|\<nobracket\>><eq-number><label|marggamma>>>|<row|<cell|>|<cell|>|<cell|<around*|\<nobracket\>|<around*|\<nobracket\>|-<frac|c|1+c>*<around|(|L-Z*U|)><rprime|'>*X<rsub|\<gamma\>>*<around|(|X<rsub|\<gamma\>><rprime|'>*X<rsub|\<gamma\>>|)><rsup|-1>*X<rsub|\<gamma\>><rprime|'>*<around|(|L-Z*U|)>|}>|]>\<times\><big|prod><rsub|j=1><rsup|p>\<pi\><rsub|j><rsup|\<gamma\><rsub|j>>*<around|(|1-\<pi\><rsub|j>|)><rsup|1-\<gamma\><rsub|j>>.>>>>>

  <subsection|The Metropolis-within-Gibbs sampler modified by the grouping
  technique>

  <subsubsection|A Metropolis-Hastings step to simulate
  <math|\<gamma\>>><label|MHgamma>

  At iteration <math|<around|(|i+1|)>> of the Metropolis-Hastings algorithm,
  a candidate <math|\<gamma\><rsup|\<ast\>>> will be proposed from
  <math|\<gamma\><rsup|<around|(|i|)>>>. We want a symmetric transition
  kernel, to simplify the acceptance rate of the algorithm. The simplest way
  to have a symmetric transition kernel is to propose a
  <math|\<gamma\><rsup|\<ast\>>> which corresponds to
  <math|\<gamma\><rsup|<around|(|i|)>>> in which <math|r> components have
  been randomly changed (see <cite-textual|ChipmanGeorge> and
  <cite-textual|GeorgeMcCulloch97>).<next-line>

  Given the target distribution (<reference|marggamma>), the acceptance rate
  <math|\<rho\>> is then:

  <eqnarray|<tformat|<table|<row|<cell|<label|acceptancerate>\<rho\><around|(|\<gamma\><rsup|<around|(|i|)>>,\<gamma\><rsup|\<ast\>>|)>>|<cell|=>|<cell|m*i*n<around*|{|exp
  <around*|[|<frac|c|2*<around|(|1+c|)>>*<around|(|L-Z*U|)><rprime|'>*<around*|(|X<rsub|\<gamma\><rsup|\<ast\>>>*<around|(|X<rsub|\<gamma\><rsup|\<ast\>>><rprime|'>*X<rsub|\<gamma\><rsup|\<ast\>>>|)><rsup|-1>*X<rsub|\<gamma\><rsup|\<ast\>>><rprime|'>-X<rsub|\<gamma\><rsup|<around|(|i|)>>>*<around|(|X<rsub|\<gamma\><rsup|<around|(|i|)>>><rprime|'>*X<rsub|\<gamma\><rsup|<around|(|i|)>>>|)><rsup|-1>*X<rsub|\<gamma\><rsup|<around|(|i|)>>><rprime|'>|)>|\<nobracket\>>|\<nobracket\>>>>|<row|<cell|>|<cell|>|<cell|<around*|\<nobracket\>|<around*|\<nobracket\>|\<times\><around|(|L-Z*U|)>|]>\<times\><around|(|1+c|)><rsup|<frac|<big|sum><around*|(|\<gamma\><rsub|j><rsup|<around|(|i|)>>-\<gamma\><rsub|j><rsup|\<ast\>>|)>|2>>\<times\><around*|(|<frac|\<pi\>|1-\<pi\>>|)><rsup|<big|sum><rsub|1><rsup|p><around*|(|\<gamma\><rsub|j><rsup|\<ast\>>-\<gamma\><rsub|j><rsup|<around|(|i|)>>|)>>,1|}>.<eq-number>>>>>>

  To facilitate the computation of the algorithm, the proposed
  <math|\<gamma\><rsup|\<ast\>>> still corresponds to
  <math|\<gamma\><rsup|<around|(|i|)>>> for which <math|r> components have
  been changed, but in such a way that the number of components whose values
  are 1 (and so the number of selected variables) is invariant. In so doing,
  <math|r/2> components among the 1 values, and <math|r/2> components among
  the 0 values are chosen at random and switched. There are several
  advantages to propose such a <math|\<gamma\><rsup|\<ast\>>>:

  <\itemize>
    <item*|<math|\<bullet\>>>In an iteration of the algorithm, if we have the
    number of variables selected <math|d> higher than the number of
    observations <math|n>, the <math|X<rsub|\<gamma\>><rprime|'>*X<rsub|\<gamma\>>>
    matrix would be singular, and the prior distribution of
    <math|\<beta\><rsub|\<gamma\>>> could not be defined as in
    (<reference|priorbetagamma>). An advantage of fixing the number of
    variables to be selected at each iteration is that this number cannot
    increase during a run of the algorithm, and if <math|d> is chosen lower
    than <math|n> this case of non singularity of
    <math|X<rsub|\<gamma\>><rprime|'>*X<rsub|\<gamma\>>> is avoided.

    <item*|<math|\<bullet\>>>The acceptance rate is simplified, as we obtain
    <math|<big|sum><around*|(|\<gamma\><rsub|j><rsup|<around|(|t|)>>-\<gamma\><rsub|j><rsup|\<ast\>>|)>=0>.

    <item*|<math|\<bullet\>>>The choice of the prior value for the variable
    selection coefficient <math|c> used in the prior distribution of
    <math|\<beta\>> is less influent (see the discussion).
  </itemize>

  <with|font-series|bold|Remark.> In the method of <cite-textual|LeeSha>, the
  <math|\<gamma\>> vector is generated component by component at each
  iteration, while in our method a Metropolis-Hastings algorithm is used to
  generate it. There are two advantages to use a Metropolis-Hastings
  algorithm: it is computationally advantageous for a very large number of
  variables compared to a generation component by component, and it enables
  us to easily generate a <math|\<gamma\>> vector with an invariant number of
  components whose values are 1.

  <subsubsection|The sampler>

  The Metropolis-within-Gibbs sampler modified by the grouping technique of
  Liu generates a sequence:

  <\equation*>
    \<gamma\><rsup|<around|(|1|)>>,\<beta\><rsub|\<gamma\>><rsup|<around|(|1|)>>,D<rsup|<around|(|1|)>>,L<rsup|<around|(|1|)>>,U<rsup|<around|(|1|)>>,\<ldots\>*\<ldots\>,\<gamma\><rsup|<around|(|b+m|)>>,\<beta\><rsub|\<gamma\>><rsup|<around|(|b+m|)>>,D<rsup|<around|(|b+m|)>>,L<rsup|<around|(|b+m|)>>,U<rsup|<around|(|b+m|)>>.
  </equation*>

  The sequence of the <math|\<gamma\><rsup|<around|(|t|)>>>, which is of
  interest for the variable selection problem, is embedded in this "Gibbs
  sequence". To generate it, at each iteration <math|\<gamma\>> is simulated
  from its integrated distribution and <math|\<beta\><rsub|\<gamma\>>,L,U>
  and <math|D> are simulated from their full conditional distributions.

  <\quote-env>
    <with|font-series|bold|Algorithm:><next-line>Starting with initial values
    <math|\<gamma\><rsup|<around|(|0|)>>,\<beta\><rsup|<around|(|0|)>>,D<rsup|<around|(|0|)>>,L<rsup|<around|(|0|)>>,U<rsup|<around|(|0|)>>>.
    At iteration <math|t+1>:

    <\enumerate>
      <item>Simulate <math|\<gamma\><rsup|<around|(|t+1|)>>> from
      <math|f<around|(|\<gamma\>\<mid\>L<rsup|<around|(|t|)>>,U<rsup|<around|(|t|)>>|)>>
      (see <reference|marggamma>), using the Metropolis-Hasting step. Given
      <math|\<gamma\><rsup|<around|(|t|)>>,L<rsup|<around|(|t|)>>,U<rsup|<around|(|t|)>>>,
      <math|k> iterations of the Metropolis-Hastings algorithm are performed
      (<math|k> arbitrarily fixed). The Metropolis-Hastings step begins with
      <math|\<gamma\><rsup|<around|(|t|)>>> as an initial value. Then at each
      iteration <math|i+1>:

      <\enumerate>
        <item>Generate the <math|\<gamma\><rsup|\<ast\>>> candidate, by
        randomly switching <math|r/2> components among the 1 values, and
        <math|r/2> components among the 0 values.

        <item>Take

        <\equation*>
          \<gamma\><rsup|<around|(|i+1|)>>=<around*|{|<tabular*|<tformat|<cwith|1|-1|1|1|cell-halign|r>|<cwith|1|-1|1|1|cell-lborder|0ln>|<cwith|1|-1|2|2|cell-halign|l>|<cwith|1|-1|2|2|cell-rborder|0ln>|<table|<row|<cell|\<gamma\><rsup|\<ast\>>>|<cell|<text|with
          probability><space|2em>\<rho\><around|(|\<gamma\><rsup|<around|(|i|)>>,\<gamma\><rsup|\<ast\>>|)><space|2em><text|see
          (<reference|acceptancerate>)>>>|<row|<cell|\<gamma\><rsup|<around|(|i|)>>>|<cell|<text|with
          probability><space|2em>1-\<rho\><around|(|\<gamma\><rsup|<around|(|i|)>>,\<gamma\><rsup|\<ast\>>|)>>>>>>|\<nobracket\>>
        </equation*>
      </enumerate>

      <math|\<gamma\><rsup|<around|(|t+1|)>>> will be the
      <math|\<gamma\><rsup|<around|(|k|)>>> obtained at the
      <math|k<rsup|t*h>> iteration of the Metropolis-Hastings algorithm.

      <item>Simulate <math|\<beta\><rsub|\<gamma\>><rsup|<around|(|t+1|)>>>
      from <math|f<around|(|\<beta\><rsub|\<gamma\>>\<mid\>L<rsup|<around|(|t+1|)>>,U<rsup|<around|(|t|)>>,\<gamma\><rsup|<around|(|t+1|)>>|)>>
      (see (<reference|fullbeta>)).

      <item>Simulate <math|D<rsup|<around|(|t+1|)>>> from
      <math|f<around|(|D\<mid\>U<rsup|<around|(|t|)>>|)>> (see
      (<reference|fullD1>), (<reference|fullD2>) or (<reference|fullsigma>)).

      <item>Simulate <math|L<rsup|<around|(|t+1|)>>> from
      <math|f<around|(|L\<mid\>Y,\<beta\><rsup|<around|(|t|)>>,U<rsup|<around|(|t|)>>|)>>
      (see (<reference|fullL>)).

      <item><math|U<rsup|<around|(|t+1|)>>> from
      <math|f<around|(|U\<mid\>L<rsup|<around|(|t+1|)>>,\<beta\><rsup|<around|(|t+1|)>>,D<rsup|<around|(|t+1|)>>|)>>
      (see (<reference|fullU>)).
    </enumerate>
  </quote-env>

  We use the fact that <math|X*\<beta\>=X<rsub|\<gamma\>>*\<beta\><rsub|\<gamma\>>>
  and that <math|\<beta\>> can be obtained from <math|\<gamma\>> and
  <math|\<beta\><rsub|\<gamma\>>>. The number of iterations is <math|b+m>,
  where <math|b> corresponds to the burn-in period and <math|m> to the
  observations from the posterior distributions.

  In our application, we are not concerned by the strict convergence of the
  sampler. The aim is to find some relevant variables explaining the
  response, and to obtain good predictions. Hence we only need to do
  stability and sensitivity studies to check that the training set and the
  choices of the hyperparameters are not too influent, and we check the
  biological relevance of the variables selected.

  <subsubsection|The selected probesets>

  For selection of variables, the sequence
  <math|<around|{|\<gamma\><rsup|<around|(|t|)>>=<around|(|\<gamma\><rsub|1><rsup|<around|(|t|)>>,\<ldots\>,\<gamma\><rsub|p><rsup|<around|(|t|)>>|)>,t=b+1,\<ldots\>,b+m|}>>
  is used. The most relevant variables for the regression model are those
  which are supported by the data and prior information. Thus they are those
  corresponding to the <math|\<gamma\>> components with higher posterior
  probabilities, and can be identified as the <math|\<gamma\>> components
  that are most often equal to 1.

  <subsection|Classification and prediction>

  Once a set of relevant variables have been selected, it can be used to fit
  a probit mixed model in a classical way and to classify future
  observations. However, if more variables than necessary to fit a probit
  mixed model have been selected in the Bayesian selection step, a second
  selection has to be performed on them in order to build a reliable probit
  mixed model. This second selection is performed on the training set using
  standard selection tools like AIC, BIC, Bayes factors,.... The final probit
  mixed model can be tested on the validation set. Moreover, the variables
  selected in the Bayesian selection step can be used in other classification
  methods, such as Support Vector Machines (but random effects are not taken
  into account).

  <section|Experimental results>

  <subsection|Application to the ER status of patients with breast cancer>

  <subsubsection|Description of the datasets>

  Three different datasets were used: one private dataset from the Institut
  Paoli Calmettes (Marseille, France), consisting of 151 samples, and two
  datasets freely available from the NCBI GEO public website
  <cite-parenthesized|GEO>: accession numbers GSE2109 (310 samples) and
  GSE5460 (124 samples). Each dataset was treated for background noise and
  normalized with respect to a reference distribution by the RMA procedure
  <cite-parenthesized|Irizarry>. Each dataset was split into a training set
  and a validation set having the same proportions of ER positive and ER
  negative observations. Then the three training datasets were merged on one
  side (497 patients) and the three validation sets were merged on the other
  side (88 patients).<next-line>For each patient, more than 54000 probesets
  were available. Two filters were applied on all of these probesets. Only
  the probesets sufficiently expressed so that they can be differentiated
  from noise and which could not be considered as invariant were kept,
  resulting in 19384 probesets. The goal was to select only a few probesets
  which are related to the ER status of the patients, by taking into account
  the different experimental conditions between the different merged
  datasets. <vspace|0.4cm>

  In this illustration, there are thousands of fixed regressors corresponding
  to the expression measurements of probesets, and only one random effect,
  which corresponds to the different datasets. <math|X<rsub|i*j>> corresponds
  to the measurement of the expression level of the <math|j<rsup|t*h>>
  probeset for the <math|i<rsup|t*h>> patient, and <math|Z<rsub|i*l>=1> if
  the <math|i<rsup|t*h>> patient is from the <math|l<rsup|t*h>> dataset, 0
  otherwise.

  <subsubsection|Prior settings for the algorithm>

  <\itemize>
    <item>Following the recommendations of <cite-textual|SmithKohn>, a value
    of <math|c=50> was chosen for the variable selection coefficient used in
    the prior distribution of <math|\<beta\>>.

    <item>Thirty probesets were selected at each iteration of the Gibbs
    sampler, when <math|\<gamma\>> is generated; <math|r=10> of them were
    changed at each iteration of the Metropolis-Hastings step (5 zeros and 5
    ones).

    <item>The random effect corresponds to the dataset, and the three
    datasets are considered independent: they were generated in different
    countries, by different teams, using different equipments and different
    patients. Therefore the variance-covariance matrix of the random effect
    <math|D> was a diagonal matrix <math|3\<times\>3> with
    <math|A<rsub|1>=\<sigma\><rsub|1><rsup|2>*I<rsub|3>>.
    <cite-textual|Gelman2006> noted that an inverse-gamma prior should not be
    too non-informative, otherwise serious problems can arise. Given our
    data, we knew that <math|\<sigma\><rsub|1><rsup|2>> is probably not too
    high, and a <math|<with|math-font|cal|I*G><around|(|2,3|)>> seemed
    reasonable for the prior distribution of
    <math|\<sigma\><rsub|1><rsup|2>>.

    <item>For the Metropolis-within-Gibbs sampler modified by the grouping
    technique, 60000 iterations were computed, among which 30000 were burn-in
    iterations. For the Metropolis-Hastings step in this sampler, 500
    iterations were computed, and the simulated <math|\<gamma\>> was the one
    corresponding to the <math|500<rsup|t*h>> iteration.
  </itemize>

  <subsubsection|Results and predictions><label|ResultsPredictions>

  We performed a first selection of variables by selecting the top-rank
  probesets, those which have been selected the most often. A boxplot can
  help, see Figure <reference|Boxplot:boxplotrealdata>. Forty probesets were
  selected at least once from the 30000 post-burn-in iterations of the
  simulated Markov Chain for <math|\<gamma\>>. Twenty three probesets were
  selected in the 30000 iterations, and thirty were selected at least from
  20000 iterations. There is a gap between probesets selected in more than
  20000 iterations and others, so the first selection is made of these
  probesets selected at least in 20000 iterations.<next-line>A second
  selection was performed on the thirty probesets from the first selection,
  to build a reliable probit mixed model. This second selection was performed
  on the training set, using a stepwise selection (with AIC and BIC criteria)
  and the classification performance of the model on the validation set. Five
  probesets were kept: Affymetrix symbols <with|font-family|tt|228241_at>,
  <with|font-family|tt|205862_at>, <with|font-family|tt|202376_at>,
  <with|font-family|tt|216222_s_at> and <with|font-family|tt|1568760_at>. See
  Table <reference|Tab:tab1> for the associated gene symbols and
  coefficients. The estimated random effects of this final model were
  reasonable: <math|-0.284> for the dataset from the Institut Paoli
  Calmettes, <math|0.199> for the GSE2109 dataset and <math|0.087> for the
  GSE5460 dataset.

  <\big-figure>
    <image|boxplotrealdata.eps||||>

    <label|Boxplot:boxplotrealdata>
  </big-figure|Boxplot of the number of selections of a probeset after the
  burn-in period, for the real datasets example. Forty probesets were
  selected at least once, all of the other probesets were never selected. A
  point represents a probeset (or several probesets if they have been
  selected the same number of times).>

  <\big-table>
    <tabular*|<tformat|<cwith|1|-1|1|1|cell-lborder|1ln>|<cwith|1|-1|1|1|cell-halign|c>|<cwith|1|-1|1|1|cell-rborder|1ln>|<cwith|1|-1|2|2|cell-halign|c>|<cwith|1|-1|2|2|cell-rborder|1ln>|<cwith|1|-1|3|3|cell-halign|c>|<cwith|1|-1|3|3|cell-rborder|1ln>|<cwith|1|-1|4|4|cell-halign|c>|<cwith|1|-1|4|4|cell-rborder|1ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-tborder|1ln>|<cwith|1|1|1|-1|cell-bborder|1ln>|<cwith|7|7|1|-1|cell-bborder|1ln>|<table|<row|<cell|Probeset>|<cell|gene>|<cell|Coefficient>|<cell|Pvalue>>|<row|<cell|Intercept>|<cell|>|<cell|-9.12074>|<cell|1.92e-05>>|<row|<cell|<with|font-family|tt|228241_at>>|<cell|AGR3>|<cell|0.45046>|<cell|1.12e-15>>|<row|<cell|<with|font-family|tt|205862_at>>|<cell|GREB1>|<cell|0.77639>|<cell|4.18e-08>>|<row|<cell|<with|font-family|tt|202376_at>>|<cell|SERPINA3>|<cell|0.37965>|<cell|0.000149>>|<row|<cell|<with|font-family|tt|216222_s_at>>|<cell|MYO10>|<cell|-0.63551>|<cell|0.004967>>|<row|<cell|<with|font-family|tt|1568760_at>>|<cell|MYH11>|<cell|0.42742>|<cell|0.050219>>>>><label|Tab:tab1>
  </big-table|Probesets selected in the final model and associated
  coefficients.>

  <FloatBarrier>

  Using this 5-probeset model, two methods were used to predict the ER status
  of the patients in the validation set:

  <\enumerate>
    <item>Using knowledge of the dataset to which each patient belonged and
    using the estimated random effects coefficients.

    <item>The estimated random effects coefficients are not used in order to
    mimic a real-life scenario of an experiment for a patient coming from an
    unknown dataset.
  </enumerate>

  The patients were predicted positive if their probability to be positive
  was higher than 0.5 and negative if it was lower than 0.5. The two methods
  gave us the same predictions, which were very good: a specificity of 1 and
  a sensitivity of 0.98 (1 wrong predictions among 88), see Figure
  <reference|Fig:histoPosNegEn>.<next-line>

  <\big-figure>
    <image|histoPosNegEn.eps||||>

    <label|Fig:histoPosNegEn>
  </big-figure|Histogram of probabilities to be ER positive given by the
  final model, for patients from the validation set.>

  <FloatBarrier> <with|font-series|bold|Remark.> In biomedical studies, when
  continuous variables are often reclassified as binary, it is common to
  define an \Pundetermined zone\Q of probabilities for which no prediction
  are given. Indeed, it is sometimes better than giving a wrong prediction,
  because these predictions imply treatments. Defining an \Pundetermined
  zone\Q between 10% and 90% probability of being positive, false predictions
  were eliminated, and 10 were considered undetermined (11.4%) (estimated
  random effects coefficients not used).<next-line>

  As a final test of our model, two more independent datasets were brought in
  from the NCBI GEO website: the GEO series GSE6532 and the GEO series
  GSE12763. The random effects associated with these datasets were entirely
  unknown, simulating an even more realistic case of prediction for a patient
  coming from an unknown dataset. Once again the results were very good :
  only 1 wrong prediction among 29 for the GSE12763 dataset, and no wrong
  predictions among 86 for the GSE6532 dataset.

  <subsection|Sensitivity and stability studies>

  The sensitivity and the stability of the algorithm were assessed by using
  the relative weighted consistency measure of <cite-textual|Somol2008>,
  denoted by <math|C*W<rsub|r*e*l>>. It is a measure evaluating how much
  subsets of selected variables for several runs overlap, and it shows the
  relative amount of randomness inherent in the concrete variable selection
  process. It takes values between 0 and 1, where 0 represents the outcome of
  completely random occurrence of variables in the selected subsets and 1
  indicates the most stable variable selection outcome
  possible.<next-line>Stability is defined as sensitivity to variations in
  the training set. Referring to our breast cancer data set, 4000 probesets
  were randomly chosen from among the 19384 originally available. Since the
  aim here was only to check the sensitivity and stability of the method,
  these 4000 were not chosen in relation to the ER status.<next-line>Several
  runs of the algorithm were performed, and are reported in Table
  <reference|Tab:tab2>. Concerning the stability, the algorithm was run on
  three different training sets of 497 microarrays (among 585), using the
  same prior values for the hyperparameters. Concerning the sensitivity, the
  algorithm was run on the same training set with different values of
  <math|c>, different prior distributions for
  <math|\<sigma\><rsub|1><rsup|2>>, different numbers of probesets to be
  selected at each iteration of the algorithm and different numbers of
  iterations. For the prior distributions for
  <math|\<sigma\><rsub|1><rsup|2>>, we chose a
  <math|<with|math-font|cal|I*G><around|(|2,3|)>> which seemed reasonable
  given our data, a <math|<with|math-font|cal|I*G><around|(|2,5|)>> to have a
  prior favoring higher values compared to the first one, a
  <math|<with|math-font|cal|I*G><around|(|3,1|)>> to favor lower values, and
  a <math|<with|math-font|cal|I*G><around|(|1,1|)>> to have a non-informative
  prior without too small parameters to avoid problems, see
  <cite-textual|Gelman2006>.<next-line>

  For each run, a reasonable number of probesets could be easily selected.
  Indeed, two to ten probesets were selected much more often than the others,
  see Figure <reference|Boxplot:boxplotsensitivity> (two to four probesets
  were selected for most of the runs). Hence there is no need to perform a
  second selection, as in section <reference|ResultsPredictions>. To compare
  the results of the different runs, the relative weighted consistency
  measure of Somol and Novovicova <math|C*W<rsub|r*e*l>> was used.

  <\big-figure>
    <image|boxplotsensitivity.eps||||>

    <label|Boxplot:boxplotsensitivity>
  </big-figure|Boxplot of the number of selections of a probeset after the
  burn-in period, for two runs of the sensitivity analysis. A point
  represents a probeset (or several probesets if they have been selected the
  same number of times). The left boxplot corresponds to the run with
  <math|c=1000>: there is a gap between the two probesets selected in more
  than 4000 iterations and the others, hence we selected these two probesets.
  The right boxplot corresponds to the run with
  <math|\<sigma\><rsub|1><rsup|2>\<sim\><with|math-font|cal|I*G><around|(|2,5|)>>:
  there is a gap between the four probesets selected in more than 1500
  iterations and the others, hence we selected these four probesets.>

  <big-table|<space|-2cm><tabular*|<tformat|<cwith|1|-1|1|1|cell-lborder|1ln>|<cwith|1|-1|1|1|cell-halign|c>|<cwith|1|-1|1|1|cell-rborder|1ln>|<cwith|1|-1|2|2|cell-halign|c>|<cwith|1|-1|2|2|cell-rborder|1ln>|<cwith|1|-1|3|3|cell-halign|c>|<cwith|1|-1|3|3|cell-rborder|1ln>|<cwith|1|-1|4|4|cell-halign|c>|<cwith|1|-1|4|4|cell-rborder|1ln>|<cwith|1|-1|5|5|cell-halign|c>|<cwith|1|-1|5|5|cell-rborder|1ln>|<cwith|1|-1|6|6|cell-halign|c>|<cwith|1|-1|6|6|cell-rborder|1ln>|<cwith|1|-1|7|7|cell-halign|c>|<cwith|1|-1|7|7|cell-rborder|1ln>|<cwith|1|-1|8|8|cell-halign|c>|<cwith|1|-1|8|8|cell-rborder|1ln>|<cwith|1|-1|9|9|cell-halign|c>|<cwith|1|-1|9|9|cell-rborder|1ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-tborder|1ln>|<cwith|3|3|1|-1|cell-bborder|1ln>|<cwith|6|6|3|3|cell-row-span|-3>|<cwith|6|6|3|3|cell-valign|c>|<cwith|6|6|4|4|cell-row-span|-3>|<cwith|6|6|4|4|cell-valign|c>|<cwith|6|6|5|5|cell-row-span|-3>|<cwith|6|6|5|5|cell-valign|c>|<cwith|6|6|6|6|cell-row-span|-3>|<cwith|6|6|6|6|cell-valign|c>|<cwith|6|6|7|7|cell-row-span|-3>|<cwith|6|6|7|7|cell-valign|c>|<cwith|6|6|8|8|cell-row-span|-3>|<cwith|6|6|8|8|cell-valign|c>|<cwith|6|6|1|-1|cell-bborder|1ln>|<cwith|10|10|2|2|cell-row-span|-4>|<cwith|10|10|2|2|cell-valign|c>|<cwith|10|10|4|4|cell-row-span|-4>|<cwith|10|10|4|4|cell-valign|c>|<cwith|10|10|5|5|cell-row-span|-4>|<cwith|10|10|5|5|cell-valign|c>|<cwith|10|10|6|6|cell-row-span|-4>|<cwith|10|10|6|6|cell-valign|c>|<cwith|10|10|7|7|cell-row-span|-4>|<cwith|10|10|7|7|cell-valign|c>|<cwith|10|10|8|8|cell-row-span|-4>|<cwith|10|10|8|8|cell-valign|c>|<cwith|10|10|1|-1|cell-bborder|1ln>|<cwith|14|14|2|2|cell-row-span|-4>|<cwith|14|14|2|2|cell-valign|c>|<cwith|14|14|3|3|cell-row-span|-4>|<cwith|14|14|3|3|cell-valign|c>|<cwith|14|14|5|5|cell-row-span|-4>|<cwith|14|14|5|5|cell-valign|c>|<cwith|14|14|6|6|cell-row-span|-4>|<cwith|14|14|6|6|cell-valign|c>|<cwith|14|14|7|7|cell-row-span|-4>|<cwith|14|14|7|7|cell-valign|c>|<cwith|14|14|8|8|cell-row-span|-4>|<cwith|14|14|8|8|cell-valign|c>|<cwith|14|14|1|-1|cell-bborder|1ln>|<cwith|17|17|2|2|cell-row-span|-3>|<cwith|17|17|2|2|cell-valign|c>|<cwith|17|17|3|3|cell-row-span|-3>|<cwith|17|17|3|3|cell-valign|c>|<cwith|17|17|4|4|cell-row-span|-3>|<cwith|17|17|4|4|cell-valign|c>|<cwith|17|17|7|7|cell-row-span|-3>|<cwith|17|17|7|7|cell-valign|c>|<cwith|17|17|8|8|cell-row-span|-3>|<cwith|17|17|8|8|cell-valign|c>|<cwith|17|17|1|-1|cell-bborder|1ln>|<cwith|18|18|1|-1|cell-bborder|1ln>|<table|<row|<cell|<rowcolor|lightgray>>|<cell|>|<cell|Value>|<cell|Prior>|<cell|Nb
  probesets to be>|<cell|Nb probesets to be>|<cell|Iterations>|<cell|burn-in>|<cell|>>|<row|<cell|<rowcolor|lightgray>
  Simu>|<cell|Dataset>|<cell|of>|<cell|for>|<cell|selected at
  each>|<cell|changed at each>|<cell|for the>|<cell|for
  the>|<cell|<math|C*W<rsub|r*e*l>>>>|<row|<cell|<rowcolor|lightgray>>|<cell|>|<cell|<math|c>>|<cell|<math|\<sigma\><rsup|2>>>|<cell|iteration
  of the GS>|<cell|iteration of the MH>|<cell|algo>|<cell|algo>|<cell|>>|<row|<cell|1>|<cell|1>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>>|<row|<cell|2>|<cell|2>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>>|<row|<cell|3>|<cell|3>|<cell|50>|<cell|<math|I*G<around|(|2,3|)>>>|<cell|15>|<cell|6>|<cell|12000>|<cell|6000>|<cell|0.25>>|<row|<cell|4>|<cell|>|<cell|10>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>>|<row|<cell|5>|<cell|>|<cell|50>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>>|<row|<cell|6>|<cell|>|<cell|100>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>>|<row|<cell|7>|<cell|1>|<cell|1000>|<cell|<math|I*G<around|(|2,3|)>>>|<cell|15>|<cell|6>|<cell|12000>|<cell|6000>|<cell|0.375>>|<row|<cell|8>|<cell|>|<cell|>|<cell|<math|I*G<around|(|2,3|)>>>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>>|<row|<cell|9>|<cell|>|<cell|>|<cell|<math|I*G<around|(|1,1|)>>>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>>|<row|<cell|10>|<cell|>|<cell|>|<cell|<math|I*G<around|(|1,1|)>>>|<cell|>|<cell|>|<cell|>|<cell|>|<cell|>>|<row|<cell|11>|<cell|1>|<cell|50>|<cell|<math|I*G<around|(|2,5|)>>>|<cell|15>|<cell|6>|<cell|12000>|<cell|6000>|<cell|0.292>>|<row|<cell|12>|<cell|>|<cell|>|<cell|>|<cell|15>|<cell|6>|<cell|>|<cell|>|<cell|>>|<row|<cell|13>|<cell|>|<cell|>|<cell|>|<cell|5>|<cell|2>|<cell|>|<cell|>|<cell|>>|<row|<cell|14>|<cell|1>|<cell|50>|<cell|<math|I*G<around|(|2,3|)>>>|<cell|30>|<cell|10>|<cell|12000>|<cell|6000>|<cell|0.5>>|<row|<cell|15>|<cell|1>|<cell|50>|<cell|<math|I*G<around|(|2,3|)>>>|<cell|15>|<cell|6>|<cell|30000>|<cell|15000>|<cell|>>>>><label|Tab:tab2>|Parameters
  of the runs for the stability and sensitivity study and associated relative
  weighted consistency measure of Somol and Novovicova
  <math|C*W<rsub|r*e*l>>. For the Metropolis-Hastings step, always 500
  iterations are computed.>

  <FloatBarrier>

  Using the results of the 15 runs together, <math|C*W<rsub|r*e*l>=0.398>.
  Subsets of selected variables for the different runs overlapped: among the
  15 runs the probesets <with|font-family|tt|215552_s_at>,
  <with|font-family|tt|209603_at> and <with|font-family|tt|209602_s_at> were
  kept in 12, 12 and 6 runs respectively. Apparently the prior for
  <math|\<sigma\><rsup|2>> (<math|C*W<rsub|r*e*l>=0.292>) has more impact
  than the number of probesets to be selected at each iteration
  (<math|C*W<rsub|r*e*l>=0.5>). We note that the number of probesets to be
  selected at each iteration of the algorithm and the number of iterations do
  not seem to modify the number of probesets more selected than the others
  during the run.<next-line>

  This was satisfying and the method appears relatively stable. First because
  the random selection of 4000 probesets carries a risk of destabilization of
  the results, since these 4000 are not necessarily those which are most
  indicative of ER status. Secondly, several probesets can represent the same
  gene, and different genes can be implied in the same biological pathway.
  Thus, it is possible that subsets of probesets are more similar than they
  appear, and therefore that <math|C*W<rsub|r*e*l>> is underestimated. For
  example, the probesets <with|font-family|tt|209603_at> and
  <with|font-family|tt|209602_s_at> mentioned above both represent the gene
  GATA3. Finally, these simulations indicate that there is not a parameter
  whose choice introduces more sensitivity than the others.<next-line>

  <subsection|Comparison with other methods>

  We compared the performance of our method with the performances of other
  methods which do not take into account random effects: we considered the
  model of <cite-textual|LeeSha> and Support Vector Machine with Recursive
  Feature Elimination of the variables, with linear or non linear kernels
  <cite-parenthesized|GuyonWeston>. We used simulated data: 200 observations
  of 1000 variables following a uniform distribution
  <math|\<cal-U\><rsub|[-5,5]>> are generated. We assume that 5 variables and
  a random effect <math|U> of size 4 explain a vector of binary variables
  <math|Y> by a probit mixed model:

  <eqnarray*|<tformat|<table|<row|<cell|p<rsub|i>>|<cell|=>|<cell|\<Phi\>*<around|(|X<rsub|i><rprime|'>*\<beta\>+Z<rsub|i><rprime|'>*U|)>,<space|2em>i=1,\<ldots\>,200>>|<row|<cell|Y<rsub|i>>|<cell|\<sim\>>|<cell|\<cal-B\><around|(|p<rsub|i>|)>,<space|2em>i=1,\<ldots\>,200>>>>>

  We took <math|\<beta\><rsub|\<gamma\>>=(-1,-1,1,1,2)> and we assume that 50
  observations are coming from each modality of the random effect. Different
  values of <math|U> were used: <math|U*1=<around|(|0,0,0,0|)>>,
  <math|U*2=(-3,-2,2,3)>, <math|U*3=(-5,-3,3,5)>, <math|U*4=(-10,-5,5,10)>
  and <math|U*5=(-30,-10,10,30)>. This set of 200 observations was splitted
  into training and validation sets, each of them of size 100, with 25
  observations coming from each modality of the random effect. For our method
  and the method of Lee et al. we took <math|c=50>, 5 probesets were selected
  at each iteration of the Gibbs sampler and <math|r=2> of them were changed
  at each iteration of the Metropolis-Hastings step (1 zero and 1 one),
  <math|D> was a diagonal matrix <math|3\<times\>3> with
  <math|A<rsub|1>=\<sigma\><rsub|1><rsup|2>*I<rsub|3>> and a prior
  <math|<with|math-font|cal|I*G><around|(|1,1|)>> was chosen for
  <math|\<sigma\><rsub|1><rsup|2>>, 500 iterations were performed for each
  Metropolis-Hastings step, and a total of 3000 and 5000 iterations were
  performed for the whole algorithm.<next-line>Concerning our method and the
  method of <cite-textual|LeeSha>, the top-ranked variables (variables
  selected more often than others, box-plots were used) were used to perform
  predictions on the validation set. The RFE-SVM method gave us directly sets
  of "best variables" and associated models, and these models were used to
  perform the predictions. The results obtained are in Table
  <reference|Tab:tab3>.

  <big-table|<space|-2cm><tabular*|<tformat|<cwith|1|-1|1|1|cell-lborder|1ln>|<cwith|1|-1|1|1|cell-halign|c>|<cwith|1|-1|1|1|cell-rborder|1ln>|<cwith|1|-1|2|2|cell-halign|c>|<cwith|1|-1|2|2|cell-rborder|1ln>|<cwith|1|-1|3|3|cell-halign|c>|<cwith|1|-1|3|3|cell-rborder|1ln>|<cwith|1|-1|4|4|cell-halign|c>|<cwith|1|-1|4|4|cell-rborder|1ln>|<cwith|1|-1|5|5|cell-halign|c>|<cwith|1|-1|5|5|cell-rborder|1ln>|<cwith|1|-1|6|6|cell-halign|c>|<cwith|1|-1|6|6|cell-rborder|1ln>|<cwith|1|-1|7|7|cell-halign|c>|<cwith|1|-1|7|7|cell-rborder|1ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-tborder|1ln>|<cwith|1|1|2|2|cell-col-span|2>|<cwith|1|1|2|2|cell-halign|c>|<cwith|1|1|2|2|cell-rborder|1ln>|<cwith|1|1|4|4|cell-col-span|2>|<cwith|1|1|4|4|cell-halign|c>|<cwith|1|1|4|4|cell-rborder|1ln>|<cwith|1|1|6|6|cell-col-span|2>|<cwith|1|1|6|6|cell-halign|c>|<cwith|1|1|6|6|cell-rborder|1ln>|<cwith|1|1|1|-1|cell-bborder|1ln>|<cwith|2|2|1|-1|cell-bborder|1ln>|<cwith|3|3|1|-1|cell-bborder|1ln>|<cwith|4|4|1|-1|cell-bborder|1ln>|<cwith|5|5|1|-1|cell-bborder|1ln>|<cwith|6|6|1|-1|cell-bborder|1ln>|<cwith|7|7|1|-1|cell-bborder|1ln>|<table|<row|<cell|<rowcolor|lightgray>
  Random effect>|<cell|<cellcolor|lightgray> Our
  method>|<cell|>|<cell|<cellcolor|lightgray> <cite-textual|LeeSha>
  method>|<cell|>|<cell|<cellcolor|lightgray>
  RFE-SVM>|<cell|>>|<row|<cell|<rowcolor|lightgray> <math|U>>|<cell|3000
  iterations>|<cell|5000 iterations>|<cell|3000 iterations>|<cell|5000
  iterations>|<cell|linear>|<cell|non linear>>|<row|<cell|<math|U*1>>|<cell|17>|<cell|26>|<cell|19>|<cell|22>|<cell|25>|<cell|23>>|<row|<cell|<math|U*2>>|<cell|19>|<cell|21>|<cell|19>|<cell|19>|<cell|20>|<cell|26>>|<row|<cell|<math|U*3>>|<cell|21>|<cell|23>|<cell|24>|<cell|24>|<cell|25>|<cell|26>>|<row|<cell|<math|U*4>>|<cell|19>|<cell|19>|<cell|35>|<cell|35>|<cell|29>|<cell|31>>|<row|<cell|<math|U*5>>|<cell|14>|<cell|11>|<cell|44>|<cell|44>|<cell|52>|<cell|56>>>>><label|Tab:tab3>|Number
  of misclassifications on the validation set, for different methods and
  different random effects.>

  <FloatBarrier>

  When there is no random effect or when the magnitude of the random effect
  is small, our method is comparable to the one of Lee et al., and the
  results of these two methods are better than or comparable with those
  obtained by RFE-SVM. But when the magnitude of the random effect is high,
  especially for <math|U*4> and <math|U*5>, it appears that our method
  outperforms the method of Lee et al. and the RFE-SVM method.

  <section|Discussion>

  In this article we have developed an approach for Bayesian gene selection
  for a probit mixed model, as an extension of previous works by
  <cite-textual|GeorgeMcCulloch> and <cite-textual|LeeSha>. An important
  contribution of our method is that it allows selection of variables in a
  mixed framework, taking into account the design of the data. It is
  particularly useful for gene selection, as it enables the use of merged
  datasets in order to introduce more observations and greater diversity.
  That may provide improved power, and we can avoid bias due to a particular
  dataset. The increased size of a merged dataset facilitates its
  re-splitting into training and validation sets, hence we do not need to
  evaluate the performance of a classification rule by a cross-validation
  procedure. It is advantageous compared to other methods which do not take
  into account random effects. Indeed, as these methods can use only one
  dataset which is usually of small size, they often need to perform
  leave-one-out-cross-validation, which can be time-consuming (see
  <cite-textual|LeeSha>, <cite-textual|YangSong>, <cite-textual|ShaVannucci>,
  <cite-textual|ZhouWang1> and <cite-textual|ZhouWang2> for instance). On the
  contrary, if several datasets are merged then a separated training set can
  be used and the performance of a classifier can be directly obtained on it.
  Using simulations to make comparisons with other methods which do not take
  into account random effects, we showed that the proposed method is
  comparable to others when the magnitude of the random effects is low, but
  performs better than the others for classification when the magnitude of
  the random effects is high. This method should prove widely useful in
  microarray bioinformatics, since many diverse datasets are freely available
  on the Internet. But it can also be used for data obtained from high
  throughput sequencing technologies, which will probably be used a lot in
  few years. Indeed, the method can be applied when we have a matrix with
  <math|n\<less\>\<less\>p> and an associated vector of random
  effects.<next-line>

  In practice, before running an analysis, one must decide how many variables
  will be selected at each iteration of the Gibbs sampler. We do not consider
  this to be a drawback, since in order to have a reliable selection, the
  number of probesets should be limited compared to the size of the training
  set. Besides, fixing the number of variables selected at each iteration is
  a computational advantage, as discussed in section <reference|MHgamma>. In
  particular, the singularity of the <math|X<rsub|\<gamma\>><rprime|'>*X<rsub|\<gamma\>>>
  matrix is avoided.<next-line>In addition, one must choose a value for the
  hyperparameter <math|c> which is large enough to have a relatively
  non-informative prior. Only the simulations of
  <math|\<beta\><rsub|\<gamma\>>\<mid\>L,U,\<gamma\>> and of
  <math|\<gamma\>\<mid\>L,U> directly depend on <math|c>. Concerning
  <math|\<beta\><rsub|\<gamma\>>\<mid\>L,U,\<gamma\>>, we can see in
  (<reference|fullbeta>) that the density is proportional to
  <math|c/<around|(|1+c|)>>, which is relatively close to 1 if <math|c> is
  large. Concerning the density of <math|\<gamma\>\<mid\>L,U> (see
  (<reference|marggamma>)), it depends on <math|c/<around|(|1+c|)>> and on
  <math|<around|(|1+c|)><rsup|-<frac|<big|sum>\<gamma\><rsub|i>|2>>>. The
  factor <math|<around|(|1+c|)><rsup|-<frac|<big|sum>\<gamma\><rsub|i>|2>>>
  does not play a role in the simulation of <math|\<gamma\>>, because the
  number of variables to be selected at each iteration is fixed: this factor
  vanishes in the acceptance rate of the Metropolis-Hastings step of the
  algorithm. Therefore the value chosen should not be too influent, as long
  as it is large enough. We chose arbitrarily <math|c=50>, following Smith
  and Kohn's (1997) recommendations. However different authors suggested
  different ranges, see <cite-textual|ChipmanGeorge>,
  <cite-textual|GeorgeFoster> and <cite-textual|ClydeGeorge> among others.
  For example <cite-textual|ZhouWang1> and <cite-textual|ZhouWang2> used
  <math|c=10>, and <cite-textual|LeeSha> used <math|c=100>. However, it is
  possible to include another level in our Bayesian hierarchical model and to
  put a prior distribution on <math|c>. <cite-textual|ZellnerSiow> for
  instance proposed a mixture of g-priors and an inverse-gamma prior on
  <math|c>. Recently <cite-textual|BottoloRichardson> considered putting a
  hyperprior on <math|c> and using a Metropolis-within-Gibbs with adaptive
  proposal for updating this coefficient. In our application this coefficient
  was held fixed for convenience, and good results were obtained. Besides the
  sensitivity study showed us that the method is not overly sensitive to the
  value chosen for <math|c>, as expected.<next-line>More generally, it
  appeared that the algorithm is fairly stable to variations in the training
  set, and is robust to prior value of any of the hyperparameters.<next-line>

  Convergence could not be verified because we did not have formal diagnostic
  tools to prove it, as the parameters vectors used in the proposed algorithm
  were not associated to the same variables from one iteration to the next.
  Besides, the different runs could have converged to a local mode of the
  posterior distribution of <math|\<gamma\>>, and not to a global one. But
  the results obtained in the stability and sensitivity analyses were
  satisfactory, as different runs with different starting points and
  different prior hyperparameters selected broadly the same variables, which
  means that these different chains had basically the same behavior. From our
  experience, it appeared that having a total number of iterations equal to
  three times the size of the set of predictors is sufficient, the results
  were not significantly different when more iterations were
  performed.<next-line>

  The probesets selected by our method to characterize the estrogen receptor
  status enabled us to fit a model with good predictions. Moreover, three
  genes among the five used in the model were also selected using a Support
  Vector Machine method (twenty-four genes were selected by SVM), and another
  group of three among those five is known to be associated with estrogen
  receptor pathways and breast cancer: GREB1
  <cite-parenthesized|Nagaraja|Towson|Rae>, SERPINA3
  <cite-parenthesized|Cimino> and MYH11 <cite-parenthesized|Singh>.
  Therefore, it seems that the probesets selected by our method are quite
  biologically relevant.<next-line>

  The algorithm developed is efficient and feasible, even for very large
  datasets with around 20000 variables. Therefore this approach has a clear
  advantage over other selection methods which handle less variables or which
  do not take into account random effects. However, Bayesian variable
  selection is an active research area, and it would be interesting to
  combine our method with recent proposals. For instance by studying the
  performance of the method with other prior distributions for
  <math|\<sigma\><rsup|2>>, like half-Cauchy or
  folded-noncentral-<with|font-shape|italic|t> distributions, see
  <cite-textual|Gelman2006>. Or by putting a prior distribution on <math|c>,
  like in <cite-textual|BottoloRichardson>. It would also be of interest to
  consider an alternative prior distribution for
  <math|\<beta\><rsub|\<gamma\>>> to handle a non-invertible
  <math|X<rsub|\<gamma\>><rprime|'>*X<rsub|\<gamma\>>> (when <math|\<gamma\>>
  is itself singular or when <math|n\<less\>d>), by combining our approach
  with the concept of ridge regression (work in progress).

  <section|Acknowledgements>

  We would like to thank an Associate Editor and two anonymous reviewers for
  careful reading of the paper and constructive comments which have led to an
  improvement of the manuscript. We are grateful to Dr Daniel Birnbaum and Pr
  Franois Bertucci from the Dpartement d'Oncologie Molculaire of the
  Institut PAOLI CALMETTES (Marseille, France) for permission to use their
  data. We also thank Pr Denys Pommeret and Rebecca Tagett for useful
  discussions and comments.

  <\bibliography|bib|plain|references>
    <bib-list|[99]|>
  </bibliography>
</body>