<TeXmacs|1.99.7>

<style|<tuple|amsart|cite-author-year|std-latex>>

<\body>
  <\left-aligned>
    <with|font-size|1.41|font-series|bold|GSGS: A Computational Framework to
    Reconstruct Signaling Pathways from Gene Sets> <vspace|>em Lipi
    Acharya<rsup|<math|1>>, Thair Judeh<rsup|<math|1>>, Zhansheng
    Duan<rsup|<math|1>>, Michael Rabbat<rsup|<math|2>> and Dongxiao
    Zhu<rsup|<math|1,3,4,\<ast\>>> <vspace|>em <rsup|<math|1>>Department of
    Computer Science, University of New Orleans, 2000 Lakeshore Drive, New
    Orleans, LA 70148, USA <vspace|>em <rsup|<math|2>>Department of
    Electrical and Computer Engineering, McGill University, 3480 University
    Street, Montral, Qubec H3A 2A7, Canada<next-line><vspace|>em
    <rsup|<math|3>>Research Institute for Children, Children's Hospital, New
    Orleans, LA 70118, USA <vspace|>em <rsup|<math|4>>Tulane Cancer Center,
    New Orleans, LA 70118, USA <vspace|>em <rsup|<math|\<ast\>>> to whom
    correspondence should be addressed
  </left-aligned>

  <section*|abstract>

  We propose a novel two-stage Gene Set Gibbs Sampling (GSGS) framework, to
  reverse engineer signaling pathways from gene sets inferred from molecular
  profiling data. We hypothesize that signaling pathways are structurally an
  ensemble of overlapping linear signal transduction events which we encode
  as Information Flow Gene Sets (IFGS's). We infer pathways from gene sets
  corresponding to these events subjected to a random permutation of genes
  within each set. In Stage I, we use a source separation algorithm to derive
  unordered and overlapping IFGS's from molecular profiling data, allowing
  cross talk among IFGS's. In Stage II, we develop a Gibbs sampling like
  algorithm, Gene Set Gibbs Sampler, to reconstruct signaling pathways from
  the latent IFGS's derived in Stage I. The novelty of this framework lies in
  the seamless integration of the two stages and the hypothesis of IFGS's as
  the basic building blocks for signal pathways. In the proof-of-concept
  studies, our approach is shown to outperform the existing Bayesian network
  approaches using both continuous and discrete data generated from benchmark
  networks in the DREAM initiative. We perform a comprehensive sensitivity
  analysis to assess the robustness of the approach. Finally, we implement
  the GSGS framework to reconstruct signaling pathways in breast cancer
  cells.

  <section|Introduction><label|sec:1>

  A central goal of computational systems biology is to decipher signal
  transduction pathways in living cells. Characterization of complicated
  interaction patterns in signaling pathways can provide insights into
  biomolecular interaction and regulation mechanisms. Consequently, there
  have been a large body of computational efforts addressing the problem of
  signaling pathway reconstruction by using Probabilistic Boolean Networks
  (PBNs) (<cite-raw|Shmulevich02>, <cite-raw|Shmulevich03>), Bayesian
  Networks (BNs) (<cite-raw|Friedman00>, <cite-raw|Segal03>,
  <cite-raw|Song09>), Relevance Networks (RNs) (<cite-raw|Butte03>),
  Graphical Gaussian Models (GGMs) (<cite-raw|Kishino00>, <cite-raw|Dobra04>,
  <cite-raw|Schaffer05>) and other approaches (<cite-raw|Gardner03>,
  <cite-raw|Tenger03>, <cite-raw|Altay10a>).

  Although the existing approaches are useful, they often represent a
  phenomenological graph of the observed data. For example, parent set of
  each gene in case of BNs, indicates statistically causal relationships.
  RNs, GGMs and PBNs are computationally tractable even for large signaling
  pathways, however co-expression criteria used in RNs and GGMs only models a
  possible functional relevancy, and the use of boolean functions in PBNs may
  lead to an oversimplification of the underlying gene regulatory mechanisms.
  Moreover, the aforementioned approaches purely rely on molecular profiling
  data generated from high-throughput platforms, which are often noisy with
  high experimental cost associated with them. Consequently, the
  reconstructed networks may fail to represent the underlying signal
  transduction mechanisms.

  On the other hand, gene set based analysis has received much attention in
  recent years. An initial characterization of large-scale molecular
  profiling data often results in the identification of many pathway
  components, which we refer to as gene sets. Availability of several
  computational and experimental strategies have led to a rapid accumulation
  of gene sets in the biomedical databases. A gene set compendium is
  comprised of a large number of overlapping gene sets as each gene may
  simultaneously participate in many biological processes. Overlapping
  reflects the interconnectedness among gene sets and should be exploited to
  infer the underlying gene regulatory network. Our motivation of considering
  a gene set based approach for network reconstruction falls into many other
  categories. For instance, a gene set based approach can more naturally
  incorporate higher order interaction mechanisms as opposed to individual
  genes. In comparison to molecular profiling data, gene sets are more robust
  to noise and facilitate data integration from multiple data acquisition
  platforms. A gene set based approach can allow us to explicitly consider
  signal transduction mechanisms underlying individual gene sets. Overall,
  gene sets provide a rich source of data to infer signaling pathways. The
  relative advantages of working with gene sets in bioinformatics analyses
  have been adequately demonstrated (<cite-raw|Subra05>, <cite-raw|Pang06>,
  <cite-raw|Pang08>, <cite-raw|Richards10>). However, signaling pathway
  reconstruction by sufficiently exploiting gene sets, a promising area of
  bioinformatics research, remains underdeveloped.

  With few exceptions, the existing network reconstruction approaches do not
  accommodate gene sets. The frequency method in (<cite-raw|Rabbat05>)
  assigns an order to a gene set by assuming a tree structure in the paths
  between pairs of nodes. However, the method is subjected to fail in the
  presence of multiple paths between the same pair of nodes. To capture the
  underlying relations between nodes, the cGraph algorithm presented in
  (<cite-raw|Kubica03>) adds weighted edges between each pair of nodes that
  appear in some gene set. The networks inferred by this approach often
  contain a large number of false positives. It is also difficult to
  incorporate prior knowledge about regulator-target pairs in the approaches
  mentioned above. The EM approach in (<cite-raw|Zhu06>, <cite-raw|Rabbat08>)
  treats permutations of genes in a gene set as missing data and assumes a
  linear arrangement of genes in each set. Nevertheless, it is necessary to
  develop a systems biology framework integrating both, identification of
  significant gene sets and signaling pathways reconstruction from gene sets.

  A central aspect of developing such network reconstruction frameworks is to
  understand the structure of signaling pathways. Signaling pathways are an
  ensemble of several overlapping signaling transduction events with a linear
  arrangement of genes in each event. We denote these events as Information
  Flows (IF's). Information Flow Gene Sets (IFGS's) stand for the gene sets
  obtained by randomly permuting the order of genes in each IF. Thus, an IF
  and an IFGS share the same set of genes, however the latter lacks gene
  ordering information or it is <em|unordered>. We hypothesize that IF's form
  the building blocks for signaling pathways and uniquely determine their
  structures. One plausible way to retrieve the latent, unordered and
  overlapping IFGS's from molecular profiling data is to use source
  separation approaches, such as Singular Value Decomposition (SVD) (Stage
  I). The true signaling pathways can be reconstructed by inferring a
  distribution of more likely orders of the genes in each IFGS (Stage II).

  In this paper, we design a two-stage Gene Set Gibbs Sampling (GSGS)
  framework by seamlessly integrating deconvolution of IFGS's and signaling
  pathway reconstruction from IFGS's. In Stage I, we infer unordered and
  overlapping IFGS's corresponding to the latent signal transduction events.
  In Stage II, we develop a stochastic algorithm Gene Set Gibbs Sampler under
  the Gibbs sampling framework (<cite-raw|Gelman03>, <cite-raw|Givens05>) to
  reconstruct signal pathways from IFGS's inferred in Stage I. The algorithm
  treats the ordering of genes in an IFGS as a random variable, and samples
  signaling pathways from the joint distribution of IFGS's. The two-stage
  GSGS framework is novel from various aspects, such as the hypothesis of
  IFGS's as the basic building blocks for signal pathways, the definition of
  gene orderings as a random variable to accommodate higher-order interaction
  as opposed to individual gene expression, and probabilistic network
  inferences.

  We comprehensively examine the performance of our approach by using two
  gold standard networks from DREAM (Dialogue for Reverse Engineering
  Assessments and Methods) initiative and compare it with the Bayesian
  network approaches K2 (<cite-raw|Cooper92>, <cite-raw|Murphy01b>b) and MCMC
  (<cite-raw|Murphy01a>a, <cite-raw|Murphy01b>b). We also perform sensitivity
  analysis to access the robustness of the framework to the under-sampling
  and over-sampling of gene sets. Finally, we use our framework to
  reconstruct signaling pathways in breast cancer cells.

  <section|Methods>

  <subsection|Our concepts>

  An <with|font-shape|italic|Information Flow (IF)> is a directed linear path
  from one node to another node in signaling pathways which does not allow
  self transition or transition to a previously visited node. An
  <with|font-shape|italic|Information Flow Gene Set (IFGS)> is the set of all
  genes in an IF with a random permutation of their ordering. The length of
  an IFGS is the number of genes present in the set. Therefore, there are
  <math|L>! putative information flows that are compatible with an IFGS of
  length <math|L>. We assume throughout that <math|L\<geq\>3>. An IF of
  length two serves as prior knowledge. Given a collection of <math|m>
  unordered IFGS's <math|X<rsub|1>,<nbsp>X<rsub|2>,\<ldots\>,X<rsub|m>>, we
  treat the order <math|\<Theta\><rsub|i>> associated with <math|X<rsub|i>>
  as a random variable. We write <math|<around|(|X<rsub|i>,\<Theta\><rsub|i>|)>>
  to represent this association. Let us assume that the length of
  <math|X<rsub|i>> is <math|L<rsub|i>>, for <math|i=1,\<ldots\>,m>. As the
  sampling space of <math|\<Theta\><rsub|i>> corresponding to
  <math|X<rsub|i>> is of size <math|L<rsub|i>>!, it follows that the sampling
  space of the joint distribution <math|P<around|(|<around|(|X<rsub|1>,\<Theta\><rsub|1>|)>,\<ldots\>,<around|(|X<rsub|m>,\<Theta\><rsub|m>|)>|)>>
  is the set of <math|<big|prod><rsub|i=1><rsup|m>L<rsub|i>!> permutations.
  Sampling space of size <math|<big|prod><rsub|i=1><rsup|m>L<rsub|i>!> can be
  computationally intractable even for moderate values of <math|L<rsub|i>>
  and <math|m>. As a result, our goal of signaling pathway reconstruction can
  be translated into drawing sample signaling pathways sequentially from the
  joint distribution <math|P<around|(|<around|(|X<rsub|1>,\<Theta\><rsub|1>|)>,\<ldots\>,<around|(|X<rsub|m>,\<Theta\><rsub|m>|)>|)>>
  (the true signaling pathway) of IFGS's and then estimating the most likely
  signaling pathway using sampled pathways.

  Next, we present our two-stage GSGS framework. In Stage I, we derive IFGS's
  which form the building blocks of the signaling pathways. In Stage II, we
  develop a Gibbs sampling like algorithm to sequentially sample permutation
  orders for each IFGS by conditioning on the remaining of the network
  structures.

  <big-figure|<with|par-mode|center|<image|fig1.eps||||><label|fig:f1>>|<with|font-size|0.84|Flow
  chart for the two-stage GSGS signaling pathway reconstruction framework.
  Stage I: Derivation of IFGS's using two common data resources. Stage II:
  Gene Set Gibbs Sampler successively draws sample signaling pathways of the
  underlying true signaling pathway from the joint distribution of IFGS's.>>

  <subsection|Stage I: Derivation of IFGS's>

  In Stage I, we derive unordered and overlapping IFGS's corresponding to
  latent information flows to serve as input for the pathway reconstruction
  algorithm presented in the next section (Fig. <reference|fig:f1>). We use
  Singular Value Decomposition (SVD) to identify candidates gene sets. To
  extract coherent gene sets, the algorithm combines knowledge from two
  complementary forms of data, gene sets available from data bases and
  molecular profiling data from high-throughput platforms. We first select
  genes which appear most frequently in the gene set compendium under study.
  This frequency is referred to as <em|degree>. We identify significant genes
  by fitting a power law distribution (<math|y\<propto\>x<rsup|-\<alpha\>>,<nbsp>\<alpha\>\<gtr\>1>)
  on the degrees of distinct genes present in the compendium. An application
  of SVD on the gene expression data <math|D> corresponding to significant
  genes leads to a factorization of the form
  <math|D<rsub|p\<times\>q>=U<rsub|p\<times\>p>\<cdot\>S<rsub|p\<times\>q>\<cdot\>V<rsub|q\<times\>q><rsup|T>>,
  where <math|p> is the number of genes and <math|q> is the number of
  samples. We choose column vectors from <math|U> corresponding to <math|k>
  highest singular values in SVD. In general, <math|k> is comparatively
  smaller than the original dimension of data. Following <cite-raw|Kim03>, we
  assume that <math|k> satisfies <math|k*<around|(|m+n|)>\<less\>m*n>. We let
  <math|k=max <around|{|r:r*<around|(|m+n|)>\<less\>m*n|}>> to derive the
  maximum number of gene sets by preserving the preceding criteria. It is
  well known that a single gene in a living cell may simultaneously
  participate in multiple biological processes. The chosen basis vectors
  represent <math|k> potential information flows. For a specified cut-off
  <math|\<beta\>>, we set the top <math|\<beta\>%> entries (in absolute
  values) among <math|k> vectors as significant and other entries as zero.
  The non-zero locations in <math|k> vectors correspond to <math|k>
  overlapping gene sets. We further perform gene set enrichment analysis on
  the gene sets derived using SVD. The enriched gene sets represent IFGS's.

  <subsection|Stage II: Signaling pathway reconstruction from IFGS's>

  <with|font-shape|italic|Joint distribution and conditional distribution of
  gene sets>. With increasing number of gene sets, the size of the sampling
  space for the multivariate distribution
  <math|P<around|(|<around|(|X<rsub|1>,\<Theta\><rsub|1>|)>,\<ldots\>,<around|(|X<rsub|m>,\<Theta\><rsub|m>|)>|)>>
  is of the order of <math|<big|prod><rsub|i=1><rsup|m>L<rsub|i>!>. Such a
  space might be computationally intractable even for moderate values of
  <math|L<rsub|i>> and <math|m>. However, it is possible to theoretically
  describe this distribution under certain assumptions.

  Now onwards, we consider IFGS's as random samples from a first order Markov
  chain model, where the state of a node is only dependent on the state of
  its previous node. We compute the initial probability vector
  <math|\<pi\><rsub|0>> and transition probability matrix <math|\<Pi\>> from
  <math|m> IF's (ordered paths) as follows. If there are a total of <math|n>
  distinct genes across <math|m> IF's, then

  <\equation>
    \<pi\><rsub|0>=<around|(|<frac|c<rsub|1>|c>,\<ldots\>,<frac|c<rsub|n>|c>|)><label|eq1>
  </equation>

  <no-indent>where <math|c<rsub|l>> is the total number of times
  <math|l<rsup|t*h>> gene appears as the first node among <math|m> IF's, for
  each <math|l=1,\<ldots\>,n> and <math|c=<big|sum><rsub|l=1><rsup|n>c<rsub|l>>.
  If <math|c<rsub|r*s>> is the total number of times <math|r<rsup|t*h>> gene
  transits to <math|s<rsup|t*h>> gene (i.e. there is edge from <math|r> to
  <math|s>) among <math|m> ordered paths, then

  <\equation>
    \<Pi\>=<around|[|p<rsub|r*s>|]><rsub|n\<times\>n><label|eq2>
  </equation>

  <no-indent>where <math|p<rsub|r*s>=c<rsub|r*s>/<big|sum><rsub|s=1><rsup|n>c<rsub|r*s>>,
  <math|r,s=1,\<ldots\>,n>.

  The computation of <math|\<pi\><rsub|0>> and <math|\<Pi\>> allows us to
  calculate the likelihood of each of the
  <math|<big|prod><rsub|i=1><rsup|m>L<rsub|i>!> collections of IF's. The
  likelihood of each collection is the product of the likelihoods of <math|m>
  individual IF's in the collection. As each IF is treated as a first order
  Markov chain, we can calculate its likelihood using <math|\<pi\><rsub|0>>
  and <math|\<Pi\>>. For example, we compute the likelihood of the
  information flow <math|z\<rightarrow\>y\<rightarrow\>x>

  <\equation>
    \<cal-P\>*<around|(|z\<rightarrow\>y\<rightarrow\>x|)>=P<around|(|z|)>\<times\>P<around|(|y\|z|)>\<times\>P<around|(|x\|y|)>.<label|eq3>
  </equation>

  The likelihood values calculated for a total of
  <math|<big|prod><rsub|i=1><rsup|m>L<rsub|i>!> collections of IF's can be
  normalized to denote a distribution of permutation ordering probabilities.
  However, the computation of <math|<big|prod><rsub|i=1><rsup|m>L<rsub|i>!>
  likelihoods might be computational intractable. This serves as motivation
  for the proposed Gibbs sampling like approach. The computational
  tractability of our GSGS framework lies in sampling an order for each IFGS
  <math|X<rsub|i>> by conditioning on the orders of the remaining IFGS's,
  with a much reduced sample space of size <math|L<rsub|i>>!.

  Let us write all IFGS's and their associated orderings together as
  <math|<around|(|<wide|X|\<bar\>>,<wide|\<Theta\>|\<bar\>>|)>>, where
  <math|<wide|X|\<bar\>>=<around|(|X<rsub|1>,\<ldots\>,X<rsub|m>|)>> and
  <math|<wide|\<Theta\>|\<bar\>>=<around|(|\<Theta\><rsub|1>,\<ldots\>,\<Theta\><rsub|m>|)>>.
  The notations are suffixed with <math|-i> to consider all but the
  <math|i<rsup|t*h>> component, e.g. <math|<wide|X|\<bar\>><rsub|-i>>,
  <math|<around|(|<wide|X|\<bar\>>,<wide|\<Theta\>|\<bar\>>|)><rsub|-i>>
  etc., for <math|i\<in\><around|{|1,\<ldots\>,m|}>>. We sample an order for
  the <math|i<rsup|t*h>> gene set <math|X<rsub|i>> by conditioning on the
  known orders of remaining <math|m-1> gene sets
  <math|X<rsub|1>,\<ldots\>,X<rsub|i-1>,X<rsub|i+1>,\<ldots\>,X<rsub|m>>. To
  sample an order for <math|X<rsub|i>> from the conditional distribution, we
  leave the <math|i<rsup|t*h>> gene set out, and compute the initial
  probability vector <math|\<pi\><rsub|-i>> and transition probability matrix
  <math|\<Pi\><rsub|-i>> by following the procedure described in Eq.
  <reference|eq1> and Eq. <reference|eq2>, from <math|m-1> IF's. Further, we
  calculate the likelihoods of all possible orders
  <math|\<Theta\><rsub|i><rsup|j>,<nbsp>j=1,\<ldots\>,L<rsub|i>> for
  <math|X<rsub|i>> by conditioning on the orders of remaining <math|m-1> gene
  sets. The conditional likelihood for the <math|j<rsup|t*h>> order for
  <math|X<rsub|i>> is given by

  <\equation>
    \<cal-L\><rsub|i><rsup|j>=<choice|<tformat|<table|<row|<cell|<frac|\<cal-P\><rsub|i><rsup|j>|<big|sum><rsub|j=1><rsup|L<rsub|i>>\<cal-P\><rsub|i><rsup|j>>>|<cell|<text|if
    ><big|sum><rsub|j=1><rsup|L<rsub|i>>\<cal-P\><rsub|i><rsup|j>\<neq\>0,>>|<row|<cell|<frac|1|L<rsub|i>>>|<cell|<text|otherwise>>>>>><label|eq4>
  </equation>

  where

  <\equation>
    \<cal-P\><rsub|i><rsup|j>=P<around|(|<around|(|X<rsub|i>,\<Theta\><rsub|i>=\<Theta\><rsub|i><rsup|j>|)>\|<around|(|<wide|X|\<bar\>>,<wide|\<Theta\>|\<bar\>>|)><rsub|-i>|)>.<label|eq5>
  </equation>

  <no-indent>For a fixed value of <math|j>, <math|\<cal-P\><rsub|i><rsup|j>>
  is computed by decomposing it into the product of conditional probability
  terms. For example, we compute the likelihood of
  <math|z\<rightarrow\>y\<rightarrow\>x> corresponding to the gene set
  <math|X<rsub|i>=<around|{|x,y,z|}>> as

  <\equation>
    \<cal-P\><around|(|<around|(|X<rsub|i>,\<Theta\><rsub|i>=z\<rightarrow\>y\<rightarrow\>x|)>\|<around|(|<wide|X|\<bar\>>,<wide|\<Theta\>|\<bar\>>|)><rsub|-i>|)>=P<around|(|z|)>\<times\>P<around|(|y\|z|)>\<times\>P<around|(|x\|y|)>.
  </equation>

  Each term on the right is conditioned on
  <math|<around|(|<wide|X|\<bar\>>,<wide|\<Theta\>|\<bar\>>|)><rsub|-i>> and
  is available from <math|\<pi\><rsub|-i>> and <math|\<Pi\><rsub|-i>>. We now
  sample an order for <math|X<rsub|i>> from the conditional distribution
  using inverse Cumulative Density Function (CDF) (<cite-raw|Gelman03>). The
  CDF of the conditional distribution <math|P<around|(|<around|(|X<rsub|i>,\<Theta\><rsub|i>|)>\|<around|(|<wide|X|\<bar\>>,<wide|\<Theta\>|\<bar\>>|)><rsub|-i>|)>>
  is defined as

  <\equation>
    F<around|(|<around|(|X<rsub|i>,\<Theta\><rsub|i>=\<Theta\><rsub|i><rsup|j>|)>\|<around|(|<wide|X|\<bar\>>,<wide|\<Theta\>|\<bar\>>|)><rsub|-i>|)>)=<big|sum><rsub|k=1><rsup|j>\<cal-P\><rsub|i><rsup|k><label|eq7>
  </equation>

  for each <math|j=1,\<ldots\>,L<rsub|i>>. By sampling a number
  <math|u\<sim\>U<around|(|0,1|)>> and letting
  <math|F<rsup|-1><around|(|u|)>=v>, we get a randomly drawn order <math|v>
  for <math|X<rsub|i>> from the conditional distribution (Eq.
  <reference|eq7>).

  <\specified-algorithm|Gene Set Gibbs Sampler>
    <label|algorithm1><algo-state|<with|font-series|bold|Input:> <math|m>
    IFGS's <math|X<rsub|i>,<nbsp>i=1,\<ldots\>,m>, prior knowledge
    (optional), burn-in state <math|B> and number of samples <math|N> to be
    collected after burn-in state>

    <algo-state|<with|font-series|bold|Output:> <math|m> information flows
    <math|<around|(|X<rsub|i>,<wide|\<Theta\>|^><rsub|i>|)>>,
    <math|i=1,\<ldots\>,m>>

    <algo-state|At <math|t=0>, make a random choice of order
    <math|\<Theta\><rsub|i><rsup|<around|(|0|)>>> from <math|L<rsub|i>>!
    permutations,<nbsp><math|i=1,\<ldots\>,m>>

    <\algo-for|<math|t=1,\<ldots\>,B+N>>
      <algo-state|<math|<wide|\<Theta\>|\<bar\>>=<around|(|\<Theta\><rsub|1><rsup|<around|(|t-1|)>>,\<ldots\>,\<Theta\><rsub|m><rsup|<around|(|t-1|)>>|)><rsup|T>>>

      <\algo-for|<math|i=1,\<ldots\>,m>>
        <algo-state|Compute <math|P<rsup|<around|(|t|)>><rsub|-i>> and
        <math|\<Pi\><rsup|<around|(|t|)>><rsub|-i>>>

        <algo-state|Calculate the conditional likelihoods
        <math|\<cal-L\><rsup|j><rsub|i>>'s (Eq. <reference|eq4>) of
        <math|L<rsub|i>>! permutations by treating <math|X<rsub|i>> as a
        first order Markov chain>

        <algo-state|Draw an order <math|\<Theta\><rsub|i><rsup|<around|(|t|)>>>
        for <math|X<rsub|i>> from the conditional distribution
        <math|P<around|(|<around|(|X<rsub|i>,\<Theta\><rsub|i>|)>\|<around|(|<wide|X|\<bar\>>,<wide|\<Theta\>|\<bar\>>|)><rsub|-i>|)>>>

        <algo-state|Update the order information for <math|X<rsub|i>>>
      </algo-for>
    </algo-for>

    <algo-state|Return <math|<wide|\<Theta\>|^><rsub|i>=<text|mode><around|(|\<Theta\><rsub|i><rsup|<around|(|B+1|)>>,\<ldots\>,\<Theta\><rsub|i><rsup|<around|(|B+N|)>>|)>>,<nbsp><math|i=1,\<ldots\>,m>.>
  </specified-algorithm>

  <next-line><next-line><no-indent><with|font-shape|italic|Gene Set Gibbs
  Sampler>. In Algorithm <reference|algorithm1>, we present Gene Set Gibbs
  Sampler, which leads to the reconstruction of signaling pathways from
  IFGS's derived in Stage I. In case of prior knowledge, we augment known
  edges as directed pairs with unordered IFGS's, and keep the direction of
  these edges fixed during the execution of the algorithm. Algorithm
  <reference|algorithm1> outputs a list of IF's. To reconstruct signaling
  pathways, we start with an empty network of distinct genes present in the
  input list and reconstruct the most likely signaling pathway by joining
  IF's present in the output of Algorithm <reference|algorithm1>.

  <subsection|Burn-in state>

  A burn-in state in Algorithm <reference|algorithm1> refers to a stage after
  which we start collecting sampled pathways. Samples collected after burn-in
  state are assumed to be drawn from the joint distribution of IFGS's. To
  determine an appropriate burn-in state, we translated the approach
  presented in (<cite-raw|Gelman03>, <cite-raw|Givens05>) in our framework to
  compute the ratio

  <\equation>
    R=<frac|<frac|N-1|N>*W<rsub|v>+<frac|1|N>*B<rsub|v>|W<rsub|v>><label|eq:r>
  </equation>

  for three quantities sensitivity, specificity and PPV. Here, <math|N> is
  the total number of pathways sampled after burn-in state, <math|W<rsub|v>>
  is the averaged within-chain variance and <math|B<rsub|v>> is between-chain
  variance. Moreover, Sensitivity = TP/(TP+FN), Specificity = TN/(TN+FP) and
  PPV = TP/(TP+FP), where TP = number of true positives, TN = number true
  negatives, FP = number of false positives, and FN = number of false
  negatives. In practice if <math|<sqrt|R>\<less\>1.2>, the choice of burn-in
  state and <math|N> is acceptable (also see <em|Supplementary Material>).

  <subsection|Computational complexity>

  The worst case time complexity of Gene Set Gibbs Sampler is
  <math|N*m*<around|(|m+n+F*L|)>>, where <math|N> is the number of sampled
  pathways, <math|m> is the number of IFGS's, <math|n> is the number of
  distinct genes, <math|L> is the length of the longest gene set in the input
  and <math|F=L>!. As longer gene sets <math|<around|(|L\<geq\>10|)>> are
  less likely to correspond to information flows, the complexity arising from
  <math|F*L> could be managed by appropriately selecting the length of gene
  sets in Stage I. Thus, the computational complexity of our algorithm
  increases quadratically with increase in the number of IFGS's, which
  compares very favorably with the Bayesian network approaches.

  <\specified-algorithm|Network2GeneSets>
    <label|algorithm2><algo-state|<with|font-series|bold|Input:> A directed
    acyclic graph with <math|n> nodes>

    <algo-state|<with|font-series|bold|Output:> All IFGS's>

    <\algo-for|<math|i=1,\<ldots\>,n>>
      <\algo-if-else-if|node <math|i> has no children|<algo-state|continue>>
        <algo-if-else-if|node <math|i> has children|<algo-state|add to Queue
        <math|Q> and the Linked List <math|L> all the directed pairs
        consisting of <math|i> and a child of <math|i>>>

        <\algo-while|<math|Q> is not empty>
          <algo-state|Pop an information flow <math|P> from <math|Q>>

          <algo-if-else-if|the last node in <math|P>, say <math|k>, has no
          children|<algo-state|continue>>

          <algo-state|add to <math|Q> and <math|L>, all information flows
          obtained by appending each child of <math|k> to <math|P>>
        </algo-while>
      </algo-if-else-if>
    </algo-for>

    <algo-state|Prune information flows in <math|L> of length 2 (prior
    knowledge)>

    <algo-state|Randomly permute orders of information flows in <math|L> and
    order of genes in each information flow>

    <algo-state|Return all IFGS's of length <math|\<geq\>3>.>
  </specified-algorithm>

  <section|Data Analysis>

  We analyzed the performance of our proposed network inference framework by
  reconstructing three different gene regulatory networks. We obtained two
  gold standard directed networks from the <with|font-shape|italic|In Silico>
  Network Challenge in DREAM initiative. The two networks are
  <with|font-shape|italic|In Silico> network
  (<cite-raw|Mendes09|Stolovitzky09>) from DREAM2 and
  <with|font-shape|italic|E. coli> network (<cite-raw|Marback09>,
  <cite-raw|Marback10>, <cite-raw|Prill10>) from DREAM3 network challenges.
  <with|font-shape|italic|E. coli> and <with|font-shape|italic|In Silico>
  networks consist of <math|50> nodes, with <math|62> and <math|37> true
  edges respectively. Availability of gold standard networks allows us to
  assess the performance of the proposed approach. In addition, we also
  implemented our two-stage GSGS framework to reconstruct signaling pathways
  in breast cancer cells.

  <subsection|Derivation of IFGS's>

  From the <em|E. coli> and <em|In Silico> networks, two collections of
  IFGS's were derived by a direct application of Algorithm
  <reference|algorithm2>. Indeed, Algorithm <reference|algorithm2> finds all
  unordered gene sets from a given network. The algorithm first finds all
  IF's (linear paths) in the network and then randomly permutes the ordering
  of genes in each IF. We may note that Algorithm <reference|algorithm2> is
  more general than the standard Depth First Search (DFS) algorithm in that
  the latter does not find all the linear paths. There were a total of
  <math|125> and <math|57> IFGS's of length <math|\<geq\>3> for the <em|E.
  coli> and <em|In Silico> networks, respectively. These collections of
  IFGS's serve as input for Gene Set Gibbs Sampler (Algorithm
  <reference|algorithm1>).

  We also derive IFGS's using the C4 gene set compendium (computational gene
  sets) from MSigDB (<cite-raw|Subra05>). There are a total of <math|883>
  overlapping cancer gene sets and <math|10,124> distinct genes in the
  compendium. We identified significant genes
  <math|<around|(|P*<around|(|X\<geq\>x|)>\<geq\>0.95|)>> by fitting a power
  law distribution on the degrees of <math|10,124> genes (Fig. 6,
  <with|font-shape|italic|Supplementary Material>). We obtained a total of
  <math|289> genes using this selection procedure. We also collected
  <math|299> samples of breast cancer patients from Affymetrix HG-U133 plus
  2.0 platform. A total of <math|267> out of <math|289> selected genes could
  be mapped to the annotation table for the Affymetrix HG-U133 plus 2.0
  platform. For each of the <math|267> genes, gene expression levels
  corresponding to exactly one probe set with highest average measurement
  among <math|299> samples were selected. The resulting data set contained
  <math|267> rows (genes) and <math|299> columns (samples). We performed SVD
  on the breast cancer gene expression data of size
  <math|267\<times\>299<nbsp><around|(|m\<times\>n|)>> and considered
  <math|k> basis vectors corresponding to <math|k> highest eigenvalues. As
  mentioned in Section 2, we chose <math|k> by setting <math|k=max
  <around|{|r:r*<around|(|m+n|)>\<less\>m*n|}>=141>. To identify the most
  significant candidates for IFGS's, top <math|2%> of the entries across
  <math|k> basis vectors were declared as non-zero and the remaining entries
  were set as zero. We derived a total of <math|138> candidate gene sets by
  identifying genes corresponding to non-zero entries among <math|k> basis
  vectors. We lost 3 gene sets by constraining a gene set to contain at least
  3 genes. To measure the enrichment of gene sets, we further performed gene
  set enrichment analysis using the functional annotation tool in DAVID
  (<cite-raw|Dennis03>, <cite-raw|Huang09>). DAVID performs gene set
  enrichment analysis using a modified Fisher Exact Test. We used Affymetrix
  Human Genome U133 Plus 2.0 Array as background to test the enrichment of
  gene sets. By setting the other parameters in DAVID as default, <math|106>
  enriched gene sets containing a total of <math|212> distinct genes were
  derived. The enriched gene sets serve as IFGS's.

  <big-figure|<with|par-mode|center|<image|fig3.eps||||><label|fig:
  f3>>|<with|font-size|0.84|Sensitivity analysis for the GSGS approach with
  increasing percentage of prior knowledge. Network: <em|E. coli>. In blocks
  (a)-(f), <math|x>-axis represents the percentage of gene sets present in
  the input and <math|y>-axis plots the total number of edges predicted by
  GSGS (solid line). The dashed line plots correspond to the ground truth.
  Here, we have considered only those genes which were present among IFGS's
  after pruning all gene pairs.>>

  <subsection|Performance evaluation using <em|E. coli> network>

  We now analyze the performance of Gene Set Gibbs Sampler using <em|E. coli>
  network. Analogous results for <with|font-shape|italic|In Silico> network
  are presented as <with|font-shape|italic|Supplementary Material>. Using
  Gene Set Gibbs Sampler (Algorithm <reference|algorithm1>), we collected a
  total of <math|500> networks after burn-in state which we fixed at
  <math|500>. As all gene pairs are pruned by Algorithm
  <reference|algorithm2>, some genes might be lost and never appear in the
  input list of IFGS's. We compare the network predicted by Algorithm
  <reference|algorithm1> with the subnetwork formed by genes present in the
  input. A detailed list of settings is presented in the <em|Supplementary
  Material>. With the chosen set of parameters, <math|<sqrt|R>> in Eq.
  <reference|eq:r> was found approximately equal to one, for each of the
  three quantities sensitivity, specificity and PPV. We used the total number
  of predicted true edges and F-score to assess the performance of Algorithm
  <reference|algorithm1>. The F-score is defined as
  <math|F=2*p*r/<around|(|p+r|)>>. Here, <math|r> is the sensitivity and
  <math|p> is the PPV. <vspace|0.20cm>

  <\big-table>
    <tabular*|<tformat|<cwith|1|-1|1|1|cell-lborder|1ln>|<cwith|1|-1|1|1|cell-halign|c>|<cwith|1|-1|1|1|cell-rborder|1ln>|<cwith|1|-1|2|2|cell-halign|c>|<cwith|1|-1|2|2|cell-rborder|1ln>|<cwith|1|-1|3|3|cell-halign|c>|<cwith|1|-1|3|3|cell-rborder|1ln>|<cwith|1|-1|4|4|cell-halign|c>|<cwith|1|-1|4|4|cell-rborder|1ln>|<cwith|1|-1|5|5|cell-halign|c>|<cwith|1|-1|5|5|cell-rborder|1ln>|<cwith|1|-1|6|6|cell-halign|c>|<cwith|1|-1|6|6|cell-rborder|1ln>|<cwith|1|-1|7|7|cell-halign|c>|<cwith|1|-1|7|7|cell-rborder|1ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-tborder|1ln>|<cwith|1|1|1|-1|cell-bborder|1ln>|<cwith|11|11|1|-1|cell-bborder|1ln>|<table|<row|<cell|>|<cell|0%>|<cell|20%>|<cell|40%>|<cell|60%>|<cell|80%>|<cell|100%>>|<row|<cell|20%>|<cell|0.430>|<cell|0.648>|<cell|0.748>|<cell|0.844>|<cell|0.926>|<cell|1>>|<row|<cell|40%>|<cell|0.496>|<cell|0.680>|<cell|0.792>|<cell|0.865>|<cell|0.937>|<cell|1>>|<row|<cell|60%>|<cell|0.513>|<cell|0.677>|<cell|0.790>|<cell|0.883>|<cell|0.943>|<cell|1>>|<row|<cell|80%>|<cell|0.468>|<cell|0.665>|<cell|0.780>|<cell|0.860>|<cell|0.947>|<cell|0.999>>|<row|<cell|100%>|<cell|0.457>|<cell|0.595>|<cell|0.719>|<cell|0.824>|<cell|0.923>|<cell|0.999>>|<row|<cell|120%>|<cell|0.459>|<cell|0.590>|<cell|0.704>|<cell|0.825>|<cell|0.913>|<cell|0.996>>|<row|<cell|140%>|<cell|0.450>|<cell|0.579>|<cell|0.722>|<cell|0.805>|<cell|0.909>|<cell|0.999>>|<row|<cell|160%>|<cell|0.422>|<cell|0.564>|<cell|0.691>|<cell|0.803>|<cell|0.913>|<cell|0.991>>|<row|<cell|180%>|<cell|0.434>|<cell|0.550>|<cell|0.679>|<cell|0.786>|<cell|0.897>|<cell|0.984>>|<row|<cell|200%>|<cell|0.425>|<cell|0.546>|<cell|0.676>|<cell|0.778>|<cell|0.877>|<cell|0.974>>>>>

    <vspace|0.20cm>

    <label|table1>
  </big-table|<with|font-size|0.84|F-scores calculated for the GSGS approach
  with increasing percentage of gene sets in the input (row) and prior
  knowledge (column). Network: <em|E. coli>. We observe a clear increasing
  trend in the F-scores in each row, indicating the positive impact of
  incorporating prior knowledge, while a clear trend of similarity is
  observed within each column, indicating a marked robustness of the
  performance of GSGS to the over-sampling and under-sampling of gene sets.>>

  In order to accommodate the real-world under-sampling and over-sampling
  situations, we first performed sensitivity analysis of the GSGS approach
  using <em|E. coli> network. Fig. <reference|fig: f3> demonstrates the
  effect of removing and adding unordered gene sets to the input list of
  IFGS's in Algorithm <reference|algorithm1>. In Fig. <reference|fig: f3>,
  <math|x>-axis represents the percentage of gene sets present in the input
  list, where <math|20%> means that <math|80%> of the gene sets were randomly
  removed from the list of all IFGS's, and <math|120%> means that <math|20%>
  of randomly sampled gene sets were added to the original list of all
  IFGS's. In Fig. <reference|fig: f3>, we present the performance of our
  approach in terms of the total number of predicted true edges. In blocks
  (a)-(f), the number of edges identified by the GSGS approach remains close
  to the ground truth. We also observe the positive effect of incorporating
  prior knowledge. As the percentage of prior knowledge increases (block (a)
  to block (f)), difference between the ground truth and prediction
  decreases. In particular, our approach does not produce a large number of
  false positives in the presence of redundant gene sets.

  In Table <reference|table1>, we present the F-scores for the GSGS approach
  with increasing percentage of gene sets (rows) and prior knowledge
  (columns). We observe that the F-scores increase with an increase in the
  percentage of prior knowledge (values in a row), and these scores remain
  close on removal or addition of gene sets (values in a column)
  demonstrating an impressive robustness to under-sampling and over-sampling.
  This observation strongly supports the applicability of our GSGS framework
  in the real-world scenarios, where we often do not observe all gene sets or
  the observed gene sets are redundant.

  <big-figure|<with|par-mode|center|<image|fig2.eps||||><label|fig:f2>>|<with|font-size|0.84|A
  sketch of the idea behind comparing the GSGS approach with Bayesian network
  approaches. Note that the underlying network from which gene sets are
  derived is a directed network. Moreover, gene sets can equivalently be
  represented as a matrix of binary discrete values. Bayesian networks are
  the best choice in this case to fairly assess the performance of GSGS.
  Bayesian network approaches accommodate both discrete and continuous data,
  and reconstruct a directed network.>>

  We also compare the performance of our approach with a number of popular
  network inference approaches (<cite-raw|Margolin06>, <cite-raw|Meyer08>)
  with a primary emphasis on the two Bayesian network approaches, K2 and MCMC
  (Metropolis-Hastings or MH) implemented in the Bayes Net Tool Box (BNT)
  (<cite-raw|Murphy01b>b, <slink|http://sourceforge.net/projects/bnt/files/>).
  The main reasons are the following: (1). From methodology point of view our
  method infers the most probable linear structure(s) using likelihood scores
  calculated from the products of conditional probabilities. It is
  essentially in the same sprit as Bayesian network approaches while
  fundamentally different from other approaches based on the calculation of
  pair-wise similarity. (2). Both our approach and Bayesian network
  approaches naturally take discrete data in that a collection of gene sets
  can equivalently be represented as a matrix of binary discrete values.
  Indeed, each IFGS naturally corresponds to a binary sample derived by
  considering the presence and absence of a gene in the set. Most of the
  existing network reconstruction algorithms are more suitable for inferring
  an undirected network from continuous data sets.

  In Fig. <reference|fig:f2>, we sketch the idea behind comparing our
  approach with the Bayesian network approaches. Our goal in this paper is to
  infer the underlying directed network. Also note that a collection of gene
  sets can be represented as a matrix of binary discrete values. A binary
  sample corresponding to an IFGS can be derived by assigning a value
  <math|0> to the genes not present in the IFGS and <math|1> otherwise.
  Bayesian network approaches can accommodate both discrete and continuous
  data sets and reconstruct a directed network. The equivalent representation
  of gene sets as binary discrete data makes the comparison between our gene
  set based approach and the Bayesian network approaches very fair. In
  addition, we also generated continuous data to serve as input for the
  Bayesian network and other approaches (<cite-raw|Margolin06>,
  <cite-raw|Meyer08>). Thus, using the <with|font-series|bold|same underlying
  network>, e.g. the <em|E. coli> network, as the sole input (Fig.
  <reference|fig:f2>): (1). We generate discrete data inputs for Gene Set
  Gibbs Sampler (Algorithm <reference|algorithm1>) by collecting IFGS's in
  the output of Algorithm. (2). We generate discrete data inputs for K2 and
  MH by considering the absence (<math|0>) or presence (<math|1>) of a gene
  in each IFGS in the output of Algorithm <reference|algorithm2>. (3). We
  generate continuous data inputs for K2, MH and MINET using BNT.

  <big-figure|<image|fig4_ecoli.eps||||> <image|fig5_ecoli.eps||||>
  <label|fig: f4>|<with|font-size|0.84|Network: <em|E. coli>. (Upper Panel)
  Comparison of the GSGS approach with K2 and MH in terms of Total Number of
  Predicted Edges with increasing percentage of prior knowledge. In each
  panel ``Method-N" stands for a Bayesian network method applied to
  continuous data of sample size N, and ``Method-DIS" corresponds to using
  binary discrete data. Bayesian Information Criterion (BIC) and Bayesian
  scoring were used on the corresponding data sets. The dashed line
  represents ground truth. (Lower Panel) Comparison of the GSGS approach with
  K2 and MH in terms of F-score. Here <math|x>-axis represents the percentage
  of prior knowledge and <math|y>-axis plots F-scores from three
  approaches.>>

  <with|par-columns|1|<big-figure|<with|par-mode|center|<image|ecoli_true.eps||||>
  <image|ecoli_pred.eps||||> <image|insilico_true.eps||||>
  <image|insilico_pred.eps||||><label|fig: f5>>|<with|font-size|0.84|A proof
  of principle study. Left panels show two gold standard networks,
  <with|font-shape|italic|E. coli> (Upper) and <with|font-shape|italic|In
  Silico> (Lower); Right panels show the corresponding predicted networks by
  GSGS, <with|font-shape|italic|E. coli> (Upper) and
  <with|font-shape|italic|In Silico> (Lower). For a fair comparison, all
  stand-alone linear paths of length <math|2> are removed from both networks.
  On the right panels, the blue edges correspond to true positives and gray
  edges represent false positives. Figures were generated using Cytoscape
  (<cite-raw|Shannon03>).>>>

  In principle, the K2 approach (<cite-raw|Cooper92>) first specifies an
  ordering of nodes involved in the underlying network. Thus, initially each
  node has no parent. The algorithm incrementally assigns a parent to a node
  whose addition increases the score of the resulting structure the most. For
  the <math|i<rsup|t*h>> node, parents are chosen from the set of nodes with
  index <math|1,\<ldots\>,i-1>. On the other hand, the MH algorithm
  (<cite-raw|Murphy01a>a) starts from an initial directed acyclic network
  <math|G<rsub|0>> and selects a network <math|G<rsub|1>> uniformly from the
  neighborhood of <math|G<rsub|0>>. The neighborhood of a network <math|G> is
  the collection of all directed acyclic networks which differ from <math|G>
  by addition, deletion or reversal of a single edge. The algorithm accepts
  or rejects the move from <math|G<rsub|0>> to <math|G<rsub|1>> by computing
  an acceptance ratio defined in terms of marginal likelihood ratio
  <math|P<around|(|D\|G<rsub|1>|)>/P<around|(|D\|G<rsub|0>|)>>. Here <math|D>
  represents the given data. This procedure is iterated starting from the
  most recent network. A specified number of networks are collected after
  burn-in state. For scoring a structure, BNT implements Bayesian Information
  Criterion (<cite-raw|Schwartz78>) and Bayesian score functions
  (<cite-raw|Cooper92>).

  <\big-table>
    <tabular*|<tformat|<cwith|1|-1|1|1|cell-lborder|1ln>|<cwith|1|-1|1|1|cell-halign|c>|<cwith|1|-1|1|1|cell-rborder|1ln>|<cwith|1|-1|2|2|cell-halign|c>|<cwith|1|-1|2|2|cell-rborder|1ln>|<cwith|1|-1|3|3|cell-halign|c>|<cwith|1|-1|3|3|cell-rborder|1ln>|<cwith|1|-1|4|4|cell-halign|c>|<cwith|1|-1|4|4|cell-rborder|1ln>|<cwith|1|-1|5|5|cell-halign|c>|<cwith|1|-1|5|5|cell-rborder|1ln>|<cwith|1|-1|6|6|cell-halign|c>|<cwith|1|-1|6|6|cell-rborder|1ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-tborder|1ln>|<cwith|1|1|1|-1|cell-bborder|1ln>|<cwith|2|2|1|-1|cell-bborder|1ln>|<cwith|3|3|1|-1|cell-bborder|1ln>|<table|<row|<cell|>|<cell|GSGS>|<cell|CLR>|<cell|ARACNE>|<cell|MRNET>|<cell|MRNETB>>|<row|<cell|<with|font-shape|italic|E.coli>>|<cell|0.457>|<cell|0.230>|<cell|0.377>|<cell|0.303>|<cell|0.228>>|<row|<cell|<with|font-shape|italic|In
    Silico>>|<cell|0.431>|<cell|0.238>|<cell|0.425>|<cell|0.389>|<cell|0.327>>>>>

    <vspace|0.20cm>

    <label|table2>
  </big-table|<with|font-size|0.84|Performance comparison of GSGS with four
  other pair-wise similarity based network reconstruction approaches using
  F-scores. The sample size is <math|50>. >>

  In the upper panel of Fig. <reference|fig: f4>, we plot the results from a
  comparative study in terms of total number of predicted edges. It is clear
  that K2 and MH predict many false positives. In the lower panel of Fig.
  <reference|fig: f4>, we have plotted the F-scores for different approaches
  with increasing percentage of prior knowledge. We observe that F-scores for
  the GSGS approach is significantly higher than K2 and MH. Further, the
  impact of incorporating prior knowledge on F-score is more prominent in
  case of GSGS than K2 and MH. F-scores for both K2 and MH remain much lower
  than the GSGS approach even in the presence of a large amount of prior
  knowledge. For similar results using <em|In Silico> network, we refer to
  the <em|Supplementary Material>. We also compare GSGS with four other
  approaches without using prior knowledge. The F-score results are presented
  in Table <reference|table2>. In Figure <reference|fig: f5>, we provide more
  detailed evidence of the superior performance of our method using both
  <em|In Silico> and <em|E coli> networks. In Figure <reference|fig: f5>, two
  left panels represent the true topologies of both networks, and two right
  panels represent the reconstructed network topologies using GSGS. In each
  reconstructed network, blue edges represent true positives and gray edges
  represent false positives. A high level of accuracy is observed in both the
  reconstructed networks.

  <subsection|Pathway Reconstruction in Breast Cancer Cells>

  Before using the IFGS's for signaling pathway reconstruction, we validated
  our underlying assumption that a large network is built from unordered and
  overlapping IFGS's. We measured the amount of overlapping among IFGS's.
  Indeed, we computed the number of genes shared by different number of gene
  sets (Fig. 7, <with|font-shape|italic|Supplementary Material>). A minimum
  of <math|75%> of total genes were found to be shared by at least two
  IFGS's. An exponentially truncated power law distribution
  (<math|y\<propto\>x<rsup|-\<alpha\>>*e<rsup|-\<beta\>*x>>) was fitted on
  the degrees of genes (Fig. 8, <with|font-shape|italic|Supplementary
  Material>). Such networks naturally occur in biology (<cite-raw|Ghaz06>).

  <big-figure|<with|par-mode|center|<image|fig6.eps||||><label|fig:
  f6>>|<with|font-size|0.84|A partial view of the subnetwork formed by nodes
  with a minimum of five first order neighbors in the network reconstructed
  from the genes related to breast cancer. Figure was generated using
  Cytoscape (<cite-raw|Shannon03>).>>

  A total of <math|20> candidate signaling pathways from <math|20>
  independent runs of Algorithm <reference|algorithm1> were predicted. To
  summarize a single network, we declared all the edges appearing in at least
  <math|5> networks as true edges, for a fair compromise between sensitivity
  and PPV (Fig 9, <with|font-shape|italic|Supplementary Material>). In Fig.
  <reference|fig: f6>, we present a subnetwork formed by nodes with at least
  5 first order neighbors in the reconstructed network. Indeed, nodes with
  high connectivity are likely to participate in many signaling transduction
  events. We made use of GeneCards (<cite-raw|Safran10>) to verify the
  relevance of genes in the subnetwork with breast cancer and signaling
  events. We found that many genes, e.g. BMP10, CCL2, CCR1, COL19A1, CXCR4,
  EPHB2, FLT1, FOS, GNG4, ITGB5 and MDM2 shown in Fig. <reference|fig: f6>,
  are involved in the molecular mechanisms of cancer. In addition, MDM2 is
  involved in HER-2 signaling in breast cancer, POLR2I in hereditary
  signaling in breast cancer and ATF2 in Estrogen-dependent breast cancer
  signaling(Sigma-Aldrich <slink|www.sigmaaldrich.com>). CXCR4 is highly
  expressed in breast cancer cells (<cite-raw|Muller01>, RefSeq
  <slink|www.ncbi.nlm.nih.gov/refseq>) whereas GJA1 is marker for detecting
  early oncogenesis in the breast (Genatlas
  <slink|http://genatlas.medecine.univ-paris5.fr>). RRM1 is located in the
  imprinted gene domain of 11p15.5 (an important tumor-suppressor gene
  region). Alterations in this region are associated with breast cancer
  (RefSeq). ATF2 and its two direct neighbors WAS and ITGB5 participate in
  CDC42 pathway (Applied Biosystems Pathway
  <slink|www.appliedbiosystems.com>). Similarly, BMP10 and HMGN1 are involved
  in ERK signaling, and EPHB2 and KCNA5 in PI3K signaling. Genes appearing in
  the directed path from FLT1 to EPHB2 via BMP10 and ATF2, and genes in the
  path from GNG4 to EPHB2 via BMP10 and ATF2, are highly relevant to MAPK
  signaling and P38 signaling. For example, BMP10 is connected to ATF2 by a
  linear path. It has been reported that TAK1 and the SMAD pathways activated
  by BMPs activate several transcription factors like ATF2
  (<cite-raw|Monzen01>). Similarly, FLT1 and GNG4 which are closely situated
  and connected by a linear path, have been reported to participate in many
  signaling events, e.g. ERK signaling, PI3K Signaling, P38 signaling and
  MAPK signaling. These evidences further support the use of GSGS framework
  for signaling pathway reconstruction.

  <section|Conclusion>

  In this paper, we proposed a novel computational framework, GSGS, to
  reconstruct signaling pathways from gene sets. As far as we know, the
  proposed framework is original in the following aspects: (1). It offers a
  unique two-stage framework for network reconstruction by combining
  knowledge from existing gene sets and molecular profiling data from
  high-throughput platforms (2). The ordering of genes in each gene set is
  treated as a random variable to capture the higher order interactions among
  genes participating in signal transduction events. In most of the existing
  approaches, individual genes are treated as variables (3). The problem of
  signaling pathway reconstruction is cast into the framework of parameter
  estimation for a multivariate distribution. (4). The true signaling
  pathways are modeled as a probability distribution of sample signaling
  pathways.

  We first assessed the performance of our network inference algorithm by
  using two gold standard networks: <with|font-shape|italic|E.coli> and
  <with|font-shape|italic|In Silico>. Our approach was shown to have
  significantly better performance in terms of F-score and total number of
  predicted edges than the Bayesian network and other pairwise similarity
  based approaches (<cite-raw|Margolin06>, <cite-raw|Meyer08>). Robustness of
  our approach against under-sampling or over-sampling of gene sets was
  proved by performing sensitivity analysis. We applied our GSGS framework to
  reconstruct a network in breast cancer cells, and verified it using
  existing database knowledge. Overall, our analyses favor the use of our
  two-stage GSGS framework in the inference of complicated signaling
  pathways.

  The advent of systems biology has been accompanied by the blooming of
  network construction algorithms, many of which treat gene pairs as the
  basic building block of the signaling pathways and reconstruct signaling
  pathways by simultaneously detecting co-expressed gene pairs using
  molecular profiling data (e.g. <cite-raw|Butte03>, <cite-raw|Zhu05>,
  <cite-raw|Margolin06>, <cite-raw|Meyer08>). This type of approaches enjoy
  simplicity and a much alleviated computational load but gene pairs do not
  represent the entire signal transduction pathways. Other approaches
  heuristically search for the higher scored network structure(s), such as
  Bayesian networks (e.g. <cite-raw|Cooper92>, <cite-raw|Song09>). Many
  network structures may be found to be statistically plausible, but similar
  to the gene pairs they do not necessarily represent the real signaling
  transduction mechanisms. Moreover, the computation loads of searching for a
  higher scored network is prohibitively high and a number of assumptions on
  the network structures have to be made, such as small size of the parent
  sets. Our GSGS framework infers the most likely signaling pathway(s) from a
  probability distribution of sampled signaling pathways using overlapping
  gene sets inferred from molecular profiling data. The reconstructed
  information flows are faithful representation of the real-world signaling
  transduction mechanisms. The advantages of gene set based computational
  approaches have been adequately demonstrated in the many bioinformatics
  research areas, for example, disease classification and enrichment
  analysis, we expect our gene set based GSGS framework to open a new avenue
  in methodology research of signal transduction.

  <section*|Acknowledgments>

  This work was supported by NIH grant R21LM010137 to D.Z.

  <\thebibliography|42>
    <bibitem-with-key|Altay and Emmert-Streib, 2010|Altay10a> Altay, G. and
    Emmert-Streib, F. (2010) Revealing differences in gene network inference
    algorithms on the network-level by ensemble methods.
    <with|font-shape|italic|Bioinformatics>, <with|font-series|bold|26>(14),
    1738-1744.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Butte and Kohane, 2003|Butte03> Butte, A.S. and Kohane,
    I.S. (2003) Relevance networks: a first step toward finding genetic
    regulatory networks within microarray data. In Parmigiani, G.,
    Garett,E.S., Irizarry,R.A. and Zeger,S.L. (eds),
    <with|font-shape|italic|The Analysis of Gene Expression Data>, Springer,
    New York, 428-446.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Cooper and Herskovits, 1992|Cooper92> Cooper, G.F. and
    Herskovits E. (1992) A Bayesian Method for the Induction of Probabilistic
    Networks from Data. <with|font-shape|italic|Machine Learning>,
    <with|font-series|bold|9>(4), 309-347.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Dennis et al., 2003|Dennis03> Dennis, G.Jr., Sherman,
    B.T., Hosack, D.A., Yang, J., Gao, W., Lane, H.C., Lempicki, R.A. (2003)
    DAVID: Database for Annotation, Visualization and Integrated Discovery.
    <with|font-shape|italic|Genome Biol.>,
    <with|font-series|bold|4>(5):P3.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Dobra et al., 2004|Dobra04> Dobra, A., Hans, C., Jones,
    B., Nevins, J.R. and West, M. (2004) Sparse graphical models for
    exploring gene expression data. <with|font-shape|italic|J. Multiv.
    Anal.>, <with|font-series|bold|90>, 196212.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Frideman et al., 2000|Friedman00> Friedman N., Linial,
    M., Nachman, I. and Peer, D. (2000) Using Bayesian networks to analyze
    expression data, <with|font-shape|italic|Journal of Computational
    Biology>, <with|font-series|bold|7>, 601-620.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Gardner et al., 2003|Gardner03> Gardner, T.S., di
    Bernardo, D., Lorenz D. and Collins J.J. (2003) Inferring genetic
    networks and identifying compound mode of action via expression
    profiling. <with|font-shape|italic|Science>, <with|font-series|bold|301>
    (5629), 102-105.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Gelman et al., 2003|Gelman03> Gelman, A., Carlin, J.B.,
    Stern, H.S. and Rubin, D.B. (2003) Bayesian Data Analysis.
    <with|font-shape|italic|Chapman & Hall>.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Ghazalpour et al., 2006|Ghaz06> Ghazalpour, A., Doss,
    S., Zhang, B., Wang, S., Plaisier, C., Castellanos, R., Brozell, A.,
    Schadt, E.E., Drake, T.A., Lusis, A.J. and Horvath S. (2006) Integrating
    genetic and network analysis to characterize genes related to mouse
    weight. <with|font-shape|italic|PLoS Genet.>,
    <with|font-series|bold|2>(8):e130.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Givens and Hoeting, 2005|Givens05> Givens, G.H. and
    Hoeting, J.A. (2005) Computational Statistics.
    <with|font-shape|italic|Wiley Series in Proabbility and
    Statistics>.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Huang et al., 2009|Huang09> Huang, D.W., Sherman, B.T.,
    Lempicki, R.A. (2009) Systematic and integrative analysis of large gene
    lists using DAVID Bioinformatics Resources.
    <with|font-shape|italic|Nature Protoc.>, <with|font-series|bold|4>(1),
    44-57.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Kim and Tidor, 2003|Kim03> Kim, P.M. and Tidor, B.
    (2003) Subsystem identification through dimensionality reduction of
    large-scale gene expression data. <with|font-shape|italic|Genome
    Research>, <with|font-series|bold|13>(7),
    1706-1718.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Kishino and Waddell, 2000|Kishino00> Kishino, H. and
    Waddell, P.J. (2000) Correspondence analysis of genes and tissue types
    and finding genetic links from microarray data.
    <with|font-shape|italic|Genome Informatics>, <with|font-series|bold|11>,
    8395.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Kubica et al., 2003|Kubica03> Kubica J., Moore A., Cohn
    D. and Schneider J. (2003) cGraph: A fast graphbased method for link
    analysis and queries. <with|font-shape|italic|Proc. IJCAI Text-Mining and
    Link-Analysis Workshop>, Acapulco, Mexico.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Marbach et al., 2009|Marback09> Marbach D., Schaffter
    T., Mattiussi C. and Floreano D. (2009) Generating Realistic in silico
    Gene Networks for Performance Assessment of Reverse Engineering Methods.
    <with|font-shape|italic|Journal of Computational Biology>,
    <with|font-series|bold|16>(2), 229-239.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Marbach et al., 2010|Marback10> Marbach D., Prill R.J.,
    Schaffter T., Mattiussi C., Floreano D. and Stolovitzky G. (2010)
    Revealing strengths and weaknesses of methods for gene network inference.
    <with|font-shape|italic|PNAS>, <with|font-series|bold|107>(14),
    6286-6291.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Margolin et al., 2006|Margolin06> Margolin A., Nemenman
    I. and Basso K., Wiggins C. Stolovitzky G., Favera R. and Califano A.
    (2006) ARACNE: An Algorithm for the Reconstruction of Gene Regulatory
    Networks in a Mammalian Cellular Context. <with|font-shape|italic|BMC
    Bioinformatics>, Suppl 1, S7.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Mendes, 2009|Mendes09> Mendes, P. (2009) Framework for
    Comparative Assessment of Parameter Estimation and Inference Methods in
    Systems Biology. <with|font-shape|italic|Learning and Inference in
    Computational Systems Biology> (Lawrence, N.D., Girolami, M., Rattray,
    M., Sanguinetti, G. eds.), MIT Press, Cambridge, MA,
    33-58.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Meyer et al., 2008|Meyer08> Meyer, PE, Lafitte, F. and
    Bontempi, G. (2008) Minet: An open source R/Bioconductor package for
    mutual information based network inference. <with|font-shape|italic|BMC
    Bioinformatics>, 9:461.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Monzen et al., 2001|Monzen01> Monzen K., Hiroi, Y.,
    Kudoh, S., Akazawa, H., Oka, T., Takimoto, E., Hayashi, D., Hosoda, T.,
    Kawabata, M., Miyazono, K., Ishiid, S., Yazakie, Y., Nagaia, R. and
    Komurob I. (2001) Smads, TAK1, and their common target ATF-2 play a
    critical role in cardiomyocyte differentiation.
    <with|font-shape|italic|J. Cell Biol.>, 153,
    687-698.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Muller et al., 2001|Muller01> Muller A., Homey B., Soto
    H., Ge N., Catron D., Buchanan M.E., McClanahan T., Murphy E.,Yuan W.,
    Wagner S.N., Barrera J.L., Mohar A., Verastegui E., Zlotnik A. (2001)
    Involvement of chemokine receptors in breast cancer metastasis.
    <with|font-shape|italic|Nature>, 410:50-56.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Murphy, 2001|Murphy01a> Murphy K. (2001) Active
    learning of causal bayes net structure. Technical Report, UC
    Berkeley.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Murphy, 2001|Murphy01b> Murphy K. (2001) The Bayes net
    toolbox for MATLAB. <with|font-shape|italic|Computing Science and
    Statistics: Proceedings of Interface>, 33.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Pang et al., 2006|Pang06> Pang, H., Lin, A., Holford,
    M., Enerson, B.E., Lu, B., Lawton, M.P., Floyd, E., Zhao H. (2006)
    Pathway analysis using random forests classification and regression.
    <with|font-shape|italic|Bioinformatics>, <with|font-series|bold|22>,
    2028-2036.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Pang and Zhao, 2008|Pang08> Pang, H. and H. Zhao (2008)
    Building pathway clusters from Random Forests classification using class
    votes. <with|font-shape|italic|BMC Bioinformatics>,
    <with|font-series|bold|9>(87).<next-line><vspace*|0.25pt>

    <bibitem-with-key|Prill et al., 2010|Prill10> Prill R.J., Marbach D.,
    Saez-Rodriguez J., Sorger P.K., Alexopoulos L.G., Xue X., Clarke N.D.,
    Altan-Bonnet G. and Stolovitzky G. (2010) Towards a rigorous assessment
    of systems biology models: the DREAM3 challenges.
    <with|font-shape|italic|PLoS ONE>, <with|font-series|bold|5>(2):e9202.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Rabbat et al., 2005|Rabbat05> Rabbat, M.G., Treichler,
    J.R., Wood, S.L. and Larimore, M.G. (2005) Understanding the topology of
    a telephone network via internallysensed network tomography.
    <with|font-shape|italic|Proc. IEEE International Confernece on Acoustics,
    Speech, and Signal Processing>, <with|font-series|bold|3>, Philadelphia,
    PA, 977980.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Rabbat et al., 2008|Rabbat08> Rabbat, M.G., Figueiredo,
    M.A.T. and Nowak, R.D. (2008) Network inference from co-occurrences.
    <with|font-shape|italic|IEEE Transactions on Information Theory>,
    <with|font-series|bold|54>(9), 4053-4068.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Richards et al., 2010|Richards10> Richards, A.J.,
    Muller, B., Shotwell, M., Cowart, L.A., Baerbel, R., and Lu, X. (2010)
    Assessing the functional coherence of gene sets with metrics based on the
    Gene Ontology graph. <with|font-shape|italic|Bioinformatics>,
    26(12):i79-i87.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Safran et al., 2010|Safran10> Safran M., Dalah I.,
    Alexander J., Rosen N., Iny Stein T., Shmoish M., Nativ N., Bahir I.,
    Doniger T., Krug H., Sirota-Madi A., Olender T., Golan Y., Stelzer G.,
    Harel A. and Lancet D. (2010) GeneCards Version 3: the human gene
    integrator. <with|font-shape|italic|Database>, Vol. 2010, No. 0. (5
    August 2010), baq020.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Schfer and Strimmer, 2005|Schaffer05> Schfer, J. and
    Strimmer K. (2005) An empirical Bayes approach to inferring large-scale
    gene association networks. <with|font-shape|italic|Bioinformatics>,
    <with|font-series|bold|21>, 754-764.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Schwartz, 1978|Schwartz78> Schwartz, G. (1978)
    Estimating the dimension of a model. <with|font-shape|italic|The Annals
    of Statistics>, <with|font-series|bold|6>(2),
    461-464.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Segal et al., 2003|Segal03> Segal, E., Shapira, M.,
    Regev, A., Peer, D., Botstein, D., Koller, D. and Friedman, N. (2003)
    Module networks: identifying regulatory modules and their
    condition-specific regulators from gene expression data.
    <with|font-shape|italic|Nat. Genet.>, <with|font-series|bold|34>,
    166-176.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Shannon et al., 2003|Shannon03> Shannon, P., Markiel,
    A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N.,
    Schwikowski, B. and Ideker T. (2003) Cytoscape: a software environment
    for integrated models of biomolecular interaction networks.
    <with|font-shape|italic|Genome Research>, <with|font-series|bold|13>(11),
    2498-2504.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Shmulevich et al., 2002|Shmulevich02> Shmulevich, I.,
    Dougherty, E.R., Kim, S. and Zhang, W. (2002) Probabilistic Boolean
    Networks: A Rule-based Uncertainty Model for Gene Regulatory Networks.
    <with|font-shape|italic|Bioinformatics>, <with|font-series|bold|18>(2),
    261-274.<next-line><vspace*|0.25pt> <bibitem-with-key|Shmulevich et al.,
    2003|Shmulevich03> Shmulevich, I., Gluhovsky, I., Hashimoto, R.,
    Dougherty, E.R. and Zhang, W. (2003) Probabilistic Boolean Networks: A
    Rule-based Uncertainty Model for Gene Regulatory Networks.
    <with|font-shape|italic|Comparative and Functional Genomics>,
    <with|font-series|bold|4>(6), 601-608.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Song et al., 2009|Song09> Song, L,, Kolar, M. and Xing,
    E.P. (2009) Time-Varying Dynamic Bayesian Networks.
    <with|font-shape|italic|In Proceeding of the 23rd Neural Information
    Processing Systems, NIPS'09>.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Stolovitzky et al., 2009|Stolovitzky09> Stolovitzky G.,
    Prill, R.J., Califano A. (2009) Lessons from the DREAM2 Challenges.
    <with|font-shape|italic|In Stolovitzky G, Kahlem P, Califano A, Eds,
    Annals of the New York Academy of Sciences>,
    <with|font-series|bold|1158>, 159-195.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Subramanian et al., 2005|Subra05> Subramanian, A.,
    Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A.,
    Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S. and Mesirov, J.P.
    (2005) Gene set enrichment analysis: A knowledge-based approach for
    interpreting genome-wide expression profiles.
    <with|font-shape|italic|Proc. Natl. Acad. Sci. USA>,
    <with|font-series|bold|102>, 15545-15550.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Tenger et al., 2003|Tenger03> Tegner, J., Yeung,
    M.K.S., Hasty, J. and Collins, J.J. (2003) Reverse engineering gene
    networks: integrating genetic perturbations with dynamical modeling.
    <with|font-shape|italic|Proc Natl Acad Sci USA>,
    <with|font-series|bold|100>(10), 5944-5949.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Zhu et al., 2005|Zhu05> Zhu, D., Hero, A.O., Qin, Z.S.
    and Swaroop, A. (2005) High throughput screening of co-expressed gene
    pairs with controlled false discovery rate (FDR) and minimum acceptable
    strength (MAS). <with|font-shape|italic|J. Comp. Biol.>,
    <with|font-series|bold|12>(7), 1029-1045.<next-line><vspace*|0.25pt>

    <bibitem-with-key|Zhu et al., 2006|Zhu06> Zhu, D., Rabbat, M.G., Hero,
    A.O., Nowak, R., Figueirado, M.A.G. (2006) <with|font-shape|italic|De
    Novo> Reconstructing Signaling Pathways from Multiple Data Sources. A
    chapter of the book <with|font-shape|italic|New Research in Signaling
    Transduction>, Nova Publisher, New York.
  </thebibliography>
</body>