<TeXmacs|1.99.7>

<style|<tuple|article|std-latex>>

<\body>
  <\hide-preamble>
    <new-theorem|lemma|Lemma>

    <new-theorem|proposition|Proposition>

    <new-theorem|theorem|Theorem>

    <new-theorem|definition|Definition>

    <new-theorem|example|Example>

    <new-theorem|corollary|Corollary>
  </hide-preamble>

  <doc-data|<doc-title|Topological Entropy>|<doc-author|<author-data|<author-name|David
  Koslicki>>>|<doc-date|<date|>>>

  <abstract-data|<\abstract>
    Topological entropy has been one of the most difficult to implement of
    all the entropy-theoretic notions. This is primarily due to finite sample
    effects and high-dimensionality problems. In particular, topological
    entropy has been implemented in previous literature to conclude that
    entropy of exons is higher than of introns, thus implying that exons are
    more ``random" than introns. We define a new approximation to topological
    entropy free from the aforementioned difficulties. We compute its
    expected value and apply this definition to the intron and exon regions
    of the human genome to observe that as expected, the entropy of introns
    are significantly higher than that of exons. Though we surprisingly find
    that introns are less random than expected: their entropy is lower than
    the computed expected value. We observe the perplexing phenomena that
    chromosome Y has atypically low and bi-modal entropy, possibly
    corresponding to random sequences (high entropy) and sequences that
    posses hidden structure or function (low entropy). A Mathematica
    implementation is available at: http://www.math.psu.edu/koslicki/entropy.nb
  </abstract>>

  <section|Introduction>

  Entropy, as a measure of information content and complexity, was first
  introduced by Shannon (1948). Since then entropy has taken on many forms,
  namely topological, metric (due to Shannon), Kolmogorov-Sinai, and Rnyi
  entropy. These entropies were defined for the purpose of classifying a
  system via some measure of complexity or simplicity. These definitions of
  entropy have have been applied to DNA sequences with varying levels of
  success. Topological entropy in particular is infrequently used due to
  high-dimensionality problems and finite sample effects. These issues stem
  from the fact that the mathematical concept of topological entropy was
  introduced to study <with|font-shape|italic|infinite> length sequences. It
  is universally recognized that the most difficult issue in implementing
  entropy techniques is the convergence problems due to finite sample effects
  (Vinga and Almeida 2004; Kirillova 2000). A few different approaches to
  circumvent these problems with topological entropy and adapt it to
  <with|font-shape|italic|finite> length sequences have been attempted
  before. For example, in Troyanskaya <with|font-shape|italic|et al.>
  (2002),linguistic complexity (the fraction of total subwords to total
  possible subwords) is utilized to circumvent finite sample problems. This
  though leads to the observation that the complexity/randomness of intron
  regions is <with|font-shape|italic|lower> than the complexity/randomness of
  exon regions. However, in Colosimo and de Luca (2000) it is found that the
  complexity of randomly produced sequences is
  <with|font-shape|italic|higher> than that of DNA sequences, a result one
  would expect given the commonly held notion that intron regions of DNA are
  free from selective pressure and so evolve more randomly than do exon
  regions. Also, little has been done in the way of mathematically analyzing
  other finitary implementations of entropy due to most previous
  implementations using an entire function instead of a single value to
  represent entropy (thus the expected value would be very difficult to
  calculate)

  In this paper we focus on topological entropy, introducing a new definition
  that has all the desired properties of an entropy and still retains
  connections to information theory. This approximation, as opposed to
  previous implementations, is a <with|font-shape|italic|single> number as
  opposed to an entire function, thus greatly speeding up the calculation
  time and removing high-dimensionality problems while allowing more
  mathematical analysis. This definition will allow the comparison of
  entropies of sequences of differing length, a property no other
  implementation of topological entropy has been able to incorporate. We will
  also calculate the expected value of the topological entropy to precisely
  draw out the connections between topological entropy and information
  content. We will then apply this definition to the human genome to observe
  that the entropy of intron regions is in fact lower than that of exon
  regions in the human genome as one would expect. We then provide evidence
  indicating that this definition of topological entropy can be used to
  detect sequences that are under selective pressure.

  <section|Methods>

  <subsection|Definitions and Preliminaries>

  We restrict our attention to the alphabet
  <math|<with|math-font|cal*|A>=<around|{|A,C,T,G|}>>. For a finite sequence
  <math|w> over the alphabet <math|<with|math-font|cal*|A>>, we use
  <math|<around|\||w|\|>> to denote the length of <math|w>. Of primary
  importance in the study of topological entropy is the complexity function
  of a sequence <math|w> (finite or infinite) formed over the alphabet
  <math|<with|math-font|cal*|A>>.

  <\definition>
    [Complexity function] For a given sequence <math|w>, the complexity
    function <math|p<rsub|w>:\<bbb-N\>\<rightarrow\>\<bbb-N\>> is defined as

    <\equation*>
      p<rsub|w><around|(|n|)>=<around|\||<around|{|u:<around|\||u|\|>=n<math-up|and
      u appears as a subword of>w|}>|\|>
    </equation*>
  </definition>

  That is, <math|p<rsub|w><around|(|n|)>> represents the number of different
  <math|n>-length subwords (overlaps allowed) that appear in <math|w>.

  Now the traditional definition of topological entropy of an
  <with|font-shape|italic|infinite> word <math|w> is the asymptotic
  exponential growth rate of the number of different subwords:

  <\definition>
    For an infinite sequence <math|w> formed over the alphabet
    <math|<with|math-font|cal*|A>>, the topological entropy is defined as

    <\equation*>
      lim<rsub|n\<rightarrow\>\<infty\>> <frac|log<rsub|4>
      p<rsub|w><around|(|n|)>|n>
    </equation*>
  </definition>

  Due to the limit in the above definition, it is easily observed that this
  definition will always lead to an answer of zero if applied directly to
  finite length sequences. This is due to the fact that the complexity
  function of infinite length sequences is non-decreasing, while of finite
  length sequences it is eventually zero. We include in figures
  <reference|complexity functions1> and <reference|complexity functions2> a
  log-linear plot of the complexity functions for the gene ACSL4 found on
  ChrX:108906440-108976621 (hg19) as well as for an infinite string generated
  by a Markov chain on four states with equal transition probabilities.

  <\big-figure>
    <image|LogLinearACSL45.eps|4.5in|||><label|complexity functions1>
  </big-figure|Log-Linear Plot of the Complexity Function of the Gene ACSL4>

  <\big-figure>
    <image|LogLinearRandom6.eps|4.5in|||><label|complexity functions2>
  </big-figure|Log-Linear Plot of the Complexity Function of a Random
  Infinite Sequence.>

  The graph of the complexity function of the gene found in figure
  <reference|complexity functions1> is entirely typical of the graph of a
  complexity function for a finite sequence as can be seen by the following
  proposition. The proof can be found in the nice summary by Colosimo and de
  Luca (2000). Note that in the following <math|m> and <math|M> are numbers
  whose calculation is straightforward.

  <\proposition>
    [Shape of Complexity Function]<label|complexity function prop> For a
    finite sequence <math|w>, there are integers <math|m,M>, and
    <math|N=<around|\||w|\|>>, such that the complexity function
    <math|p<rsub|w><around|(|n|)>> is strictly increasing in the interval
    <math|<around|[|0,m|]>>, non-decreasing in the interval
    <math|<around|[|m,M|]>> and strictly decreasing in the interval
    <math|<around|[|M,N|]>>. In fact, for <math|n> in the interval
    <math|<around|[|M,N|]>> we have <math|p<rsub|w>*<around|(|n+1|)>-p<rsub|w><around|(|n|)>=-1>.
  </proposition>

  Now for a finite sequence <math|w> we desire that an approximation of
  topological entropy <math|H<rsub|t*o*p><around|(|w|)>> should have the
  following properties:

  <\enumerate>
    <item><math|0\<leq\>H<rsub|t*o*p><around|(|w|)>\<leq\>1>

    <item><math|H<rsub|t*o*p><around|(|w|)>\<approx\>0> if and only if
    <math|w> is highly repetitive (contains few subwords)

    <item><math|H<rsub|t*o*p><around|(|w|)>\<approx\>1> if and only if
    <math|w> is highly complex (contains many subwords)

    <item>For different length sequences <math|v,w>,
    <math|H<rsub|t*o*p><around|(|w|)>> and <math|H<rsub|t*o*p><around|(|v|)>>
    should be comparable
  </enumerate>

  It should be noted that item 4 on this list is of utmost importance when
  implementing topological entropy. It is very important to normalize with
  respect to length since otherwise when counting the number of subwords,
  longer sequences will appear artificially more complex simply due to the
  fact that since the sequence is longer, there are more chances for subwords
  to show up. This explains the \Plinear correlation" between sequence length
  and the implementations of topological entropy used in Karamanos
  <with|font-shape|italic|et al.> (2006) and Kirillova (2000). This also
  hints at the incomparability of the notions of entropy contained in
  Karamanos <with|font-shape|italic|et al.> (2006), Colosimo and de Luca
  (2000), Kirillova (2000), and Schmitt and Herzel (1997).

  Recall that an approximation of topological entropy should give an
  approximate asymptotic exponential growth rate of the number of subwords.
  With this and the above properties in mind, it is immediately concluded
  that we can disregard the values of <math|p<rsub|w><around|(|n|)>> for
  <math|n> in the interval <math|<around|[|m,N|]>> mentioned in proposition
  <reference|complexity function prop>. In fact, as in Colosimo and de Luca
  (2000) the only information gained by considering
  <math|p<rsub|w><around|(|n|)>> for <math|n> in the interval
  <math|<around|[|m,N|]>> has to do with the specific combinatorial
  arrangement of \Pspecial factors" and has little to do with the complexity
  of a sequence.

  We define the approximation to topological entropy as follows

  <\definition>
    [Topological Entropy]<label|topological entropy> Let <math|w> be a finite
    sequence of length <math|<around|\||w|\|>>, let <math|n> be the unique
    integer such that

    <\equation*>
      4<rsup|n>+n-1\<leq\><around|\||w|\|>\<less\>4<rsup|n+1>+<around|(|n+1|)>-1
    </equation*>

    Then for <math|w<rsub|1><rsup|4<rsup|n>+n-1>> the first
    <math|4<rsup|n>+n-1> letters of <math|w>,

    <\equation*>
      H<rsub|t*o*p><around|(|w|)>\<assign\><frac|log<rsub|4><around|(|p<rsub|w<rsub|1><rsup|4<rsup|n>+n-1>><around|(|n|)>|)>|n>
    </equation*>
  </definition>

  The reason for concatenating <math|w> to the first <math|4<rsup|n>+n-1>
  letters is due to the following two facts whose proofs are omitted.

  <\lemma>
    <label|word containment>A sequence <math|w> over the alphabet
    <math|<around|{|A,C,T,G|}>> of length <math|4<rsup|n>+n-1> can contain at
    most <math|4<rsup|n>> subwords of length <math|n>. Conversely, if a word
    <math|w> is to have <math|4<rsup|n>> subwords, it must have length at
    least <math|4<rsup|n>+n-1>.
  </lemma>

  Thus if we had taken an integer <math|m\<gtr\>n> in the above definitions
  and instead utilized <math|<frac|log<rsub|4><around|(|p<rsub|w><around|(|m|)>|)>|m>>,
  <math|w> would not be long enough to contain all different possible
  subwords.

  <\lemma>
    <label|max entropy>Say a sequence <math|w> has length
    <math|4<rsup|n>+n-1> for some integer <math|n>, then if <math|w> contains
    all possible subwords of length <math|n> formed on the alphabet
    <math|<around|{|A,C,T,G|}>>, then <math|H<rsub|t*o*p><around|(|w|)>=1>
  </lemma>

  Thus if a sequence of length <math|4<rsup|n>+n-1> is \Pas random as
  possible" (i.e. contains every possible subword), its topological entropy
  is 1, just as we would expect in the infinite sequence case. Similarly, if
  <math|w> is \Pas nonrandom as possible", that is, if <math|w> is simply the
  repetition of a single letter <math|4<rsup|n>+n-1> times, then
  <math|H<rsub|t*o*p><around|(|w|)>=0>.

  Furthermore, if we had not used concatenation in definition
  <reference|topological entropy>, then for a sequence <math|v> such that
  <math|<around|\||v|\|>\<gtr\><around|\||w|\|>>, the topological entropy of
  <math|v> would on average be artificially higher due to <math|v> being a
  longer sequence and thus has more opportunity for the appearance of
  subwords. Thus, by concatenating we have allowed sequences of different
  lengths to have comparable topological entropies.

  This definition of topological entropy serves as a measure of the
  randomness of a sequence: the higher the entropy, the more random the
  sequence. The justification for this finite implementation giving an
  approximate characterization of randomness is given in Ornstein and Weiss
  (2007) in which it is shown that functions of entropy are the only finitely
  observable invariants of a process.

  <subsection|Expected Value>

  While topological entropy has been well studied for infinite sequences,
  very little has been done by way of mathematically analyzing topological
  entropy for finite sequences. This lack of analysis is most likely due to
  topological entropy as in the literature (Kirillova 2000; Crochemore and
  Renaud 1999; Schmitt and Herzel 1997) being considered not as a single
  number to be associated to a DNA sequence, but rather the entire function
  <math|<frac|log<rsub|4> p<rsub|w><around|(|n|)>|n>> is considered for
  <with|font-shape|italic|every> <math|n>. This approach turns topological
  entropy (which should be just a single number associated to a DNA
  sequences) into a very high dimensional problem. In fact, as many
  dimensions as is the length of the DNA sequence under consideration. Our
  definition given above (definition <reference|topological entropy>) does in
  fact associate just a single number (instead of an entire function) to a
  sequence, and so is much more analytically tractable.

  We now utilize the results of Gheorghiciuc and Ward (2007) to compute the
  expected value of the above topological entropy. This will assist us in
  determining what constitutes \Phigh" or \Plow" entropy. First, we calculate
  the expected value of the complexity function
  <math|p<rsub|w><around|(|n|)>>. As is commonly assumed (Lio and Goldman
  1998; Hasegawa <with|font-shape|italic|et al.> 1985; Jukes and Cantor
  1969), we now assume that DNA sequences evolve in the following way: each
  state in a Markov fashion independent of neighboring states. We do not
  assume a single model of molecular evolution, but rather just assume that
  there is some set of probabilities <math|<around|{|\<pi\><rsub|A>,\<pi\><rsub|C>,\<pi\><rsub|T>,\<pi\><rsub|G>|}>>
  such that the probability of appearance of a sequence <math|w> is given by
  the following: for <math|n<rsub|A>> the number of occurrences of the letter
  <math|A> in <math|w>, <math|n<rsub|C>> the number of occurrences of the
  letter <math|C> in <math|w>, etc., the probability of the sequence <math|w>
  appearing is given by:

  <\equation*>
    \<bbb-P\><around|(|w|)>=\<pi\><rsub|A><rsup|n<rsub|A>>*\<pi\><rsub|C><rsup|n<rsub|C>>*\<pi\><rsub|T><rsup|n<rsub|T>>*\<pi\><rsub|G><rsup|n<rsub|G>>
  </equation*>

  This assumption regarding the probability of appearance of a DNA sequence
  is used only to procure a distribution against which we may calculate the
  expected number of subwords. The actual calculation of topological entropy
  as in definition <reference|topological entropy> does not make any such
  assumption about the probability of appearance.

  <\theorem>
    [Expected Value of the Complexity Function]<label|expected value of the
    complexity function> The expected value of the complexity function
    <math|p<rsub|w><around|(|n|)>> taken over sequences of length
    <math|<around|\||w|\|>=n+k-1> is given by

    <align|<tformat|<table|<row|<cell|<label|general expected
    value>\<bbb-E\><around|[|p<rsub|w><around|(|n|)>|]>>|<cell|=4<rsup|k>-<big|sum><rsub|w><around|(|1-\<bbb-P\><around|(|w|)>|)><rsup|n>+<with|math-font|cal*|O><around|(|n<rsup|-\<epsilon\>>*\<mu\><rsup|n>|)>>>>>>

    where the summation is over all sequences <math|w> of length <math|n>,
    and <math|0\<less\>\<epsilon\>\<less\>1>, <math|\<mu\>\<less\>1> (these
    are explicitly computed constants based on the <math|\<pi\><rsub|i>>
    defined above, see (Gheorghiciuc and Ward 2007)).
  </theorem>

  <\proof>
    See (Gheorghiciuc and Ward 2007).
  </proof>

  This theorem has a particularly nice reduction when one assumes that the
  probability of appearance of each subletter is the same (equivalent to the
  the expected value being computed with a uniform distribution on the set of
  all sequences of a certain length).

  <\corollary>
    <label|nice expected value>Assuming that
    <math|\<pi\><rsub|A>=\<pi\><rsub|C>=\<pi\><rsub|T>=\<pi\><rsub|G>=1/4>,
    the expected value of complexity function taken over sequences of length
    <math|<around|\||w|\|>=n+k-1> is given by

    <align|<tformat|<table|<row|<cell|<label|nice expected value
    formula>\<bbb-E\><around|[|p<rsub|w><around|(|n|)>|]>=4<rsup|k>-4<rsup|k>*<around|(|1-<around|(|<frac|1|q>|)><rsup|k>|)><rsup|n>+<with|math-font|cal*|O><around|(|n<rsup|-\<epsilon\>>*\<mu\><rsup|k>|)>>>>>>
  </corollary>

  While clearly there <with|font-shape|italic|is> a mononucleotide bias for
  different genomic regions and DNA sequences do not occur uniformly
  randomly, we do assume equal probability of appearance of each nucleotide
  as then the calculation of the expected number of subwords reduces in
  computational complexity from exponential to linear in the length of the
  sequence.

  It is a straightforward calculation to combine formula <reference|nice
  expected value formula> with definition <reference|topological entropy> and
  compute the constants <math|\<epsilon\>> and <math|\<mu\>> as set forth in
  Gheorghiciuc and Ward 2007. Doing so, we obtain the following expected
  value for the topological entropy.

  <\theorem>
    [Expected Value of Topological Entropy]<label|expected value of
    topological entropy> The expected value of topological entropy taken over
    sequences of length <math|<around|\||w|\|>=4<rsup|n>+n-1> is given by

    <align|<tformat|<table|<row|<cell|<label|expected value of topological
    entropy formula>\<bbb-E\><around|[|H<rsub|t*o*p>|]>=<frac|log<rsub|4><around|(|4<rsup|n>-4<rsup|n>*<around|(|1-1/4<rsup|n>|)><rsup|4<rsup|n>>+<with|math-font|cal*|O><around|(|<around|(|<frac|1|<sqrt|2>>|)>|)><rsup|n>|)>|n>>>>>>
  </theorem>

  We now present in table <reference|Calculated Expected Value Table> the
  calculated estimation of the expected value of <math|H<rsub|<math-up|top>>>
  using the above formula. Keep in mind that the convergence of this
  calculation to the actual expected value is exponentially quick (the term
  <math|<with|math-font|cal*|O><around|(|<around|(|<frac|1|<sqrt|2>>|)>|)><rsup|n>>)
  as <math|n> increases (and so also the length of the sequence). We thus
  ignore the <math|<with|math-font|cal*|O><around|(|<around|(|<frac|1|<sqrt|2>>|)>|)><rsup|n>>
  term in the following calculation.

  <\big-table>
    <label|Calculated Expected Value Table>

    <with|font-size|0.84|<tabular*|<tformat|<cwith|1|-1|1|1|cell-halign|l>|<cwith|1|-1|1|1|cell-lborder|0ln>|<cwith|1|-1|2|2|cell-halign|l>|<cwith|1|-1|3|3|cell-hyphen|t>|<cwith|1|-1|3|3|cell-hmode|exact>|<cwith|1|-1|3|3|cell-width|25ex>|<cwith|1|-1|3|3|cell-rborder|0ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-bborder|1ln>|<table|<row|<cell|<math|n>>|<cell|<math|<with|font-size|0.84|4<rsup|n>+n-1>>>|<cell|Calculated
    Expected Value of <math|H<rsub|<math-up|top>>>>>|<row|<cell|1>|<cell|4>|<cell|.725606>>|<row|<cell|2>|<cell|17>|<cell|.841242>>|<row|<cell|3>|<cell|66>|<cell|.890810>>|<row|<cell|4>|<cell|249>|<cell|.917489>>|<row|<cell|5>|<cell|1028>|<cell|.933868>>|<row|<cell|6>|<cell|4101>|<cell|.944865>>|<row|<cell|7>|<cell|16390>|<cell|.952736>>|<row|<cell|8>|<cell|65543>|<cell|.958642>>|<row|<cell|9>|<cell|262152>|<cell|.963237>>|<row|<cell|10>|<cell|1048585>|<cell|.966914>>|<row|<cell|11>|<cell|4194315>|<cell|.969921>>|<row|<cell|12>|<cell|16777227>|<cell|.972428>>>>>>
  </big-table|Calculated Expected Value of Topological Entropy>

  For comparison's sake, we present in table <reference|Sampled Expected
  Value Table> the sampled expected values for <math|n=1,\<ldots\>,9> along
  with sampled standard deviations (the calculation where made by explicitly
  computing the topological entropy of uniformly randomly selected
  sequences).

  <\big-table>
    <label|Sampled Expected Value Table>

    <with|font-size|0.84|<tabular*|<tformat|<cwith|1|-1|1|1|cell-halign|l>|<cwith|1|-1|1|1|cell-lborder|0ln>|<cwith|1|-1|2|2|cell-halign|l>|<cwith|1|-1|3|3|cell-hyphen|t>|<cwith|1|-1|3|3|cell-hmode|exact>|<cwith|1|-1|3|3|cell-width|10exp11exp15ex>|<cwith|1|-1|3|3|cell-rborder|0ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-bborder|1ln>|<table|<row|<cell|<math|n>>|<cell|<math|<with|font-size|0.84|4<rsup|n>+n-1>>>|<cell|Sampled
    Expected Value of <math|H<rsub|<math-up|top>>>>|<cell|Sampled Standard
    Deviation>|<cell|Sample Size>>|<row|<cell|1>|<cell|4>|<cell|.703583>|<cell|.184798>|<cell|256>>|<row|<cell|2>|<cell|17>|<cell|.838956>|<cell|.0508640>|<cell|300000>>|<row|<cell|3>|<cell|66>|<cell|.890576>|<cell|.0176785>|<cell|300000>>|<row|<cell|4>|<cell|249>|<cell|.917457>|<cell|.00674325>|<cell|300000>>|<row|<cell|5>|<cell|1028>|<cell|.933869>|<cell|.0027160>|<cell|300000>>|<row|<cell|6>|<cell|4101>|<cell|.944861>|<cell|.00113176>|<cell|300000>>|<row|<cell|7>|<cell|16390>|<cell|.952733>|<cell|.000486368>|<cell|300000>>|<row|<cell|8>|<cell|65543>|<cell|.958642>|<cell|.000212283>|<cell|300000>>|<row|<cell|9>|<cell|262152>|<cell|.963237>|<cell|.0000944814>|<cell|300000>>>>>>
  </big-table|Sampled Expected Value and Standard Deviation of Topological
  Entropy>

  Summarizing this table, the topological entropy of randomly selected
  sequences is tightly centered around the expected value which itself is
  close to one. Furthermore, the distribution of topological entropy is very
  close to a normal distribution as can be observed from the histogram of
  topological entropy for sequences of length <math|4<rsup|9>+9-1> included
  in figure <reference|histogram of topological entropy>. The skewness and
  kurtosis are .0001996 and 2.99642 respectively.

  <\big-figure>
    <label|histogram of topological entropy><image|Histogram92.eps|4.5in|||>
  </big-figure|Histogram of Topological Entropy of Randomly Selected
  Sequences of Length <math|4<rsup|9>+9-1=262152>>

  <section|Algorithm>

  An implementation of this approximation to topological entropy is available
  at:<next-line>http://www.math.psu.edu/koslicki/entropy.nb<next-line>We
  mention a few notes regarding this estimation of topological entropy.
  First, if a sequence <math|w> in consideration has a length such that for
  some <math|n>, <math|4<rsup|n>+n-1\<less\><around|\||w|\|>\<less\>4<rsup|n+1>+n>
  it will be more accurate to use a sliding window to compute the topological
  entropy. For example, if <math|<around|\||w|\|>=16000>, we would normally
  concatenate this sequence to the first 4101 letters. This might
  misrepresent the actually topological entropy of the sequence. Accordingly,
  we could instead compute the average of the topological entropy of the
  following sequences (where <math|w<rsub|n><rsup|m>> means the subsequence
  of <math|w> consisting of the <math|n<rsup|<math-up|th>>> to
  <math|m<rsup|<math-up|th>>> letters of <math|w>):

  <\equation*>
    w<rsub|1><rsup|4101>,w<rsub|2><rsup|4102>,w<rsub|3><rsup|4103>,\<ldots\>,w<rsub|11899><rsup|16000>
  </equation*>

  This is computationally intensive, so for longer sequences, one might
  instead choose to take non-overlapping windows, so finding the average of
  the topological entropy of the sequences

  <\equation*>
    w<rsub|1><rsup|4101>,w<rsub|4102><rsup|8203>,w<rsub|8204><rsup|12305>,\<ldots\>
  </equation*>

  The above website includes serial and parallel versions of the algorithm.
  The fastest version utilizes Nvidia CUDA GPU computing, has complexity
  <math|<with|math-font|cal*|O><around|(|n|)>> for a sequence of length
  <math|n>, and takes an average of 5.2 seconds to evaluate on a DNA sequence
  of length 16,777,227 when using an Intel i7-950 3.6 GHz CPU and an Nvidia
  GTX 460 GPU.

  <subsection|Comparison to Traditional Measures of Complexity>

  Other measures of DNA sequence complexity similar to this approximation of
  topological entropy include: previous implementations of topological
  entropy (Kirillova, 2000), special factors (Colosimo and de Luca, 2000),
  Shannon's metric entropy (Kirillova, 2000; Farach
  <with|font-shape|italic|et al.>, 1995), Rnyi continuous entropy (Vinga and
  Almeida, 2004; Rnyi, 1961), and linguistic complexity (LC) (Troyanskaya
  <with|font-shape|italic|et al.>, 2002; Gabrielian and Bolshoy, 1999).

  The implementation of topological entropy in Kirillova (2000) does not
  produce a single number representing entropy, but rather an entire sequence
  of values. Thus while the implementation of Kirillova (2000) does
  distinguish between artificial and actual DNA sequences, Kirillova notes
  that the implementation is hampered by high-dimensionality and finiteness
  problems.

  In Colosimo and de Luca (2000), it is noted that the special factors
  approach does not differentiate between introns and exons.

  Note also that the convergence of our approximation of topological entropy
  is even faster than that of Shannon's metric entropy. Shannon's metric
  entropy of the sequence <math|u> for the value <math|n> is defined as

  <\equation*>
    H<rsub|m*e*t><around|(|u,n|)>=<frac|-1|n>*<big|sum><rsub|w>\<mu\><rsub|u><around|(|w|)>*log
    <around|(|\<mu\><rsub|u><around|(|w|)>|)>
  </equation*>

  where the summation is over all words of length <math|n> and
  <math|\<mu\><rsub|u><around|(|w|)>> is the probability (frequency) of the
  word <math|w> appearing in the given sequence <math|u>. Thus Shannon's
  metric entropy requires not only the appearance of subwords, but for the
  actual frequency of appearance of the subwords to converge as well. As can
  be seen from definition <reference|topological entropy>, our notion of
  topological entropy does not require the use of the actual subword
  frequencies. So topological entropy will in general be more accurate than
  Shannon's metric entropy for shorter sequences. Accordingly, the
  convergence issues mentioned in Farach <with|font-shape|italic|et al.>
  (1995) (even with the clever Lempel-Ziv estimator) can be circumvented.

  Furthermore, it is not difficult to show (as in Blanchard
  <with|font-shape|italic|et al.> (2000), Proposition 1.2.5) what is known as
  the <with|font-shape|italic|Variational Principle>, that is, topological
  entropy dominates metric entropy: for any sequence <math|u> (finite or not)
  and integer <math|n>

  <align|<tformat|<table|<row|<cell|H<rsub|m*e*t><around|(|u,n|)>\<leq\>H<rsub|t*o*p><around|(|u,n|)>>>>>>

  Thus topological entropy retains connections to the information theoretic
  interpretation of metric entropy as set forth by Shannon (1948). Since
  topological entropy bounds metric entropy from above:

  <\quote-env>
    Low topological entropy of a sequence implies that it is ``less chaotic"
    and is ``more structured."
  </quote-env>

  This connection to information theory is also an argument for the use of
  topological entropy over Rnyi continuous entropy of order <math|\<alpha\>>
  (see Vinga and Almeida (2004) for more details). Rnyi (1961) showed that
  for <math|\<alpha\>\<neq\>1>, one cannot define conditional and mutual
  information functions and hence Rnyi continuous entropy does not measure
  \Pinformation content" in the usual sense. So while Rnyi entropy does
  allow for the identification of statistically significant motifs (Vinga and
  Almeida, 2004), one cannot conclude that higher/lower Rnyi continuous
  entropy for <math|\<alpha\>\<neq\>1> implies more/less information content
  or complexity in the usual sense.

  Thus LC is the only other similar measurement of sequence complexity that
  produces a single number representing the complexity of a sequence. Like
  our implementation of topological entropy, the implementation of LC
  contained in Troyanskaya et al. (2002) also runs in linear time. A
  comparison of our implementation of topological entropy and LC is contained
  in section 4.4.

  <section|Application to Exons/Introns of the Human Genome>

  <subsection|Method>

  We now apply our definition of topological entropy to the intron and exon
  regions of the human genome.

  We retrieved the February 2009 GRCh37/ hg19 human genome assembly from the
  UCSC database and utilized Galaxy (Blankenberg <with|font-shape|italic|et
  al.> 2010; Blankenberg <with|font-shape|italic|et al.> 2007) to extract the
  nucleotide sequences corresponding to the introns and exons of each
  chromosome (including ChrX and ChrY). Now even though as argued above
  topological entropy converges more quickly than metric entropy, one must be
  careful to not use this definition of topological entropy on sequences that
  are too short as this would lead to significant noise. For example, the
  UCSC database contains exons that consist of a single base and it is
  meaningless to attempt to measure topological entropy of such sequences.
  Hence we selected the longest 100 different intron and exon sequences from
  each chromosome.

  After ensuring that each sequence consisted only of letters from
  <math|<around|{|A,C,T,G|}>>, we then applied the approximation of
  topological entropy found in definition <reference|topological entropy> to
  the resulting sequences. For comparison's sake we also applied the
  approximation of topological entropy to the longest 50, 200, and 400
  sequences, as well as to <with|font-shape|italic|all> the intron and exon
  sequences. The salient observed features persist throughout. Though as
  expected, when shorter sequences are allowed, the results become noisier.

  To investigate in more detail the relationship between regions under
  selective pressure and the value of topological entropy, we also selected
  each 5' and 3' UTR on chromosome Y that consisted of more than
  <math|4<rsup|3>+3-1=66> bp.

  <subsection|Data>

  Figure <reference|ave bar chart 100> displays the error bar plot for the
  longest 100 exons and introns. The error bar plots for the longest 50, 200,
  and 400 sequences, as well as the plot for all the intron and exon
  sequences are, for brevity's sake, not shown. Figure <reference|UTRerror>
  displays the error bar plot for chromosome Y 5' and 3' UTRs which are
  longer than 66bp long.

  <big-figure|<label|ave bar chart 100><image|AVeTopEnt1003.eps|6in|||>
  |Error bar plot of average topological entropy for the longest 100 introns
  and exons in each chromosome>

  <big-figure|<label|UTRerror><image|UTRerror.eps|6in|||> |Error bar plot of
  chromosome Y 5' and 3' UTRs longer than 66bp long>

  <subsection|Analysis and Discussion>

  We first discuss the results regarding intron and exon regions. As figure
  <reference|ave bar chart 100> demonstrates, the topological entropies of
  intron regions of the human genome are larger than the topological
  entropies of the exon regions. For example, the mean of the entropies of
  the introns on chromosome 21 is more than 11 standard deviations away from
  the mean of the entropy of the exons on the same chromosome. This result
  supports the commonly held notion that intron regions of DNA are mostly
  free from selective pressure and so evolve more randomly than do exon
  regions. We thus suggest that the observation of Karamanos
  <with|font-shape|italic|et al.> (2006), Troyanskaya
  <with|font-shape|italic|et al.> (2002), Mantegna <with|font-shape|italic|et
  al.> (1995), and Stanley <with|font-shape|italic|et al.> (1999) that intron
  entropy is <with|font-shape|italic|smaller> than exon entropy is due to the
  aforementioned finite sample effects and high-dimensionality problems
  related to previous implementations of entropy.

  Interestingly, even though we observe that intron entropy is larger than
  exon entropy, the entropies of <with|font-shape|italic|both> regions are
  much lower than expected (here expectation is as calculated in table
  <reference|Calculated Expected Value Table>). Indeed, of the longest 100
  sequences, the average intron length is 180880 and the average exon length
  is 2059, so according to tables <reference|Calculated Expected Value Table>
  and <reference|Sampled Expected Value Table>, we would expect the entropies
  to be .966914 and .933853 respectively. We find, though, that the average
  entropy for introns is .9323166 and for exons is .897451. Note that the
  largest intron sequence entropy (<math|H<rsub|t*o*p>=.*943627> for an
  intron of length 1.1Mbp found on chromosome 16) is significantly lower than
  the expected value of .969921 (at least 60 standard deviations from the
  expectation).This is not too surprising considering that the expectation as
  calculated in theorem <reference|expected value of topological entropy>
  uses the uniform distribution. This supports the conclusion that while
  intron regions do evolve more randomly than exon regions, introns do not
  evolve uniformly randomly.

  Note the disparity between the entropies of the sex chromosomes: The
  entropy of chromosome X in both intron and exon regions is significantly
  higher than in chromosome Y. In fact, the mean of chromosome X intron
  entropies is 3.5 standard deviations higher than the mean of chromosome Y
  intron entropies; the mean of chromosome X exon entropies is 1 standard
  deviation higher than the mean of chromosome Y exon entropies. Thus the X
  chromosome has intron and exon entropy similar to that of the autosomes,
  but chromosome Y has significantly differing exon and intron entropy. This
  is a particularly puzzling result considering that chromosome Y is known to
  have a high mutation rate and a special selection regime (Wilson and Makova
  2009a; Wilson and Makova 2009b; Graves 2006), and so one would expect the
  entropy of chromosome Y (both intron and exon regions) to be much higher
  than it is. In fact, the chromosome Y introns have the lowest mean
  topological entropy of any intron region across the entire genome. This
  would suggest that the accumulation of \Pjunk" DNA and the massive
  accumulation of retrotransposable elements mentioned in Graves (2006) have
  some underlying function or structure. More specifically, it appears that
  the intron regions in chromosome Y might fall into two categories: the
  truly \Pjunk" DNA consisting of the introns with topological entropy
  greater than .910, and the introns that have hidden structure consisting of
  those sequences with entropy less than .910. We present in figure
  <reference|ChrYHistogram> a histogram of the topological entropy on
  chromosome Y demonstrating the distinction between the two categories.

  <\big-figure>
    <label|ChrYHistogram> <image|ChrYHistogram2.eps|4.5in|||>
  </big-figure|Histogram of topological entropy of introns in chromosome Y>

  Remaining on chromosome Y, we now present evidence that topological entropy
  can be used to detect sequences that are under selective pressure. Note
  that Siepel <with|font-shape|italic|et al.> (2005) showed that both 5' and
  3' UTRs are among the most conserved elements in vertebrate genomes. Thus
  one would expect that the topological entropy of these regions would be
  very low (as this is indicative of a high degree of structure). As
  indicated in figure <reference|UTRerror>, the entropy of both the 5' and 3'
  region are low in comparison to the entropy of the intron and exon regions
  across the autosomes. In fact the mean of the topological entropy of the 5'
  and 3' UTRs (<math|.*871545\<pm\>.*0290619> and
  <math|.*879163\<pm\>.*0219371>) are lower than the mean entropy of
  <with|font-shape|italic|any> intron or exon region across every chromosome.
  The lowest mean topological entropy for an autosome is
  <math|.*927802\<pm\>.*00539> on chromosome 19, this is more than nine
  standard deviations <with|font-shape|italic|higher> than the mean of
  topological entropy for either the 3' or 5' UTRs. This lends support to the
  assertion that topological entropy can be used to detect functional regions
  and regions under selective constraint.

  <\big-figure>
    <label|UTRhistogram> <image|UTRhistogram.eps|4.5in|||>
  </big-figure|Histogram of topological entropy for 5' and 3' UTRs in
  chromosome Y>

  <subsection|Comparison to Linguistic Complexity>

  As mentioned in section 3.1, LC is the only other similar measurement of
  sequence complexity that produces a single number to represent the
  complexity of a sequence. We applied the algorithm described in Troyanskaya
  et al. (2002) and written by Larsson (1999) to the same data set contained
  in section 4.1 of this paper. To obtain directly comparable results, we
  used a window size as big as the given sequence is long. As can be seen in
  figure <reference|LC>, LC does distinguish between introns and exons to an
  extent, though not to the same quality of resolution as that of topological
  entropy (compare to figure <reference|ave bar chart 100>). For example,
  while topological entropy consistently measures introns as more random than
  exons, LC does not. This discrepancy is most likely due to linguistic
  complexity being effectively utilized (Troyanskaya
  <with|font-shape|italic|et al.>, 2002) as a sliding window method to detect
  repetitive motifs, not as a holistic measure of sequence information
  content. So we also applied LC using a sliding window of 2000bp, taking the
  average value of LC on a given sequence, and then averaging on a given
  chromosome (see figure <reference|LC2000>). Using the sliding window, LC
  does give a higher value to introns than to exons (except on chromosome 5).
  While the separation between the LC of introns and exons becomes more
  pronounced, the resolution is still not nearly as clear as with topological
  entropy since a large amount of error persisted. The LC values amongst
  introns and exons are well within one standard deviation of each other
  across the entire genome.

  <big-figure|<label|LC><image|LC.eps|6in|||> |Error bar plot of linguistic
  complexity on introns and exons using window as long as the sequence.>

  <big-figure|<label|LC2000><image|LC2000.eps|6in|||> |Error bar plot of
  linguistic complexity on introns and exons using 2000bp windows.>

  <section|Conclusion>

  This implementation of topological entropy is free from issues that other
  implementations have encountered. Namely, this definition allows for the
  comparison of sequences of different length and does not suffer from
  multi-dimensionality complications. Since this definition supplies a single
  value to characterize the complexity of a sequence, it is much more capable
  of being mathematically analyzed. Beyond measuring the complexity or
  simplicity of a sequence, we presented evidence that our approximation to
  topological entropy might detect functional regions and sequences free from
  or under selective constraint. The speed and simplicity of this
  implementation of topological entropy makes it very suitable for
  utilization in detecting regions of high/low complexity. For example, we
  observe the novel phenomena that the introns on chromosome Y have
  atypically low and bi-modal entropy, possibly corresponding to random
  sequences and sequences that posses hidden structure or function.

  <section*|Acknowledgments>

  The author would like to thank Manfred Denker, Kateryna Makova, and
  Francesca Chiaromonte for their assistance and fruitful discussion
  regarding this paper. This work was supported by the National Science
  Foundation [grant number DMS-1008538].

  <\thebibliography|>
    <bibitem|Bla>Blanchard, F., Maass, A., Nogueira, A. eds 2000. Topics in
    symbolic dynamics and applications. London Math. Soc. Lecture Note Ser.
    Cambridge Univ. Press, 279.

    <bibitem|Blan1>Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus
    R, Mangan M, Nekrutenko A, Taylor J. 2010. Galaxy: a web-based genome
    analysis tool for experimentalists. Current Protocols in Molec. Biol.
    19:1\U21

    <bibitem|Blan2>Blankenberg D, Taylor J, Schenk I, He J, Zhang Y, Ghent M,
    Veeraraghavan N, Albert I, Miller W, Makova K, Hardison RC, Nekrutenko A.
    2007. A framework for collaborative analysis of ENCODE data: Making
    large-scale analyses biologist-friendly, Genome Research. 17:6:960\U964.

    <bibitem|Col>Colosimo, A., de Luca, A. 2000. Special factors in
    biological strings. J. Theor. Biol. 204:29\U46.

    <bibitem|Cro>Crochemore, M., Vrin, R.. 1999. Zones of low entropy in
    genomic sequences. Computers & Chemistry. 23:275\U282.

    <bibitem|Far>Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner,
    A., Ziv, J. 1995. On the entropy of DNA: algorithms and measurements
    based on memory and rapid convergence. Proceedings of the sixth annual
    ACM-SIAM symposium on discrete algorithms, SIAM, Philadelphia, PA. pp.
    48\U57.

    <bibitem|Gab>Gabrielian, A., Bolshoy, A. 1999. Sequence complexity and
    DNA curvature. 23:263\U274.

    <bibitem|Ghe>Gheorghiciuc, I., Ward, M.D. 2007. On correlation
    polynomials and subword complexity, Conf. on Analysis of Alg., DMTCS Proc
    AH. pp. 1\U18.

    <bibitem|Gra>Graves, J. A. M. 2006. Sex chromosome specialization and
    degeneration in mammals. Cell. 124:5:901\U914.

    <bibitem|Has>Hasegawa, M., Kishino, H., Yano, T. 1985. Dating of the
    human-ape splitting by a molecular clock of mitochondrial DNA, J. Mol.
    Evol., 22: 160\U174.

    <bibitem|Juk>Jukes, T.H., Cantor, C.R. 1969. Evolution of protein
    molecules, In Mammalian Protein Metabolism (ed. H.N. Munro), Academic
    Press, New York, New York, pp. 21\U132.

    <bibitem|Kar>Karamanos, K, Kotsireas, I., Peratzakis, A., and Eftaxias,
    K. 2006. Statistical compressibility analysis of DNA sequences by
    generalized entropy-like quantities: Towards algorithmic laws for
    Biology?, Proc. of the 6th WSEAS Int. Conf. on Applied Informatics and
    Communications, 18:481\U491.

    <bibitem|Kir>Kirillova, O.V. 2000. Entropy concepts and DNA
    investigations, Phys. Letters A, 274:247\U253.

    <bibitem|Lar>Larsson, N.J. 1999. Structures of String Matching and Data
    Compression, PhD. Thesis, Lund University, Sweden.

    <bibitem|Lio>Lio, P., Goldman, N. 1998. Models of molecular evolution and
    phylogeny, Genome Res., 8:1233\U1244.

    <bibitem|Man>Mantegna, R.N., Buldyrev, S.V., Goldberger, A.L., et. al.
    1995. Systematic analysis of coding and noncoding DNA sequences using
    methods of statistical linguistics. Phys. Rev E, 52:3:2939\U2950.

    <bibitem|Orn>Ornstein, D., Weiss, B. 2007. Entropy is the only finitely
    observable invariant, J. Mod. Dyn., 1:93\U107.

    <bibitem|Ren>Rnyi, A. 1961. On measures of information and entropy,
    Proc. of the 4th Berkely Sympo. on Math. Stat. and Prob., Vol. I, Univ.
    California Press, Berkely, Calif. pp. 547\U561.

    <bibitem|Sch>Schmitt, A.O., Herzel, H. 1997. Estimating the entropy of
    DNA sequences, J. Theor. Biol. 1888:369\U377.

    <bibitem|Schm>Schmutz, J., Martin, J. Terry, A. et. al. 2004. The DNA
    sequence and comparative analysis of human chromosome 5, Nature,
    431:268\U274.

    <bibitem|Sha>Shannon, C.E. 1948. A Mathematical theory of communication,
    Bell Sys. Tech. J., 27:379\U423.

    <bibitem|Sie>Siepel, A., Bejerano, G., Pedersen, J.S., et. al. 2005.
    Evolutionarily conserved elements in vertebrate, insect, worm, and yeast
    genomes. Genome Res., 15:8:1034\U1050.

    <bibitem|Sta>Stanley, H.E., Buldyrev, S.V., Goldberger, A.L., Havlin, S.,
    Peng, C.-K., Simons, M. 1999. Scaling features of noncoding DNA, Physica
    A, 273:1\U18.

    <bibitem|Tro>Troyanskaya, O.G., Arbell, O., Koren, Y., Landau, G.M,
    Bolshoy, A. 2002. Sequence complexity profiles of prokaryotic genomic
    sequences: A fast algorithm for calculating linguistic complexity,
    Bioinformatics, 18:5:679\U688.

    <bibitem|Vin>Vinga, S., Almeida, J.S. 2004. Rnyi continuous entropy of
    DNA sequences, J. Theor. Biol. 231: 377\U388.

    <bibitem|Wil1>Wilson, M. A., Makova, K. D. 2009. Genomic analyses of Sex
    chromosome evolution, Annu. Rev. Genom. Human Genet., 10:333\U354.

    <bibitem|Wil2>Wilson, M. A., Makova, K. D. 2009. Evolution and survival
    on eutherian sex chromosomes, PLoS 5:7:e1000568.
  </thebibliography>

  <section|Supplementary Material >

  <\proof>
    <dueto|Proof of lemma <reference|word containment>>First observe that any
    sequence <math|w> has <math|<around|\||w|\|>-N+1> subwords of length
    <math|N> for each <math|N>. So if <math|w> is a sequence of length
    <math|<around|\||w|\|>=4<rsup|n>+n-1> it contains <math|4<rsup|n>> (a
    priori non-unique) subwords. We will show that there exists such a word
    where each subword is unique. Arguing as in Gheorghiciuc and Ward (2007),
    define the de Bruijn graph <math|B<rsub|n>> to be a directed graph with
    vertices the words of length <math|n> over the alphabet
    <math|<around|{|A,C,T,G|}>>. The edges are defined as follows: there is a
    directed edge labeled with <math|w<rsub|1>,\<ldots\>,w<rsub|n+1>> which
    points from the vertex <math|w<rsub|1>,\<ldots\>,w<rsub|n>> to the vertex
    <math|w<rsub|2>,\<ldots\>,w<rsub|n+1>>. Since the de Bruijn graph
    <math|B<rsub|n>> is trivially strongly connected with ingoing and
    outgoing vertex degrees of 4, then <math|B<rsub|n>> is Eulerian. Now it
    is easily observed that <math|B<rsub|n>> is the line graph of
    <math|B<rsub|n-1>>. Thus <math|B<rsub|n>> is Hamiltonian as well. Thus
    there exists a sequence <math|w> of length <math|4<rsup|n>+n-1> that
    contains exactly <math|4<rsup|n>> <with|font-shape|italic|different>
    subwords of length <math|n>.
  </proof>

  <\proof>
    <dueto|Proof of lemma <reference|max entropy>>If a sequence <math|w> has
    length <math|<around|\||w|\|>=4<rsup|n>+n-1> for some <math|n> and
    contains all possible subwords of length <math|n> (all <math|4<rsup|n>>
    of them). Then applying definition <reference|topological entropy>, we
    have that

    <align|<tformat|<table|<row|<cell|H<rsub|t*o*p><around|(|w|)>>|<cell|=<frac|log<rsub|4><around|(|p<rsub|w<rsub|1><rsup|4<rsup|n>+n-1>><around|(|n|)>|)>|n>>>|<row|<cell|>|<cell|=<frac|log<rsub|4><around|(|p<rsub|w><around|(|n|)>|)>|n>>>|<row|<cell|>|<cell|=<frac|log<rsub|4><around|(|4<rsup|n>|)>|n>>>|<row|<cell|>|<cell|=1>>>>>
  </proof>

  <section|Supplementary Material >

  Figure <reference|MeansAndDevsMult> shows the error bar plots of
  topological entropy of introns and exons across the each chromosome,
  justifying the selection of the longest 100 sequences from each region.

  <big-figure|<label|MeansAndDevsMult><image|MeansAndDevsMult7.eps|6.5in|||>
  |a) Error bar plot of average topological entropy for the longest 50
  introns and exons in each chromosome, b) Error bar plot of average
  topological entropy for the longest 200 introns and exons in each
  chromosome, c) Error bar plot of average topological entropy for the
  longest 400 introns and exons in each chromosome, and d) Error bar plot of
  average topological entropy all introns and exons in each chromosome.>

  <section|Supplementary Material >

  We include in this section a simple implementation of the calculation of
  topological entropy utilizing Mathematica 7.0. The input is any sequence on
  a four character alphabet, and the output is a single number representing
  the topological entropy. This algorithm is automatically parallelized and
  scales very well with the more processors that are available. As for
  computational intensity, the computation of topological entropy on a
  randomly chosen sequence of length <math|4<rsup|8>+8-1=65543> using this
  most straightforwardly coded algorithm on an Intel i7-950 3.6Ghz quad-core
  processor with 12Mb RAM (though the algorithm itself uses very little RAM)
  takes an average of 5.02 seconds.

  <\with|font-size|0.71>
    <\verbatim>
      \;

      TopEntropyP::not4char =

      \ "The string has more than 4 unique characters";

      TopEntropyP[w_] /;

      If[Length[DeleteDuplicates[Characters[w]]] \<less\>= 4,

      True, Message[TopEntropyP::not4char]; False]:=

      \ Module[{len = StringLength[w],

      \ \ logg = Floor[Log[4, StringLength[w]]]},

      \ \ If[len\<less\>4^logg+logg-1,logg=logg-1];

      \ \ neww=ToLowerCase[StringTake[w, 4^logg+logg-1]];

      \ \ DistributeDefinitions[neww, logg, len];

      \ \ N[

      \ \ \ Log[4,

      \ \ \ \ Length[

      \ \ \ \ \ DeleteDuplicates[

      \ \ \ \ \ \ ParallelTable[

      \ \ \ \ \ \ \ StringTake[neww,{i,i+logg-1}],{i,1,4^logg}]

      \ \ \ \ \ \ ]

      \ \ \ \ \ ]

      \ \ \ \ ]/logg,10

      \ \ \ ]

      \ \ ]

      \;
    </verbatim>
  </with>

  In table <reference|top table> we present the means and standard deviations
  of the exon and introns in each chromosome.

  <\big-table>
    <label|top table>

    <tabular*|<tformat|<cwith|1|-1|1|1|cell-hyphen|t>|<cwith|1|-1|1|1|cell-hmode|exact>|<cwith|1|-1|1|1|cell-width|15ex>|<cwith|1|-1|1|1|cell-lborder|0ln>|<cwith|1|-1|2|2|cell-halign|l>|<cwith|1|-1|2|2|cell-rborder|1ln>|<cwith|1|-1|3|3|cell-hyphen|t>|<cwith|1|-1|3|3|cell-hmode|exact>|<cwith|1|-1|3|3|cell-width|20ex>|<cwith|1|-1|4|4|cell-hyphen|t>|<cwith|1|-1|4|4|cell-hmode|exact>|<cwith|1|-1|4|4|cell-width|20ex>|<cwith|1|-1|4|4|cell-rborder|0ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-bborder|1ln>|<cwith|3|3|1|-1|cell-bborder|1ln>|<cwith|5|5|1|-1|cell-bborder|1ln>|<cwith|7|7|1|-1|cell-bborder|1ln>|<cwith|9|9|1|-1|cell-bborder|1ln>|<cwith|11|11|1|-1|cell-bborder|1ln>|<cwith|13|13|1|-1|cell-bborder|1ln>|<cwith|15|15|1|-1|cell-bborder|1ln>|<cwith|17|17|1|-1|cell-bborder|1ln>|<cwith|19|19|1|-1|cell-bborder|1ln>|<cwith|21|21|1|-1|cell-bborder|1ln>|<cwith|23|23|1|-1|cell-bborder|1ln>|<cwith|25|25|1|-1|cell-bborder|1ln>|<cwith|27|27|1|-1|cell-bborder|1ln>|<cwith|29|29|1|-1|cell-bborder|1ln>|<cwith|31|31|1|-1|cell-bborder|1ln>|<cwith|33|33|1|-1|cell-bborder|1ln>|<cwith|35|35|1|-1|cell-bborder|1ln>|<cwith|37|37|1|-1|cell-bborder|1ln>|<cwith|39|39|1|-1|cell-bborder|1ln>|<cwith|41|41|1|-1|cell-bborder|1ln>|<cwith|43|43|1|-1|cell-bborder|1ln>|<cwith|45|45|1|-1|cell-bborder|1ln>|<cwith|47|47|1|-1|cell-bborder|1ln>|<table|<row|<cell|Chromosome>|<cell|>|<cell|Mean
    of Topological Entropy>|<cell|Standard
    Deviation>>|<row|<cell|<text|Chr1>>|<cell|<text|Exons>>|<cell|0.898485>|<cell|0.0204572>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.934247>|<cell|0.00362997>>|<row|<cell|<text|Chr2>>|<cell|<text|Exons>>|<cell|0.904304>|<cell|0.013386>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.932893>|<cell|0.00302547>>|<row|<cell|<text|Chr3>>|<cell|<text|Exons>>|<cell|0.8978>|<cell|0.0233471>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.933303>|<cell|0.00315674>>|<row|<cell|<text|Chr4>>|<cell|<text|Exons>>|<cell|0.905006>|<cell|0.0173517>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.932147>|<cell|0.0028979>>|<row|<cell|<text|Chr5>>|<cell|<text|Exons>>|<cell|0.918945>|<cell|0.00970226>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.93309>|<cell|0.00306843>>|<row|<cell|<text|Chr6>>|<cell|<text|Exons>>|<cell|0.902393>|<cell|0.0207456>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.933356>|<cell|0.00323593>>|<row|<cell|<text|Chr7>>|<cell|<text|Exons>>|<cell|0.895162>|<cell|0.0246499>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.932822>|<cell|0.00351942>>|<row|<cell|<text|Chr8>>|<cell|<text|Exons>>|<cell|0.899962>|<cell|0.0182706>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.933313>|<cell|0.00313769>>|<row|<cell|<text|Chr9>>|<cell|<text|Exons>>|<cell|0.904716>|<cell|0.0117213>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.932795>|<cell|0.00288724>>|<row|<cell|<text|Chr10>>|<cell|<text|Exons>>|<cell|0.903538>|<cell|0.013383>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.933332>|<cell|0.00344157>>|<row|<cell|<text|Chr11>>|<cell|<text|Exons>>|<cell|0.899676>|<cell|0.0147793>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.932999>|<cell|0.0038576>>|<row|<cell|<text|Chr12>>|<cell|<text|Exons>>|<cell|0.893752>|<cell|0.027089>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.932167>|<cell|0.00364814>>|<row|<cell|<text|Chr13>>|<cell|<text|Exons>>|<cell|0.899857>|<cell|0.0183225>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.933226>|<cell|0.00308836>>|<row|<cell|<text|Chr14>>|<cell|<text|Exons>>|<cell|0.900326>|<cell|0.0143402>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.933549>|<cell|0.00392597>>|<row|<cell|<text|Chr15>>|<cell|<text|Exons>>|<cell|0.898687>|<cell|0.0242099>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.934057>|<cell|0.00321531>>|<row|<cell|<text|Chr16>>|<cell|<text|Exons>>|<cell|0.890238>|<cell|0.0303307>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.934175>|<cell|0.00432015>>|<row|<cell|<text|Chr17>>|<cell|<text|Exons>>|<cell|0.901288>|<cell|0.0130169>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.931886>|<cell|0.00397499>>|<row|<cell|<text|Chr18>>|<cell|<text|Exons>>|<cell|0.891662>|<cell|0.0269399>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.933567>|<cell|0.00296836>>|<row|<cell|<text|Chr19>>|<cell|<text|Exons>>|<cell|0.889203>|<cell|0.0151235>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.927802>|<cell|0.00539303>>|<row|<cell|<text|Chr20>>|<cell|<text|Exons>>|<cell|0.896028>|<cell|0.0171093>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.932347>|<cell|0.00462207>>|<row|<cell|<text|Chr21>>|<cell|<text|Exons>>|<cell|0.881925>|<cell|0.026025>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.931734>|<cell|0.00441368>>|<row|<cell|<text|Chr22>>|<cell|<text|Exons>>|<cell|0.886374>|<cell|0.0220634>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.931269>|<cell|0.00484108>>|<row|<cell|<text|ChrX>>|<cell|<text|Exons>>|<cell|0.899906>|<cell|0.021806>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.933126>|<cell|0.00288883>>|<row|<cell|<text|ChrY>>|<cell|<text|Exons>>|<cell|0.879597>|<cell|0.0273487>>|<row|<cell|>|<cell|<text|Introns>>|<cell|0.922966>|<cell|0.0142828>>>>>
  </big-table|Topological Entropy for Introns and Exons in each Chromosome>

  Figure <reference|ave bar chart> is a visual representation of table
  <reference|top table>. The figure shows the mean topological entropies with
  the error bars representing the associated standard deviations.

  <\big-table>
    <label|length table>

    <tabular*|<tformat|<cwith|1|-1|1|1|cell-hyphen|t>|<cwith|1|-1|1|1|cell-hmode|exact>|<cwith|1|-1|1|1|cell-width|15ex>|<cwith|1|-1|1|1|cell-lborder|0ln>|<cwith|1|-1|2|2|cell-hyphen|t>|<cwith|1|-1|2|2|cell-hmode|exact>|<cwith|1|-1|2|2|cell-width|10ex>|<cwith|1|-1|3|3|cell-hyphen|t>|<cwith|1|-1|3|3|cell-hmode|exact>|<cwith|1|-1|3|3|cell-width|20ex>|<cwith|1|-1|4|4|cell-hyphen|t>|<cwith|1|-1|4|4|cell-hmode|exact>|<cwith|1|-1|4|4|cell-width|20ex>|<cwith|1|-1|5|5|cell-hyphen|t>|<cwith|1|-1|5|5|cell-hmode|exact>|<cwith|1|-1|5|5|cell-width|20ex>|<cwith|1|-1|6|6|cell-hyphen|t>|<cwith|1|-1|6|6|cell-hmode|exact>|<cwith|1|-1|6|6|cell-width|20ex>|<cwith|1|-1|6|6|cell-rborder|0ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-bborder|1ln>|<cwith|3|3|1|-1|cell-bborder|1ln>|<cwith|5|5|1|-1|cell-bborder|1ln>|<cwith|7|7|1|-1|cell-bborder|1ln>|<cwith|9|9|1|-1|cell-bborder|1ln>|<cwith|11|11|1|-1|cell-bborder|1ln>|<cwith|13|13|1|-1|cell-bborder|1ln>|<cwith|15|15|1|-1|cell-bborder|1ln>|<cwith|17|17|1|-1|cell-bborder|1ln>|<cwith|19|19|1|-1|cell-bborder|1ln>|<cwith|21|21|1|-1|cell-bborder|1ln>|<cwith|23|23|1|-1|cell-bborder|1ln>|<cwith|25|25|1|-1|cell-bborder|1ln>|<cwith|27|27|1|-1|cell-bborder|1ln>|<cwith|29|29|1|-1|cell-bborder|1ln>|<cwith|31|31|1|-1|cell-bborder|1ln>|<cwith|33|33|1|-1|cell-bborder|1ln>|<cwith|35|35|1|-1|cell-bborder|1ln>|<cwith|37|37|1|-1|cell-bborder|1ln>|<cwith|39|39|1|-1|cell-bborder|1ln>|<cwith|41|41|1|-1|cell-bborder|1ln>|<cwith|43|43|1|-1|cell-bborder|1ln>|<cwith|45|45|1|-1|cell-bborder|1ln>|<cwith|47|47|1|-1|cell-bborder|1ln>|<table|<row|<cell|Chromosome>|<cell|>|<cell|Maximum
    Length>|<cell|Minimum Length>|<cell|Mean
    Length>>|<row|<cell|<text|Chr1>>|<cell|<text|Exons>>|<cell|7812>|<cell|1300>|<cell|2111>>|<row|<cell|>|<cell|<text|Introns>>|<cell|510519>|<cell|138073>|<cell|229891>>|<row|<cell|<text|Chr2>>|<cell|<text|Exons>>|<cell|17331>|<cell|1428>|<cell|2402>>|<row|<cell|>|<cell|<text|Introns>>|<cell|811152>|<cell|116831>|<cell|210036.>>|<row|<cell|<text|Chr3>>|<cell|<text|Exons>>|<cell|5920>|<cell|1151>|<cell|1764>>|<row|<cell|>|<cell|<text|Introns>>|<cell|427499>|<cell|102212>|<cell|168762>>|<row|<cell|<text|Chr4>>|<cell|<text|Exons>>|<cell|20938>|<cell|810>|<cell|1892>>|<row|<cell|>|<cell|<text|Introns>>|<cell|740920>|<cell|85373>|<cell|184905.>>|<row|<cell|<text|Chr5>>|<cell|<text|Exons>>|<cell|16737>|<cell|1062>|<cell|1867>>|<row|<cell|>|<cell|<text|Introns>>|<cell|700383>|<cell|104051>|<cell|236844>>|<row|<cell|<text|Chr6>>|<cell|<text|Exons>>|<cell|8035>|<cell|1097>|<cell|1961>>|<row|<cell|>|<cell|<text|Introns>>|<cell|550366>|<cell|77389>|<cell|133592>>|<row|<cell|<text|Chr7>>|<cell|<text|Exons>>|<cell|8599>|<cell|1166>|<cell|2119>>|<row|<cell|>|<cell|<text|Introns>>|<cell|1096453>|<cell|62130>|<cell|185100>>|<row|<cell|<text|Chr8>>|<cell|<text|Exons>>|<cell|5565>|<cell|1253>|<cell|1995>>|<row|<cell|>|<cell|<text|Introns>>|<cell|1043911>|<cell|64153>|<cell|121759>>|<row|<cell|<text|Chr9>>|<cell|<text|Exons>>|<cell|4721>|<cell|678>|<cell|1489>>|<row|<cell|>|<cell|<text|Introns>>|<cell|411175>|<cell|91189>|<cell|146103>>|<row|<cell|<text|Chr10>>|<cell|<text|Exons>>|<cell|21693>|<cell|1823>|<cell|2632>>|<row|<cell|>|<cell|<text|Introns>>|<cell|320871>|<cell|41878>|<cell|82882.1>>|<row|<cell|<text|Chr11>>|<cell|<text|Exons>>|<cell|12048>|<cell|1615>|<cell|2767>>|<row|<cell|>|<cell|<text|Introns>>|<cell|1096470>|<cell|158119>|<cell|270878>>|<row|<cell|<text|Chr12>>|<cell|<text|Exons>>|<cell|3738>|<cell|956>|<cell|1573>>|<row|<cell|>|<cell|<text|Introns>>|<cell|544980>|<cell|56451>|<cell|118739>>|<row|<cell|<text|Chr13>>|<cell|<text|Exons>>|<cell|5916>|<cell|476>|<cell|936>>|<row|<cell|>|<cell|<text|Introns>>|<cell|396889>|<cell|46112>|<cell|106038>>|<row|<cell|<text|Chr14>>|<cell|<text|Exons>>|<cell|6762>|<cell|846>|<cell|1738>>|<row|<cell|>|<cell|<text|Introns>>|<cell|322908>|<cell|44005>|<cell|84671>>|<row|<cell|<text|Chr15>>|<cell|<text|Exons>>|<cell|17106>|<cell|1576>|<cell|2873>>|<row|<cell|>|<cell|<text|Introns>>|<cell|866399>|<cell|141832>|<cell|235972>>|<row|<cell|<text|Chr16>>|<cell|<text|Exons>>|<cell|7748>|<cell|1398>|<cell|2461>>|<row|<cell|>|<cell|<text|Introns>>|<cell|842377>|<cell|160862>|<cell|254995>>|<row|<cell|<text|Chr17>>|<cell|<text|Exons>>|<cell|6255>|<cell|1394>|<cell|2401>>|<row|<cell|>|<cell|<text|Introns>>|<cell|912253>|<cell|133506>|<cell|237662>>|<row|<cell|<text|Chr18>>|<cell|<text|Exons>>|<cell|10489>|<cell|2388>|<cell|2861>>|<row|<cell|>|<cell|<text|Introns>>|<cell|869182>|<cell|143530>|<cell|242207>>|<row|<cell|<text|Chr19>>|<cell|<text|Exons>>|<cell|7152>|<cell|1401>|<cell|2208>>|<row|<cell|>|<cell|<text|Introns>>|<cell|478751>|<cell|128419>|<cell|203055>>|<row|<cell|<text|Chr20>>|<cell|<text|Exons>>|<cell|12219>|<cell|1395>|<cell|2180>>|<row|<cell|>|<cell|<text|Introns>>|<cell|657297>|<cell|148214>|<cell|238464>>|<row|<cell|<text|Chr21>>|<cell|<text|Exons>>|<cell|7263>|<cell|1062>|<cell|1935>>|<row|<cell|>|<cell|<text|Introns>>|<cell|973451>|<cell|111063>|<cell|207047>>|<row|<cell|<text|Chr22>>|<cell|<text|Exons>>|<cell|6598>|<cell|1299>|<cell|2387>>|<row|<cell|>|<cell|<text|Introns>>|<cell|582141>|<cell|102459>|<cell|165488>>|<row|<cell|<text|ChrX>>|<cell|<text|Exons>>|<cell|6042>|<cell|1428>|<cell|2369>>|<row|<cell|>|<cell|<text|Introns>>|<cell|803075>|<cell|113812>|<cell|225649>>|<row|<cell|<text|ChrY>>|<cell|<text|Exons>>|<cell|2493>|<cell|188>|<cell|489>>|<row|<cell|>|<cell|<text|Introns>>|<cell|400349>|<cell|14490>|<cell|50375>>>>>
  </big-table|Length of 100 Longest Introns and Exons in each Chromosome>

  <\tabular*>
    <\tformat|<cwith|1|-1|1|1|cell-halign|c>|<cwith|1|-1|1|1|cell-lborder|0ln>|<cwith|1|-1|2|2|cell-halign|c>|<cwith|1|-1|3|3|cell-halign|c>|<cwith|1|-1|4|4|cell-halign|c>|<cwith|1|-1|4|4|cell-rborder|0ln>|<cwith|1|-1|1|-1|cell-valign|c>>
      <\table>
        <\row>
          <\cell>
            <big-figure|<label|ave bar chart
            50><image|AveTopEnt50.eps|6.5in|||> |Plot of Average Topological
            Entropy for Longest 50 Introns and Exons in each Chromosome>
          </cell>
        <|row>
          <\cell>
            <big-figure|<label|ave bar chart
            200><image|AveTopEnt200.eps|6.5in|||> |Plot of Average
            Topological Entropy for Longest 200 Introns and Exons in each
            Chromosome>
          </cell>
        <|row>
          <\cell>
            <big-figure|<label|ave bar chart
            400><image|AveTopEnt400.eps|6.5in|||> |Plot of Average
            Topological Entropy for Longest 400 Introns and Exons in each
            Chromosome>
          </cell>
        <|row>
          <\cell>
            <big-figure|<label|ave bar chart
            all><image|AveTopEntAll.eps|6.5in|||> |Plot of Average
            Topological Entropy for All Introns and Exons in each Chromosome>
          </cell>
        </row>
      </table>
    </tformat>
  </tabular*>
</body>