<TeXmacs|1.0.6.9>

<style|tmarticle>

<\body>
  <doc-data|<doc-title|New algorithms for relaxed
  multiplication>|<doc-author-data|<author-name|Joris van der
  Hoeven>|<\author-address>
    <abbr|Dpt.> de Mathmatiques (<abbr|Bt.> 425)

    CNRS, Universit Paris-Sud

    91405 Orsay Cedex

    France

    Email: <verbatim|joris@texmacs.org>
  </author-address>>|<doc-date|<date>>|<doc-keywords|power
  series|multiplication|algorithm|FFT|computer
  algebra>|<doc-AMS-class|68W25|42-04|68W30|30B10|33F05|11Y55>>

  <\abstract>
    In previous work, we have introduced the technique of relaxed power
    series computations. With this technique, it is possible to solve
    implicit equations almost as quickly as doing the operations which occur
    in the implicit equation. Here ``almost as quickly'' means that we need
    to pay a logarithmic overhead. In this paper, we will show how to reduce
    this logarithmic factor in the case when the constant ring has
    sufficiently many <with|mode|math|2<rsup|p>>-th roots of unity.
  </abstract>

  <assign|fun|<macro|f|<with|mode|text|<with|font-family|ss|<arg|f>>>>><assign|type|<macro|f|<with|mode|text|<with|font-family|ss|<arg|f>>>>><with|mode|math|<assign|C|\<cal-C\>><assign|R|\<cal-R\>><assign|K|\<cal-K\>>><section|Introduction>

  Let <with|mode|math|<type|C>\<ni\>{<frac|1|2>}> be an effective ring and
  consider two power series <with|mode|math|f=f<rsub|0>+f<rsub|1>*z+\<cdots\>>
  and <with|mode|math|g=g<rsub|0>+g<rsub|1>*z+\<cdots\>> in
  <no-break><with|mode|math|<type|C>[[z]]>. In this paper we will be
  concerned with the efficient computation of the first <with|mode|math|n>
  coefficients of the product <with|mode|math|h=f*g=h<rsub|0>+h<rsub|1>*z+\<cdots\>>.

  If the first <with|mode|math|n> coefficients of <with|mode|math|f> and
  <with|mode|math|g> are known beforehand, then we may use any fast
  multiplication for polynomials in order to achieve this goal, such as
  divide and conquer multiplication <cite|Kar63|Kn97>, which has a time
  complexity <with|mode|math|K(n)=O(n<rsup|log 3/log 2>)>, or <abbr|F.F.T.>
  multiplication <cite|CT65|SS71|CK91|vdH:relax>, which has a time complexity
  <with|mode|math|M(n)=O(n*log <no-break>n*log <no-break>log n)>.

  <yes-indent>For certain computations, and most importantly the resolution
  of implicit equations, it is interesting to use so called ``relaxed
  algorithms'' which output the first <with|mode|math|i> coefficients of
  <with|mode|math|h> as soon as the first <with|mode|math|i> coefficients of
  <with|mode|math|f> and <with|mode|math|g> are known for each
  <with|mode|math|i\<leqslant\><no-break>n>. This allows for instance the
  computation of the exponential <with|mode|math|g=exp f> of a series
  <with|mode|math|f> with <with|mode|math|f<rsub|0>=0> using the formula

  <\equation>
    <label|exp-form>g=<big|int>f<rprime|'>*g.
  </equation>

  More precisely, this formula shows that the computation of
  <with|mode|math|exp f> reduces to one differentiation, one relaxed product
  and one relaxed integration. Differentiation and relaxed integration being
  linear in time, it follows that <with|mode|math|n> terms of
  <with|mode|math|exp f> can be computed in time <with|mode|math|R(n)+O(n)>,
  where <with|mode|math|R(n)> denotes the time complexity of relaxed
  multiplication. In <cite|vdH:issac97|vdH:relax>, we proved the following
  theorem:

  <\theorem>
    <label|old-relax-th>There exists a relaxed multiplication algorithm of
    time complexity

    <\equation*>
      R(n)=O(M(n)*log n)
    </equation*>

    <no-page-break*>and space complexity <with|mode|math|O(n)>.
  </theorem>

  In this paper, we will improve the time complexity bound in this theorem in
  the case when <with|mode|math|<type|C>> admits
  <with|mode|math|2<rsup|p>>-th roots of unity for any
  <with|mode|math|p\<in\>\<bbb-N\>>. In section <reference|semi-relaxed-sec>,
  we first reduce this problem to the case of ``semi-relaxed
  multiplication'', when one of the arguments is fixed and the other one
  relaxed. More precisely, let <with|mode|math|f> and <with|mode|math|g> be
  power series, such that <with|mode|math|g> is known up to order
  <no-break><with|mode|math|n>. Then a semi-relaxed multiplication algorithm
  computes the product <with|mode|math|h=f*g> up to order <with|mode|math|n>
  and outputs <with|mode|math|(f*g)<rsub|i>> as soon as
  <with|mode|math|f<rsub|0>,\<ldots\>,f<rsub|i>> are known, for all
  <no-break><with|mode|math|i\<less\>n>. In section
  <reference|impr-relaxed-sec>, we show that the <with|mode|math|log n>
  overhead in theorem <reference|old-relax-th> can be reduced to
  <with|mode|math|O((log n)<rsup|log 3/log 2>)>. In section
  <no-break><reference|new-relaxed-sec>, the technique of section
  <reference|impr-relaxed-sec> is further improved so as to yield an
  <with|mode|math|O(\<mathe\><rsup|2*<sqrt|log 2*log log n>>)> overhead.

  In the sequel, we will use the following notations from <cite|vdH:relax>:
  we denote by <with|mode|math|<type|C>[[z]]<rsub|n>\<subseteq\><type|C>[z]\<subseteq\><type|C>[[z]]>
  the set of truncated power series of order <with|mode|math|n>, like
  <with|mode|math|f=f<rsub|0>+\<cdots\>+f<rsub|n-1>*z<rsup|n-1>>. Given
  <with|mode|math|f\<in\><type|C>[[z]]<rsub|n>> and
  <with|mode|math|0\<leqslant\>i\<less\>j\<leqslant\>n>, we will denote
  <with|mode|math|f<rsub|i\<ldots\>j>=f<rsub|i>+\<cdots\>+f<rsub|j-1>*z<rsup|j-i-1>\<in\><type|C>[[z]]<rsub|j-i>>.

  <\remark>
    An preprint of the present paper was published a few years
    ago<nbsp><cite|vdH:newrelax:pre>. The current version includes a new
    section<nbsp><reference|impl-sec> with implementation details, benchmarks
    and a<nbsp>few notes on how to apply similar ideas in the Karatsuba and
    Toom-Cook models. Another algorithm for semi-relaxed multiplication,
    based on the middle product <cite|HQZ04>, was also published before
    <cite|vdH:issac03>.
  </remark>

  <\remark>
    The exotic form <with|mode|math|O(n*log n*\<mathe\><rsup|2*<sqrt|log
    2*log log n>>)> of the new complexity for relaxed multiplication might
    surprise the reader. It should be noticed that the time complexity of
    Toom-Cook's algorithm for polynomial multiplication <cite|Toom63b|Cook66>
    has a similar complexity <with|mode|math|O(n*log n*2<rsup|<sqrt|2*log
    n>>)> <cite-detail|Kn97|Section 4.3, p. 286 and exercise 5, p. 300>.
    Indeed, whereas our algorithm from section <reference|impr-relaxed-sec>
    has a Karatsuba-like flavour, the algorithm from
    section<nbsp><reference|new-relaxed-sec> uses a generalized subdivision
    which is similar to the one used by Toom and Cook.

    An interesting question is whether even better time complexities can be
    obtained (in analogy with FFT-multiplication). However, we have not
    managed so far to reduce the cost of relaxed multiplication to
    <with|mode|math|O(M(n))> or <with|mode|math|O(M(n)*log log log n)>.
    Nevertheless, it should be noticed that the function
    <with|mode|math|\<mathe\><rsup|2*<sqrt|log 2*log log n>>> grows very
    slowly; in practice, it very much behaves like as a constant (see
    section<nbsp><reference|impl-sec>).
  </remark>

  <\remark>
    The reader may wonder whether further improvements in the complexity of
    relaxed multiplication are really useful, since the algorithms from
    <cite|vdH:issac97|vdH:relax> are already optimal up to a factor
    <with|mode|math|O(log n)>. In fact, we expect fast algorithms for formal
    power series to be one of the building bricks for effective analysis
    <cite|vdH:riemann>. Therefore, even small improvements in the complexity
    of relaxed multiplication should lead to global speed-ups for this kind
    of software.
  </remark>

  <section|Full and semi-relaxed multiplication>

  <label|semi-relaxed-sec>In <cite|vdH:issac97|vdH:relax>, we have stated
  several fast algorithms for relaxed multiplication. Let us briefly recall
  some of the main concepts and ideas. For details, we refer
  to<nbsp><cite|vdH:relax>. Throughout this section, <with|mode|math|f> and
  <with|mode|math|g> are two power series in <with|mode|math|<type|C>[[z]]>.

  <\definition>
    We call

    <\equation>
      <label|full-prod>P=f<rsub|0\<ldots\>n>*g<rsub|0\<ldots\>n>
    </equation>

    the full product of <with|mode|math|f> and <with|mode|math|g> at order
    <with|mode|math|n>.
  </definition>

  <\definition>
    We call

    <\equation>
      <label|trunc-prod>P=<big|sum><rsub|i+j\<less\>n>(f<rsub|i>*g<rsub|j>)*z<rsup|i+j>
    </equation>

    <no-page-break*>the truncated product of <with|mode|math|f> and
    <with|mode|math|g> at order <with|mode|math|n>.
  </definition>

  <\definition>
    A full (or truncated) zealous multiplication algorithm of
    <with|mode|math|f> and <with|mode|math|g> at
    order<nbsp><with|mode|math|n> takes <with|mode|math|f<rsub|0>,\<ldots\>,f<rsub|n-1>>
    and <with|mode|math|g<rsub|0>,\<ldots\>,g<rsub|n-1>> on input and
    computes <with|mode|math|P> as in (<reference|full-prod>) (<abbr|resp.>
    (<reference|trunc-prod>)).
  </definition>

  <\definition>
    A full (or truncated) relaxed multiplication algorithm of
    <with|mode|math|f> and <with|mode|math|g> at
    order<nbsp><with|mode|math|n> successively takes the pairs
    <with|mode|math|(f<rsub|0>,g<rsub|0>),\<ldots\>,(f<rsub|n-1>,g<rsub|n-1>)>
    on input and successively computes <with|mode|math|P<rsub|0>,\<ldots\>,P<rsub|2*n-2>>
    (resp. <with|mode|math|P<rsub|0>,\<ldots\>,P<rsub|n-1>>). Here it is
    understood that <with|mode|math|P<rsub|i>> is output as soon as
    <with|mode|math|(f<rsub|0>,<no-break>g<rsub|0>),\<ldots\>,(f<rsub|i>,g<rsub|i>)>
    are known.
  </definition>

  <\definition>
    A full (or truncated) semi-relaxed multiplication algorithm of
    <with|mode|math|f> and <with|mode|math|g> takes
    <with|mode|math|g<rsub|0>,\<ldots\>,g<rsub|n-1>> and the successive
    values <with|mode|math|f<rsub|0>,\<ldots\>,f<rsub|n-1>> on input and
    successively computes <with|mode|math|P<rsub|0>,\<ldots\>,<no-break>P<rsub|2*n-2>>
    (resp. <with|mode|math|P<rsub|0>,\<ldots\>,P<rsub|n-1>>). Here it is
    understood that <with|mode|math|P<rsub|i>> is output as soon as
    <with|mode|math|f<rsub|0>,\<ldots\>,<no-break>f<rsub|i>> are known.
  </definition>

  We will denote by <with|mode|math|M(n)>, <with|mode|math|R(n)> and
  <with|mode|math|Q(n)> the time complexities of full zealous, relaxed and
  semi-relaxed multiplication at order <with|mode|math|n>, where it is
  understood that the ring operations in <with|mode|math|<type|C>> can be
  performed in time <with|mode|math|O(1)>. We notice that full zealous
  multiplication is equivalent to polynomial multiplication. Hence, classical
  fast multiplication algorithms can be applied in this case
  <cite|Kar63|Toom63b|Cook66|CT65|SS71|CK91|vdH:relax>.

  The main idea behind efficient algorithms for relaxed multiplication is to
  anticipate on future computations. More precisely, the computation of a
  full product (<reference|full-prod>) can be represented by an
  <with|mode|math|n\<times\>n> square with entries
  <with|mode|math|f<rsub|i>*g<rsub|j>>, <with|mode|math|0\<leqslant\>i,j\<less\>n>.
  As soon as <with|mode|math|f<rsub|0>,\<ldots\>,f<rsub|i>> and
  <with|mode|math|g<rsub|0>,\<ldots\>,<no-break>g<rsub|i>> are known, it
  becomes possible to compute the contributions of the products
  <with|mode|math|f<rsub|j>*g<rsub|k>> with
  <with|mode|math|0\<leqslant\>j,k\<leqslant\>i> to <with|mode|math|P>, even
  though the contributions of <with|mode|math|f<rsub|j>*g<rsub|k>> with
  <with|mode|math|j+k\<gtr\>i> are not yet needed. The next idea is to
  subdivide the <with|mode|math|n\<times\>n> square into smaller squares, in
  such a way that the contribution of each small square to <with|mode|math|P>
  can be computed using a zealous algorithm. Now the contribution of such a
  small square is of the form <with|mode|math|f<rsub|i<rsub|1>\<ldots\>i<rsub|2>>*g<rsub|j<rsub|1>\<ldots\>j<rsub|2>>*z<rsup|i<rsub|1>+j<rsub|1>>>.
  Therefore, the requirement <math|i<rsub|1>+j<rsub|1>\<leqslant\>max(i<rsub|2>,j<rsub|2>)>
  suffices to ensure that the resulting algorithm will be relaxed. In the
  left hand image of figure <reference|dichrelax-fig>, we have shown the
  subdivision from the main algorithm of <cite|vdH:issac97|vdH:relax>, which
  has time complexity <with|mode|math|R(n)=O(M(n)*log n)>.

  <\big-figure|<postscript|nonlin-relax.ps|*.5|*.5||||><space|2fn><postscript|lin-relax.ps|*.5|*.5||||>>
    <label|dichrelax-fig>Illustration of the facts that (1) a full relaxed
    <with|mode|math|2*n\<times\>2*n> multiplication reduces to one full
    relaxed <with|mode|math|n\<times\>n> multiplication, two semi-relaxed
    <with|mode|math|n\<times\>n> multiplication and one zealous
    <with|mode|math|n\<times\>n> multiplication (2) a semi-relaxed
    <with|mode|math|2*n\<times\>2*n> multiplication reduces to two
    semi-relaxed <with|mode|math|n\<times\>n> multiplications and two zealous
    <with|mode|math|n\<times\>n> multiplications.
  </big-figure>

  There is an alternative interpretation of the left hand image in figure
  <reference|dichrelax-fig>: when interpreting the big square as a
  <with|mode|math|2*n\<times\>2*n> multiplication

  <\equation*>
    P=f<rsub|0\<ldots\>2*n>*g<rsub|0\<ldots\>2*n>,
  </equation*>

  we may regard it as the sum

  <\equation*>
    P=P<rsub|0,0>+P<rsub|0,1>*z<rsup|n>+P<rsub|1,0>*z<rsup|n>+P<rsub|1,1>*z<rsup|2*n>
  </equation*>

  of four <with|mode|math|n\<times\>n> multiplications

  <\eqnarray*>
    <tformat|<table|<row|<cell|P<rsub|0,0>>|<cell|=>|<cell|f<rsub|0\<ldots\>n>*g<rsub|0\<ldots\>n>>>|<row|<cell|P<rsub|0,1>>|<cell|=>|<cell|f<rsub|0\<ldots\>n>*g<rsub|n\<ldots\>2*n>>>|<row|<cell|P<rsub|1,0>>|<cell|=>|<cell|f<rsub|n\<ldots\>2*n>*g<rsub|0\<ldots\>n>>>|<row|<cell|P<rsub|1,1>>|<cell|=>|<cell|f<rsub|n\<ldots\>2*n>*g<rsub|n\<ldots\>2*n>.>>>>
  </eqnarray*>

  Now <with|mode|math|P<rsub|0,0>> is a relaxed multiplication at order
  <with|mode|math|n>, but <with|mode|math|P<rsub|0,1>> is even semi-relaxed,
  since <with|mode|math|g<rsub|0>,\<ldots\>,<no-break>g<rsub|n-1>> are
  already known by the time that we need <with|mode|math|(P<rsub|0,1>)<rsub|0>>.
  Similarly, <with|mode|math|P<rsub|1,0>> corresponds to a semi-relaxed
  product and <with|mode|math|P<rsub|1,1>> to a zealous product. This shows
  that

  <\equation*>
    R(2*n)\<leqslant\>R(n)+2*Q(n)+M(n).
  </equation*>

  Similarly, we have

  <\equation*>
    Q(2*n)\<leqslant\>2*Q(n)+2*M(n),
  </equation*>

  as illustrated in the right-hand image of figure <reference|dichrelax-fig>.
  Under suitable regularity hypotheses for <with|mode|math|M(n)> and
  <with|mode|math|Q(n)>, the above relations imply:

  <\theorem>
    <label|rel-srel-th>

    <\enumerate-alpha>
      <item>If <with|mode|math|<frac|M(n)|n>> is increasing, then
      <with|mode|math|Q(n)=O(M(n)*log n)>.

      <item>If <with|mode|math|<frac|Q(n)|n>> is increasing, then
      <with|mode|math|R(n)=O(Q(n))>.
    </enumerate-alpha>
  </theorem>

  <yes-indent>A consequence of part (<em|b>) of the theorem is that it
  suffices to design fast algorithms for semi-relaxed multiplication in order
  to obtain fast algorithms for relaxed multiplication. This fact may be
  reinterpreted by observing that the fast relaxed multiplication algorithm
  actually applies Newton's method in a hidden way. Indeed, since Brent and
  Kung <cite|BK78>, it is well known that Newton's method can also be used in
  the context of formal power series in order to solve differential or
  functional equations. One step of Newton's method at order
  <with|mode|math|n> involves the recursive application of the method at
  order <with|mode|math|\<lceil\>n/2\<rceil\>> and the resolution of a linear
  equation at order <with|mode|math|\<lfloor\>n/2\<rfloor\>>. The resolution
  of the linear equation corresponds to the computation of the two
  semi-relaxed products.

  <section|A new algorithm for fast relaxed multiplication>

  <label|impr-relaxed-sec>Assume from now on that <with|mode|math|<type|C>>
  admits an <with|mode|math|n>-th root of unity
  <with|mode|math|\<omega\><rsub|n>> for every power of two
  <no-break><with|mode|math|n\<in\>2<rsup|\<bbb-N\>>>. Given an element
  <with|mode|math|f\<in\><type|C>[[z]]<rsub|n>>, let
  <with|mode|math|FFT<rsub|n>(f)\<in\><type|C><rsup|n>> denote its Fourier
  transform

  <\equation*>
    FFT<rsub|n>(f)=(f(1),f(\<omega\><rsub|n>),\<ldots\>,f(\<omega\><rsub|n><rsup|n-1>))
  </equation*>

  and let <with|mode|math|FFT<rsub|n><rsup|-1>:<type|C><rsup|n>\<rightarrow\><type|TPS>(n)>
  be the inverse mapping of <with|mode|math|FFT<rsub|n>>. It is well known
  that both <with|mode|math|FFT<rsub|n>> and
  <with|mode|math|FFT<rsub|n><rsup|-1>> can be computed in time
  <with|mode|math|O(n*log n)>. Furthermore, if
  <with|mode|math|f,g\<in\><type|C>[[z]]<rsub|n>> are such that
  <with|mode|math|f*g\<in\><type|C>[[z]]<rsub|n>>, then

  <\equation*>
    f*g=FFT<rsub|n><rsup|-1>(FFT<rsub|n>(f)*FFT<rsub|n>(g)),
  </equation*>

  where the product in <with|mode|math|<type|C><rsup|n>> is scalar
  multiplication <with|mode|math|(a<rsub|0>,\<ldots\>,a<rsub|n-1>)*(b<rsub|0>,\<ldots\>,b<rsub|n-1>)=(a<rsub|0>*b<rsub|0>,\<ldots\>,a<rsub|n-1>*b<rsub|n-1>)>.

  Now consider a decomposition <with|mode|math|n=n<rsub|1>*n<rsub|2>> with
  <with|mode|math|n<rsub|1>=2<rsup|p<rsub|1>>> and
  <with|mode|math|n<rsub|2>=2<rsup|p<rsub|2>>>. Then a truncated power series
  <with|mode|math|f\<in\><type|C>[z]<rsub|n>> can be rewritten as a series

  <\equation*>
    f<rsub|0\<ldots\>n<rsub|1>>+f<rsub|n<rsub|1>\<ldots\>2*n<rsub|1>>*y+\<cdots\>+f<rsub|(n<rsub|2>-1)*n<rsub|1>\<ldots\>n<rsub|2>*n<rsub|1>>*y<rsup|n<rsub|2>-1>
  </equation*>

  in <with|mode|math|<type|C>[z]<rsub|n<rsub|1>>[y]<rsub|n<rsub|2>>>, where
  <with|mode|math|y=z<rsup|n<rsub|1>>>. This series may again be
  reinterpreted as a series <with|mode|math|\<Nu\>(f)\<in\><type|C>[z]<rsub|2*n<rsub|1>>[y]<rsub|n<rsub|2>>>,
  and we have

  <\equation*>
    f*g=\<Nu\><rsup|-1>(\<Nu\>(f)*\<Nu\>(g)),
  </equation*>

  where <with|mode|math|\<Nu\><rsup|-1>:<type|C>[z]<rsub|2*n<rsub|1>>[y]\<rightarrow\><type|C>[z]>
  is the mapping which substitutes <with|mode|math|z<rsup|n<rsub|1>>> for
  <with|mode|math|y>. Also, the FFT-transform
  <with|mode|math|FFT<rsub|2*n<rsub|1>>:<type|C>[z]<rsub|2*n<rsub|1>>\<rightarrow\><type|C><rsup|2*n<rsub|1>>>
  may be extended to a mapping

  <\eqnarray*>
    <tformat|<table|<row|<cell|<type|C>[z]<rsub|2*n<rsub|1>>[y]<rsub|l>>|<cell|\<longrightarrow\>>|<cell|<type|C><rsup|2*n<rsub|1>>[y]<rsub|l>>>|<row|<cell|c<rsub|0>+\<cdots\>+c<rsub|l-1>*y<rsup|l-1>>|<cell|\<longmapsto\>>|<cell|FFT<rsub|d>(c<rsub|0>)+\<cdots\>+FFT<rsub|d>(c<rsub|l-1>)*y<rsup|l-1>>>>>
  </eqnarray*>

  for each <with|mode|math|l>, and similarly for its inverse
  <with|mode|math|FFT<rsub|2*n<rsub|1>><rsup|-1>>. Now the formula

  <\equation*>
    f*g=\<Nu\><rsup|-1>(FFT<rsub|2*n<rsub|1>><rsup|-1>(FFT<rsub|2*n<rsub|1>>(\<Nu\>(f))*FFT<rsub|2*n<rsub|1>>(\<Nu\>(g))))
  </equation*>

  yields a way to compute <with|mode|math|f*g> by reusing the Fourier
  transforms of the ``bunches of coefficients''
  <with|mode|math|f<rsub|k*n<rsub|1>\<ldots\>(k+1)*n<rsub|1>>> and
  <with|mode|math|g<rsub|l*n<rsub|1>\<ldots\>(l+1)*n<rsub|1>>> many times.

  In the context of a semi-relaxed multiplication <with|mode|math|f*g> with
  fixed argument <with|mode|math|g>, the above scheme almost reduces the
  computation of an <with|mode|math|n\<times\>n> product with coefficients in
  <with|mode|math|<type|C>> to the computation of an
  <with|mode|math|n<rsub|2>\<times\>n<rsub|2>> product with coefficients in
  <with|mode|math|<type|C><rsup|2*n<rsub|1>>>. The only problem which remains
  is that we can only compute <with|mode|math|FFT<rsub|2*n<rsub|1>>(f<rsub|k*n<rsub|1>\<ldots\>(k+1)*n<rsub|1>>)>
  when <with|mode|math|f<rsub|k*n<rsub|1>>,\<ldots\>,f<rsub|(k+1)*n<rsub|1>-1>>
  are all known. Consequently, the products
  <with|mode|math|f<rsub|k*n<rsub|1>\<ldots\>(k+1)*n<rsub|1>>*g<rsub|0\<ldots\>n<rsub|1>>>
  should be computed apart, using a traditional semi-relaxed multiplication.
  In other words, we have reduced the computation of a semi-relaxed
  <with|mode|math|n\<times\>n> product with coefficients in
  <with|mode|math|<type|C>> to the computation of <with|mode|math|n<rsub|2>>
  semi-relaxed <with|mode|math|n<rsub|1>\<times\>n<rsub|1>> products with
  coefficients in <with|mode|math|<type|C>>, one semi-relaxed
  <with|mode|math|n<rsub|2>\<times\>(n<rsub|2>-1)> product with coefficients
  in <with|mode|math|<type|C><rsup|2*n<rsub|1>>> and
  <with|mode|math|4*n<rsub|2>-3> FFT-transforms of length
  <with|mode|math|2*n<rsub|1>>. This has been illustrated in figure
  <reference|fftrelax-fig>.

  <\big-figure|<postscript|fftrelax.ps|*.5|*.5||||>>
    <label|fftrelax-fig>New decomposition of a semi-relaxed
    <with|mode|math|n\<times\>n> multiplication into
    <with|mode|math|n/n<rsub|1>> semi-relaxed
    <with|mode|math|n<rsub|1>\<times\>n<rsub|1>> multiplications (the light
    regions) and one semi-relaxed <with|mode|math|n<rsub|2>\<times\>(n<rsub|2>-1)>
    multiplication (the dark region) with FFT-ed coefficients in
    <with|mode|math|<type|C><rsup|2*n<rsub|1>>>.
  </big-figure>

  In order to obtain an efficient algorithm, we may choose
  <with|mode|math|p<rsub|1>=\<lceil\>p/2\<rceil\>> and
  <with|mode|math|p<rsub|2>=\<lfloor\>p/2\<rfloor\>>:

  <\theorem>
    Assume that <with|font-shape|right|<with|mode|math|<type|C>>> admits an
    <with|mode|math|n>-th root of unity for each
    <with|mode|math|n\<in\>2<rsup|\<bbb-N\>>>. Then there exists a relaxed
    multiplication algorithm of time complexity <with|mode|math|O(n*(log
    n)<rsup|log 3/log 2>)> and space complexity <with|mode|math|O(n*log n)>.
  </theorem>

  <\proof>
    In view of section <reference|semi-relaxed-sec>, it suffices to consider
    the case of a semi-relaxed product. Let <with|mode|math|T(n)> denote the
    time complexity of the above method. Then we observe that

    <\eqnarray*>
      <tformat|<table|<row|<cell|T(n)>|<cell|\<leqslant\>>|<cell|n<rsub|2>*T(n<rsub|1>)+2*n<rsub|1>*T(n<rsub|2>)+O(n<rsub|2>*n<rsub|1>*log
      n<rsub|1>)>>|<row|<cell|>|<cell|\<leqslant\>>|<cell|n<rsub|2>*T(n<rsub|1>)+2*n<rsub|1>*T(n<rsub|2>)+O(n*log
      n).>>>>
    </eqnarray*>

    Taking <with|mode|math|p<rsub|1>=\<lfloor\>p/2\<rfloor\>>,
    <with|mode|math|p<rsub|2>=\<lceil\>p/2\<rceil\>> and
    <with|mode|math|U(p)=T(2<rsup|p>)/2<rsup|p>>, we obtain

    <\equation*>
      U(p)\<leqslant\>U(\<lceil\>p/2\<rceil\>)+2*U(\<lfloor\>p/2\<rfloor\>)+O(p),
    </equation*>

    from which we deduce that <with|mode|math|U(p)=O(p<rsup|log 3/log 2>)>
    and <with|mode|math|T(n)=O(n*(log n)<rsup|log 3/log 2>)>. Similarly, the
    space complexity <with|mode|math|S(n)> satisfies the bound

    <\equation*>
      S(n)\<leqslant\>S(n<rsub|1>)+2*n<rsub|1>*S(n<rsub|2>)+O(n)\<leqslant\>(2*n<rsub|1>+1)*S(n<rsub|2>)+O(n).
    </equation*>

    Setting <with|mode|math|R(p)=S(2<rsup|p>)/2<rsup|p>>, it follows that

    <\equation*>
      R(p)\<leqslant\>(2+<with|math-display|false|<frac|1|2<rsup|\<lfloor\>p/2\<rfloor\>>>>)*R(\<lceil\>p/2\<rceil\>)+O(1)
    </equation*>

    Consequently, <with|mode|math|R(p)=O(p)> and
    <with|mode|math|S(n)=O(n*p)=O(n*log n)>.
  </proof>

  <section|Further improvements of the algorithm>

  <label|new-relaxed-sec>More generally, if
  <with|mode|math|n=n<rsub|1>*\<cdots\>*n<rsub|l>> with
  <with|mode|math|n<rsub|1>=2<rsup|p<rsub|1>>,\<ldots\>,n<rsub|l>=2<rsup|p<rsub|l>>>,
  then we may reduce the computation of a semi-relaxed
  <with|mode|math|n\<times\>n> product with coefficients in
  <with|mode|math|<type|C>> into the computation of

  <\itemize>
    <item><with|mode|math|<frac|n|n<rsub|1>>> semi-relaxed
    <with|mode|math|n<rsub|1>\<times\>n<rsub|1>> products over
    <with|mode|math|<type|C>> of the form
    <with|mode|math|f<rsub|k*n<rsub|1>\<ldots\>(k+1)*n<rsub|1>>*g<rsub|0\<ldots\>n<rsub|1>>>;

    <item><with|mode|math|2*(<frac|n|n<rsub|1>>+n<rsub|2>-1)-1>
    FFT-transforms of length <with|mode|math|2*n<rsub|1>>;

    <item><with|mode|math|<frac|n|n<rsub|1>*n<rsub|2>>> semi-relaxed
    <with|mode|math|n<rsub|2>\<times\>(n<rsub|2>-1)> products over
    <with|mode|math|<type|C><rsup|2*n<rsub|1>>>;

    <item><with|mode|math|2*(<frac|n|n<rsub|1>*n<rsub|2>>+n<rsub|3>-1)-1>
    FFT-transforms of length <with|mode|math|2*n<rsub|1>*n<rsub|2>>;

    <item><with|mode|math|<frac|n|n<rsub|1>*n<rsub|2>*n<rsub|3>>>
    semi-relaxed <with|mode|math|n<rsub|3>\<times\>(n<rsub|3>-1)> products
    over <with|mode|math|<type|C><rsup|2*n<rsub|1>*n<rsub|2>>>;

    <item><with|mode|math|\<vdots\>>

    <item><with|mode|math|4*n<rsub|l>-3> FFT-transforms of length
    <with|mode|math|2*<frac|n|n<rsub|l>>>;

    <item>one semi-relaxed <with|mode|math|n<rsub|l>\<times\>(n<rsub|l>-1)>
    product over <with|mode|math|<type|C><rsup|2*n<rsub|1>*\<cdots\>*n<rsub|l-1>>>.
  </itemize>

  This computation is illustrated in <reference|fftrelax2-fig>. From the
  complexity point of view, it leads to the following theorem:

  <\big-figure|<postscript|fftrelax2.ps|*.5|*.5||||>>
    <label|fftrelax2-fig>Generalized decomposition of a semi-relaxed
    <with|mode|math|n\<times\>n> multiplication into <with|mode|math|l=3>
    layers.
  </big-figure>

  <\theorem>
    <label|main-th>Assume that <with|font-shape|right|<with|mode|math|<type|C>>>
    admits an <with|mode|math|n>-th root of unity for each
    <with|mode|math|n\<in\>2<rsup|\<bbb-N\>>>. Then there exists a relaxed
    multiplication algorithm of time complexity <with|mode|math|O(n*log
    n*\<mathe\><rsup|2*<sqrt|log 2*log log n>>)> and space complexity
    <with|mode|math|O(n*\<mathe\><rsup|<sqrt|log 2*log log n>>)>.
  </theorem>

  <\proof>
    In view of theorem<nbsp><reference|rel-srel-th>(<em|b>), it suffices to
    consider the case of a semi-relaxed product. Denoting by
    <with|mode|math|T(n)> the time complexity of the above method, we have

    <\equation>
      <label|preq-0>T(n)\<leqslant\><frac|n|n<rsub|1>>*T(n<rsub|1>)+<frac|2*n|n<rsub|2>>*T(n<rsub|2>)+\<cdots\>+<frac|2*n|n<rsub|l>>*T(n<rsub|l>)+O(l*n*log
      n).
    </equation>

    Let

    <\equation*>
      U(p)=<frac|T(2<rsup|p>)|p*2<rsup|p>>.
    </equation*>

    Taking <with|mode|math|n<rsub|1>=\<cdots\>=n<rsub|l>=2<rsup|p>> in
    (<reference|preq-0>), it follows for any <with|mode|math|l> that

    <\equation>
      <label|preq-1>U(l*p)\<leqslant\>2*U(p) + O(l).
    </equation>

    Applying this relation <with|mode|math|k> times, we obtain

    <\equation>
      <label|preq-2>U(l<rsup|k>)\<leqslant\>2<rsup|k>*U(1) +
      O(2<rsup|k>*l)=O(2<rsup|k>*l).
    </equation>

    For a fixed <with|mode|math|p> such that <with|mode|math|k=log p/log l>
    is an integer, we obtain

    <\equation>
      <label|preq-3>U(p)=O(2<rsup|log p/log l>*l).
    </equation>

    The minimum of <with|mode|math|2<rsup|log p/log l>*l> is reached when its
    derivative <abbr|w.r.t.> <with|mode|math|l> cancels. This happens for

    <\equation*>
      l<rsub|p> =\<mathe\><rsup|<sqrt|log 2*log p>>
    </equation*>

    Plugging this value into (<reference|preq-3>), we obtain

    <\equation*>
      U(p)=O(\<mathe\><rsup|2*<sqrt|log 2*log p>>).
    </equation*>

    Substitution of <with|mode|math|p=log n/log 2> finally gives the desired
    estimate

    <\equation>
      <label|T-th-bnd>T(n) = O(n*log n*\<mathe\><rsup|2*<sqrt|log 2*log log
      n>>).
    </equation>

    In order to be painstakingly correct, we notice that we really proved
    (<reference|preq-3>) for <with|mode|math|p> of the form
    <with|mode|math|p=l<rsup|\<lceil\>log p/log l\<rceil\>>> and
    (<reference|T-th-bnd>) for <with|mode|math|n> of the form
    <with|mode|math|n=2<rsup|p>>. Of course, we may always replace
    <with|mode|math|p> and <with|mode|math|n> by larger values which do have
    this form. Since these replacements only introduce additional constant
    factors in the complexity bounds, the bound (<reference|T-th-bnd>) holds
    for general <with|mode|math|n>.

    As to the space complexity <with|mode|math|S(n)>, we have

    <\equation*>
      S(n)\<leqslant\>S(n<rsub|1>)+2*n<rsub|1>*S(n<rsub|2>)+\<cdots\>+2*n<rsub|1>*\<cdots\>*n<rsub|l-1>*S(n<rsub|l>)+O(n).
    </equation*>

    Let

    <\equation*>
      R(p)=<frac|S(2<rsup|p>)|2<rsup|p>>.
    </equation*>

    Taking <with|mode|math|n<rsub|1>=\<cdots\>=n<rsub|l>=2<rsup|p>>, it
    follows for any <with|mode|math|l> that

    <\equation*>
      R(l*p)\<leqslant\>(2+C/2<rsup|p>)*R(p) + O(1),
    </equation*>

    for some fixed constant <with|mode|math|C>. Applying this bound
    <with|mode|math|k> times, we obtain

    <\equation*>
      R(l<rsup|k>)\<leqslant\><left|(><big|prod><rsub|i=1><rsup|k>2+<frac|C|2<rsup|i*l>><big|.><right|)>*(R(1)+O(1)).
    </equation*>

    For <with|mode|math|l\<rightarrow\>\<infty\>>, this bound simplifies to

    <\equation*>
      R(l<rsup|k>)=O(2<rsup|k>).
    </equation*>

    Taking <with|mode|math|k=log p/log l> and <with|mode|math|l
    =\<mathe\><rsup|<sqrt|log 2*log p>>> as above, it follows that

    <\equation*>
      R(p)=O(2<rsup|<sqrt|log p/log 2>>)=O(\<mathe\><rsup|<sqrt|log 2*log
      p>>).
    </equation*>

    Substitution of <with|mode|math|p=log n/log 2> finally gives us the
    desired estimate

    <\equation*>
      S(n)=O(n*\<mathe\><rsup|<sqrt|log 2*log log n>>)
    </equation*>

    for the space complexity. For similar reasons as above, the bound holds
    for general <with|mode|math|n>.
  </proof>

  <section|Implementation details and benchmarks>

  <label|impl-sec>We implemented the algorithm from section
  <reference|impr-relaxed-sec> in the <name|C++> library <name|Mmxlib>
  <cite|vdH:mml>. Instead of taking <with|mode|math|n<rsub|1>\<approx\>n<rsub|2>>,
  we took <with|mode|math|n<rsub|2>> small (with
  <with|mode|math|n<rsub|2>\<in\>{4,8,16,32}> in the FFT range up to
  <with|mode|math|n=2<rsup|24>>), and used a naive multiplication algorithm
  on the FFT-ed blocks. The reason behind this change is that
  <with|mode|math|n<rsub|1>> needs to be reasonably large in order to profit
  from the better asymptotic complexity of relaxed multiplication. In
  practice, the optimal choice of <with|mode|math|(n<rsub|1>,<no-break>n<rsub|2>)>
  is obtained by taking <with|mode|math|n<rsub|2>> quite small.

  Moreover, our implementation uses a<nbsp>truncated version of relaxed
  multiplication <cite-detail|vdH:relax|Section<nbsp>4.4.2>. In particular,
  the use of naive multiplication on the FFT-ed blocks allows us to gain
  a<nbsp>factor<nbsp><with|mode|math|2> at the top-level. For small values of
  <with|mode|math|n=2<rsup|p>>, we also replaced FFT transforms by
  ``Karatsuba transforms'': given a polynomial
  <with|mode|math|f=f<rsub|0>+\<cdots\>+f<rsub|2<rsup|p-1>>*Z<rsup|2<rsup|p>-1>>,
  we may form a polynomial <with|mode|math|F(Z<rsub|1>,\<ldots\>,Z<rsub|p>)>
  in <with|mode|math|p> variables with coefficients
  <with|mode|math|F<rsub|i<rsub|0>,\<ldots\>,i<rsub|p-1>>=f<rsub|i<rsub|0>+\<cdots\>+i<rsub|p-1>*2<rsup|p-1>>>
  for <with|mode|math|i<rsub|0>,\<ldots\>,i<rsub|p-1>\<in\>{0,1}>. Then the
  Karatsuba transform of <with|mode|math|f> is the vector
  <with|mode|math|(F(z<rsub|0>,\<ldots\>,z<rsub|p-1>))<rsub|z<rsub|i>\<in\>{0,1,\<Omega\>}>>
  of size <with|mode|math|3<rsup|p>>, where
  <with|mode|math|(a+b*Z)(\<Omega\>)=b>.

  We have both tested (truncated) relaxed and semi-relaxed multiplication for
  different types of coefficients on an Intel Xeon processor at
  <with|mode|math|3.2 GHz> with <with|mode|math|1 Gb> of memory. The results
  of our benchmarks can be found in tables<nbsp><reference|cd-tab>
  and<nbsp><reference|other-coeff-tab> below. Our benchmarks start at the
  order<nbsp><with|mode|math|n> where FFT multiplication becomes useful.
  Notice that working with orders in<nbsp><with|mode|math|2<rsup|\<bbb-N\>>>
  does not give us any significant advantage, because the top-level product
  on FFT-<no-break>ed blocks is naive. In table<nbsp><reference|cd-tab>, the
  choice of <with|mode|math|n<rsub|2>> as a function of <with|mode|math|n>
  has been optimized for complex double coefficients. No particular
  optimization effort was made for the coefficient types in
  table<nbsp><reference|other-coeff-tab>, and it might be possible to gain
  about <with|mode|math|10%> on our timings.

  <\big-table|<block|<tformat|<cwith|1|-1|2|2|cell-halign|r>|<cwith|1|-1|3|3|cell-halign|r>|<cwith|1|-1|3|3|cell-width|2cm>|<cwith|1|-1|2|2|cell-width|2cm>|<cwith|1|-1|5|5|cell-width|2cm>|<cwith|1|-1|4|4|cell-width|2cm>|<cwith|1|-1|4|4|cell-halign|r>|<cwith|1|-1|5|5|cell-halign|r>|<cwith|2|16|4|4|cell-halign|r>|<cwith|2|16|5|5|cell-halign|r>|<cwith|2|16|5|5|cell-width|2cm>|<cwith|2|16|4|4|cell-width|2cm>|<cwith|1|-1|1|1|cell-halign|c>|<table|<row|<cell|<with|mode|math|n>>|<cell|<with|mode|math|Q(n)>>|<cell|<with|mode|math|<frac|Q(n)|M(n)>>>|<cell|<with|mode|math|R(n)>>|<cell|<with|mode|math|<frac|R(n)|M(n)>>>>|<row|<cell|<with|mode|math|2<rsup|8>>>|<cell|0.001>|<cell|1.844>|<cell|0.001>|<cell|1.923>>|<row|<cell|<with|mode|math|2<rsup|9>>>|<cell|0.003>|<cell|2.266>|<cell|0.003>|<cell|2.633>>|<row|<cell|<with|mode|math|2<rsup|10>>>|<cell|0.007>|<cell|2.426>|<cell|0.008>|<cell|2.879>>|<row|<cell|<with|mode|math|2<rsup|11>>>|<cell|0.014>|<cell|2.377>|<cell|0.017>|<cell|2.878>>|<row|<cell|<with|mode|math|2<rsup|12>>>|<cell|0.031>|<cell|2.537>|<cell|0.037>|<cell|3.037>>|<row|<cell|<with|mode|math|2<rsup|13>>>|<cell|0.068>|<cell|2.659>|<cell|0.088>|<cell|3.385>>|<row|<cell|<with|mode|math|2<rsup|14>>>|<cell|0.158>|<cell|2.844>|<cell|0.190>|<cell|3.420>>|<row|<cell|<with|mode|math|2<rsup|15>>>|<cell|0.341>|<cell|2.893>|<cell|0.437>|<cell|3.701>>|<row|<cell|<with|mode|math|2<rsup|16>>>|<cell|0.767>|<cell|3.038>|<cell|1.018>|<cell|4.032>>|<row|<cell|<with|mode|math|2<rsup|17>>>|<cell|1.703>|<cell|3.151>|<cell|2.195>|<cell|4.061>>|<row|<cell|<with|mode|math|2<rsup|18>>>|<cell|3.618>|<cell|2.968>|<cell|4.618>|<cell|3.770>>|<row|<cell|<with|mode|math|2<rsup|19>>>|<cell|8.097>|<cell|3.001>|<cell|10.319>|<cell|3.820>>|<row|<cell|<with|mode|math|2<rsup|20>>>|<cell|17.307>|<cell|2.921>|<cell|22.149>|<cell|3.723>>|<row|<cell|<with|mode|math|2<rsup|21>>>|<cell|37.804>|<cell|2.916>|<cell|49.347>|<cell|3.856>>|<row|<cell|<with|mode|math|2<rsup|22>>>|<cell|80.298>|<cell|2.881>|<cell|104.159>|<cell|3.746>>>>>>
    <label|cd-tab>Timings in seconds for the computation of
    <with|mode|math|n> terms of the exponential of a given series using
    complex double coefficients. We both computed the exponential using a
    semi-relaxed and a<nbsp>relaxed product, corresponding to
    <with|mode|math|Q(n)> and <with|mode|math|R(n)>. We also considered the
    ratios with the timings<nbsp><with|mode|math|M(n)> for a full FFT-product
    of two polynomials of degree <with|mode|math|<op|\<less\>> n>.
  </big-table>

  <\big-table|<block|<tformat|<cwith|1|-1|2|2|cell-width|2cm>|<cwith|1|-1|3|3|cell-width|2cm>|<cwith|1|-1|4|4|cell-width|2cm>|<cwith|1|-1|5|5|cell-width|2cm>|<cwith|1|-1|5|5|cell-halign|r>|<cwith|1|-1|4|4|cell-halign|r>|<cwith|1|-1|3|3|cell-halign|r>|<cwith|1|-1|2|2|cell-halign|r>|<cwith|1|-1|1|1|cell-halign|c>|<table|<row|<cell|<with|mode|math|n>>|<cell|semi,
  <with|mode|math|\<bbb-F\><rsub|p>>>|<cell|both,
  <with|mode|math|\<bbb-F\><rsub|p>>>|<cell|semi,
  <with|mode|math|\<bbb-C\><rsub|256>>>|<cell|both,
  <with|mode|math|\<bbb-C\><rsub|256>>>>|<row|<cell|<with|mode|math|2<rsup|8>>>|<cell|2.552>|<cell|2.793>|<cell|1.481>|<cell|1.627>>|<row|<cell|<with|mode|math|2<rsup|10>>>|<cell|2.794>|<cell|3.423>|<cell|1.851>|<cell|2.168>>|<row|<cell|<with|mode|math|2<rsup|12>>>|<cell|3.486>|<cell|4.250>|<cell|2.484>|<cell|2.987>>|<row|<cell|<with|mode|math|2<rsup|14>>>|<cell|3.576>|<cell|4.584>|<cell|2.757>|<cell|3.683>>|<row|<cell|<with|mode|math|2<rsup|16>>>|<cell|3.940>|<cell|5.135>|<cell|3.429>|<cell|4.604>>|<row|<cell|<with|mode|math|2<rsup|18>>>|<cell|4.293>|<cell|5.490>|<cell|3.842>|<cell|5.418>>|<row|<cell|<with|mode|math|2<rsup|20>>>|<cell|4.329>|<cell|5.839>|<cell|>|<cell|>>|<row|<cell|<with|mode|math|2<rsup|22>>>|<cell|4.509>|<cell|6.006>|<cell|>|<cell|>>>>>>
    <label|other-coeff-tab>Ratios for the computation of <with|mode|math|n>
    terms of the exponential of a given series using different types of
    coefficients. In the first two columns, we use
    <with|mode|math|\<bbb-F\><rsub|p>> as our ground field, with
    <with|mode|math|p=3*2<rsup|30>+1>. In the last two columns, we compute
    with <with|mode|math|256> bit complex floats from the <name|Mpfr>
    library.
  </big-table>

  <\remark>
    It is instructive to compare the efficiencies of relaxed evaluation and
    Newton's method. For instance, the exponentiation algorithm
    from<nbsp><cite|BK78> has a time complexity<nbsp><group|<with|mode|math|<op|\<sim\>>
    4*M(n)>>. Although this is better from an asymptotic point of view, the
    ratio<nbsp><with|mode|math|Q(n)/<no-break>M(n)> rarely reaches
    <with|mode|math|4> in our tables. Consequently, relaxed algorithms are
    often better. A similar phenomenon was already observed
    in<nbsp><cite-detail|vdH:relax|Tables<nbsp>4 and<nbsp>5>. It would be
    interesting to pursue the comparisons in view of some recent advances
    concerning Newton's method <cite|BCOSSS06|vdH:fnewton>; see also
    <cite-detail|Sedo01|Section<nbsp>5.2.1>.
  </remark>

  <\remark>
    Although the emphasis of this paper is on asymptotic complexity, the idea
    behind the new algorithms also applies in the Karatsuba and Toom-Cook
    models. In the latter case, we take <with|mode|math|n<rsub|1>> small
    (typically <with|mode|math|n<rsub|1>\<in\>{2,3,4}>) and use evaluation
    (interpolation) for polynomials of degree <with|mode|math|n<rsub|1>-1>
    (<with|mode|math|2*n<rsub|1>-2>) at <with|mode|math|2*n<rsub|1>-1>
    points. From an asymptotic point of view, this yields
    <with|mode|math|R(n)\<sim\>M(n)> for relaxed multiplication. Moreover,
    the approach naturally combines with the generalization of pair/odd
    decompositions<nbsp><cite|HaZi02>, which also yields an optimal bound for
    truncated multiplications. In fact, we notice that truncated pair/odd
    Karatsuba multiplication is ``essentially relaxed''
    <cite-detail|vdH:relax|Section 4.2>.

    On the negative side, these theoretically fast algorithms have bad space
    complexities and they are difficult to implement. In order to obtain good
    timings, it seems to be necessary to use dedicated code generation at
    different (ranges of) orders <with|mode|math|n>, which can be done using
    the <name|C++> template mechanism. The current implementation in
    <name|Mmxlib> does not achieve the theoretical time complexity by far,
    because the recursive function calls suffer from too much overhead.
  </remark>

  <section|Conclusion>

  We have shown how to improve the complexity of relaxed multiplication in
  the case when the coefficient ring admits sufficiently many
  <with|mode|math|2<rsup|p>>-th roots of unity. The improvement is based on
  reusing FFT-transforms of pieces of the multiplicands at different levels
  of the underlying binary splitting algorithm. The new approach has proved
  to be efficient in practice (see tables<nbsp><reference|cd-tab>
  and<nbsp><reference|other-coeff-tab>).

  For further studies, it would be interesting to study the price of
  artificially adding <with|mode|math|2<rsup|p>>-<no-break>th roots of unity,
  like in Schnhage-Strassen's algorithm. In practice, we notice that it is
  often possible, and better, to ``cut the coefficients into pieces'' and to
  replace them by polynomials over the complexified doubles
  <with|mode|math|\<bbb-C\><rsub|52>> or <with|mode|math|\<bbb-F\><rsub|p>>
  with <with|mode|math|p=3*2<rsup|20>+1>. However, this approach requires
  more implementation effort.

  <paragraph*|Acknowledgement>We would like to thank the third referee for
  his detailed comments on the proof of theorem<nbsp><reference|main-th>,
  which also resulted in slightly sharper bounds.

  <\bibliography|bib|alpha|all>
    <\bib-list|BCO+06>
      <bibitem*|BCO+06><label|bib-BCOSSS06>A.<nbsp>Bostan, F.<nbsp>Chyzak,
      F.<nbsp>Ollivier, B.<nbsp>Salvy, . Schost, and A.<nbsp>Sedoglavic.
      <newblock>Fast computation of power series solutions of systems of
      differential equation. <newblock>preprint, april 2006.
      <newblock>submitted, 13 pages.

      <bibitem*|BK78><label|bib-BK78>R.P. Brent and H.T. Kung. <newblock>Fast
      algorithms for manipulating formal power series.
      <newblock><with|font-shape|italic|Journal of the ACM>, 25:581--595,
      1978.

      <bibitem*|CK91><label|bib-CK91>D.G. Cantor and E.<nbsp>Kaltofen.
      <newblock>On fast multiplication of polynomials over arbitrary
      algebras. <newblock><with|font-shape|italic|Acta Informatica>,
      28:693--701, 1991.

      <bibitem*|Coo66><label|bib-Cook66>S.A. Cook.
      <newblock><with|font-shape|italic|On the minimum computation time of
      functions>. <newblock>PhD thesis, Harvard University, 1966.

      <bibitem*|CT65><label|bib-CT65>J.W. Cooley and J.W. Tukey. <newblock>An
      algorithm for the machine calculation of complex Fourier series.
      <newblock><with|font-shape|italic|Math. Computat.>, 19:297--301, 1965.

      <bibitem*|HQZ04><label|bib-HQZ04>Guillaume Hanrot, Michel Quercia, and
      Paul Zimmermann. <newblock>The middle product algorithm I. speeding up
      the division and square root of power series.
      <newblock><with|font-shape|italic|AAECC>, 14(6):415--438, 2004.

      <bibitem*|HZ02><label|bib-HaZi02>Guillaume Hanrot and Paul Zimmermann.
      <newblock>A long note on Mulders' short product. <newblock>Research
      Report 4654, INRIA, December 2002. <newblock>Available from
      <with|font-family|tt|http://www.loria.fr/<nbsp>hanrot/Papers/mulders.ps>.

      <bibitem*|Knu97><label|bib-Kn97>D.E. Knuth.
      <newblock><with|font-shape|italic|The Art of Computer Programming>,
      volume 2: Seminumerical Algorithms. <newblock>Addison-Wesley, 3-rd
      edition, 1997.

      <bibitem*|KO63><label|bib-Kar63>A.<nbsp>Karatsuba and J.<nbsp>Ofman.
      <newblock>Multiplication of multidigit numbers on automata.
      <newblock><with|font-shape|italic|Soviet Physics Doklady>, 7:595--596,
      1963.

      <bibitem*|Sed01><label|bib-Sedo01>Alexandre Sedoglavic.
      <newblock><with|font-shape|italic|Mthodes seminumriques en algbre
      diffrentielle<nbsp>; applications  l'tude des proprits
      structurelles de systmes diffrentiels algbriques en automatique>.
      <newblock>PhD thesis, cole polytechnique, 2001.

      <bibitem*|SS71><label|bib-SS71>A.<nbsp>Schnhage and V.<nbsp>Strassen.
      <newblock>Schnelle Multiplikation grosser Zahlen.
      <newblock><with|font-shape|italic|Computing 7>, 7:281--292, 1971.

      <bibitem*|Too63><label|bib-Toom63b>A.L. Toom. <newblock>The complexity
      of a scheme of functional elements realizing the multiplication of
      integers. <newblock><with|font-shape|italic|Soviet Mathematics>,
      4(2):714--716, 1963.

      <bibitem*|vdH97><label|bib-vdH:issac97>J.<nbsp>van<nbsp>der Hoeven.
      <newblock>Lazy multiplication of formal power series. <newblock>In
      W.<nbsp>W. Kchlin, editor, <with|font-shape|italic|Proc. ISSAC '97>,
      pages 17--20, Maui, Hawaii, July 1997.

      <bibitem*|vdH02a><label|bib-vdH:relax>J.<nbsp>van<nbsp>der Hoeven.
      <newblock>Relax, but don't be too lazy.
      <newblock><with|font-shape|italic|JSC>, 34:479--542, 2002.

      <bibitem*|vdH02b><label|bib-vdH:mml>J.<nbsp>van<nbsp>der Hoeven et al.
      <newblock>Mmxlib: the standard library for Mathemagix, 2002.
      <newblock><with|font-family|tt|http://www.mathemagix.org/mml.html>.

      <bibitem*|vdH03a><label|bib-vdH:newrelax:pre>J.<nbsp>van<nbsp>der
      Hoeven. <newblock>New algorithms for relaxed multiplication.
      <newblock>Technical Report 2003-44, Universit Paris-Sud, Orsay,
      France, 2003.

      <bibitem*|vdH03b><label|bib-vdH:issac03>J.<nbsp>van<nbsp>der Hoeven.
      <newblock>Relaxed multiplication using the middle product. <newblock>In
      Manuel Bronstein, editor, <with|font-shape|italic|Proc. ISSAC '03>,
      pages 143--147, Philadelphia, USA, August 2003.

      <bibitem*|vdH06a><label|bib-vdH:fnewton>J.<nbsp>van<nbsp>der Hoeven.
      <newblock>Newton's method and FFT trading. <newblock>Technical Report
      2006-17, Univ. Paris-Sud, 2006. <newblock>Submitted to JSC.

      <bibitem*|vdH06b><label|bib-vdH:riemann>J.<nbsp>van<nbsp>der Hoeven.
      <newblock>On effective analytic continuation. <newblock>Technical
      Report 2006-15, Univ. Paris-Sud, 2006.
    </bib-list>
  </bibliography>
</body>

<\initial>
  <\collection>
    <associate|font-base-size|11>
    <associate|language|english>
    <associate|page-medium|paper>
    <associate|page-show-hf|true>
    <associate|par-hyphen|professional>
  </collection>
</initial>

<\references>
  <\collection>
    <associate|P-bound|<tuple|12|7>>
    <associate|Q-ineq|<tuple|5|?>>
    <associate|R-ineq|<tuple|4|?>>
    <associate|T-th-bnd|<tuple|8|7>>
    <associate|auto-1|<tuple|1|1>>
    <associate|auto-10|<tuple|2|9>>
    <associate|auto-11|<tuple|6|9>>
    <associate|auto-12|<tuple|6|9>>
    <associate|auto-13|<tuple|6|10>>
    <associate|auto-2|<tuple|2|2>>
    <associate|auto-3|<tuple|1|3>>
    <associate|auto-4|<tuple|3|4>>
    <associate|auto-5|<tuple|2|5>>
    <associate|auto-6|<tuple|4|6>>
    <associate|auto-7|<tuple|3|6>>
    <associate|auto-8|<tuple|5|8>>
    <associate|auto-9|<tuple|1|8>>
    <associate|bib-BCOSSS06|<tuple|BCO+06|10>>
    <associate|bib-BK78|<tuple|BK78|10>>
    <associate|bib-CK91|<tuple|CK91|10>>
    <associate|bib-CT65|<tuple|CT65|10>>
    <associate|bib-Cook66|<tuple|Coo66|10>>
    <associate|bib-HQZ04|<tuple|HQZ04|10>>
    <associate|bib-HaZi02|<tuple|HZ02|10>>
    <associate|bib-Kar63|<tuple|KO63|10>>
    <associate|bib-Kn97|<tuple|Knu97|10>>
    <associate|bib-SS71|<tuple|SS71|10>>
    <associate|bib-Sedo01|<tuple|Sed01|10>>
    <associate|bib-Toom63b|<tuple|Too63|10>>
    <associate|bib-vdH:fnewton|<tuple|vdH06a|10>>
    <associate|bib-vdH:issac03|<tuple|vdH03b|10>>
    <associate|bib-vdH:issac97|<tuple|vdH97|10>>
    <associate|bib-vdH:mml|<tuple|vdH02b|10>>
    <associate|bib-vdH:newrelax:pre|<tuple|vdH03a|10>>
    <associate|bib-vdH:relax|<tuple|vdH02a|10>>
    <associate|bib-vdH:riemann|<tuple|vdH06b|10>>
    <associate|cd-tab|<tuple|1|8>>
    <associate|dichrelax-fig|<tuple|1|3>>
    <associate|exp-form|<tuple|1|1>>
    <associate|fftrelax-fig|<tuple|2|5>>
    <associate|fftrelax2-fig|<tuple|3|6>>
    <associate|full-prod|<tuple|2|2>>
    <associate|impl-sec|<tuple|5|8>>
    <associate|impr-relaxed-sec|<tuple|3|4>>
    <associate|main-th|<tuple|12|6>>
    <associate|new-relaxed-sec|<tuple|4|6>>
    <associate|old-relax-th|<tuple|1|1>>
    <associate|other-coeff-tab|<tuple|2|9>>
    <associate|preq-0|<tuple|4|7>>
    <associate|preq-1|<tuple|5|7>>
    <associate|preq-2|<tuple|6|7>>
    <associate|preq-3|<tuple|7|7>>
    <associate|rel-srel-th|<tuple|10|4>>
    <associate|rel-srem-th|<tuple|9|?>>
    <associate|semi-relaxed-sec|<tuple|2|2>>
    <associate|trunc-prod|<tuple|3|2>>
  </collection>
</references>

<\auxiliary>
  <\collection>
    <\associate|bib>
      Kar63

      Kn97

      CT65

      SS71

      CK91

      vdH:relax

      vdH:issac97

      vdH:relax

      vdH:relax

      vdH:newrelax:pre

      HQZ04

      vdH:issac03

      Toom63b

      Cook66

      Kn97

      vdH:issac97

      vdH:relax

      vdH:riemann

      vdH:issac97

      vdH:relax

      vdH:relax

      Kar63

      Toom63b

      Cook66

      CT65

      SS71

      CK91

      vdH:relax

      vdH:issac97

      vdH:relax

      BK78

      vdH:mml

      vdH:relax

      BK78

      vdH:relax

      BCOSSS06

      vdH:fnewton

      Sedo01

      HaZi02

      vdH:relax
    </associate>
    <\associate|figure>
      <\tuple|normal>
        <label|dichrelax-fig>Illustration of the facts that (1) a full
        relaxed <with|mode|<quote|math>|2*n\<times\>2*n> multiplication
        reduces to one full relaxed <with|mode|<quote|math>|n\<times\>n>
        multiplication, two semi-relaxed <with|mode|<quote|math>|n\<times\>n>
        multiplication and one zealous <with|mode|<quote|math>|n\<times\>n>
        multiplication (2) a semi-relaxed
        <with|mode|<quote|math>|2*n\<times\>2*n> multiplication reduces to
        two semi-relaxed <with|mode|<quote|math>|n\<times\>n> multiplications
        and two zealous <with|mode|<quote|math>|n\<times\>n> multiplications.
      </tuple|<pageref|auto-3>>

      <\tuple|normal>
        <label|fftrelax-fig>New decomposition of a semi-relaxed
        <with|mode|<quote|math>|n\<times\>n> multiplication into
        <with|mode|<quote|math>|n/n<rsub|1>> semi-relaxed
        <with|mode|<quote|math>|n<rsub|1>\<times\>n<rsub|1>> multiplications
        (the light regions) and one semi-relaxed
        <with|mode|<quote|math>|n<rsub|2>\<times\>(n<rsub|2>-1)>
        multiplication (the dark region) with FFT-ed coefficients in
        <with|mode|<quote|math>|<with|mode|<quote|text>|<with|font-family|<quote|ss>|C>><rsup|2*n<rsub|1>>>.
      </tuple|<pageref|auto-5>>

      <\tuple|normal>
        <label|fftrelax2-fig>Generalized decomposition of a semi-relaxed
        <with|mode|<quote|math>|n\<times\>n> multiplication into
        <with|mode|<quote|math>|l=3> layers.
      </tuple|<pageref|auto-7>>
    </associate>
    <\associate|table>
      <\tuple|normal>
        <label|cd-tab>Timings in seconds for the computation of
        <with|mode|<quote|math>|n> terms of the exponential of a given series
        using complex double coefficients. We both computed the exponential
        using a semi-relaxed and a <no-break><specific|screen|<resize|<move|<with|color|<quote|#A0A0FF>|->|-0.3em|>|0em||0em||>>relaxed
        product, corresponding to <with|mode|<quote|math>|Q(n)> and
        <with|mode|<quote|math>|R(n)>. We also considered the ratios with the
        timings <no-break><specific|screen|<resize|<move|<with|color|<quote|#A0A0FF>|->|-0.3em|>|0em||0em||>><with|mode|<quote|math>|M(n)>
        for a full FFT-product of two polynomials of degree
        <with|mode|<quote|math>|<with|math-condensed|<quote|true>|\<less\>>
        n>.
      </tuple|<pageref|auto-9>>

      <\tuple|normal>
        <label|other-coeff-tab>Ratios for the computation of
        <with|mode|<quote|math>|n> terms of the exponential of a given series
        using different types of coefficients. In the first two columns, we
        use <with|mode|<quote|math>|\<bbb-F\><rsub|p>> as our ground field,
        with <with|mode|<quote|math>|p=3*2<rsup|30>+1>. In the last two
        columns, we compute with <with|mode|<quote|math>|256> bit complex
        floats from the <with|font-shape|<quote|small-caps>|Mpfr> library.
      </tuple|<pageref|auto-10>>
    </associate>
    <\associate|toc>
      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|1.<space|2spc>Introduction>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-1><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|2.<space|2spc>Full
      and semi-relaxed multiplication> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-2><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|3.<space|2spc>A
      new algorithm for fast relaxed multiplication>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-4><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|4.<space|2spc>Further
      improvements of the algorithm> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-6><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|5.<space|2spc>Implementation
      details and benchmarks> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-8><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|6.<space|2spc>Conclusion>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-11><vspace|0.5fn>

      <with|par-left|<quote|6fn>|Acknowledgement
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-12><vspace|0.15fn>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|Bibliography>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-13><vspace|0.5fn>
    </associate>
  </collection>
</auxiliary>