<TeXmacs|1.99.2>

<style|<tuple|tmarticle|number-long-article>>

<\body>
  <\hide-preamble>
    \;

    <assign|proof-text|<\macro>
      <localize|Proof>
    </macro>>
  </hide-preamble>

  <doc-data|<doc-title|Even faster integer
  multiplication>|<doc-date|<date>>|<doc-title-options|cluster-by-affiliation>|<doc-author|<author-data|<author-name|David
  Harvey>|<\author-affiliation>
    School of Mathematics and Statistics

    University of New South Wales

    Sydney NSW 2052

    Australia
  </author-affiliation>|<author-email|d.harvey@unsw.edu.au>>>|<doc-author|<author-data|<author-name|Joris
  van der Hoeven>|<\author-affiliation>
    CNRS, Laboratoire d'informatique

    cole polytechnique

    91128 Palaiseau Cedex

    France
  </author-affiliation>|<author-email|vdhoeven@lix.polytechnique.fr>>>|<doc-author|<author-data|<author-name|Grgoire
  Lecerf>|<\author-affiliation>
    CNRS, Laboratoire d'informatique

    cole polytechnique

    91128 Palaiseau Cedex

    France
  </author-affiliation>|<author-email|lecerf@lix.polytechnique.fr>>>>

  <abstract-data|<\abstract>
    We give a new proof of Frer's bound for the cost of multiplying
    <math|n>-bit integers in the bit complexity model. Unlike Frer, our
    method does not require constructing special coefficient rings with
    ``fast'' roots of unity. Moreover, we establish the improved bound
    <math|O<around|(|n*log n*K<rsup|log<rsup|\<ast\>> n>|)>> with <math|K=8>.
    We show that an optimised variant of Frer's algorithm achieves only
    <math|K=16>, suggesting that the new algorithm is faster than Frer's by
    a<nbsp>factor of <math|2<rsup|log<rsup|\<ast\>> n>>. Assuming standard
    conjectures about the distribution of Mersenne primes, we give yet
    another algorithm that achieves <math|K=4>.
  </abstract>|<abstract-acm|G.1.0 Computer-arithmetic|F.2.1 Number-theoretic
  computations>|<abstract-keywords|Integer
  multiplication|algorithm|complexity bound|FFT>|<abstract-msc|68W30|68Q17|68W40>>

  <section|Introduction><label|intro-sec>

  Let <math|<math-ss|I><around*|(|n|)>> denote the cost of multiplying two
  <math|n>-bit integers in the deterministic multitape Turing
  model<nbsp><cite|Pap94> (commonly called ``bit complexity''). Previously,
  the best known asymptotic bound for <math|<math-ss|I><around*|(|n|)>> was
  due to Frer<nbsp><cite|Furer2007|Furer2009>. He proved that there is a
  constant <math|K\<gtr\>1> such that

  <\eqnarray*>
    <tformat|<table|<row|<cell|<math-ss|I><around*|(|n|)>>|<cell|=>|<cell|O<around*|(|n*log
    n*K<rsup|log<rsup|\<ast\>> n>|)>,<eq-number><label|K-bound>>>>>
  </eqnarray*>

  where <math|log<rsup|\<asterisk\>> x>, for <math|x\<in\>\<bbb-R\>>, denotes
  the iterated logarithm, i.e.,

  <\eqnarray*>
    <tformat|<table|<row|<cell|log<rsup|\<ast\>>
    x>|<cell|\<assign\>>|<cell|min <around*|{|k\<in\>\<bbb-N\>:log<rsup|\<circ\>k>
    x\<leqslant\>1|}>,<eq-number><label|it-log>>>|<row|<cell|<op|log<rsup|\<circ\>k>>>|<cell|\<assign\>>|<cell|<op|log>\<circ\><below|\<cdots\>|k\<times\>>\<circ\><op|log>.>>>>
  </eqnarray*>

  The main contribution of this paper is a new algorithm that yields the
  following improvement.

  <\theorem>
    <label|main-thm>For <math|n\<rightarrow\>\<infty\>> we have

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|I><around*|(|n|)>>|<cell|=>|<cell|O<around*|(|n*log
      n*8<rsup|log<rsup|\<ast\>> n>|)>.>>>>
    </eqnarray*>
  </theorem>

  Frer suggested several methods to minimise the value of <math|K> in his
  algorithm, but did not give an explicit bound for <math|K>. In
  section<nbsp><reference|Furer-sec> of this paper, we outline an optimised
  variant of Frer's algorithm that achieves <math|K=16>. We do not know how
  to obtain <math|K\<less\>16> using Frer's approach. This suggests that the
  new algorithm is faster than Frer's by a factor of
  <math|2<rsup|log<rsup|\<ast\>> n>>.

  The idea of the new algorithm is remarkably simple. Given two <math|n>-bit
  integers, we split them into chunks of exponentially smaller size, say
  around <math|log n> bits, and thus reduce to the problem of multiplying
  integer polynomials of degree <math|O<around*|(|n/log n|)>> with
  coefficients of bit size <math|O<around*|(|log n|)>>. We multiply the
  polynomials using discrete Fourier transforms (DFTs) over <math|\<bbb-C\>>,
  with a working precision of <math|O<around*|(|log n|)>> bits. To compute
  the DFTs, we decompose them into ``short transforms'' of exponentially
  smaller length, say length around <math|log n>, using the Cooley--Tukey
  method. We then use Bluestein's chirp transform to convert each short
  transform into a polynomial multiplication problem over <math|\<bbb-C\>>,
  and finally convert back to integer multiplication via Kronecker
  substitution. These much smaller integer multiplications are handled
  recursively.

  The algorithm just sketched leads immediately to a bound of the
  form<nbsp><eqref|K-bound>. A detailed proof is given in
  section<nbsp><reference|simple-algo-sec>. We emphasise that the new method
  works directly over <math|\<bbb-C\>>, and does not need special coefficient
  rings with ``fast'' roots of unity, of the type constructed by Frer.
  Optimising parameters and keeping careful track of constants leads to
  Theorem<nbsp><reference|main-thm>, which is proved in
  section<nbsp><reference|even-faster-sec>. We also prove the following
  conditional result in section<nbsp><reference|yet-faster-sec>.

  <\theorem>
    <label|mersenne-thm>Assume Conjecture<nbsp><reference|Mersenne-conj>.
    Then

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|I><around*|(|n|)>>|<cell|=>|<cell|O<around*|(|n*log
      n*4<rsup|log<rsup|\<ast\>> n>|)>.>>>>
    </eqnarray*>
  </theorem>

  Conjecture<nbsp><reference|Mersenne-conj> is a slight weakening of the
  Lenstra--Pomerance--Wagstaff conjecture on the distribution of Mersenne
  primes, i.e., primes of the form <math|p=2<rsup|q>-1>. The idea of the
  algorithm is to replace the coefficient ring <math|\<bbb-C\>> by the finite
  field <math|\<bbb-F\><rsub|p><around*|[|\<mathi\>|]>>; we are then able to
  exploit fast algorithms for multiplication modulo numbers of the form
  <math|2<rsup|q>-1>.

  An important feature of the new algorithms is that the same techniques are
  applicable in other contexts, such as polynomial multiplication over finite
  fields. Previously, no Frer-type complexity bounds were known for the
  latter problem. The details are presented in the companion
  paper<nbsp><cite|vdH:ffmul>.

  In the remainder of this section, we present a brief history of complexity
  bounds for integer multiplication, and we give an overview of the paper and
  of our contribution. More historical details can be found in books such
  as<nbsp><cite-detail|GaGe2002|Chapter<nbsp>8>.

  <subsection|Brief history and related work>

  Multiplication algorithms of complexity <math|O<around*|(|n<rsup|2>|)>> in
  the number of digits <math|n> were already known in ancient civilisations.
  The Egyptians used an algorithm based on repeated doublings and additions.
  The Babylonians invented the positional numbering system, while performing
  their computations in<nbsp>base <math|60> instead of <math|10>. Precise
  descriptions of multiplication methods close to the ones that we learn at
  school appeared in Europe during the late Middle Ages. For historical
  references, we refer to<nbsp><cite-detail|Smi58|Section II.5>
  and<nbsp><cite|Neu57|Boy85>.

  The first subquadratic algorithm for integer multiplication, with
  complexity <math|O<around|(|n<rsup|log 3/log 2>|)>>, was discovered by
  Karatsuba<nbsp><cite|Kar62|Kar63>. From a modern viewpoint, Karatsuba's
  algorithm utilises an evaluation-interpolation scheme. The input integers
  are cut into smaller chunks, which are taken to be the coefficients of two
  integer polynomials; the polynomials are evaluated at several well-chosen
  points; their values at those points are (recursively) multiplied;
  interpolating the results at those points yields the product polynomial;
  finally, the integer product is recovered by pasting together the
  coefficients of the product polynomial. This cutting-and-pasting procedure
  is sometimes known as Kronecker segmentation (see
  section<nbsp><reference|Kronecker-sec>).

  Shortly after the discovery of Karatsuba's algorithm, which uses three
  evaluation points, Toom generalised it so as to use <math|2*r-1> evaluation
  points instead<nbsp><cite|Toom63a|Toom63b>, for any <math|r\<geqslant\>2>.
  This leads to the bound <math|<math-ss|I><around*|(|n|)>=O<around|(|n<rsup|log
  <around*|(|2*r-1|)>/log r>|)>> for fixed <math|r>. Letting<nbsp><math|r>
  grow slowly with <math|n>, he also showed that
  <math|<math-ss|I><around*|(|n|)>=O<around|(|n*2<rsup|5*<sqrt|log n/log
  2>>|)>>. The algorithm was adapted to the Turing model by
  Cook<nbsp><cite|Cook66> and is now known as Toom--Cook multiplication.
  Schnhage obtained a slightly better bound<nbsp><cite|Sch66> by working
  modulo several numbers of the form <math|2<rsup|k>-1> instead of using
  several polynomial evaluation points. Knuth proved that an even better
  complexity bound could be achieved by suitably adapting Toom's
  method<nbsp><cite|Kn69>.

  The next step towards even faster integer multiplication was the
  rediscovery of the fast Fourier transform (FFT) by Cooley and
  Tukey<nbsp><cite|CT65> (essentially the same algorithm was already known to
  Gauss <cite|HJB-gauss-fft>). The FFT yields particularly efficient
  algorithms for evaluating and interpolating polynomials on certain special
  sets of evaluation points. For example, if <math|R> is a ring in which
  <math|2> is invertible, and if <math|\<omega\>\<in\>R> is a principal
  <math|2<rsup|k>>-th root of unity (see section<nbsp><reference|DFT-sec> for
  detailed definitions), then the FFT permits evaluation and interpolation at
  the points <math|1,\<omega\>,\<ldots\>,\<omega\><rsup|2<rsup|k>-1>> using
  only <math|O<around*|(|k*2<rsup|k>|)>> ring operations in <math|R>.
  Consequently, if <math|P> and <math|Q> are polynomials in
  <math|R<around*|[|X|]>> whose product has degree less
  than<nbsp><math|2<rsup|k>>, then the product <math|P*Q> can be computed
  using <math|O<around*|(|k*2<rsup|k>|)>> ring operations as well.

  In<nbsp><cite|SS71>, Schnhage and Strassen presented two FFT-based
  algorithms for integer multiplication. In both algorithms, they first use
  Kronecker segmentation to convert the problem to multiplication of integer
  polynomials. They then embed these polynomials into <math|R<around*|[|X|]>>
  for a suitable ring <math|R> and multiply the polynomials by using FFTs
  over <math|R>. The first algorithm takes <math|R=\<bbb-C\>\<nocomma\>> and
  <math|\<omega\>=exp<around*|(|2*\<mathpi\>*\<mathi\>/2<rsup|k>|)>>, and
  works with finite-precision approximations to elements of <math|\<bbb-C\>>.
  Multiplications in <math|\<bbb-C\>> itself are handled recursively, by
  treating them as integer multiplications (after appropriate scaling). The
  second algorithm, popularly known as <with|font-shape|italic|the>
  Schnhage--Strassen algorithm, takes <math|R=\<bbb-Z\>/<around*|\<nobracket\>|m*\<bbb-Z\>|\<nobracket\>>>
  where <math|m=2<rsup|2<rsup|k>>+1> is a<nbsp>Fermat<nbsp>number. This
  algorithm is the faster of the two, achieving the bound
  <math|<math-ss|I><around*|(|n|)>=O<around*|(|n*log n*log log n|)>>. It
  benefits from the fact that <math|\<omega\>=2> is a principal
  <math|2<rsup|k+1>>-th root of unity in<nbsp><math|R>, and that
  multiplications by powers of<nbsp><math|\<omega\>> can be carried out
  efficiently, as they correspond to simple shifts and negations. At around
  the same time, Pollard pointed out that one can also work with
  <math|R=\<bbb-Z\>/<around*|\<nobracket\>|m*\<bbb-Z\>|\<nobracket\>>> where
  <math|m> is a prime of the form <math|m=a*2<rsup|k>+1>, since then
  <math|R<rsup|\<ast\>>> contains primitive <math|2<rsup|k>>-th roots of
  unity<nbsp><cite|Pol71> (although he did not give a bound
  for<nbsp><math|<math-ss|I><around*|(|n|)>>).

  Schnhage and Strassen's algorithm remained the champion for more than
  thirty years, but was recently superseded by Frer's
  algorithm<nbsp><cite|Furer2007>. In short, Frer managed to combine the
  advantages of the two algorithms from<nbsp><cite|SS71>, to achieve the
  bound <math|<math-ss|I><around*|(|n|)>=O<around|(|n*log
  n*2<rsup|O<around*|(|log<rsup|\<ast\>> n|)>>|)>>. Frer's algorithm is
  based on the ingenious observation that the ring
  <math|R=\<bbb-C\><around*|[|X|]>/<around|(|X<rsup|2<rsup|r-1>>+1|)>>
  contains a small number of ``fast'' principal <math|2<rsup|r>>-th roots of
  unity, namely the powers of <math|X>, but also a large supply of much
  higher-order roots of unity inherited from <math|\<bbb-C\>>. To evaluate an
  FFT over <math|R>, he decomposes it into many ``short'' transforms of
  length at most <math|2<rsup|r>>, using the Cooley--Tukey method. He
  evaluates the short transforms with the fast roots of unity, pausing
  occasionally to perform ``slow'' multiplications by higher-order roots of
  unity (``twiddle factors''). A slightly subtle point of the construction is
  that we really need, for large <math|k>, a principal <math|2<rsup|k>>-th
  root of unity <rigid|<math|\<omega\>\<in\>R>> such that
  <math|\<omega\><rsup|2<rsup|k-r>>=X>.

  In<nbsp><cite|DeKuSaSa2013> it was shown that the technique
  from<nbsp><cite|Pol71> to compute modulo suitable prime numbers of the form
  <math|m=a*2<rsup|k>+1> can be adapted to Frer's algorithm. Although the
  complexity of this algorithm is essentially the same as that of Frer's
  algorithm, this method has the advantage that it does not require any error
  analysis for approximate numerical operations in <math|\<bbb-C\>>.

  <big-table|<block|<tformat|<cwith|1|-1|1|-1|cell-tsep|0.5spc>|<cwith|1|-1|1|-1|cell-bsep|0.5spc>|<cwith|6|6|1|3|cell-tsep|0.5spc>|<cwith|6|6|1|3|cell-bsep|0.5spc>|<table|<row|<cell|Date>|<cell|Authors>|<cell|Time
  complexity>>|<row|<cell|<math|\<less\>>3000
  BC>|<cell|Unknown<nbsp><cite|Neu57>>|<cell|<math|O<around*|(|n<rsup|2>|)>>>>|<row|<cell|1962>|<cell|Karatsuba<nbsp><cite|Kar62|Kar63>>|<cell|<math|O<around|(|n<rsup|log
  3/log 2>|)>>>>|<row|<cell|1963>|<cell|Toom<nbsp><cite|Toom63a|Toom63b>>|<cell|<math|O<around|(|<with|font-shape|italic|n*>2<rsup|5*<sqrt|log
  n/log 2>>|)>>>>|<row|<cell|1966>|<cell|Schnhage<nbsp><cite|Sch66>>|<cell|<math|O<around|(|n*2<rsup|<sqrt|2*log
  n/log 2>>*<around*|(|log n|)><rsup|3/2>|)>>>>|<row|<cell|1969>|<cell|Knuth<nbsp><cite|Kn69>>|<cell|<math|O<around|(|n*2<rsup|<sqrt|2*log
  n/log 2>>*log n|)>>>>|<row|<cell|1971>|<cell|Schnhage--Strassen<nbsp><cite|SS71>>|<cell|<math|O<around*|(|n*log
  n*log log n|)>>>>|<row|<cell|2007>|<cell|Frer<nbsp><cite|Furer2007>>|<cell|<math|O<around*|(|n*log
  n*2<rsup|O<around*|(|log<rsup|\<ast\>> n|)>>|)>>>>|<row|<cell|2014>|<cell|This
  paper>|<cell|<math|O<around*|(|n*log n*8<rsup|log<rsup|\<ast\>>
  n>|)>>>>>>>|Historical overview of known complexity bounds for <math|n>-bit
  integer multiplication.>

  <subsection|Our contributions and outline of the paper>

  Throughout the paper, integers are assumed to be handled in the standard
  binary representation. For our computational complexity results, we assume
  that we work on a Turing machine with a<nbsp>finite but sufficiently large
  number of tapes<nbsp><cite|Pap94>. The Turing machine model is very
  conservative with respect to the cost of memory access, which is pertinent
  from a practical point of view for implementations of FFT algorithms.
  Nevertheless, other models for sequential computations could be
  considered<nbsp><cite|Sch80|Furer2014>. For practical purposes, parallel
  models might be more appropriate, but we will not consider these in this
  paper. Occasionally, for polynomial arithmetic over abstract rings, we will
  also consider algebraic complexity measures<nbsp><cite-detail|BuClSh1997|Chapter<nbsp>4>.

  In section<nbsp><reference|survey-sec>, we start by recalling several
  classical techniques for completeness and later use: sorting and array
  transposition algorithms, discrete Fourier transforms (DFTs), the
  Cooley--Tukey algorithm, FFT multiplication and convolution, Bluestein's
  chirp transform, and Kronecker substitution and segmentation. In
  section<nbsp><reference|err-sec>, we also provide the necessary tools for
  the error analysis of complex Fourier transforms. Most of these tools are
  standard, although our presentation is somewhat <em|ad hoc>, being based on
  fixed point arithmetic.

  In section<nbsp><reference|simple-algo-sec>, we describe a simplified
  version of the new integer multiplication algorithm, without any attempt to
  minimise the aforementioned constant <math|K>. As mentioned in the sketch
  above, the key idea is to reduce a given DFT over <math|\<bbb-C\>> to a
  collection of ``short'' transforms, and then to convert these short
  transforms back to integer multiplication by a combination of Bluestein's
  chirp transform and Kronecker substitution.

  The complexity analysis of Frer's algorithm and the algorithm from
  section<nbsp><reference|simple-algo-sec> involves functional inequalities
  which contain post-compositions with logarithms and other slowly growing
  functions. In section<nbsp><reference|iter-sec>, we present a few
  systematic tools for analysing these types of inequalities. For more
  information on this quite particular kind of asymptotic analysis, we refer
  the reader to<nbsp><cite|Schm01|Ec92>.

  In section<nbsp><reference|even-faster-sec>, we present an optimised
  version of the algorithm from section<nbsp><reference|simple-algo-sec>,
  proving in particular the bound <math|<math-ss|I><around*|(|n|)>=O<around*|(|n*log
  n*8<rsup|log<rsup|\<ast\>> n>|)>> (Theorem<nbsp><reference|main-thm>),
  which constitutes the main result of this paper. In
  section<nbsp><reference|Furer-sec>, we outline a similar complexity
  analysis for Frer's algorithm. Even after several optimisations of the
  original algorithm, we were unable to attain a bound better than
  <math|<math-ss|I><around*|(|n|)>=O<around*|(|n*log
  n*16<rsup|log<rsup|\<ast\>> n>|)>>. This suggests that the new algorithm
  outperforms Frer's algorithm by a factor of <math|2<rsup|log<rsup|\<ast\>>
  n>>.

  This speedup is surprising, given that the short transforms in Frer's
  algorithm involve only shifts, additions and subtractions. The solution to
  the paradox is that Frer has made the short transforms <em|too fast>.
  Indeed, they are so fast that they make a negligible contribution to the
  overall complexity, and his computation is dominated by the ``slow''
  twiddle factor multiplications. In the new algorithm, we push more work
  into the short transforms, allowing them to get slightly slower; the
  <em|quid pro quo> is that we avoid the factor of two in zero-padding caused
  by Frer's introduction of artificial ``fast'' roots of unity. The optimal
  strategy is actually to let the short transforms dominate the computation,
  by increasing the short transform length relative to the coefficient size.
  Frer is unable to do this, because in his algorithm these two parameters
  are too closely linked. To underscore just how far the situation has been
  inverted relative to Frer's algorithm, we point out that in our
  presentation we can get away with using Schnhage--Strassen for the twiddle
  factor multiplications, without any detrimental effect on the overall
  complexity.

  We have chosen to base most of our algorithms on approximate complex
  arithmetic. Instead, following<nbsp><cite|Pol71>
  and<nbsp><cite|DeKuSaSa2013>, we might have chosen to use modular
  arithmetic. In section<nbsp><reference|param-sec>, we will briefly indicate
  how our main algorithm can be adapted to this setting. This variant of our
  algorithm presents several analogies with its adaptation to polynomial
  multiplication over finite fields<nbsp><cite|vdH:ffmul>.

  The question remains whether there exists an even faster algorithm than the
  algorithm of section<nbsp><reference|even-faster-sec>. In an earlier
  paper<nbsp><cite|Fur89>, Frer gave another algorithm of complexity
  <math|O<around|(|n*log n*2<rsup|O<around*|(|log<rsup|\<ast\>> n|)>>|)>>
  under the assumption that there exist sufficiently many Fermat primes,
  i.e., primes of the form <math|F<rsub|m>=2<rsup|2<rsup|m>>+1>. It can be
  shown that a careful optimisation of this algorithm yields the bound
  <math|<math-ss|I><around*|(|n|)>=O<around*|(|n*log
  n*4<rsup|log<rsup|\<ast\>> n>|)>>. Unfortunately, odds are high that
  <math|F<rsub|4>> is the largest Fermat prime. In
  section<nbsp><reference|yet-faster-sec>, we present an algorithm that
  achieves the bound <math|<math-ss|I><around*|(|n|)>=O<around*|(|n*log
  n*4<rsup|log<rsup|\<ast\>> n>|)>> under the more plausible conjecture that
  there exist sufficiently many Mersenne primes
  (Theorem<nbsp><reference|mersenne-thm>). The main technical ingredient is a
  variant of an algorithm of Crandall and Fagin<nbsp><cite|CF94> that permits
  efficient multiplication modulo <math|2<rsup|q>-1>, despite<nbsp><math|q>
  not being divisible by a large power of two.

  It would be interesting to know whether the new algorithms could be useful
  in practice. We have implemented an unoptimised version of the algorithm
  from section<nbsp><reference|param-sec> in the <name|Mathemagix>
  system<nbsp><cite|vdH:mmx> and found our implementation to be an order of
  magnitude slower than the <name|Gmp> library<nbsp><cite|GMP>. There is
  certainly room for improvement, but we doubt that even a highly optimised
  implementation of the new algorithm will be competitive in the near future.
  Nevertheless, the variant for polynomial multiplication over finite fields
  presented in<nbsp><cite|vdH:ffmul> seems to be a promising avenue for
  achieving speedups in practical computations. This will be investigated in
  a forthcoming paper.

  <\render-remark|Notations>
    We use Hardy's notations <math|f\<prec\>g> for <math|f=o<around*|(|g|)>>,
    and <math|f\<asymp\>g> for <math|f=O<around*|(|g|)>> and
    <math|g=O<around*|(|f|)>>. The symbol <math|\<bbb-R\><rsup|\<geqslant\>>>
    denotes the set of non-negative real numbers, and <math|\<bbb-N\>>
    denotes <math|<around*|{|0,1,2,\<ldots\>|}>>. We will write <math|lg
    n\<assign\><around*|\<lceil\>|log n/log 2|\<rceil\>>>.
  </render-remark>

  <section|Survey of classical tools><label|survey-sec>

  This section recalls basic facts on Fourier transforms and related
  techniques used in subsequent sections. For more details and historical
  references we refer the reader to standard books on the subject such
  as<nbsp><cite|AhHoUl1974|BuClSh1997|GaGe2002|RaKiHw2010>.

  <subsection|Arrays and sorting><label|arrays-sec>

  In the Turing model, we have available a fixed number of linear tapes. An
  <math|n<rsub|1>\<times\>\<cdots\>\<times\>n<rsub|d>> array
  <math|M<rsub|i<rsub|1>,\<ldots\>,i<rsub|d>>> of <math|b>-bit elements is
  stored as a linear array of <math|n<rsub|1>*\<cdots\>*n<rsub|d>*b> bits. We
  generally assume that the elements are ordered lexicographically by
  <math|<around*|(|i<rsub|1>,\<ldots\>,i<rsub|d>|)>>, though this is just an
  implementation detail.

  What is significant from a complexity point of view is that occasionally we
  must switch representations, to access an array (say 2-dimensional) by
  ``rows'' or by ``columns''. In the Turing model, we may transpose an
  <math|n<rsub|1>\<times\>n<rsub|2>> matrix of <math|b>-bit elements in time
  <math|O<around*|(|b*n<rsub|1>*n<rsub|2>*lg
  min<around*|(|n<rsub|1>,n<rsub|2>|)>|)>>, using the algorithm
  of<nbsp><cite-detail|BGS07|Appendix>. Briefly, the idea is to split the
  matrix into two halves along the ``short'' dimension, and transpose each
  half recursively.

  We will also require more complex rearrangements of data, for which we
  resort to sorting. Suppose that <math|X> is a totally ordered set, whose
  elements are represented by bit strings of length<nbsp><math|b>, and
  suppose that we can compare elements of <math|X> in time
  <math|O<around*|(|b|)>>. Then an array of <math|n> elements of <math|X> may
  be sorted in time <math|O<around*|(|b*n*lg n|)>> using merge sort
  <cite|Knu-vol3>, which can be implemented efficiently on a<nbsp>Turing
  machine.

  <subsection|Discrete Fourier transforms><label|DFT-sec>

  Let <math|R> be a commutative ring with identity and let
  <math|n\<geqslant\>1>. An element <math|\<omega\>\<in\>R> is said to be a
  <em|principal <math|n>-th root of unity> if <math|\<omega\><rsup|n>=1> and

  <\equation>
    <label|principal-root-unity><big|sum><rsub|k=0><rsup|n-1><around*|(|\<omega\><rsup|i>|)><rsup|k>=0
  </equation>

  for all <math|i\<in\><around*|{|1,\<ldots\>,n-1|}>>. In this case, we
  define the <em|discrete Fourier transform> (or DFT) of an
  <math|n><nbhyph>tuple <math|a=<around*|(|a<rsub|0>,\<ldots\>,a<rsub|n-1>|)>\<in\>R<rsup|n>>
  with respect to <math|\<omega\>> to be <math|DFT<rsub|\<omega\>><around*|(|a|)>=<wide|a|^>=<around*|(|<wide|a|^><rsub|0>,\<ldots\>,<wide|a|^><rsub|n-1>|)>\<in\>R<rsup|n>>
  where

  <\eqnarray*>
    <tformat|<table|<row|<cell|<wide|a|^><rsub|i>>|<cell|\<assign\>>|<cell|a<rsub|0>+a<rsub|1>*\<omega\><rsup|i>+\<cdots\>+a<rsub|n-1>*\<omega\><rsup|<around*|(|n-1|)>*i>.>>>>
  </eqnarray*>

  That is, <math|<wide|a|^><rsub|i>> is the evaluation of the polynomial
  <math|A<around*|(|X|)>\<assign\>a<rsub|0>+a<rsub|1>*X+\<cdots\>+a<rsub|n-1>*X<rsup|n-1>>
  at <math|\<omega\><rsup|i>>.

  If <math|\<omega\>> is a<nbsp>principal <math|n>-th root of unity, then so
  is its inverse <math|\<omega\><rsup|-1>=\<omega\><rsup|n-1>>, and we have

  <\eqnarray*>
    <tformat|<table|<row|<cell|DFT<rsub|\<omega\><rsup|-1>><around*|(|DFT<rsub|\<omega\>><around*|(|a|)>|)>>|<cell|=>|<cell|n*a.>>>>
  </eqnarray*>

  Indeed, writing <math|b\<assign\>DFT<rsub|\<omega\><rsup|-1>><around*|(|DFT<rsub|\<omega\>><around*|(|a|)>|)>>,
  the relation<nbsp>(<reference|principal-root-unity>) implies that

  <\equation*>
    b<rsub|i>=<big|sum><rsub|j=0><rsup|n-1><wide|a|^><rsub|j>*\<omega\><rsup|-j*i>=<big|sum><rsub|j=0><rsup|n-1><big|sum><rsub|k=0><rsup|n-1>a<rsub|k>*\<omega\><rsup|j*<around*|(|k-i|)>>=<big|sum><rsub|k=0><rsup|n-1>a<rsub|k>*<big|sum><rsub|j=0><rsup|n-1>\<omega\><rsup|j*<around*|(|k-i|)>>=<big|sum><rsub|k=0><rsup|n-1>a<rsub|k>*<around*|(|n*\<delta\><rsub|i,k>|)>=n*a<rsub|i>,
  </equation*>

  where <math|\<delta\><rsub|i,k>=1> if <math|i=k> and
  <math|\<delta\><rsub|i,k>=0> otherwise.

  <\remark>
    In all of the new algorithms introduced in this paper, we actually work
    over a<nbsp>field, whose characteristic does not divide <math|n>. In this
    setting, the concept of principal root of unity coincides with the more
    familiar <em|primitive root of unity>. The more general ``principal
    root'' concept is only needed for discussions of other algorithms, such
    as the Schnhage--Strassen algorithm or Frer's algorithm.
  </remark>

  <subsection|The Cooley--Tukey FFT><label|FFT-sec>

  Let <math|\<omega\>> be a principal <math|n>-th root of unity and let
  <math|n=n<rsub|1>*n<rsub|2>> where <math|1\<less\>n<rsub|1>\<less\>n>. Then
  <math|\<omega\><rsup|n<rsub|1>>> is a<nbsp>principal <math|n<rsub|2>>-th
  root of unity and <math|\<omega\><rsup|n<rsub|2>>> is a principal
  <math|n<rsub|1>>-th root of unity. Moreover, for any
  <math|i<rsub|1>\<in\><around*|{|0,\<ldots\>,n<rsub|1>-1|}>> and
  <math|i<rsub|2>\<in\><around*|{|0,\<ldots\>,n<rsub|2>-1|}>>, we have

  <\eqnarray*>
    <tformat|<table|<row|<cell|<wide|a|^><rsub|i<rsub|1>*n<rsub|2>+i<rsub|2>>>|<cell|=>|<cell|<big|sum><rsub|k<rsub|1>=0><rsup|n<rsub|1>-1><big|sum><rsub|k<rsub|2>=0><rsup|n<rsub|2>-1>a<rsub|k<rsub|2>*n<rsub|1>+k<rsub|1>>*\<omega\><rsup|<around*|(|k<rsub|2>*n<rsub|1>+k<rsub|1>|)>*<around*|(|i<rsub|1>*n<rsub|2>+i<rsub|2>|)>>>>|<row|<cell|>|<cell|=>|<cell|<big|sum><rsub|k<rsub|1>=0><rsup|n<rsub|1>-1>\<omega\><rsup|k<rsub|1>*i<rsub|2>>*<around*|(|<big|sum><rsub|k<rsub|2>=0><rsup|n<rsub|2>-1>a<rsub|k<rsub|2>*n<rsub|1>+k<rsub|1>>*<around*|(|\<omega\><rsup|n<rsub|1>>|)><rsup|k<rsub|2>*i<rsub|2>>|)>*<around*|(|\<omega\><rsup|n<rsub|2>>|)><rsup|k<rsub|1>*i<rsub|1>>.<eq-number><label|FFT-dec>>>>>
  </eqnarray*>

  If <math|\<cal-A\><rsub|1>> and <math|\<cal-A\><rsub|2>> are algorithms for
  computing DFTs of length <math|n<rsub|1>> and <math|n<rsub|2>>, we may use
  <eqref|FFT-dec> to construct an algorithm
  <math|\<cal-A\><rsub|1>\<odot\>\<cal-A\><rsub|2>> for computing DFTs of
  length <math|n> as follows.

  For each <math|k<rsub|1>\<in\><around*|{|0,\<ldots\>,n<rsub|1>-1|}>>, the
  sum inside the brackets corresponds to the <math|i<rsub|2>>-th coefficient
  of a<nbsp>DFT of the <math|n<rsub|2>>-tuple
  <math|<around*|(|a<rsub|0*n<rsub|1>+k<rsub|1>>,\<ldots\>,a<rsub|<around*|(|n<rsub|2>-1|)>*n<rsub|1>+k<rsub|1>>|)>\<in\>R<rsup|n<rsub|2>>>
  with respect to <math|\<omega\><rsup|n<rsub|1>>>. Evaluating these
  <with|font-shape|italic|inner DFTs> requires <math|n<rsub|1>> calls to
  <math|\<cal-A\><rsub|2>>. Next, we multiply by the <em|twiddle factors>
  <math|\<omega\><rsup|k<rsub|1>*i<rsub|2>>\<nocomma\>>, at a cost
  of<nbsp><math|n> operations in <math|R>. (Actually, fewer than <math|n>
  multiplications are required, as some of the twiddle factors are equal to
  <math|1>. This optimisation, while important in practice, has no asymptotic
  effect on the algorithms discussed in this paper.) Finally, for each
  <math|i<rsub|2>\<in\><around*|{|0,\<ldots\>,n<rsub|2>-1|}>>, the outer sum
  corresponds to the <math|i<rsub|1>>-th coefficient of a DFT of an
  <math|n<rsub|1>>-tuple in <math|R<rsup|n<rsub|1>>> with respect to
  <math|\<omega\><rsup|n<rsub|2>>>. These <with|font-shape|italic|outer DFTs>
  require <math|n<rsub|2>> calls to <math|\<cal-A\><rsub|1>>.

  Denoting by <math|<math-ss|F><rsub|R><around*|(|n|)>> the number of ring
  operations needed to compute a DFT of length <math|n>, and assuming that we
  have available a precomputed table of twiddle factors, we obtain

  <\eqnarray*>
    <tformat|<table|<row|<cell|<math-ss|F><rsub|R><around*|(|n<rsub|1>*n<rsub|2>|)>>|<cell|\<leqslant\>>|<cell|n<rsub|1>*<math-ss|F><rsub|R><around*|(|n<rsub|2>|)>+n<rsub|2>*<math-ss|F><rsub|R><around*|(|n<rsub|1>|)>+n.>>>>
  </eqnarray*>

  For a factorisation <math|n=n<rsub|1>*\<cdots\>*n<rsub|d>>, this yields
  recursively

  <\eqnarray*>
    <tformat|<table|<row|<cell|<math-ss|F><rsub|R><around*|(|n|)>>|<cell|\<leqslant\>>|<cell|<big|sum><rsup|d><rsub|i=1><frac|n|n<rsub|i>>*<math-ss|F><rsub|R><around*|(|n<rsub|i>|)>+<around*|(|d-1|)>*n.<eq-number><label|fft-rec-bound>>>>>
  </eqnarray*>

  The corresponding algorithm is denoted <math|\<cal-A\><rsub|1>\<odot\>\<cdots\>\<odot\>\<cal-A\><rsub|d>>.
  The <math|\<odot\>> operation is neither commutative nor associative; the
  above expression will always be taken to mean
  <math|<around*|(|\<cdots\><around*|(|<around*|(|\<cal-A\><rsub|1>\<odot\>\<cal-A\><rsub|2>|)>\<odot\>\<cal-A\><rsub|3>|)>\<odot\>\<cdots\>|)>\<odot\>\<cal-A\><rsub|d>>.

  Let <math|\<cal-B\>> be the butterfly algorithm that computes a DFT of
  length 2 by the formula <math|<around*|(|a<rsub|0>,a<rsub|1>|)>\<mapsto\><around*|(|a<rsub|0>+a<rsub|1>,a<rsub|0>-a<rsub|1>|)>>.
  Then <math|\<cal-B\><rsup|\<odot\>k>\<assign\>\<cal-B\>\<odot\>\<cdots\>\<odot\>\<cal-B\>>
  computes a DFT of length <math|n\<assign\>2<rsup|k>> in time
  <math|<math-ss|F><rsub|R><around*|(|2<rsup|k>|)>=O<around*|(|k*n|)>>.
  Algorithms of this type are called <em|fast Fourier transforms> (or FFTs).

  The above discussion requires several modifications in the Turing model.
  Assume that elements of <math|R> are represented by <math|b> bits.

  First, for <math|\<cal-A\><rsub|1>\<odot\>\<cal-A\><rsub|2>>, we must add a
  rearrangement cost of <math|O<around|(|b*n*lg
  <rigid|min<around|(|n<rsub|1>,n<rsub|2>|)>>|)>> to efficiently access the
  rows and columns for the recursive subtransforms (see
  section<nbsp><reference|arrays-sec>). For the general case
  <math|\<cal-A\><rsub|1>\<odot\>\<cdots\>\<odot\>\<cal-A\><rsub|d>\<nocomma\>>,
  the total rearrangement cost is bounded by
  <math|O<around*|(|<big|sum><rsub|i>b*n*lg n<rsub|i>|)>=O<around*|(|b*n*lg
  n|)>>.

  Second, we will sometimes use <with|font-shape|italic|non-algebraic>
  algorithms to compute the subtransforms, so it may not make sense to
  express their cost in terms of <math|<math-ss|F><rsub|R>>. The relation
  <eqref|fft-rec-bound> therefore becomes

  <\eqnarray*>
    <tformat|<table|<row|<cell|<math-ss|F><around*|(|n|)>>|<cell|\<leqslant\>>|<cell|<big|sum><rsup|d><rsub|i=1><frac|n|n<rsub|i>>*<math-ss|F><around*|(|n<rsub|i>|)>+<around*|(|d-1|)>*n*<math-ss|m><rsub|R>+O<around*|(|b*n*lg
    n|)>,<eq-number><label|fft-rec-bound2>>>>>
  </eqnarray*>

  where <math|<math-ss|F><around*|(|n|)>> is the (Turing) cost of a transform
  of length <math|n> over <math|R>, and where <math|<math-ss|m><rsub|R>> is
  the cost of a<nbsp>single multiplication in <math|R>.

  Finally, we point out that <math|\<cal-A\><rsub|1>\<odot\>\<cal-A\><rsub|2>>
  requires access to a table of twiddle factors
  <math|\<omega\><rsup|i<rsub|1>*i<rsub|2>>>, ordered lexicographically by
  <math|<around*|(|i<rsub|1>,i<rsub|2>|)>>, for
  <math|0\<leqslant\>i<rsub|1>\<less\>n<rsub|1>>,
  <math|0\<leqslant\>i<rsub|2>\<less\>n<rsub|2>\<nocomma\>>. Assuming that we
  are given as input a<nbsp>precomputed table of the form
  <math|1,\<omega\>,\<ldots\>,\<omega\><rsup|n-1>>, we must show how to
  extract the required twiddle factor table in the correct order. We first
  construct a list of triples <math|<around*|(|i<rsub|1>,i<rsub|2>,i<rsub|1>*i<rsub|2>|)>>,
  ordered by <math|<around*|(|i<rsub|1>,i<rsub|2>|)>>, in time
  <math|O<around*|(|n*lg n|)>>; then sort by <math|i<rsub|1>*i<rsub|2>> in
  time <math|O<around*|(|n*lg<rsup|2> n|)>> (see
  section<nbsp><reference|arrays-sec>); then merge with the given root table
  to obtain a table <math|<around*|(|i<rsub|1>,i<rsub|2>,\<omega\><rsup|i<rsub|1>*i<rsub|2>>|)>>,
  ordered by <math|i<rsub|1>*i<rsub|2>>, in time
  <math|O<around*|(|n*<around*|(|b+lg n|)>|)>>; and finally sort by
  <math|<around*|(|i<rsub|1>,i<rsub|2>|)>> in time
  <math|O<around*|(|n*lg*n*<around*|(|b+lg n|)>|)>>. The total cost of the
  extraction is thus <math|O<around*|(|n*lg*n*<around*|(|b+lg n|)>|)>>.

  The corresponding cost for <math|\<cal-A\><rsub|1>\<odot\>\<cdots\>\<odot\>\<cal-A\><rsub|d>>
  is determined as follows. Assuming that the table
  <math|1,\<omega\>,\<ldots\>,\<omega\><rsup|n-1>> is given as input, we
  first extract the subtables of <around*|(|<math|n<rsub|1>*\<cdots\>*n<rsub|i>>|)>-th
  roots of unity for <math|i=d-1,\<ldots\>,2> in time
  <math|O<around*|(|<around*|(|n<rsub|1>*\<cdots\>*n<rsub|d>+\<cdots\>+n<rsub|1>*n<rsub|2>|)>*<around*|(|b+lg
  n|)>|)>=O<around*|(|n*<around*|(|b+lg n|)>|)>>. Extracting the twiddle
  factor table for the decomposition <math|<around*|(|n<rsub|1>*\<cdots\>*n<rsub|i-1>|)>\<times\>n<rsub|i>>
  then costs <math|O<around*|(|n<rsub|1>*\<cdots\>*n<rsub|i>*lg
  n*<around*|(|b+lg n|)>|)>>; the total over all <math|i> is again
  <math|O<around*|(|n*lg n*<around*|(|b+lg n|)>|)>>.

  <\remark>
    An alternative approach is to compute the twiddle factors directly in the
    correct order. When working over <math|\<bbb-C\>>, as in
    section<nbsp><reference|err-sec>, this requires a slight increase in the
    working precision. Similar comments apply to the root tables used in
    Bluestein's algorithm in section<nbsp><reference|Bluestein-sec>.
  </remark>

  <subsection|Fast Fourier multiplication>

  Let <math|\<omega\>> be a principal <math|n>-th root of unity in <math|R>
  and assume that <math|n> is invertible in <math|R>. Consider two
  polynomials <math|A=a<rsub|0>+\<cdots\>+a<rsub|n-1>*X<rsup|n-1>> and
  <math|B=b<rsub|0>+\<cdots\>+b<rsub|n-1>*X<rsup|n-1>>
  in<nbsp><math|R<around*|[|X|]>>. Let <math|C=c<rsub|0>+\<cdots\>+c<rsub|n-1>*X<rsup|n-1>>
  be the polynomial defined by

  <\eqnarray*>
    <tformat|<table|<row|<cell|c>|<cell|\<assign\>>|<cell|<tfrac|1|n>*DFT<rsub|\<omega\><rsup|-1>><around*|(|DFT<rsub|\<omega\>><around*|(|a|)>*DFT<rsub|\<omega\>><around*|(|b|)>|)>,>>>>
  </eqnarray*>

  where the product of the DFTs is taken pointwise. By construction, we have
  <math|<wide|c|^>=<wide|a|^>*<wide|b|^>>, which means that
  <math|C<around*|(|\<omega\><rsup|i>|)>=A<around*|(|\<omega\><rsup|i>|)>*B<around*|(|\<omega\><rsup|i>|)>>
  for all <math|i\<in\><around*|{|0,\<ldots\>,n-1|}>>. The product
  <math|S=s<rsub|0>+\<cdots\>+s<rsub|n-1>*X<rsup|n-1>> of <math|A> and
  <math|B> modulo <math|X<rsup|n>-1> also satisfies
  <math|S<around*|(|\<omega\><rsup|i>|)>=A<around*|(|\<omega\><rsup|i>|)>*B<around*|(|\<omega\><rsup|i>|)>>
  for all<nbsp><math|i>. Consequently, <math|<wide|s|^>=<wide|a|^>*<wide|b|^>>,
  <math|s=DFT<rsub|\<omega\><rsup|-1>><around*|(|<wide|s|^>|)>/n=c>, whence
  <math|C=S>.

  For polynomials <math|A,B\<in\>R<around*|[|X|]>> with <math|deg A\<less\>n>
  and <math|deg B\<less\>n>, we thus obtain an algorithm for the computation
  of <math|A*B> modulo <math|X<rsup|n>-1> using at most
  <math|3*<math-ss|F><rsub|R><around*|(|n|)>+O<around*|(|n|)>> operations in
  <math|R>. Modular products of this type are also called <em|cyclic
  convolutions>. If <math|deg <around*|(|A*B|)>\<less\>n>, then we may
  recover the product <math|A*B> from its reduction modulo
  <math|X<rsup|n>-1>. This multiplication method is called <em|FFT
  multiplication>.

  If one of the arguments (say <math|B>) is fixed and we want to compute many
  products <math|A*B> (or cyclic convolutions) for different <math|A>, then
  we may precompute <math|DFT<rsub|\<omega\>><around*|(|b|)>>, after which
  each new product <math|A*B> can be computed using only
  <math|2*<math-ss|F><rsub|R><around*|(|n|)>+O<around*|(|n|)>>
  operations<nbsp>in <math|R>.

  <subsection|Bluestein's chirp transform><label|Bluestein-sec>

  We have shown above how to multiply polynomials using DFTs. Inversely, it
  is possible to reduce the computation of DFTs <emdash> of arbitrary length,
  not necessarily a power of two <emdash> to polynomial multiplication
  <cite|Bluestein1970>, as follows.

  Let <math|\<omega\>> be a principal <math|n>-th root of unity. For
  simplicity we assume that <math|n> is even, and that there exists some
  <math|\<eta\>\<in\>R> with <math|\<eta\><rsup|2>=\<omega\>>. Consider the
  sequences

  <\equation*>
    f<rsub|i>\<assign\>\<eta\><rsup|i<rsup|2>>,<space|1em>g<rsub|i>\<assign\>\<eta\><rsup|-i<rsup|2>>.
  </equation*>

  Then <math|\<omega\><rsup|i*j>=f<rsub|i>*f<rsub|j>*g<rsub|i-j>>, so for any
  <math|a\<in\>R<rsup|n>> we have

  <\equation>
    <wide|a|^><rsub|i>=<big|sum><rsub|j=0><rsup|n-1>a<rsub|j>*\<omega\><rsup|i*j>=f<rsub|i>*<big|sum><rsub|j=0><rsup|n-1><around*|(|a<rsub|j>*f<rsub|j>|)>*g<rsub|i-j>.<label|chirp-formula>
  </equation>

  Also, since <math|n> is even,

  <\equation*>
    g<rsub|i+n>=\<eta\><rsup|-<around*|(|i+n|)><rsup|2>>=\<eta\><rsup|-i<rsup|2>-n<rsup|2>-2*n*i>=\<eta\><rsup|-i<rsup|2>>*\<omega\><rsup|-<around*|(|<frac|n|2>+i|)>*n>=g<rsub|i>.
  </equation*>

  Now let <math|F\<assign\>f<rsub|0>*a<rsub|0>+\<cdots\>+f<rsub|n-1>*a<rsub|n-1>*X<rsup|n-1>>,
  <math|G\<assign\>g<rsub|0>+\<cdots\>+g<rsub|n-1>*X<rsup|n-1>> and
  <math|C\<assign\>c<rsub|0>+\<cdots\>+c<rsub|n-1>*X<rsup|n-1>\<equiv\>F*G>
  modulo <math|X<rsup|n>-1>. Then<nbsp>(<reference|chirp-formula>) implies
  that <math|<wide|a|^><rsub|i>=f<rsub|i>*c<rsub|i>> for all
  <math|i\<in\><around*|{|0,\<ldots\>,n-1|}>>. In other words, the
  computation of a DFT of even length<nbsp><math|n> reduces to a<nbsp>cyclic
  convolution product of the same length, together with
  <math|O<around*|(|n|)>> additional operations in <math|R>. Notice that the
  polynomial <math|G> is fixed and independent of <math|a> in this product.

  The only complication in the Turing model is the cost of extracting the
  <math|f<rsub|i>> in the correct order, i.e., in the order
  <math|1,\<eta\>,\<eta\><rsup|4>,\<eta\><rsup|9>,\<ldots\>,\<eta\><rsup|<around*|(|n-1|)><rsup|2>>>,
  given as input a precomputed table <math|1,\<eta\>,\<eta\><rsup|2>,\<ldots\>,\<eta\><rsup|2*n-1>>.
  We may do this in time <math|O<around*|(|n*lg*n*<around*|(|b+lg n|)>|)>> by
  applying the strategy from section<nbsp><reference|FFT-sec> to the pairs
  <math|<around*|(|i,i<rsup|2> mod 2*n|)>> for
  <math|0\<leqslant\>i\<less\>n>. Similar remarks apply to the
  <math|g<rsub|i>>.

  <\remark>
    It is also possible to give variants of the new multiplication algorithms
    in which Bluestein's transform is replaced by a different method for
    converting DFTs to convolutions, such as Rader's algorithm
    <cite|Rad-prime>.
  </remark>

  <subsection|Kronecker substitution and segmentation><label|Kronecker-sec>

  Multiplication in <math|\<bbb-Z\><around*|[|X|]>> may be reduced to
  multiplication in <math|\<bbb-Z\>> using the classical technique of
  <em|Kronecker substitution><nbsp><cite-detail|GaGe2002|Corollary<nbsp>8.27>.
  More precisely, let <math|d\<gtr\>0> and <math|n\<gtr\>0>, and suppose that
  we are given two polynomials <math|A,B\<in\>\<bbb-Z\><around*|[|X|]>> of
  degree less than <math|d>, with coefficients <math|A<rsub|i>> and
  <math|B<rsub|i>> satisfying <math|<around*|\||A<rsub|i>|\|>\<leqslant\>2<rsup|n>>
  and <math|<around*|\||B<rsub|i>|\|>\<leqslant\>2<rsup|n>>. Then for the
  product <math|C=A*B> we have <math|<around*|\||C<rsub|i>|\|>\<leqslant\>2<rsup|2*n+lg
  d>>. Consequently, the coefficients of <math|C> may be read off the integer
  product <math|C<around*|(|2<rsup|N>|)>=A<around*|(|2<rsup|N>|)>*B<around*|(|2<rsup|N>|)>>
  where <math|N\<assign\>2*n+lg d+2>. Notice that the integers
  <math|<around*|\||A<around*|(|2<rsup|N>|)>|\|>> and
  <math|<around*|\||B<around*|(|2<rsup|N>|)>|\|>> have bit length at most
  <math|d*N>, and the encoding and decoding processes have complexity
  <math|O<around*|(|d*N|)>>.

  The inverse procedure is <with|font-shape|italic|Kronecker segmentation>.
  Given <math|n\<gtr\>0> and <math|d\<gtr\>0>, and non-negative integers
  <math|a\<less\>2<rsup|n>> and <math|b\<less\>2<rsup|n>>, we may reduce the
  computation of <math|c\<assign\>a*b> to the computation of a<nbsp>product
  <math|C\<assign\>A*B> of two polynomials
  <math|A,B\<in\>\<bbb-Z\><around*|[|X|]>> of degree less than <math|d>, and
  with <math|<around*|\||A<rsub|i>|\|>\<less\>2<rsup|k>> and
  <math|<around*|\||B<rsub|i>|\|>\<less\>2<rsup|k>> where
  <math|k\<assign\><around*|\<lceil\>|n/d|\<rceil\>>>. Indeed, we may cut the
  integers into chunks of <math|k> bits each, so that
  <math|a=A<around*|(|2<rsup|k>|)>>, <math|b=B<around*|(|2<rsup|k>|)>> and
  <math|c=C<around*|(|2<rsup|k>|)>>. Notice that we may recover <math|c> from
  <math|C> using an overlap-add procedure in time
  <math|O<around*|(|d*<around*|(|k+lg d|)>|)>=O<around*|(|n+d*lg d|)>>. In
  our applications, we will always have <math|d=O<around*|(|n/lg n|)>>, so
  that <math|O<around*|(|n+d*lg d|)>=O<around*|(|n|)>>.

  Kronecker substitution and segmentation can also be used to handle Gaussian
  integers (and Gaussian integer polynomials), and to compute cyclic
  convolutions. For example, given polynomials
  <math|A,B\<in\>\<bbb-Z\><around*|[|\<mathi\>|]><around*|[|X|]>/<around*|(|X<rsup|d>-1|)>>
  with <math|<around*|\||A<rsub|i>|\|>,<around*|\||B<rsub|i>|\|>\<leqslant\>2<rsup|n>>,
  then for <math|C=A*B> we have <math|<around*|\||C<rsub|i>|\|>\<leqslant\>2<rsup|2*n+lg
  d>>, so we may recover <math|C> from the cyclic Gaussian integer product
  <math|C<around*|(|2<rsup|N>|)>=A<around*|(|2<rsup|N>|)>*B<around*|(|2<rsup|N>|)>\<in\><around*|(|\<bbb-Z\>/<around*|(|2<rsup|d*N>-1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>,
  where <math|N\<assign\>2*n+lg d+2>. In the other direction, suppose that we
  wish to compute <math|a*b> for some <math|a,b\<in\><around*|(|\<bbb-Z\>/<around*|(|2<rsup|d*n>-1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>.
  We may assume that the ``real'' and ``imaginary'' parts of <math|a> and
  <math|b> are non-negative, and so reduce to the problem of multiplying
  <math|A,B\<in\>\<bbb-Z\><around*|[|\<mathi\>|]><around*|[|X|]>/<around*|(|X<rsup|d>-1|)>>,
  where <math|a=A<around*|(|2<rsup|n>|)>> and
  <math|b=B<around*|(|2<rsup|n>|)>>, and where the real and imaginary parts
  of <math|A<rsub|i>,B<rsub|i>\<in\>\<bbb-Z\><around*|[|\<mathi\>|]>> are
  non-negative and have at most <math|n> bits.

  <section|Fixed point computations and error bounds><label|err-sec>

  In this section, we consider the computation of DFTs over <math|\<bbb-C\>>
  in the Turing model. Elements of <math|\<bbb-C\>> can only be represented
  approximately on a Turing machine. We describe algorithms that compute DFTs
  approximately, using a fixed-point representation for <math|\<bbb-C\>>, and
  we give complexity bounds and a detailed error analysis for these
  algorithms. We refer the reader to<nbsp><cite|BZ10> for more details about
  multiple precision arithmetic.

  For our complexity estimates we will freely use the standard observation
  that <math|<math-ss|I><around*|(|O<around*|(|n|)>|)>=O<around*|(|<math-ss|I><around*|(|n|)>|)>>,
  since the multiplication of two integers of bit length
  <math|\<leqslant\>k*n> reduces to <math|k<rsup|2>> multiplications of
  integers of bit length <math|\<leqslant\>n>, for any fixed
  <math|k\<geqslant\>1>.

  <subsection|Fixed point numbers>

  We will represent fixed point numbers by a signed mantissa and a fixed
  exponent. More precisely, given a precision parameter
  <math|p\<geqslant\>4>, we denote by <math|\<bbb-C\><rsub|p>> the set of
  complex numbers of the form <math|z=m<rsub|z>*2<rsup|-p>>, where
  <math|m<rsub|z>=u+v*\<mathi\>> for integers <math|u> and <math|v>
  satisfying <math|u<rsup|2>+v<rsup|2>\<leqslant\>2<rsup|2*p>>, i.e.,
  <math|<around*|\||z|\|>\<leqslant\>1>. We write
  <math|\<bbb-C\><rsub|p>*2<rsup|e>> for the set of complex numbers of the
  form <math|u*2<rsup|e>\<nocomma\>>, where <math|u\<in\>\<bbb-C\><rsub|p>>
  and <math|e\<in\>\<bbb-Z\>>; in particular, for
  <math|z\<in\>\<bbb-C\><rsub|p>*2<rsup|e>> we always have
  <math|<around*|\||z|\|>\<leqslant\>2<rsup|e>>. At every stage of our
  algorithms, the exponent <math|e> will be determined implicitly by context,
  and in particular, the exponents do not have to be explicitly stored or
  manipulated.

  In our error analysis of numerical algorithms, each
  <math|z\<in\>\<bbb-C\><rsub|p>*2<rsup|e>> is really the approximation of
  some genuine complex number <math|<wide|z|~>\<in\>\<bbb-C\>>. Each such
  <math|z> comes with an implicit error bound
  <math|\<varepsilon\><rsub|z>\<geqslant\>0>; this is a<nbsp>real number for
  which we can guarantee that <math|<around*|\||z-<wide|z|~>|\|>\<leqslant\>\<varepsilon\><rsub|z>>.
  We also define the relative error bound for <math|z> by
  <math|\<rho\><rsub|z>\<assign\>\<varepsilon\><rsub|z>/2<rsup|e>>. We
  finally denote by <math|\<epsilon\>\<assign\>2<rsup|1-p>\<leqslant\>1/8>
  the ``machine accuracy''.

  <\remark>
    Interval arithmetic<nbsp><cite|Moo66> (or ball
    arithmetic<nbsp><cite-detail|vdH:jncf|Chapter 3>) provides
    a<nbsp>systematic method for tracking error bounds by storing the bounds
    along with <math|z>. We will use similar formulas for the computation of
    <math|\<varepsilon\><rsub|z>> and <math|\<rho\><rsub|z>>, but we will not
    actually store the bounds during computations.
  </remark>

  <subsection|Basic arithmetic>

  In this section we give error bounds and complexity estimates for fixed
  point addition, subtraction and multiplication, under certain simplifying
  assumptions. In particular, in our DFTs, we only ever need to add and
  subtract numbers with the same exponent. We also give error bounds for
  fixed point convolution of vectors; the complexity of this important
  operation is considered later.

  For <math|x\<in\>\<bbb-R\>>, we define the ``round towards zero'' function
  <math|<around*|\<lfloor\>|x|\<rceil\>>> by
  <math|<around*|\<lfloor\>|x|\<rceil\>>\<assign\><around*|\<lfloor\>|x|\<rfloor\>>>
  if <math|x\<geqslant\>0> and <math|<around*|\<lfloor\>|x|\<rceil\>>\<assign\><around*|\<lceil\>|x|\<rceil\>>>
  if <math|x\<leqslant\>0>. For <math|x,y\<in\>\<bbb-R\>>, we define
  <math|<around*|\<lfloor\>|x+y*\<mathi\>|\<rceil\>>\<assign\><around*|\<lfloor\>|x|\<rceil\>>+<around*|\<lfloor\>|y|\<rceil\>>*\<mathi\>>.
  Notice that <math|<around*|\||<around*|\<lfloor\>|z|\<rceil\>>|\|>\<leqslant\><around*|\||z|\|>>
  and <math|<around*|\||<around*|\<lfloor\>|z|\<rceil\>>-z|\|>\<leqslant\><sqrt|2>>
  for any <math|z\<in\>\<bbb-C\>>.

  <\proposition>
    <label|add-prop>Let <math|z,u\<in\>\<bbb-C\><rsub|p>*2<rsup|e>>. Define
    the fixed point sum and difference <math|z\<dotplus\>u,z\<dotminus\>u\<in\>\<bbb-C\><rsub|p>*2<rsup|e+1>>
    by <math|m<rsub|z\<dotpm\>u>\<assign\><around*|\<lfloor\>|<around*|(|m<rsub|z>\<pm\>m<rsub|u>|)>/2|\<rceil\>>>.
    Then <math|z\<dotplus\>u> and <math|z\<dotminus\>u> can be computed in
    time<nbsp><math|O<around*|(|p|)>>, and

    <\eqnarray*>
      <tformat|<table|<row|<cell|\<rho\><rsub|z\<dotpm\>u>>|<cell|\<leqslant\>>|<cell|<frac|\<rho\><rsub|z>+\<rho\><rsub|u>|2>+\<epsilon\>.>>>>
    </eqnarray*>
  </proposition>

  <\proof>
    We have

    <\equation*>
      <frac|<around*|\||<around*|(|z\<dotpm\>u|)>-<around*|(|z\<pm\>u|)>|\|>|2<rsup|e+1>>=<around*|\||<around*|\<lfloor\>|<frac|m<rsub|z>\<pm\>m<rsub|u>|2>|\<rceil\>>-<frac|m<rsub|z>\<pm\>m<rsub|u>|2>|\|>*2<rsup|-p>\<leqslant\><sqrt|2>\<cdot\>2<rsup|-p>\<leqslant\>\<epsilon\>
    </equation*>

    and

    <\equation*>
      <frac|<around*|\||<around*|(|z\<pm\>u|)>-<around*|(|<wide|z|~>\<pm\><wide|u|~>|)>|\|>|2<rsup|e+1>>\<leqslant\><frac|\<varepsilon\><rsub|z>+\<varepsilon\><rsub|u>|2<rsup|e+1>>=<frac|\<rho\><rsub|z>+\<rho\><rsub|u>|2>,
    </equation*>

    whence <math|<around*|\||<around*|(|z\<dotpm\>u|)>-<around*|(|<wide|z|~>\<pm\><wide|u|~>|)>|\|>/2<rsup|e+1>\<leqslant\><around*|(|\<rho\><rsub|z>+\<rho\><rsub|u>|)>/2+\<epsilon\>>.
  </proof>

  <\proposition>
    <label|err-mul>Let <math|z\<in\>\<bbb-C\><rsub|p>*2<rsup|e<rsub|z>>> and
    <math|u\<in\>\<bbb-C\><rsub|p>*2<rsup|e<rsub|u>>>. Define the fixed point
    product <math|z\<dottimes\>u\<in\>\<bbb-C\><rsub|p>*2<rsup|e<rsub|z>+e<rsub|u>>>
    by <math|m<rsub|z\<dottimes\>u>\<assign\><around*|\<lfloor\>|2<rsup|-p>*m<rsub|z>*m<rsub|u>|\<rceil\>>>.
    Then <math|z\<dottimes\>u> can be computed in time
    <math|O<around*|(|<math-ss|I><around*|(|p|)>|)>>, and

    <\eqnarray*>
      <tformat|<table|<row|<cell|1+\<rho\><rsub|z\<dottimes\>u>>|<cell|\<leqslant\>>|<cell|<around*|(|1+\<rho\><rsub|z>|)>*<around*|(|1+\<rho\><rsub|u>|)>*<around*|(|1+\<epsilon\>|)>.>>>>
    </eqnarray*>
  </proposition>

  <\proof>
    We have

    <\equation*>
      <around*|\||z\<dottimes\>u-z*u|\|>/2<rsup|e<rsub|z>+e<rsub|u>>=<around*|\||<around*|\<lfloor\>|2<rsup|-p>*m<rsub|z>*m<rsub|u>|\<rceil\>>-2<rsup|-p>*m<rsub|z>*m<rsub|u>|\|>*2<rsup|-p>\<leqslant\><sqrt|2>\<cdot\>2<rsup|-p>\<leqslant\>\<epsilon\>
    </equation*>

    and

    <\eqnarray*>
      <tformat|<table|<row|<cell|<around*|\||z*u-<wide|z|~>*<wide|u|~>|\|>>|<cell|\<leqslant\>>|<cell|<around*|\||z|\|>*<around*|\||u-<wide|u|~>|\|>+<around*|\||z-<wide|z|~>|\|>*<around*|(|<around*|\||u|\|>+<around*|\||<wide|u|~>-u|\|>|)>>>|<row|<cell|>|<cell|\<leqslant\>>|<cell|2<rsup|e<rsub|z>>*\<varepsilon\><rsub|u>+2<rsup|e<rsub|u>>*\<varepsilon\><rsub|z>+\<varepsilon\><rsub|z>*\<varepsilon\><rsub|u>>>|<row|<cell|>|<cell|=>|<cell|<around*|(|\<rho\><rsub|u>+\<rho\><rsub|z>+\<rho\><rsub|z>*\<rho\><rsub|u>|)>*2<rsup|e<rsub|z>+e<rsub|u>>.>>>>
    </eqnarray*>

    Consequently, <math|<around*|\||z\<dottimes\>u-<wide|z|~>*<wide|u|~>|\|>/2<rsup|e<rsub|z>+e<rsub|u>>\<leqslant\>\<rho\><rsub|z>+\<rho\><rsub|u>+\<rho\><rsub|z>*\<rho\><rsub|u>+\<epsilon\>\<leqslant\><around*|(|1+\<rho\><rsub|z>|)>*<around*|(|1+\<rho\><rsub|u>|)>*<around*|(|1+\<epsilon\>|)>-1>.
  </proof>

  Proposition<nbsp><reference|err-mul> may be generalised to numerical cyclic
  convolution of vectors as follows.

  <\proposition>
    <label|err-conv>Let <math|k\<geqslant\>1> and
    <math|n\<assign\>2<rsup|k>>. Let <math|z\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|e<rsub|z>>|)><rsup|n>>
    and <math|u\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|e<rsub|u>>|)><rsup|n>>.
    Define the fixed point convolution <math|z\<dotast\>u\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|e<rsub|z>+e<rsub|u>+k>|)><rsup|n>>
    by

    <\equation*>
      m<rsub|<around*|(|z\<dotast\>u|)><rsub|i>>\<assign\><around*|\<lfloor\>|2<rsup|-p-k>*<big|sum><rsub|i<rsub|1>+i<rsub|2>=i
      <pmod|n>>m<rsub|z<rsub|i<rsub|1>>>*m<rsub|u<rsub|i<rsub|2>>>|\<rceil\>>,<space|2em>0\<leqslant\>i\<less\>n.
    </equation*>

    Then

    <\eqnarray*>
      <tformat|<table|<row|<cell|max<rsub|i>
      <around*|(|1+\<rho\><rsub|<around*|(|z\<dotast\>u|)><rsub|i>>|)>>|<cell|\<leqslant\>>|<cell|max<rsub|i>
      <around*|(|1+\<rho\><rsub|z<rsub|i>>|)>*max<rsub|i>
      <around*|(|1+\<rho\><rsub|u<rsub|i>>|)>*<around*|(|1+\<epsilon\>|)>.>>>>
    </eqnarray*>
  </proposition>

  <\proof>
    Let <math|\<ast\>> denote the exact convolution, and write
    <math|\<rho\><rsub|z>\<assign\>max<rsub|j> \<rho\><rsub|z<rsub|j>>> and
    <math|\<rho\><rsub|u>\<assign\>max<rsub|j> \<rho\><rsub|u<rsub|j>>>. As
    in the proof of Proposition<nbsp><reference|err-mul>, we obtain
    <math|<around*|\||<around*|(|z\<dotast\>u|)><rsub|i>-<around*|(|z\<ast\>u|)><rsub|i>|\|>/2<rsup|e<rsub|z>+e<rsub|u>+k>\<leqslant\><sqrt|2>\<cdot\>2<rsup|-p>\<leqslant\>\<epsilon\>>
    and

    <\eqnarray*>
      <tformat|<table|<row|<cell|<around*|\||<around*|(|z\<ast\>u|)><rsub|i>-<around*|(|<wide|z|~>\<ast\><wide|u|~>|)><rsub|i>|\|>>|<cell|\<leqslant\>>|<cell|<big|sum><rsub|i<rsub|1>+i<rsub|2>=i
      <pmod|n>><around*|\||z<rsub|i<rsub|1>>*u<rsub|i<rsub|2>>-<wide|z|~><rsub|i<rsub|1>>*<wide|u|~><rsub|i<rsub|2>>|\|>>>|<row|<cell|>|<cell|\<leqslant\>>|<cell|<around*|(|\<rho\><rsub|z>+\<rho\><rsub|u>+\<rho\><rsub|z>*\<rho\><rsub|u>|)>*2<rsup|e<rsub|z>+e<rsub|u>+k>.>>>>
    </eqnarray*>

    The proof is concluded in the same way as
    Proposition<nbsp><reference|err-mul>.
  </proof>

  <subsection|Precomputing roots of unity>

  Let <math|\<bbb-H\>\<assign\><around*|{|x+y*\<mathi\>\<in\>\<bbb-C\>:y\<geqslant\>0|}>>
  and <math|\<bbb-H\><rsub|p>\<assign\><around*|{|x+y*\<mathi\>\<in\>\<bbb-C\><rsub|p>:y\<geqslant\>0|}>>.
  Let <math|<sqrt|<math-ordinary|<space|1spc>>>:\<bbb-H\>\<rightarrow\>\<bbb-H\>>
  be the branch of the square root function such that
  <math|<sqrt|\<mathe\><rsup|\<mathi\>*\<theta\>>>\<assign\>\<mathe\><rsup|\<mathi\>*\<theta\>/2>>
  for <math|0\<leqslant\>\<theta\>\<leqslant\>\<pi\>>. Using Newton's
  method<nbsp><cite-detail|BZ10|Section 3.5> and Schnhage--Strassen
  multiplication<nbsp><cite|SS71>, we may construct a fixed point square root
  function <math|<sqrt|<math-ordinary|<space|1spc>>|<resize|<with|math-level|0|\<cdummy\>>||0.25ex||>>:\<bbb-H\><rsub|p>\<rightarrow\>\<bbb-H\><rsub|p>>,
  which may be evaluated in time <math|O<around*|(|p*log p*log log p|)>>,
  such that <math|<around*|\||<sqrt|z|<resize|<with|math-level|0|\<cdummy\>>||0.25ex||>>-<sqrt|z>|\|>\<leqslant\>\<epsilon\>>
  for all <math|z\<in\>\<bbb-H\><rsub|p>\<nocomma\>>. For example, we may
  first compute some <math|u\<in\>\<bbb-H\>> such that
  <math|<around*|\||u-<sqrt|z>|\|>\<leqslant\>\<epsilon\>/4> and
  <math|<around*|\||u|\|>\<leqslant\>1>, and then take
  <math|<sqrt|z|<resize|<with|math-level|0|\<cdummy\>>||0.25ex||>>\<assign\><around*|\<lfloor\>|2<rsup|p>*u|\<rceil\>>*2<rsup|-p>>;
  the desired bound follows since <math|\<epsilon\>/4+<sqrt|2>\<cdot\>2<rsup|-p>\<leqslant\>\<epsilon\>>.

  <\lemma>
    <label|sqrt-prop>Let <math|z\<in\>\<bbb-H\><rsub|p>>, and assume that
    <math|<around*|\||<wide|z|~>|\|>=1> and
    <math|\<rho\><rsub|z>\<leqslant\>3/8>. Then
    <math|\<rho\><rsub|<sqrt|z|<resize|<with|math-level|0|\<cdummy\>>||0.25ex||>>>\<leqslant\>\<rho\><rsub|z>+\<epsilon\>>.
  </lemma>

  <\proof>
    The mean value theorem implies that <math|<around*|\||<sqrt|<wide|z|~>>-<sqrt|z>|\|>\<leqslant\>\<varepsilon\><rsub|z>*max<rsub|w\<in\>D><around*|\||1/<around*|(|2*<sqrt|w>|)>|\|>>
    where <math|D\<assign\><around*|{|w\<in\>\<bbb-H\>:<around*|\||w-z|\|>\<leqslant\>\<varepsilon\><rsub|z>|}>>.
    For <math|w\<in\>D> we have <math|<around*|\||w|\|>\<geqslant\><around*|\||<wide|z|~>|\|>-<around*|\||<wide|z|~>-z|\|>-<around*|\||z-w|\|>\<geqslant\>1-3/8-3/8\<geqslant\>1/4>;
    hence <math|<around*|\||<sqrt|<wide|z|~>>-<sqrt|z>|\|>\<leqslant\>\<varepsilon\><rsub|z>=\<rho\><rsub|z>>.
    By construction <math|<around*|\||<sqrt|z|<resize|<with|math-level|0|\<cdummy\>>||0.25ex||>>-<sqrt|z>|\|>\<leqslant\>\<epsilon\>>.
    We conclude that <math|<around*|\||<sqrt|z|<resize|<with|math-level|0|\<cdummy\>>||0.25ex||>>-<sqrt|<wide|z|~>>|\|>\<leqslant\>\<rho\><rsub|z>+\<epsilon\>>.
  </proof>

  <\proposition>
    <label|roots-prop>Let <math|k\<in\>\<bbb-N\>> and <math|p\<geqslant\>k>,
    and let <math|\<omega\>\<assign\>\<mathe\><rsup|2*\<pi\>*\<mathi\>/2<rsup|k>>>.
    We may compute <math|1,\<omega\>,\<omega\><rsup|2>,\<ldots\>,\<omega\><rsup|2<rsup|k>-1>\<in\>\<bbb-C\><rsub|p>>,
    with <math|\<rho\><rsub|\<omega\><rsup|i>>\<leqslant\>\<epsilon\>> for
    all <math|i>, in time <math|O<around*|(|2<rsup|k>*p*log p*log log p|)>>.
  </proposition>

  <\proof>
    It suffices to compute <math|1,\<omega\>,\<ldots\>,\<omega\><rsup|2<rsup|k-1>-1>\<in\>\<bbb-H\><rsub|p>>.
    Starting from <math|\<omega\><rsup|0>=1> and
    <math|\<omega\><rsup|2<rsup|k-2>>=\<mathi\>>, for each
    <math|\<ell\>=k-3,k-4,\<ldots\>,0>, we compute
    <math|\<omega\><rsup|i*2<rsup|\<ell\>>>> for
    <math|i=1,3,\<ldots\>,2<rsup|k-\<ell\>-1>-1> using
    <math|\<omega\><rsup|i*2<rsup|\<ell\>>>\<assign\><sqrt|\<omega\><rsup|i*2<rsup|\<ell\>+1>>|<resize|<with|math-level|0|\<cdummy\>>||0.25ex||>>>
    if <math|i\<less\>2<rsup|k-\<ell\>-2>> and
    <math|\<omega\><rsup|i*2<rsup|\<ell\>>>\<assign\>\<mathi\>*\<omega\><rsup|i*2<rsup|\<ell\>>-2<rsup|k-2>>>
    otherwise. Performing all computations with temporarily increased
    precision <math|p<rprime|'>\<assign\>p+lg p+2> and corresponding
    <math|\<epsilon\><rprime|'>\<assign\>2<rsup|1-p<rprime|'>>>,
    Lemma<nbsp><reference|sqrt-prop> yields
    <math|\<rho\><rsub|\<omega\><rsup|i>>\<leqslant\>k*\<epsilon\><rprime|'>\<leqslant\>\<epsilon\>/4>.
    This also shows that the hypothesis <math|\<rho\><rsub|\<omega\><rsup|i>>\<leqslant\>3/8>
    is always satisfied, since <math|\<epsilon\>/4\<leqslant\>1/32\<leqslant\>3/8>.
    After rounding to <math|p> bits, the relative error is at most
    <math|\<epsilon\>/4+<sqrt|2>\<cdot\>2<rsup|-p>\<leqslant\>\<epsilon\>>.
  </proof>

  <subsection|Error analysis for fast Fourier transforms><label|err-FFT>

  A <em|tight> algorithm for computing DFTs of length
  <math|n=2<rsup|k>\<geqslant\>2> is a numerical algorithm that takes as
  input an <math|n><nbhyph>tuple <math|<rigid|a>\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|e>|)><rsup|n>>
  and computes an approximation <math|<rigid|<wide|a|^>>\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|e+k>|)><rsup|n>>
  to the DFT of <math|a> with respect to <math|\<omega\>=\<mathe\><rsup|2*\<mathpi\>*\<mathi\>/n>>
  (or <math|\<omega\>=\<mathe\><rsup|-2*\<mathpi\>*\<mathi\>/n>> in the case
  of an inverse transform), such that

  <\eqnarray*>
    <tformat|<table|<row|<cell|max<rsub|i>
    <around*|(|1+\<rho\><rsub|<wide|a|^><rsub|i>>|)>>|<cell|\<leqslant\>>|<cell|max<rsub|i>
    <around*|(|1+\<rho\><rsub|a<rsub|i>>|)>*<around*|(|1+\<epsilon\>|)><rsup|3*k-2>.>>>>
  </eqnarray*>

  We assume for the moment that any such algorithm has at its disposal all
  necessary root tables with relative error not exceeding <math|\<epsilon\>>.
  Propositions<nbsp><reference|add-prop> and<nbsp><reference|err-mul>
  directly imply the following:

  <\proposition>
    The butterfly algorithm <math|\<cal-B\>> that computes a DFT of length
    <math|2> using the formula <math|<around*|(|a<rsub|0>,a<rsub|1>|)>\<mapsto\><around*|(|a<rsub|0>\<dotplus\>a<rsub|1>,a<rsub|0>\<dotminus\>a<rsub|1>|)>>
    is tight.
  </proposition>

  <\proof>
    We have <math|><math|\<rho\><rsub|<wide|a|^><rsub|i>>\<leqslant\><around*|(|\<rho\><rsub|a<rsub|0>>+\<rho\><rsub|a<rsub|1>>|)>/2+\<epsilon\>\<leqslant\>max<rsub|i>
    \<rho\><rsub|a<rsub|i>>+\<epsilon\>\<leqslant\><around*|(|1+max<rsub|i>
    \<rho\><rsub|a<rsub|i>>|)>*<around*|(|1+\<epsilon\>|)>-1>.
  </proof>

  <\proposition>
    <label|comp-FFT>Let <math|k<rsub|1>,k<rsub|2>\<geqslant\>1>, and let
    <math|\<cal-A\><rsub|1>> and <math|\<cal-A\><rsub|2>> be tight algorithms
    for computing DFTs of lengths <math|2<rsup|k<rsub|1>>> and
    <math|2<rsup|k<rsub|2>>>. Then <math|\<cal-A\><rsub|1>\<odot\>\<cal-A\><rsub|2>>
    is a tight algorithm for computing DFTs of length
    <math|2<rsup|k<rsub|1>+k<rsub|2>>>.
  </proposition>

  <\proof>
    The inner and outer DFTs contribute factors of
    <math|<around*|(|1+\<epsilon\>|)><rsup|3*k<rsub|1>-2>> and
    <math|<around*|(|1+\<epsilon\>|)><rsup|3*k<rsub|2>-2>>, and by
    Proposition<nbsp><reference|err-mul> the twiddle factor multiplications
    contribute a factor of <math|<around*|(|1+\<epsilon\>|)><rsup|2>>. Thus

    <\equation*>
      max<rsub|i> <around*|(|1+\<rho\><rsub|<wide|a|^><rsub|i>>|)>\<leqslant\>max<rsub|i>
      <around*|(|1+\<rho\><rsub|a<rsub|i>>|)>*<around*|(|1+\<epsilon\>|)><rsup|<around*|(|3*k<rsub|1>-2|)>+2+<around*|(|3*k<rsub|2>-2|)>>\<leqslant\>max<rsub|i>
      <around*|(|1+\<rho\><rsub|a<rsub|i>>|)>*<around*|(|1+\<epsilon\>|)><rsup|3*<around*|(|k<rsub|1>+k<rsub|2>|)>-2>.
    </equation*>
  </proof>

  <\corollary>
    <label|CT-cor>Let <math|k\<geqslant\>1>. Then
    <math|\<cal-B\><rsup|\<odot\>k>> is a tight algorithm for computing DFTs
    of length <math|2<rsup|k>> over<nbsp><math|\<bbb-C\><rsub|p>>, whose
    complexity is bounded by <math|O<around*|(|2<rsup|k>*k*<math-ss|I><around*|(|p|)>|)>>.
  </corollary>

  <section|A simple and fast multiplication algorithm><label|simple-algo-sec>

  In this section we give the simplest version of the new integer
  multiplication algorithm. The key innovation is an alternative method for
  computing DFTs of small length. This new method uses a<nbsp>combination of
  Bluestein's chirp transform and Kronecker substitution (see
  sections<nbsp><reference|Bluestein-sec> and<nbsp><reference|Kronecker-sec>)
  to convert the DFT to a cyclic integer product in
  <math|<around*|(|\<bbb-Z\>/<around*|(|2<rsup|n<rprime|'>>-1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>
  for suitable <math|n<rprime|'>>.

  <\proposition>
    <label|BK-prop>Let <math|1\<leqslant\>r\<leqslant\>p>. There exists a
    tight algorithm <math|\<cal-C\><rsub|r>> for computing DFTs of
    length<nbsp><math|2<rsup|r>> over <math|\<bbb-C\><rsub|p>>, whose
    complexity is bounded by <math|O<around*|(|<math-ss|I><around*|(|2<rsup|r>*p|)>+2<rsup|r>*<math-ss|I><around*|(|p|)>|)>>.
  </proposition>

  <\proof>
    Let <math|n\<assign\>2<rsup|r>>, and suppose that we wish to compute the
    DFT of <math|a\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|e>|)><rsup|n>>.
    Using Bluestein's chirp transform (notation as in
    section<nbsp><reference|Bluestein-sec>), this reduces to computing a
    cyclic convolution of suitable <math|F\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|e>|)><around*|[|X|]>/<around*|(|X<rsup|n>-1|)>>
    and <math|G\<in\>\<bbb-C\><rsub|p><around*|[|X|]>/<around*|(|X<rsup|n>-1|)>>.
    We assume that the <math|f<rsub|i>> and<nbsp><math|g<rsub|i>> have been
    precomputed with <math|\<rho\><rsub|f<rsub|i>>,\<rho\><rsub|g<rsub|i>>\<leqslant\>\<varepsilon\>>.

    We may regard <math|F<rprime|'>\<assign\>2<rsup|p-e>*F> and
    <math|G<rprime|'>\<assign\>2<rsup|p>*G> as cyclic polynomials with
    complex <with|font-shape|italic|integer> coefficients, i.e., as elements
    of <math|\<bbb-Z\><around*|[|\<mathi\>|]><around*|[|X|]>/<around*|(|X<rsup|n>-1|)>>.
    Write <math|F<rprime|'>=<big|sum><rsup|n-1><rsub|i=0>F<rprime|'><rsub|i>*X<rsup|i>>
    and <math|G<rprime|'>=<big|sum><rsup|n-1><rsub|i=0>G<rprime|'><rsub|i>*X<rsup|i>>,
    where <math|F<rprime|'><rsub|i>,G<rsub|i><rprime|'>\<in\>\<bbb-Z\><around*|[|\<mathi\>|]>>
    with <math|<around*|\||F<rprime|'><rsub|i>|\|>\<leqslant\>2<rsup|p>> and
    <math|<around*|\||G<rprime|'><rsub|i>|\|>\<leqslant\>2<rsup|p>>. Now we
    compute the <with|font-shape|italic|exact> product
    <math|H<rprime|'>\<assign\>F<rprime|'>*G<rprime|'>\<in\>\<bbb-Z\><around*|[|\<mathi\>|]><around*|[|X|]>/<around*|(|X<rsup|n>-1|)>>
    using Kronecker substitution. More precisely, we have
    <math|<around*|\||H<rprime|'><rsub|i>|\|>\<leqslant\>2<rsup|2*p+r>\<nocomma\>>,
    so it suffices to compute the cyclic integer product
    <math|H<rprime|'><around*|(|2<rsup|b>|)>=F<rprime|'><around*|(|2<rsup|b>|)>*G<rprime|'><around*|(|2<rsup|b>|)>\<in\><around*|(|\<bbb-Z\>/<around*|(|2<rsup|n*b>-1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>,
    where <math|b\<assign\>2*p+r+2=O<around*|(|p|)>>. Then
    <math|H\<assign\>H<rprime|'>*2<rsup|e-2*p>> is the exact convolution of
    <math|F> and <math|G>, and rounding <math|H> to precision <math|p> yields
    <math|F\<dotast\>G\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|e+r>|)><around*|[|X|]>/<around*|(|X<rsup|n>-1|)>>
    in the sense of Proposition<nbsp><reference|err-conv>. A final
    multiplication by<nbsp><math|f<rsub|i>> yields the Fourier coefficients
    <math|<wide|a|^><rsub|i>\<in\>\<bbb-C\><rsub|p>*2<rsup|e+r>>.

    To establish tightness, observe that <math|1+\<rho\><rsub|F<rsub|i>>\<leqslant\><around*|(|1+\<rho\><rsub|a<rsub|i>>|)>*<around*|(|1+\<epsilon\>|)><rsup|2>>
    and <math|\<rho\><rsub|G<rsub|i>>\<leqslant\>\<epsilon\>>, so
    Proposition<nbsp><reference|err-conv> yields
    <math|1+\<rho\><rsub|<around*|(|F\<dotast\>G|)><rsub|i>>\<leqslant\><around*|(|1+\<rho\><rsub|a>|)>*<around*|(|1+\<epsilon\>|)><rsup|4>>
    where <math|\<rho\><rsub|a>\<assign\>max<rsub|i>
    \<rho\><rsub|a<rsub|i>>>; we conclude that
    <math|1+\<rho\><rsub|<wide|a|^><rsub|i>>\<leqslant\><around*|(|1+\<rho\><rsub|a>|)>*<around*|(|1+\<epsilon\>|)><rsup|6>>.
    For <math|r\<geqslant\>3>, this means that the algorithm is tight; for
    <math|r\<leqslant\>2>, we may take<nbsp><math|\<cal-C\><rsub|r>\<assign\>\<cal-B\><rsup|\<odot\>r>>.

    For the complexity, observe that the product in
    <math|<around*|(|\<bbb-Z\>/<around*|(|2<rsup|n*b>-1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>
    reduces to three integer products of size <math|O<around*|(|n*p|)>>.
    These have cost <math|O<around*|(|<math-ss|I><around*|(|n*p|)>|)>>, and
    the algorithm also performs <math|O<around*|(|n|)>> multiplications in
    <math|><math|\<bbb-C\><rsub|p>>, contributing the
    <math|O<around*|(|n*<math-ss|I><around*|(|p|)>|)>> term.
  </proof>

  <\remark>
    <label|faster-rem>A crucial observation is that, for suitable parameters,
    the DFT algorithm in Proposition<nbsp><reference|BK-prop> is actually
    faster than the conventional Cooley--Tukey algorithm of
    Corollary<nbsp><reference|CT-cor>. For example, if we assume that
    <math|<math-ss|I><around|(|m|)>=m*<around*|(|log
    m|)><rsup|1+o<around*|(|1|)>>>, then to compute a transform of length
    <math|n> over<nbsp><math|\<bbb-C\><rsub|p>> with <math|n\<sim\>p>, the
    Cooley--Tukey approach has complexity <math|n<rsup|2>*<around*|(|log
    n|)><rsup|2+o<around*|(|1|)>>\<nocomma\>>, whereas
    Proposition<nbsp><reference|BK-prop> yields
    <math|n<rsup|2>*<around*|(|log n|)><rsup|1+o<around*|(|1|)>>>, an
    improvement by a factor of roughly <math|log n>.
  </remark>

  <\theorem>
    <label|thm:simple>For <math|n\<rightarrow\>\<infty\>>, we have

    <\eqnarray*>
      <tformat|<table|<row|<cell|<frac|<math-ss|I><around|(|n|)>|n*lg
      n>>|<cell|=>|<cell|O<around*|(|<frac|<math-ss|I><around|(|lg<rsup|2>
      n|)>|lg<rsup|2> n*lg lg n>+<frac|<math-ss|I><around|(|lg n|)>|lg n*lg
      lg n>+1|)>.<eq-number><label|eq:simple>>>>>
    </eqnarray*>
  </theorem>

  <\proof>
    We first reduce our integer product to a polynomial product using
    Kronecker segmentation (section<nbsp><reference|Kronecker-sec>).
    Splitting the two <math|n>-bit inputs into chunks of <math|b\<assign\>lg
    n> bits, we need to compute a<nbsp>product of polynomials
    <math|u,v\<in\>\<bbb-Z\><around|[|X|]>> with non-negative <math|b>-bit
    coefficients and degrees less than <math|m\<assign\><around|\<lceil\>|n/b|\<rceil\>>=O<around|(|n/lg
    n|)>>. The coefficients of <math|h\<assign\>u*v> have <math|O<around|(|lg
    n|)>> bits, and we may deduce the desired integer product
    <math|h<around|(|2<rsup|b>|)>> in time <math|O<around|(|n|)>>.

    Let <math|k\<assign\>lg <around*|(|2*m|)>>. To compute <math|u*v>, we
    will use DFTs of length <math|2<rsup|k>=O<around*|(|n/lg n|)>> over
    <math|\<bbb-C\><rsub|p>>, where <math|p\<assign\>2*b+2*k+lg
    k+8=O<around*|(|lg n|)>>. Zero-padding <math|u> to obtain a sequence
    <math|<around*|(|u<rsub|0>,\<ldots\>,u<rsub|2<rsup|k>-1>|)>\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|b>|)><rsup|2<rsup|k>>>,
    and similarly for <math|v>, we compute the transforms
    <math|<wide|u|^>,<wide|v|^>\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|b+k>|)><rsup|2<rsup|k>>>
    with respect to <math|\<omega\>\<assign\>\<mathe\><rsup|2*\<mathpi\>*\<mathi\>/2<rsup|k>>>
    as follows.

    Let <math|r\<assign\>lg lg n> and <math|d\<assign\><around|\<lceil\>|k/r|\<rceil\>>=O<around|(|lg
    n/lg lg n|)>>. Write <math|k=r<rsub|1>+\<cdots\>+r<rsub|d>> with
    <math|r<rsub|i>\<assign\>r> for <math|i\<leqslant\>d-1> and
    <math|r<rsub|d>\<assign\>k-<around*|(|d-1|)>*r\<leqslant\>r>. We use the
    algorithm <math|\<cal-A\>\<assign\>\<cal-A\><rsub|1>\<odot\>\<cdots\>\<odot\>\<cal-A\><rsub|d>>
    (see section<nbsp><reference|FFT-sec>), where for
    <math|1\<leqslant\>i\<leqslant\>d-1> we take <math|\<cal-A\><rsub|i>> to
    be the tight algorithm <math|\<cal-C\><rsub|r>> for DFTs of length
    <math|2<rsup|r>\<asymp\>lg n> given by
    Proposition<nbsp><reference|BK-prop>, and where <math|\<cal-A\><rsub|d>>
    is <math|\<cal-B\><rsup|\<odot\>r<rsub|d>>> as in
    Corollary<nbsp><reference|CT-cor>. In other words, we split the <math|k>
    usual radix-2 layers of the FFT into groups of <math|r> layers, handling
    the transforms in each group with the Bluestein--Kronecker reduction, and
    then using ordinary Cooley--Tukey for the remaining <math|r<rsub|d>>
    layers.

    We next compute the pointwise products
    <math|<wide|h|^><rsub|i>\<assign\><wide|u|^><rsub|i>*<wide|v|^><rsub|i>\<in\>\<bbb-C\><rsub|p>*2<rsup|2*b+2*k>>,
    and then apply an inverse transform <math|\<cal-A\><rprime|'>> defined
    analogously to <math|\<cal-A\>>. A final division by <math|2<rsup|k>>
    (which is really just an implicit adjustment of exponents) yields
    approximations <math|h<rsub|i>\<in\>\<bbb-C\><rsub|p>*2<rsup|2*b+2*k>>.

    Since <math|\<cal-A\>> and <math|\<cal-A\><rprime|'>> are tight by
    Propositions<nbsp><reference|comp-FFT>,<nbsp><reference|BK-prop> and
    Corollary<nbsp><reference|CT-cor>, we have
    <math|1+\<rho\><rsub|<wide|u|^><rsub|i>>\<leqslant\><around*|(|1+\<epsilon\>|)><rsup|3*k-2>>,
    and similarly for <math|<wide|v|^>>. Thus
    <math|1+\<rho\><rsub|<wide|h|^><rsub|i>>\<leqslant\><around*|(|1+\<epsilon\>|)><rsup|6*k-3>>,
    so <math|1+\<rho\><rsub|h<rsub|i>>\<leqslant\><around*|(|1+\<epsilon\>|)><rsup|9*k-5>\<leqslant\>exp<around*|(|9*k*\<epsilon\>|)>\<leqslant\>exp<around*|(|2<rsup|5+lg
    k-p>|)>\<leqslant\>1+2<rsup|6+lg k-p>> after the inverse transform (since
    <math|exp x\<leqslant\>1+2*x> for <math|x\<leqslant\>1>). In particular,
    <math|\<varepsilon\><rsub|h<rsub|i>>=2<rsup|2*b+2*k>*\<rho\><rsub|h<rsub|i>>\<leqslant\>2<rsup|2*b+2*k+lg
    k-p+6>\<leqslant\>1/4>, so we obtain the exact value of <math|h<rsub|i>>
    by rounding to the nearest integer.

    Now we analyse the complexity. Using Proposition<nbsp><reference|roots-prop>,
    we first compute a table of roots <math|1,\<omega\>,\<ldots\>,\<omega\><rsup|2<rsup|k>-1>>
    in time <math|O<around*|(|2<rsup|k>*p*log p*log log p|)>=O<around*|(|n*lg
    n|)>>, and then extract the required twiddle factor tables in time
    <math|O<around*|(|2<rsup|k>*k*<around*|(|p+k|)>|)>=O<around*|(|n*lg n|)>>
    (see section<nbsp><reference|FFT-sec>). For the Bluestein reductions, we
    may extract a table of <math|2<rsup|r+1>>-th roots in time
    <math|O<around*|(|2<rsup|k>*p|)>=O<around*|(|n|)>>, and then rearrange
    them as required in time <math|O<around*|(|2<rsup|r>*r*<around*|(|p+r|)>|)>=O<around*|(|lg<rsup|2>
    n*lg lg n|)>> (see section<nbsp><reference|Bluestein-sec>). These
    precomputations are then all repeated for the inverse transforms.

    By Corollary<nbsp><reference|CT-cor>,
    Proposition<nbsp><reference|BK-prop> and<nbsp><eqref|fft-rec-bound2>,
    each invocation of <math|\<cal-A\>> (or <math|\<cal-A\><rprime|'>>) has
    cost

    <\eqnarray*>
      <tformat|<table|<row|<cell|>|<cell|>|<cell|O<around*|(|<around*|(|d-1|)>*2<rsup|k-r>*<around*|(|<math-ss|I><around*|(|2<rsup|r>*p|)>+2<rsup|r>*<math-ss|I><around*|(|p|)>|)>+2<rsup|k-r<rsub|d>>*2<rsup|r<rsub|d>>*r<rsub|d>*<math-ss|I><around*|(|p|)>+<around*|(|d-1|)>*2<rsup|k>*<math-ss|I><around*|(|p|)>+p*2<rsup|k>*k|)>>>|<row|<cell|>|<cell|=>|<cell|O<around*|(|<around*|(|d-1|)>*2<rsup|k-r>*<math-ss|I><around*|(|2<rsup|r>*p|)>+<around*|(|d+r<rsub|d>|)>*2<rsup|k>*<math-ss|I><around*|(|p|)>+p*2<rsup|k>*k|)>>>|<row|<cell|>|<cell|=>|<cell|O<around*|(|<frac|n|lg
      n*lg lg n>*<math-ss|I><around|(|lg<rsup|2> n|)>+<frac|n|lg lg
      n>*<math-ss|I><around|(|lg n|)>+n*lg n|)>.>>>>
    </eqnarray*>

    The cost of the <math|O<around*|(|2<rsup|k>|)>> pointwise multiplications
    is subsumed within this bound.
  </proof>

  It is now a straightforward matter to recover Frer's bound.

  <\theorem>
    <label|simple-th>For some constant <math|K\<gtr\>1>, we have

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|I><around|(|n|)>>|<cell|=>|<cell|O<around*|(|n*lg
      n*K<rsup|log<rsup|\<ast\>> n>|)>.>>>>
    </eqnarray*>
  </theorem>

  <\proof>
    Let <math|T<around*|(|n|)>\<assign\><math-ss|I><around*|(|n|)>/<around*|(|n*lg
    n|)>> for <math|n\<geqslant\>2>. By Theorem<nbsp><reference|thm:simple>,
    there exists <math|x<rsub|0>\<geqslant\>2> and <math|C\<gtr\>1> such that

    <\eqnarray*>
      <tformat|<table|<row|<cell|T<around*|(|n|)>>|<cell|\<leqslant\>>|<cell|C*<around*|(|T<around*|(|lg<rsup|2>
      n|)>+T<around*|(|lg n|)>+1|)>>>>>
    </eqnarray*>

    for all <math|n\<gtr\>x<rsub|0>>. Let
    <math|\<Phi\><around*|(|x|)>\<assign\>4*log<rsup|2> x> for
    <math|x\<in\>\<bbb-R\>>, <math|x\<gtr\>1>. Increasing <math|x<rsub|0>> if
    necessary, we may assume that <math|\<Phi\><around*|(|x|)>\<leqslant\>x-1>
    for <math|x\<gtr\>x<rsub|0>>, so that the function
    <math|\<Phi\><rsup|\<ast\>><around*|(|x|)>\<assign\>min<around*|{|j\<in\>\<bbb-N\>:\<Phi\><rsup|\<circ\>j><around*|(|x|)>\<leqslant\>x<rsub|0>|}>>
    is well-defined. Increasing <math|C> if necessary, we may also assume
    that <math|T<around*|(|n|)>\<leqslant\>3*C> for all
    <math|n\<leqslant\>x<rsub|0>>.

    \ We prove by induction on <math|\<Phi\><rsup|\<ast\>><around*|(|n|)>>
    that <math|T<around*|(|n|)>\<leqslant\><around*|(|3*C|)><rsup|\<Phi\><rsup|\<ast\>><around*|(|n|)>+1>>
    for all <math|n>. If <math|\<Phi\><rsup|\<ast\>><around*|(|n|)>=0>, then
    <math|n\<leqslant\>x<rsub|0>>, so the bound holds. Now suppose that
    <math|\<Phi\><rsup|\<ast\>><around*|(|n|)>\<geqslant\>1>. Since
    <math|lg<rsup|2> n\<leqslant\>\<Phi\><around*|(|n|)>>, we have
    <math|\<Phi\><rsup|\<ast\>><around*|(|lg
    n|)>\<leqslant\>\<Phi\><rsup|\<ast\>><around*|(|lg<rsup|2>
    n|)>\<leqslant\>\<Phi\><rsup|\<ast\>><around*|(|\<Phi\><around*|(|n|)>|)>=\<Phi\><rsup|\<ast\>><around*|(|n|)>-1>,
    so by induction <math|T<around*|(|n|)>\<leqslant\>C*<around*|(|3*C|)><rsup|\<Phi\><rsup|\<ast\>><around*|(|n|)>>+C*<around*|(|3*C|)><rsup|\<Phi\><rsup|\<ast\>><around*|(|n|)>>+C\<leqslant\><around*|(|3*C|)><rsup|\<Phi\><rsup|\<ast\>><around*|(|n|)>+1>>.

    Finally, since <math|\<Phi\><around*|(|\<Phi\><around*|(|x|)>|)>\<prec\>log
    x\<nocomma\>>, we have <math|\<Phi\><rsup|\<ast\>><around*|(|x|)>\<leqslant\>2*log<rsup|\<ast\>>
    x+O<around*|(|1|)>>, so <math|T<around*|(|n|)>=O<around*|(|K<rsup|log<rsup|\<ast\>>
    n>|)>> for <math|K\<assign\><around*|(|3*C|)><rsup|2>>.
  </proof>

  <section|Logarithmically slow recurrence inequalities><label|iter-sec>

  This section is devoted to developing a framework for handling recurrence
  inequalities, similar to<nbsp>(<reference|eq:simple>), that appear in
  subsequent sections.

  Let <math|\<Phi\>:<around*|(|x<rsub|0>,\<infty\>|)>\<rightarrow\>\<bbb-R\>>
  be a smooth increasing function, for some <math|x<rsub|0>\<in\>\<bbb-R\>>.
  We say that <rigid|<math|\<Phi\><rsup|\<ast\>>:<around*|(|x<rsub|0>,\<infty\>|)>\<rightarrow\>\<bbb-R\><rsup|\<geqslant\>>>>
  is an <em|iterator> of <math|\<Phi\>> if <math|\<Phi\><rsup|\<ast\>>> is
  increasing and if

  <\eqnarray*>
    <tformat|<table|<row|<cell|\<Phi\><rsup|\<ast\>><around*|(|x|)>>|<cell|=>|<cell|\<Phi\><rsup|\<ast\>><around*|(|\<Phi\><around*|(|x|)>|)>+1<eq-number><label|it-gen>>>>>
  </eqnarray*>

  for all sufficiently large <math|x>.

  For instance, the standard iterated logarithm <math|<op|log<rsup|\<ast\>>>>
  defined in<nbsp>(<reference|it-log>) is an iterator of <math|<op|log>>. An
  analogous iterator may be defined for any smooth increasing function
  <math|\<Phi\>:<around*|(|x<rsub|0>,\<infty\>|)>\<rightarrow\>\<bbb-R\>> for
  which there exists some <math|\<sigma\>\<geqslant\>x<rsub|0>> such that
  <math|\<Phi\><around*|(|x|)>\<leqslant\>x-1> for all
  <math|x\<gtr\>\<sigma\>>. Indeed, in that case,

  <\eqnarray*>
    <tformat|<table|<row|<cell|\<Phi\><rsup|\<ast\>><around*|(|x|)>>|<cell|\<assign\>>|<cell|min
    <around*|{|k\<in\>\<bbb-N\>:\<Phi\><rsup|\<circ\>k><around*|(|x|)>\<leqslant\>\<sigma\>|}>>>>>
  </eqnarray*>

  is well-defined and satisfies<nbsp>(<reference|it-gen>) for all
  <math|x\<gtr\>\<sigma\>>. It will sometimes be convenient to increase
  <math|x<rsub|0>> so that <math|\<Phi\><around*|(|x|)>\<leqslant\>x-1> is
  satisfied on the whole domain of <math|\<Phi\>>.

  We say that <math|\<Phi\>> is <em|logarithmically slow> if there exists
  an<nbsp><math|\<ell\>\<in\>\<bbb-N\>> such that

  <\eqnarray*>
    <tformat|<table|<row|<cell|<around*|(|log<rsup|\<circ\>\<ell\>>\<circ\>\<Phi\>\<circ\>exp<rsup|\<circ\>\<ell\>>|)><around*|(|x|)>>|<cell|=>|<cell|log
    x+O<around*|(|1|)><eq-number><label|log-slow-cond>>>>>
  </eqnarray*>

  for <math|x\<rightarrow\>\<infty\>>. For example, the functions <math|log
  <around*|(|2*x|)>>, <math|2*log x>, <math|<around*|(|log x|)><rsup|2>> and
  <math|<around*|(|log x|)><rsup|log log x>> are logarithmically slow, with
  <math|\<ell\>=0,1,2,3> respectively.

  <\lemma>
    <label|phi-bound>Let <math|\<Phi\>:<around*|(|x<rsub|0>,\<infty\>|)>\<rightarrow\>\<bbb-R\>>
    be a logarithmically slow function. Then there exists
    <math|\<sigma\>\<geqslant\>x<rsub|0>> such that
    <math|\<Phi\><around*|(|x|)>\<leqslant\>x-1> for all
    <math|x\<gtr\>\<sigma\>>. Consequently all logarithmically slow functions
    admit iterators.
  </lemma>

  <\proof>
    The case <math|\<ell\>=0> is clear. For <math|\<ell\>\<geqslant\>1>, let
    <math|\<Psi\>\<assign\>log\<circ\>\<Phi\>\<circ\>exp>. By induction
    <math|\<Psi\><around*|(|x|)>\<leqslant\>x-1> for large<nbsp><math|x>, so
    <math|\<Phi\><around*|(|x|)>\<leqslant\>exp<around*|(|log
    x-1|)>=x/e\<leqslant\>x-1> for large <math|x>.
  </proof>

  In this paper, the main role played by logarithmically slow functions is to
  measure <with|font-shape|italic|size reduction> in multiplication
  algorithms. In other words, multiplication of objects of size <math|n> will
  be reduced to multiplication of objects of size <math|n<rprime|'>>, where
  <math|n<rprime|'>\<leqslant\>\<Phi\><around*|(|n|)>> for some
  logarithmically slow function <math|\<Phi\><around*|(|x|)>>. The following
  result asserts that, from the point of view of iterators, such functions
  are more or less interchangeable with <math|log x>.

  <\lemma>
    <label|iter-lem>For any iterator <math|\<Phi\><rsup|\<ast\>>> of a
    logarithmically slow function <math|\<Phi\>>, we have

    <\eqnarray*>
      <tformat|<table|<row|<cell|\<Phi\><rsup|\<ast\>><around*|(|x|)>>|<cell|=>|<cell|log<rsup|\<ast\>>
      x+O<around*|(|1|)>.>>>>
    </eqnarray*>
  </lemma>

  <\proof>
    First consider the case where <math|\<ell\>=0> in <eqref|log-slow-cond>,
    i.e., assume that <math|<around*|\||\<Phi\><around*|(|x|)>-log
    x|\|>\<leqslant\>C> for some constant <math|C\<gtr\>0> and all
    <math|x\<gtr\>x<rsub|0>>. Increasing <math|x<rsub|0>> and <math|C> if
    necessary, we may assume that <math|\<Phi\><rsup|\<asterisk\>><around*|(|x|)>=\<Phi\><rsup|\<ast\>><around*|(|\<Phi\><around*|(|x|)>|)>+1>
    for all <math|x\<gtr\>x<rsub|0>\<nocomma\>>, and that
    <math|2*\<mathe\><rsup|2*C>\<gtr\>x<rsub|0>>.

    We claim that

    <\eqnarray*>
      <tformat|<table|<row|<cell|<frac|y|2>\<leqslant\>x\<leqslant\>2*y>|<cell|\<Longrightarrow\>>|<cell|<frac|log
      y|2>\<leqslant\>\<Phi\><around*|(|x|)>\<leqslant\>2*log
      y<eq-number><label|rec-ineq>>>>>
    </eqnarray*>

    for all <math|y\<gtr\>4*e<rsup|2*C>>. Indeed, if
    <math|<frac|y|2>\<leqslant\>x\<leqslant\>2*y>, then

    <\equation*>
      <tfrac|1|2>*log y\<leqslant\>log <tfrac|y|2>-C\<leqslant\>\<Phi\><around*|(|<tfrac|<smash|y>|2>|)>\<leqslant\>\<Phi\><around*|(|x|)>\<leqslant\>\<Phi\><around*|(|2*y|)>\<leqslant\>log
      <around*|(|2*y|)>+C\<leqslant\>2*log y.
    </equation*>

    <yes-indent>Now, given any <math|x\<gtr\>4*\<mathe\><rsup|2*C>>, let
    <math|k\<assign\>min<around*|{|k\<in\>\<bbb-N\>:log<rsup|\<circ\>k>
    x\<leqslant\>4*\<mathe\><rsup|2*C>|}>>, so <math|k\<geqslant\>1>. For any
    <math|j=0,\<ldots\>,k-1> we have <math|log<rsup|\<circ\>j>
    x\<gtr\>4*\<mathe\><rsup|2*C>>, so <math|k>-fold iteration of
    <eqref|rec-ineq>, starting with <math|y=x>, yields

    <\equation*>
      <frac|log<rsup|\<circ\>j> x|2>\<leqslant\>\<Phi\><rsup|\<circ\>j><around*|(|x|)>\<leqslant\>2*log<rsup|\<circ\>j>
      x\<nocomma\><space|1em><around*|(|0\<leqslant\>j\<leqslant\>k|)>.
    </equation*>

    Moreover this shows that <math|\<Phi\><rsup|\<circ\>j><around*|(|x|)>\<gtr\>2*\<mathe\><rsup|2*C>\<gtr\>x<rsub|0>>
    for <math|0\<leqslant\>j\<less\>k>, so
    <math|\<Phi\><rsup|\<ast\>><around*|(|x|)>=\<Phi\><rsup|\<ast\>><around*|(|\<Phi\><rsup|\<circ\>k><around*|(|x|)>|)>+k>.
    Since <math|\<Phi\><rsup|\<circ\>k><around*|(|x|)>\<leqslant\>2*log<rsup|\<circ\>k>
    x\<leqslant\>8*\<mathe\><rsup|2*C>> and <math|k=log<rsup|\<ast\>>
    x+O<around*|(|1|)>\<nocomma\>>, we obtain
    <math|\<Phi\><rsup|\<ast\>><around*|(|x|)>=log<rsup|\<ast\>>
    x+O<around*|(|1|)>>.

    Now consider the general case <math|\<ell\>\<geqslant\>0>. Let
    <math|\<Psi\>\<assign\><op|log<rsup|\<circ\>\<ell\>>>\<circ\>\<Phi\>\<circ\><op|exp<rsup|\<circ\>\<ell\>>>>,
    so that <math|\<Psi\><rsup|\<ast\>>\<assign\>\<Phi\><rsup|\<ast\>>\<circ\><op|exp<rsup|\<circ\>\<ell\>>>>
    is an iterator of<nbsp><math|\<Psi\>>. By the above argument
    <math|\<Psi\><rsup|\<ast\>><around*|(|x|)>=log<rsup|\<ast\>>
    x+O<around*|(|1|)>>, and so <math|\<Phi\><rsup|\<ast\>><around*|(|x|)>=\<Psi\><rsup|\<ast\>><around*|(|log<rsup|\<circ\>\<ell\>>
    x|)>=log<rsup|\<ast\>><around*|(|log<rsup|\<circ\>\<ell\>>
    x|)>+O<around*|(|1|)>=log<rsup|\<ast\>>
    x-\<ell\>+O<around*|(|1|)>=log<rsup|\<ast\>> x+O<around*|(|1|)>>.
  </proof>

  The next result, which generalises and refines the argument of
  Theorem<nbsp><reference|simple-th>, is our main tool for converting
  recurrence inequalities into actual asymptotic bounds for solutions. We
  state it in a slightly more general form than is necessary for the present
  paper, anticipating the more complicated situation that arises in
  <cite|vdH:ffmul>.

  <\proposition>
    <label|slow-rec-lem>Let <math|K\<gtr\>1>, <math|B\<geqslant\>0> and
    <math|\<ell\>\<in\>\<bbb-N\>>. Let <math|x<rsub|0>\<geqslant\>exp<rsup|\<circ\>\<ell\>><around*|(|1|)>>,
    and let <math|\<Phi\>:<around*|(|x<rsub|0>,\<infty\>|)>\<rightarrow\>\<bbb-R\>>
    be a<nbsp>logarithmically slow function such that
    <math|\<Phi\><around*|(|x|)>\<leqslant\>x-1> for all
    <math|x\<gtr\>x<rsub|0>>. Then there exists a positive constant <math|C>
    (depending on <math|x<rsub|0>>, <math|\<Phi\>>, <math|K>, <math|B> and
    <math|\<ell\>>) with the following property.

    Let <math|\<sigma\>\<geqslant\>x<rsub|0>> and <math|L\<gtr\>0>. Let
    <math|\<cal-S\>\<subseteq\>\<bbb-R\>>, and let
    <math|T:\<cal-S\>\<rightarrow\>\<bbb-R\><rsup|\<geqslant\>>> be any
    function satisfying the following recurrence. First,
    <math|T<around*|(|y|)>\<leqslant\>L> for all <math|y\<in\>\<cal-S\>>,
    <math|y\<leqslant\>\<sigma\>>. Second, for all <math|y\<in\>\<cal-S\>>,
    <math|y\<gtr\>\<sigma\>>, there exist
    <math|y<rsub|1>,\<ldots\>,y<rsub|d>\<in\>\<cal-S\>> with
    <math|y<rsub|i>\<leqslant\>\<Phi\><around*|(|y|)>>, and weights
    <math|\<gamma\><rsub|1>,\<ldots\>,\<gamma\><rsub|d>\<geqslant\>0> with
    <math|<big|sum><rsub|i>\<gamma\><rsub|i>=1>, such that

    <\eqnarray*>
      <tformat|<table|<row|<cell|T<around*|(|y|)>>|<cell|\<leqslant\>>|<cell|K*<around*|(|1+<frac|B|log<rsup|\<circ\>\<ell\>>
      y>|)>*<big|sum><rsup|d><rsub|i=1>\<gamma\><rsub|i>*T<around*|(|y<rsub|i>|)>+L.>>>>
    </eqnarray*>

    Then we have <math|T<around*|(|y|)>\<leqslant\>C*L*K<rsup|log<rsup|\<ast\>>
    y-log<rsup|\<ast\>> \<sigma\>>> for all <math|y\<in\>\<cal-S\>>,
    <math|y\<gtr\>\<sigma\>>.
  </proposition>

  <\proof>
    Let <math|\<sigma\>>, <math|L>, <math|\<cal-S\>> and
    <math|T<around*|(|x|)>> be as above. Define
    <math|\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|x|)>\<assign\>min<around*|{|k\<in\>\<bbb-N\>:\<Phi\><rsup|\<circ\>k><around*|(|x|)>\<leqslant\>\<sigma\>|}>>
    for <math|x\<gtr\>x<rsub|0>>. We claim that there exists
    <math|r\<in\>\<bbb-N\>>, depending only on <math|x<rsub|0>> and
    <math|\<Phi\>>, such that

    <\eqnarray*>
      <tformat|<table|<row|<cell|\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|x|)>>|<cell|\<leqslant\>>|<cell|log<rsup|\<ast\>>
      x-log<rsup|\<ast\>> \<sigma\>+r<eq-number><label|phi-sigma>>>>>
    </eqnarray*>

    for all <math|x\<gtr\>\<sigma\>>. Indeed, let
    <math|\<Phi\><rsup|\<ast\>><around*|(|x|)>\<assign\>min<around*|{|j\<in\>\<bbb-N\>:\<Phi\><rsup|\<circ\>j><around*|(|x|)>\<leqslant\>x<rsub|0>|}>>.
    First suppose <math|\<sigma\>\<gtr\>x<rsub|0>>, so that
    <math|\<Phi\><rsup|\<ast\>><around*|(|\<sigma\>|)>\<geqslant\>1>. For any
    <math|x\<gtr\>\<sigma\>>, we have <math|\<Phi\><rsup|\<circ\><around*|(|\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|x|)>-1|)>><around*|(|x|)>\<gtr\>\<sigma\>>,
    so

    <\equation*>
      \<Phi\><rsup|\<circ\><around*|(|\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|x|)>-1+\<Phi\><rsup|\<ast\>><around*|(|\<sigma\>|)>-1|)>><around*|(|x|)>\<geqslant\>\<Phi\><rsup|\<circ\><around*|(|\<Phi\><rsup|\<ast\>><around*|(|\<sigma\>|)>-1|)>><around*|(|\<sigma\>|)>\<gtr\>x<rsub|0>,
    </equation*>

    and hence <math|\<Phi\><rsup|\<ast\>><around*|(|x|)>\<gtr\>\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|x|)>+\<Phi\><rsup|\<ast\>><around*|(|\<sigma\>|)>-2\<nocomma\>>.
    This last inequality also clearly holds if <math|\<sigma\>=x<rsub|0>>
    (since <math|0\<gtr\>-2>). By Lemma<nbsp><reference|iter-lem> we obtain
    <math|\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|x|)>\<leqslant\>\<Phi\><rsup|\<ast\>><around*|(|x|)>-\<Phi\><rsup|\<ast\>><around*|(|\<sigma\>|)>+O<around*|(|1|)>=log<rsup|\<ast\>>
    x-log<rsup|\<ast\>> \<sigma\>+O<around*|(|1|)>>.

    Define a sequence of real numbers <math|E<rsub|1>,E<rsub|2>,\<ldots\>> by
    the formula

    <\eqnarray*>
      <tformat|<table|<row|<cell|E<rsub|j>>|<cell|\<assign\>>|<cell|<choice|<tformat|<table|<row|<cell|1+B>|<cell|>|<cell|<text|if
      >j\<leqslant\>r+\<ell\>,>>|<row|<cell|1+B/exp<rsup|\<circ\><around*|(|j-r-\<ell\>-1|)>><around*|(|1|)>>|<cell|>|<cell|<text|if
      >j\<gtr\>r+\<ell\>.>>>>>>>>>
    </eqnarray*>

    We claim that

    <\eqnarray*>
      <tformat|<table|<row|<cell|1+B/log<rsup|\<circ\>\<ell\>>
      x>|<cell|\<leqslant\>>|<cell|E<rsub|\<Phi\><rsub|\<sigma\>><rsup|\<ast\>><around*|(|x|)>><eq-number><label|E-bound>>>>>
    </eqnarray*>

    for all <math|x\<gtr\>\<sigma\>>. Indeed, let
    <math|j\<assign\>\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|x|)>>.
    If <math|j\<leqslant\>r+\<ell\>> then <eqref|E-bound> holds as
    <math|x\<gtr\>\<sigma\>\<geqslant\>x<rsub|0>\<geqslant\>exp<rsup|\<circ\>\<ell\>><around*|(|1|)>>.
    If <math|j\<gtr\>r+\<ell\>> then <math|log<rsup|\<ast\>>
    x\<geqslant\>j-r> by <eqref|phi-sigma>, so
    <math|x\<geqslant\>exp<rsup|\<circ\><around*|(|j-r-1|)>><around*|(|1|)>>
    and hence <math|log<rsup|\<circ\>\<ell\>>
    x\<geqslant\>exp<rsup|\<circ\><around*|(|j-r-\<ell\>-1|)>><around*|(|1|)>>.

    Now let <math|y\<in\>\<cal-S\>>. We will prove by induction on
    <math|j\<assign\>\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|y|)>>
    that

    <\eqnarray*>
      <tformat|<table|<row|<cell|T<around*|(|y|)>>|<cell|\<leqslant\>>|<cell|E<rsub|1>*\<cdots\>*E<rsub|j>*L*<around*|(|K<rsup|j>+\<cdots\>+K+1|)>>>>>
    </eqnarray*>

    for all <math|y\<gtr\>x<rsub|0>>. The base case <math|j\<assign\>0>,
    i.e., <math|y\<leqslant\>\<sigma\>>, holds by assumption. Now assume that
    <math|j\<geqslant\>1>, so <math|y\<gtr\>\<sigma\>>. By hypothesis there
    exist <math|y<rsub|1>,\<ldots\>,y<rsub|d>\<in\>\<cal-S\>>,
    <math|y<rsub|i>\<leqslant\>\<Phi\><around*|(|y|)>>, and
    <math|\<gamma\><rsub|1>,\<ldots\>,\<gamma\><rsub|d>\<geqslant\>0> with
    <math|<big|sum><rsub|i>\<gamma\><rsub|i>=1>, such that

    <\eqnarray*>
      <tformat|<table|<row|<cell|T<around*|(|y|)>>|<cell|\<leqslant\>>|<cell|K*E<rsub|j>*<big|sum><rsub|i>\<gamma\><rsub|i>*T<around*|(|y<rsub|i>|)>+L.>>>>
    </eqnarray*>

    Since <math|\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|y<rsub|i>|)>\<leqslant\>\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|\<Phi\><around*|(|y|)>|)>=\<Phi\><rsup|\<ast\>><rsub|\<sigma\>><around*|(|y|)>-1>,
    we obtain

    <\eqnarray*>
      <tformat|<table|<row|<cell|T<around*|(|y|)>>|<cell|\<leqslant\>>|<cell|K*E<rsub|j>*<big|sum><rsub|i>\<gamma\><rsub|i>*<around*|(|E<rsub|1>*\<cdots\>*E<rsub|j-1>*L*<around*|(|K<rsup|j-1>+\<cdots\>+K+1|)>|)>+L>>|<row|<cell|>|<cell|=>|<cell|E<rsub|1>*\<cdots\>*E<rsub|j>*L*<around*|(|K<rsup|j>+\<cdots\>+K<rsup|2>+K|)>+L>>|<row|<cell|>|<cell|\<leqslant\>>|<cell|E<rsub|1>*\<cdots\>*E<rsub|j>*L*<around*|(|K<rsup|j>+\<cdots\>+K<rsup|2>+K+1|)>.>>>>
    </eqnarray*>

    <yes-indent>Finally, the infinite product

    <\equation*>
      E\<assign\><big|prod><rsub|j\<geqslant\>1>E<rsub|j>\<leqslant\><around*|(|1+B|)><rsup|r+\<ell\>>*<big|prod><rsub|k\<geqslant\>0><around*|(|1+<frac|B|exp<rsup|\<circ\>k><around*|(|1|)>>|)>
    </equation*>

    certainly converges, so we have <math|T<around*|(|y|)>\<leqslant\>E*L*K<rsup|j+1>/<around*|(|K-1|)>>
    for <math|y\<gtr\>x<rsub|0>>. Setting
    <math|C\<assign\>E*K<rsup|r+1>/<around*|(|K-1|)>>,
    by<nbsp><eqref|phi-sigma> we obtain <math|T<around*|(|y|)>\<leqslant\>C*L*K<rsup|log<rsup|\<ast\>>
    y-log<rsup|\<ast\>> \<sigma\>>> for all <math|y\<gtr\>\<sigma\>>.
  </proof>

  <section|Even faster multiplication><label|even-faster-sec>

  In this section, we present an optimised version of the new integer
  multiplication algorithm. The basic outline is the same as in
  section<nbsp><reference|simple-algo-sec>, but our goal is now to minimise
  the ``expansion factor'' at each recursion level. The necessary
  modifications may be summarised as follows.

  <\itemize>
    <item>Since Bluestein's chirp transform reduces a DFT to a complex cyclic
    convolution, we take the basic recursive problem to be complex cyclic
    integer convolution, i.e., multiplication in
    <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|<around*|(|2<rsup|n>-1|)>*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]>\<nocomma\>>,
    rather than ordinary integer multiplication.

    <item>In multiplications involving one fixed operand, we reuse the
    transform of the fixed operand.

    <item>In a convolution of length <math|n> with input coefficients of bit
    size <math|b>, the size of the output coefficients is
    <math|2*b+O<around*|(|lg n|)>>, so the ratio of output to input size is
    <math|2+O<around*|(|<around*|(|lg n|)>/b|)>>. We increase<nbsp><math|b>
    from <math|lg n> to <math|<around*|(|lg n|)><rsup|2>>, so as to reduce
    the inflation ratio from <math|O<around*|(|1|)>> to
    <math|2+O<around*|(|1/lg n|)>>.

    <item>We increase the ``short transform length'' from <math|lg n> to
    <math|<around*|(|lg n|)><rsup|lg lg n+O<around*|(|1|)>>>. The complexity
    then becomes dominated by the Bluestein--Kronecker multiplications, while
    the contribution from ordinary arithmetic in <math|\<bbb-C\><rsub|p>>
    becomes asymptotically negligible. (As noted in
    section<nbsp><reference|intro-sec>, this is precisely the opposite of
    what occurs in Frer's algorithm.)
  </itemize>

  We begin with a technical preliminary. To perform multiplication in
  <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|<around*|(|2<rsup|n>-1|)>*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]>>
  efficiently using FFT multiplication, we need <math|n> to be divisible by a
  high power of two. We say that an integer <math|n\<geqslant\>3> is
  <with|font-shape|italic|admissible> if <math|2<rsup|\<kappa\><around*|(|n|)>>\<divides\>n>,
  where <math|\<kappa\><around*|(|n|)>\<assign\>lg n-lg <around*|(|lg<rsup|2>
  n|)>+1> (note that <math|0\<leqslant\>\<kappa\><around*|(|n|)>\<leqslant\>lg
  n> for all <math|n\<geqslant\>3>). We will need a function that rounds a
  given <math|n> up to an admissible integer. For this purpose we define
  <math|\<alpha\><around*|(|n|)>\<assign\><around*|\<lceil\>|n/2<rsup|\<kappa\><around*|(|n|)>>|\<rceil\>>*2<rsup|\<kappa\><around*|(|n|)>>>
  for <math|n\<geqslant\>3>. Note that <math|\<alpha\><around*|(|n|)>> may be
  computed in time <math|O<around*|(|lg n|)>>.

  <\lemma>
    <label|admissible>Let <math|n\<geqslant\>3>. Then
    <math|\<alpha\><around*|(|n|)>> is admissible and

    <\equation>
      n\<leqslant\>\<alpha\><around*|(|n|)>\<leqslant\>n+<frac|4*n|lg<rsup|2>
      n>.<label|eq:rho-bound>
    </equation>
  </lemma>

  <\proof>
    We have <math|n\<leqslant\>\<alpha\><around*|(|n|)>\<leqslant\>n+2<rsup|\<kappa\><around*|(|n|)>>>,
    which implies<nbsp><eqref|eq:rho-bound>. Since
    <math|n/2<rsup|\<kappa\><around*|(|n|)>>\<leqslant\>2<rsup|lg
    n-\<kappa\><around*|(|n|)>>> and <math|\<kappa\><around*|(|n|)>\<leqslant\>lg
    n>, we have <math|<around*|\<lceil\>|n/2<rsup|\<kappa\><around*|(|n|)>>|\<rceil\>>\<leqslant\>2<rsup|lg
    n-\<kappa\><around*|(|n|)>>> and thus
    <math|\<alpha\><around*|(|n|)>\<leqslant\>2<rsup|lg n>>, i.e., <math|lg
    \<alpha\><around*|(|n|)>=lg n>. In particular
    <math|\<kappa\><around*|(|\<alpha\><around*|(|n|)>|)>=\<kappa\><around*|(|n|)>>,
    so <math|\<alpha\><around*|(|n|)>> is admissible. (In fact, one easily
    checks that <math|\<alpha\><around*|(|n|)>> is the
    <with|font-shape|italic|smallest> admissible integer
    <math|\<geqslant\>n>).
  </proof>

  <\remark>
    It is actually possible to drop the requirement that <math|n> be
    divisible by a high power of two, by using the Crandall--Fagin method
    (see section<nbsp><reference|yet-faster-sec>). We prefer to avoid this
    approach in this section, as it adds an unnecessary layer of complexity
    to the presentation.
  </remark>

  Now let <math|n> be admissible, and consider the problem of computing
  <math|t\<geqslant\>1> products <rigid|<math|u<rsub|1>*v,\<ldots\>,u<rsub|t>*v>>
  with <math|<rigid|u<rsub|1>,\<ldots\>,u<rsub|t>>,v\<in\><rigid|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|<around*|(|2<rsup|n>-1|)>*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]>>>,
  i.e., <math|t> products with one fixed operand. Denote the cost of this
  operation by <math|><math|<math-ss|C><rsub|t><around*|(|n|)>>. Our
  algorithm for this problem will perform <math|t+1> forward DFTs and
  <math|t> inverse DFTs, so it is convenient to introduce the normalisation

  <\eqnarray*>
    <tformat|<table|<row|<cell|<math-ss|C><around*|(|n|)>>|<cell|\<assign\>>|<cell|sup<rsub|t\<geqslant\>1>
    <frac|<math-ss|C><rsub|t><around*|(|n|)>|2*t+1>.>>>>
  </eqnarray*>

  This is well-defined since clearly <math|<math-ss|C><rsub|t><around*|(|n|)>\<leqslant\>t*<math-ss|C><rsub|1><around*|(|n|)>>.
  Roughly speaking, <math|<math-ss|C><around*|(|n|)>> may be thought of as
  the notional cost of a single DFT.

  The problem of multiplying <math|k>-bit integers may be reduced to the
  above problem by using zero-padding, i.e., by taking
  <math|n\<assign\>\<alpha\><around*|(|2*k+1|)>> and <math|t\<assign\>1>.
  Since <math|\<alpha\><around*|(|2*k+1|)>=O<around*|(|k|)>> and
  <math|><math|<math-ss|C><rsub|1><around*|(|n|)>\<leqslant\>3*<math-ss|C><around*|(|n|)>>,
  we obtain <math|<math-ss|I><around*|(|k|)>\<leqslant\>3*<math-ss|C><around*|(|O<around*|(|k|)>|)>+O<around*|(|k|)>>.
  Thus it suffices to obtain a good bound for
  <math|<math-ss|C><around*|(|n|)>>.

  The recursive step in the main multiplication algorithm involves computing
  ``short'' DFTs via the Bluestein--Kronecker device. As pointed out in
  section<nbsp><reference|Bluestein-sec>, this leads to a<nbsp>cyclic
  convolution with one fixed operand. To take advantage of the fixed operand,
  let <math|<math-ss|B><rsub|p,t><around*|(|2<rsup|r>|)>> denote the cost of
  computing<nbsp><math|t> independent DFTs of length <math|2<rsup|r>> over
  <math|\<bbb-C\><rsub|p>>, and let <math|<math-ss|B><rsub|p><around*|(|2<rsup|r>|)>\<assign\>sup<rsub|t\<geqslant\>1>
  <math-ss|B><rsub|p,t><around*|(|2<rsup|r>|)>/<around*|(|2*t+1|)>>. Then we
  have the following refinement of Proposition<nbsp><reference|BK-prop>. As
  usual we assume that the necessary Bluestein root table has been
  precomputed.

  <\proposition>
    <label|BK-cyclic-prop>Let <math|r\<geqslant\>3>, and assume that
    <math|2<rsup|r>> divides <math|n<rprime|'>\<assign\>\<alpha\><around*|(|<around*|(|2*p+r+2|)>*2<rsup|r>|)>>.
    Then there exists a tight algorithm <math|\<cal-C\><rprime|'><rsub|r>>
    for computing DFTs of length <math|2<rsup|r>> over
    <math|\<bbb-C\><rsub|p>>, with

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|B><rsub|p><around*|(|2<rsup|r>|)>>|<cell|\<leqslant\>>|<cell|<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|2<rsup|r>*<math-ss|I><around*|(|p|)>|)>.>>>>
    </eqnarray*>
  </proposition>

  <\proof>
    We use the same notation and algorithm as in the proof of
    Proposition<nbsp><reference|BK-prop>, except that in the Kronecker
    substitution we take <math|b\<assign\>n<rprime|'>/2<rsup|r>\<geqslant\>2*p+r+2>,
    so that the resulting integer multiplication takes place in
    <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|<around*|(|2<rsup|n<rprime|'>>-1|)>*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]>>.
    The proof of tightness is identical to that of
    Proposition<nbsp><reference|BK-prop> (this is where we use the assumption
    <math|r\<geqslant\>3>). For the complexity bound, note that
    <math|n<rprime|'>> is admissible by construction, so for any
    <math|t\<geqslant\>1> we have <math|<math-ss|B><rsub|p,t><around*|(|2<rsup|r>|)>\<leqslant\><math-ss|C><rsub|t><around*|(|n<rprime|'>|)>+O<around*|(|t*2<rsup|r>*<math-ss|I><around*|(|p|)>|)>.>
    Here we have used the fact that <math|G<rprime|'>> is fixed over all
    these multiplications. Dividing by <math|2*t+1> and taking suprema over
    <math|t\<geqslant\>1> yields the result.
  </proof>

  The next result gives the main recurrence satisfied by
  <math|<math-ss|C><around*|(|n|)>> (compare with
  Theorem<nbsp><reference|thm:simple>).

  <\theorem>
    <label|main-lem>There exists <math|x<rsub|0>\<geqslant\>3> and a
    logarithmically slow function <math|\<Phi\>:<around*|(|x<rsub|0>,\<infty\>|)>\<rightarrow\>\<bbb-R\>>
    with the following property. For all admissible
    <math|n\<gtr\>x<rsub|0>\<nocomma\>>, there exists an admissible
    <math|n<rprime|'>\<leqslant\>\<Phi\><around*|(|n|)>> such that

    <\eqnarray*>
      <tformat|<table|<row|<cell|<frac|<math-ss|C><around*|(|n|)>|n*lg
      n>>|<cell|\<leqslant\>>|<cell|<around*|(|8+O<around*|(|<frac|1|lg lg
      n>|)>|)>*<frac|<math-ss|C><around*|(|n<rprime|'>|)>|n<rprime|'>*lg
      n<rprime|'>>+O<around*|(|1|)>.<eq-number><label|main-rec-rel>>>>>
    </eqnarray*>
  </theorem>

  <\proof>
    Let <math|n> be admissible and sufficiently large, and consider the
    problem of computing <math|t\<geqslant\>1> products
    <math|u<rsub|1>*v,\<ldots\>,u<rsub|t>*v>, for
    <math|<rigid|u<rsub|1>,\<ldots\>,u<rsub|t>>,v\<in\><rigid|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|<around*|(|2<rsup|n>-1|)>*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]>>>.
    Let <math|k\<assign\>\<kappa\><around*|(|n|)>\<sim\>lg n>, so that
    <math|2<rsup|k>\<divides\>n>, and let
    <math|b\<assign\>n/2<rsup|k>\<asymp\>lg<rsup|2> n>.

    We cut the inputs into <math|2<rsup|k>> chunks of size <math|b>, i.e., if
    <math|w> is one of the <math|t+1> inputs, we write
    <math|w=w<rsub|0>+w<rsub|1>*2<rsup|b>+\<cdots\>+w<rsub|2<rsup|k>-1>*2<rsup|<around*|(|2<rsup|k>-1|)>*b>>,
    where <math|w<rsub|i>\<in\>\<bbb-Z\><around*|[|\<mathi\>|]>>, and where
    the real and imaginary parts of <math|w<rsub|i>> have absolute value at
    most <math|2<rsup|b>>. Thus <math|<around*|\||w<rsub|i>|\|>\<leqslant\><sqrt|2>\<cdot\>2<rsup|b>\<less\>2<rsup|b+1>\<nocomma\>>,
    and for any <math|p\<geqslant\>b+1> we may encode <math|w> as a
    polynomial <math|W\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|b+1>|)><around*|[|X|]>/<around|(|X<rsup|2<rsup|k>>-1|)>>.

    We will multiply the desired (cyclic) polynomials by using DFTs of length
    <math|2<rsup|k>> over <math|\<bbb-C\><rsub|p>> where
    <math|p\<assign\>2*b+2*k+lg k+10=O<around*|(|lg<rsup|2> n|)>>. We
    construct the DFTs in a similar way to
    section<nbsp><reference|simple-algo-sec>. Let
    <math|r\<assign\><around*|(|lg lg n|)><rsup|2>> and
    <math|d\<assign\><around|\<lceil\>|k/r|\<rceil\>>=O<around|(|lg
    n/<around*|(|lg lg n|)><rsup|2>|)>>. Write
    <math|k=r<rsub|1>+\<cdots\>+r<rsub|d>> with <math|r<rsub|i>\<assign\>r>
    for <math|i\<leqslant\>d-1> and <math|r<rsub|d>\<assign\>k-<around*|(|d-1|)>*r\<leqslant\>r>.
    We use the tight algorithm <math|\<cal-A\>\<assign\>\<cal-A\><rsub|1>\<odot\>\<cdots\>\<odot\>\<cal-A\><rsub|d>>,
    where for <math|1\<leqslant\>i\<leqslant\>d-1> we take
    <math|\<cal-A\><rsub|i>> to be the tight algorithm
    <math|\<cal-C\><rprime|'><rsub|r>> for DFTs of length <math|2<rsup|r>>
    given by Proposition<nbsp><reference|BK-cyclic-prop>, and where
    <math|\<cal-A\><rsub|d>> is <math|\<cal-B\><rsup|\<odot\>r<rsub|d>>> as
    in Corollary<nbsp><reference|CT-cor>. Thus, for the first <math|d-1>
    groups of <math|r> layers, we use Bluestein--Kronecker to reduce to
    complex integer convolution of size <math|n<rprime|'>\<assign\>\<alpha\><around*|(|<around*|(|2*p+r+2|)>*2<rsup|r>|)>>,
    and the remaining layers are handled using ordinary Cooley--Tukey. We
    write <math|\<cal-A\><rprime|'>> for the analogous inverse transform.

    To check the hypothesis of Proposition<nbsp><reference|BK-cyclic-prop>,
    we observe that <math|2<rsup|r>\<divides\>n<rprime|'>> for sufficiently
    large <math|n>, as <math|n<rprime|'>> is divisible
    by<nbsp><math|2<rsup|k<rprime|'>>> where <math|k<rprime|'>\<assign\>lg
    n<rprime|'>-lg<around*|(|lg<rsup|2> n<rprime|'>|)>+1>, and

    <\equation*>
      2<rsup|k<rprime|'>>\<asymp\><frac|n<rprime|'>|lg<rsup|2>
      n<rprime|'>>\<asymp\><frac|<around*|(|2*p+r+2|)>*2<rsup|r>|lg<rsup|2>
      <around*|(|<around*|(|2*p+r+2|)>*2<rsup|r>|)>>\<asymp\><frac|b*2<rsup|r>|<around*|(|lg
      b+r|)><rsup|2>>\<asymp\><frac|<around*|(|lg n|)><rsup|2>|<around*|(|lg
      lg n|)><rsup|4>>*2<rsup|r>\<succ\>2<rsup|r>.
    </equation*>

    <yes-indent>Denote by <math|<math-ss|D>> the cost of a single invocation
    of <math|\<cal-A\>> (or <math|\<cal-A\><rprime|'>>). By
    Corollary<nbsp><reference|CT-cor> and<nbsp><eqref|fft-rec-bound2>, we
    have

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|D>>|<cell|\<leqslant\>>|<cell|<around*|(|d-1|)>*<math-ss|B><rsub|p,2<rsup|k-r>><around*|(|2<rsup|r>|)>+O<around*|(|2<rsup|k-r<rsub|d>>*2<rsup|r<rsub|d>>*r<rsub|d>*<math-ss|I><around*|(|p|)>|)>+O<around*|(|d*2<rsup|k>*<math-ss|I><around*|(|p|)>|)>+O<around*|(|2<rsup|k>*k*b|)>.>>>>
    </eqnarray*>

    The last term is the rearrangement cost, and simplifies to
    <math|O<around*|(|n*lg n|)>>. The second term covers the invocations of
    <math|\<cal-A\><rsub|d>>, and simplifies to
    <math|O<around*|(|r*2<rsup|k>*<math-ss|I><around*|(|p|)>|)>>, so is
    absorbed by the <math|d*2<rsup|k>*<math-ss|I><around*|(|p|)>> term. The
    first term covers the invocations of <math|\<cal-C\><rprime|'><rsub|r>>.
    By definition <math|<math-ss|B><rsub|p,2<rsup|k-r>><around*|(|2<rsup|r>|)>\<leqslant\><around*|(|2\<cdot\>2<rsup|k-r>+1|)>*<math-ss|B><rsub|p><around*|(|2<rsup|r>|)>>,
    and since <math|2<rsup|k-r>\<succ\>lg lg n>,
    Proposition<nbsp><reference|BK-cyclic-prop> yields

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|B><rsub|p,2<rsup|k-r>><around*|(|2<rsup|r>|)>>|<cell|\<leqslant\>>|<cell|<around*|(|2+O<around*|(|1/lg
      lg n|)>|)>*2<rsup|k-r>*<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|2<rsup|k>*<math-ss|I><around*|(|p|)>|)>.>>>>
    </eqnarray*>

    Thus

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|D>>|<cell|\<leqslant\>>|<cell|<around*|(|2+O<around*|(|1/lg
      lg n|)>|)>*d*2<rsup|k-r>*<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|d*2<rsup|k>*<math-ss|I><around*|(|p|)>|)>+O<around*|(|n*lg
      n|)>.>>>>
    </eqnarray*>

    <yes-indent>We will use Schnhage--Strassen's algorithm for fixed point
    multiplications in <math|\<bbb-C\><rsub|p>>. Since
    <math|p=O<around*|(|lg<rsup|2> n|)>>, we may take
    <math|<math-ss|I><around*|(|p|)>=O<around*|(|lg<rsup|2> n*lg lg n*lg lg
    lg n|)>>. Thus the <math|d*2<rsup|k>*<math-ss|I><around*|(|p|)>> term
    becomes

    <\equation*>
      O<around*|(|<frac|lg n|<around*|(|lg lg
      n|)><rsup|2>>*<frac|n|lg<rsup|2> n>*lg<rsup|2> n*lg lg n*lg lg lg
      n|)>=O<around*|(|n*lg n*<frac|lg lg lg n|lg lg n>|)>=O<around*|(|n*lg
      n|)>.
    </equation*>

    (We could of course use our algorithm recursively for these
    multiplications; however, it turns out that Schnhage--Strassen is fast
    enough, and leads to simpler recurrences. In fact, the algorithm
    asymptotically spends more time rearranging data than multiplying in
    <math|\<bbb-C\><rsub|p>>!)

    Since <math|<around*|(|2*p+r+2|)>*2<rsup|r>=<around*|(|4*b+O<around*|(|lg
    n|)>|)>*2<rsup|r>=<around*|(|4+O<around*|(|1/lg lg n|)>|)>*b*2<rsup|r>>,
    and since <math|lg<around*|(|b*2<rsup|r>|)>=r+O<around*|(|lg lg
    n|)>=<around*|(|1+O<around*|(|1/lg lg n|)>|)>*r\<succ\>lg lg n>, by
    Lemma<nbsp><reference|admissible> we have

    <\eqnarray*>
      <tformat|<table|<row|<cell|n<rprime|'>>|<cell|=>|<cell|<around*|(|4+O<around*|(|1/lg
      lg n|)>|)>*b*2<rsup|r>,>>|<row|<cell|lg
      n<rprime|'>>|<cell|=>|<cell|<around*|(|1+O<around*|(|1/lg lg
      n|)>|)>*r.>>>>
    </eqnarray*>

    We also have <math|k=lg n+O<around*|(|lg lg n|)>> and
    <math|d=k/r+O<around*|(|1|)>>, so

    <\eqnarray*>
      <tformat|<table|<row|<cell|lg n>|<cell|=>|<cell|<around*|(|1+O<around*|(|1/lg
      lg n|)>|)>*k,>>|<row|<cell|d>|<cell|=>|<cell|<around*|(|1+O<around*|(|1/lg
      lg n|)>|)>*k/r.>>>>
    </eqnarray*>

    Thus

    <\equation*>
      d*2<rsup|k-r>=<frac|4*<around*|(|2<rsup|k>*b|)>*d|<around*|(|4*b*2<rsup|r>|)>>=<around*|(|4+O<around*|(|<frac|1|lg
      lg n>|)>|)>*<frac|n*lg n|n<rprime|'> lg n<rprime|'>>,
    </equation*>

    and consequently

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|D>>|<cell|\<leqslant\>>|<cell|<around*|(|8+O<around*|(|<frac|1|lg
      lg n>|)>|)>*<frac|n*lg n|n<rprime|'>*lg
      n<rprime|'>>*<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|n*lg
      n|)>.>>>>
    </eqnarray*>

    <yes-indent>To compute the desired <math|t> products, we must execute
    <math|t+1> forward transforms and <math|t> inverse transforms. For each
    product, we must also perform <math|O<around*|(|2<rsup|k>|)>> pointwise
    multiplications in <math|><math|\<bbb-C\><rsub|p>>, at cost
    <math|O<around*|(|2<rsup|k>*<math-ss|I><around*|(|p|)>|)>=O<around*|(|n*lg
    n|)>>. As in the proof of Theorem<nbsp><reference|thm:simple>, the cost
    of all necessary root table precomputations is also bounded by
    <math|O<around*|(|2<rsup|k>*<math-ss|I><around*|(|p|)>|)>=O<around*|(|n*lg
    n|)>>. Thus we obtain

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|C><rsub|t><around*|(|n|)>>|<cell|\<leqslant\>>|<cell|<around*|(|2*t+1|)>*<math-ss|D>+O<around*|(|t*n*lg
      n|)>.>>>>
    </eqnarray*>

    Dividing by <math|<around*|(|2*t+1|)>*n*lg n> and taking suprema yields
    the bound<nbsp>(<reference|main-rec-rel>).

    The error analysis is almost identical to the proof of
    Theorem<nbsp><reference|thm:simple>, the only difference being that
    <math|b> is replaced by <math|b+1>. Denoting one of the <math|t> products
    by <math|h\<in\><around*|(|\<bbb-C\><rsub|p>*2<rsup|2*b+2*k+2>|)><around*|[|X|]>/<around|(|X<rsup|2<rsup|k>>-1|)>>,
    we have <math|\<rho\><rsub|h<rsub|i>>\<leqslant\>2<rsup|6+lg k-p>>
    exactly as in Theorem<nbsp><reference|thm:simple>. Thus
    <math|\<varepsilon\><rsub|h<rsub|i>>\<leqslant\>2<rsup|2*b+2*k+lg
    k-p+8>\<leqslant\>1/4>, and again we obtain <math|h<rsub|i>> by rounding
    to the nearest integer.

    Finally we show how to define <math|\<Phi\><around*|(|x|)>>. We already
    observed that <math|lg n<rprime|'>\<sim\>r\<sim\><around*|(|lg lg
    n|)><rsup|2>>. Thus there exists a constant <math|C\<gtr\>0> such that
    <math|log log log n<rprime|'>\<leqslant\>log log log log n+C> for large
    <math|n>, so we may take <math|\<Phi\><around*|(|x|)>\<assign\>exp<rsup|\<circ\>3><around*|(|log<rsup|\<circ\>4>
    x+C|)>>.
  </proof>

  Now we may prove the main theorem announced in the introduction.

  <\render-proof|Proof of Theorem <reference|main-thm>>
    Let <math|x<rsub|0>> and <math|\<Phi\><around*|(|x|)>> be as in
    Theorem<nbsp><reference|main-lem>. Increasing <math|x<rsub|0>> if
    necessary, by Lemma<nbsp><reference|phi-bound> we may assume that
    <math|\<Phi\><around*|(|x|)>\<leqslant\>x-1> for
    <math|x\<gtr\>x<rsub|0>>, and that <math|x<rsub|0>\<geqslant\>exp
    <around*|(|exp <around*|(|1|)>|)>>.

    Let <math|T<around*|(|n|)>\<assign\><math-ss|C><around*|(|n|)>/<around*|(|n*lg
    n|)>> for admissible <math|n\<geqslant\>3>. By the theorem, there exist
    constants <math|B,L\<gtr\>0> such that for all admissible
    <math|n\<gtr\>x<rsub|0>>, there exists an admissible
    <math|n<rprime|'>\<leqslant\>\<Phi\><around*|(|n|)>> with

    <\eqnarray*>
      <tformat|<table|<row|<cell|T<around*|(|n|)>>|<cell|\<leqslant\>>|<cell|8*<around*|(|1+<frac|B|log
      log n>|)>*T<around*|(|n<rprime|'>|)>+L.>>>>
    </eqnarray*>

    Increasing <math|L> if necessary, we may also assume that
    <math|T<around*|(|n|)>\<leqslant\>L> for all admissible
    <math|n\<leqslant\>x<rsub|0>>. Taking <math|\<cal-S\>> to be the set of
    admissible integers, we apply Proposition<nbsp><reference|slow-rec-lem>
    with <math|K\<assign\>8>, <math|\<sigma\>\<assign\>x<rsub|0>>,
    <math|\<ell\>\<assign\>2>, and for each admissible
    <math|n\<gtr\>x<rsub|0>> setting <math|d\<assign\>1>,
    <math|\<gamma\><rsub|1>\<assign\>1>, <math|y\<assign\>n> and
    <math|y<rsub|1>\<assign\>n<rprime|'>> as above. We conclude that
    <math|T<around*|(|n|)>=O<around*|(|8<rsup|log<rsup|\<ast\>> n>|)>>, and
    hence <math|<math-ss|C><around*|(|n|)>=O<around*|(|n*lg
    n*8<rsup|log<rsup|\<ast\>> n>|)>> as<nbsp><math|n> runs over admissible
    integers. We already pointed out that
    <math|<math-ss|I><around*|(|k|)>\<leqslant\>3*<math-ss|C><around*|(|O<around*|(|k|)>|)>+O<around*|(|k|)>>.
  </render-proof>

  <section|An optimised variant of Frer's algorithm><label|Furer-sec>

  As pointed out in the introduction, Frer proved that
  <math|<math-ss|I><around*|(|n|)>=O<around*|(|n*log
  n*K<rsup|log<rsup|\<ast\>> n>|)>> for some <math|K\<gtr\>1>, but did not
  give an explicit bound for <math|K>. In this section we sketch an argument
  showing that one may achieve <math|K=16> in Frer's algorithm, by reusing
  tools from previous sections, especially
  section<nbsp><reference|even-faster-sec>.

  At the core of Frer's algorithm is the ring
  <math|R=\<bbb-C\><around*|[|X|]>/<around|(|X<rsup|2<rsup|r-1>>+1|)>>, which
  contains the principal <math|2<rsup|r>><nbhyph>th root of unity <math|X>.
  Note that <math|R> is a direct sum of <math|2<rsup|r-1>> copies of
  <math|\<bbb-C\>>, and hence not a field (for <math|r\<geqslant\>2>). A
  crucial observation is that <math|X> is a ``fast'' root of unity, in the
  sense that multiplication by <math|X> and its powers can be achieved in
  linear time, as in Schnhage--Strassen's algorithm. For any
  <math|k\<gtr\>r>, we need to construct a <math|2<rsup|k-r>>-th root
  <math|\<omega\>> of <math|X>, which is itself a <math|2<rsup|k>>-th
  principal root of unity. We recall Frer's construction of <math|\<omega\>>
  as follows.

  <\lemma>
    <label|Furer-lem>With <math|R> as above, let <math|\<varrho\>=exp
    <frac|2*\<mathpi\>*\<mathi\>|2<rsup|k>>> and <math|\<sigma\>=exp
    <frac|2*\<mathpi\>*\<mathi\>|2<rsup|r>>>. Then

    \;

    <\eqnarray*>
      <tformat|<table|<row|<cell|\<omega\>>|<cell|\<assign\>>|<cell|<big|sum><rsub|i=0><rsup|2<rsup|r-1>-1>\<varrho\><rsup|2*i+1>*<frac|<big|prod><rsub|j\<neq\>i><around*|(|X-\<sigma\><rsup|2*j+1>|)>|<big|prod><rsub|j\<neq\>i><around*|(|\<sigma\><rsup|2*i+1>-\<sigma\><rsup|2*j+1>|)>><space|1em>\<in\>R>>>>
    </eqnarray*>

    is a principal <math|2<rsup|k>>-th root of unity with
    <math|\<omega\><rsup|2<rsup|k-r>>=X>. The coefficients of
    <math|\<omega\>> have absolute value <math|\<leqslant\>1>.
  </lemma>

  <\proof>
    See <cite-detail|Furer2009|Section 4>.
  </proof>

  As our basic recursive problem, we will consider multiplication in
  <math|<around*|(|\<bbb-Z\>/<around*|(|2<rsup|n>+1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>,
  where <math|n> is divisible by a high power of two. We will refer to the
  last property as ``admissibility'', but we will not define it precisely. We
  write <math|<math-ss|C><rsub|t><around*|(|n|)>> for the cost of
  <math|t\<geqslant\>1> such products with one fixed argument, and
  <math|<math-ss|C><around*|(|n|)>\<assign\>sup<rsub|t\<geqslant\>1>
  <math-ss|C><rsub|t><around*|(|n|)>/<around*|(|2*t+1|)>> for the normalised
  cost, exactly as in section<nbsp><reference|even-faster-sec>.

  Frer worked with <math|\<bbb-Z\>/<around*|(|2<rsup|n>+1|)>*\<bbb-Z\>>
  rather than <math|<around*|(|\<bbb-Z\>/<around*|(|2<rsup|n>+1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>,
  but, since we are interested in constant factors, and since the recursive
  multiplication step involves multiplication of complex quantities, it
  simplifies the exposition to work systematically with complexified objects
  everywhere.\ 

  For suitable parameters <math|r> and <math|k>, we will encode elements of
  <math|<around*|(|\<bbb-Z\>/<around*|(|2<rsup|n>+1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>
  as (nega)cyclic polynomials in <math|R<around*|[|Y|]>/<around*|(|Y<rsup|2<rsup|k>>+1|)>>,
  where <math|R\<assign\>\<bbb-C\><around*|[|X|]>/<around|(|X<rsup|2<rsup|r-1>>+1|)>>
  as above. We choose the parameters later; for now we require only that
  <math|2<rsup|k+r-2>> divides <math|n> and that
  <math|b\<assign\>n/2<rsup|k+r-2>\<geqslant\>lg n> (so that the coefficients
  are not too small).

  The encoding proceeds as follows. Given
  <math|a\<in\>\<bbb-Z\>/<around*|(|2<rsup|n>+1|)>*\<bbb-Z\>>, we split
  <math|a> into <math|2<rsup|k>> parts <rigid|<math|a<rsub|0>,\<ldots\>,a<rsub|2<rsup|k>-1>>>
  of <math|n/2<rsup|k>> bits. Each <math|a<rsub|i>> is cut into
  <math|2<rsup|r-2>> even smaller pieces <math|a<rsub|i,0>,\<ldots\>,a<rsub|i,2<rsup|r-2>-1>>
  of <math|b> bits. Then <math|a> is encoded as

  <\eqnarray*>
    <tformat|<table|<row|<cell|<wide|a|~>>|<cell|\<assign\>>|<cell|<big|sum><rsub|i=0><rsup|2<rsup|k>-1><big|sum><rsub|j=0><rsup|2<rsup|r-2>-1>a<rsub|i,j>*X<rsup|j>*Y<rsup|i>,>>>>
  </eqnarray*>

  and an element <math|u=x+y*\<mathi\>\<in\><around*|(|\<bbb-Z\>/<around*|(|2<rsup|n>+1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>
  is encoded as <math|<wide|u|~>\<assign\><wide|x|~>+<wide|y|~>*\<mathi\>>.
  (Notice that the coefficients of <math|X<rsup|j>> are zero for
  <math|2<rsup|r-2>\<leqslant\>j\<less\>2<rsup|r-1>>; this zero-padding is
  the price Frer pays for introducing artificial roots of unity.)

  We represent complex coefficients by elements of
  <math|\<bbb-C\><rsub|p>*2<rsup|e>> for a suitable precision parameter
  <math|p>. The exponent <math|e> varies during the algorithm, as explained
  in<nbsp><cite|Furer2009>; nevertheless, additions and subtractions only
  occur for numbers with the same exponent, as in the algorithms from
  sections<nbsp><reference|simple-algo-sec><nbsp>and<nbsp><reference|even-faster-sec>.

  Given <math|u,v\<in\><around*|(|\<bbb-Z\>/<around*|(|2<rsup|n>+1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>,
  to successfully recover the product <math|u*v> from the polynomial product
  <math|<wide|u|~>*<wide|v|~>\<in\>R<around*|[|Y|]>/<around*|(|Y<rsup|2<rsup|k>>+1|)>>,
  we must choose <math|p\<geqslant\>2*b+k+r+h>, where <math|h> is an
  allowance for numerical error. Certainly <math|r\<leqslant\>k\<leqslant\>lg
  n>, and, as shown by Frer, we may also take <math|h=O<around*|(|lg n|)>>
  (an analogous conclusion is reached in sections<nbsp><reference|simple-algo-sec><nbsp>and<nbsp><reference|even-faster-sec>).
  Thus we may assume that <math|p=2*b+O<around*|(|lg n|)>>.

  We must now show how to compute a product
  <math|<wide|u|~>*<wide|v|~>\<nocomma\>>, for
  <math|<wide|u|~>,<wide|v|~>\<in\>R<around*|[|Y|]>/<around*|(|Y<rsup|2<rsup|k>>+1|)>>.
  Frer handles these types of multiplications using ``half-DFTs'', i.e.,
  DFTs that evaluate at odd powers of<nbsp><math|\<eta\>>, where
  <math|\<eta\>\<in\>R> is a principal <math|2<rsup|k+1>>-th root of unity
  such that <math|\<eta\><rsup|2<rsup|k+1-r>>=X>
  (Lemma<nbsp><reference|Furer-lem>). To keep terminology and notation
  consistent with previous sections, we prefer to make the substitution
  <math|U<around*|(|X,Y|)>\<assign\><wide|u|~><around*|(|X,\<eta\>*Y|)>>,
  i.e., writing <math|<wide|u|~>=<big|sum><rsub|i=0><rsup|2<rsup|k>-1><wide|u|~><rsub|i><around*|(|X|)>*Y<rsup|i>>,
  we put <math|U\<assign\><big|sum><rsub|i><around*|(|<wide|u|~><rsub|i>*\<eta\><rsup|i>|)>*Y<rsup|i>>,
  and similarly for <math|<wide|v|~>> and <math|V>. This reduces the problem
  to computing the product <math|U*V> in <math|R<around*|[|Y|]>/<around|(|Y<rsup|2<rsup|k>>-1|)>>.
  The change of variable imposes a cost of
  <math|O<around*|(|2<rsup|k>*<math-ss|m><rsub|R>|)>>, where
  <math|<math-ss|m><rsub|R>> is the cost of a multiplication in <math|R>.

  So now consider a product <math|U*V>, where
  <math|U,V\<in\>R<around*|[|Y|]>/<around|(|Y<rsup|2<rsup|k>>-1|)>>. Let
  <math|\<omega\>\<assign\>\<eta\><rsup|2>>, so that
  <math|\<omega\><rsup|2<rsup|k-r>>=X>. Let
  <math|d\<assign\><around|\<lceil\>|k/r|\<rceil\>>>, and write
  <math|k=r<rsub|1>+\<cdots\>+r<rsub|d>> with <math|r<rsub|i>\<assign\>r> for
  <math|i\<leqslant\>d-1> and <math|r<rsub|d>\<assign\>k-<around*|(|d-1|)>*r\<leqslant\>r>.
  For each <math|i\<nocomma\>>, let <math|\<cal-A\><rsub|i>> be the algorithm
  for DFTs of length <math|2<rsup|r<rsub|i>>> that applies the usual
  Cooley--Tukey method, taking advantage of the fast
  <math|2<rsup|r<rsub|i>>>-th root of unity<nbsp><math|X<rsup|2<rsup|r-r<rsub|i>>>.>
  The complexity of <math|\<cal-A\><rsub|i>> is
  <math|O<around*|(|2<rsup|r<rsub|i>+r>*r<rsub|i>*p|)>>, since it performs
  <math|O<around*|(|2<rsup|r<rsub|i>>*r<rsub|i>|)>> linear-time operations on
  objects of bit size <math|O<around*|(|2<rsup|r>*p|)>>. Let
  <math|<math-ss|D>> be the complexity of the algorithm
  <math|\<cal-A\>\<assign\>\<cal-A\><rsub|1>\<odot\>\<cdots\>\<odot\>\<cal-A\><rsub|d>>
  for DFTs of length<nbsp><math|2<rsup|k>> over <math|R>.
  Then<nbsp><eqref|fft-rec-bound2> yields

  <\eqnarray*>
    <tformat|<table|<row|<cell|<math-ss|D>>|<cell|\<leqslant\>>|<cell|O<around*|(|<big|sum><rsup|d><rsub|i=1>2<rsup|k-r<rsub|i>>*2<rsup|r<rsub|i>+r>*r<rsub|i>*p|)>+<around*|\<lceil\>|<frac|k|r>|\<rceil\>>*2<rsup|k>*<math-ss|m><rsub|R>+O<around*|(|n*lg
    n|)>,>>>>
  </eqnarray*>

  The first term is bounded by <math|O<around*|(|d*2<rsup|k>*2<rsup|r>*r*p|)>=O<around*|(|<around*|(|2<rsup|k+r>*p|)>*k|)>=O<around*|(|n*lg
  n|)>>, since <math|p=O<around*|(|b|)>>.

  Let us now consider the second term <math|<around*|\<lceil\>|k/r|\<rceil\>>*2<rsup|k>*<math-ss|m><rsub|R>>,
  which describes the cost of the twiddle factor multiplications. This term
  turns out to be the dominant one. Both Kronecker substitution and FFT
  multiplication may be considered for multiplication in<nbsp><math|R>, but
  it turns out that Kronecker substitution is faster (a similar phenomenon
  was noted in Remark<nbsp><reference|faster-rem>). So we reduce
  multiplication in <math|R> to multiplication in
  <math|<around*|(|\<bbb-Z\>/<around*|(|2<rsup|n<rprime|'>>+1|)>*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>
  where <math|n<rprime|'>\<geqslant\>2<rsup|r-1>*<around*|(|2*p+r+2|)>> is
  admissible and divisible by <math|2<rsup|r-1>>. For any reasonable
  definition of admissibility we then have
  <math|n<rprime|'>=<around*|(|1+o<around*|(|1|)>|)>*2<rsup|r>*p>, provided
  that <math|r> is somewhat smaller than <math|p>. (In the interests of
  brevity, we will not specify the <math|o<around*|(|1|)>> terms for the
  remainder of the argument. They can all be controlled along the lines of
  section<nbsp><reference|even-faster-sec>.) Most of the twiddle factors are
  reused many times, so we will assume that
  <math|<math-ss|m><rsub|R>=<around*|(|2+o<around*|(|1|)>|)>*<math-ss|C><around*|(|n<rprime|'>|)>>,
  where the factor <math|2> counts the two (rather than three) DFTs needed
  for each multiplication of size<nbsp><math|n<rprime|'>>. The term of
  interest then becomes

  <\eqnarray*>
    <tformat|<table|<row|<cell|<around*|\<lceil\>|<frac|k|r>|\<rceil\>>*2<rsup|k>*<math-ss|m><rsub|R>>|<cell|=>|<cell|<around*|(|2+o<around*|(|1|)>|)>*<frac|r+lg
    p|r>*<frac|2<rsup|k+r>*p*k|n<rprime|'>*lg
    n<rprime|'>>*<math-ss|C><around*|(|n<rprime|'>|)>.>>>>
  </eqnarray*>

  Since <math|p=2*b+O<around*|(|lg n|)>=<around*|(|2+O<around*|(|<frac|lg
  n|b>|)>|)>*b> and <math|2<rsup|k+r>*b=4*n>, this yields

  <\eqnarray*>
    <tformat|<table|<row|<cell|<math-ss|D>>|<cell|\<leqslant\>>|<cell|<around*|(|16+o<around*|(|1|)>|)>*<around*|(|1+O<around*|(|<frac|lg
    n|b>|)>|)>*<frac|r+lg p|r>*<frac|n*lg n|n<rprime|'>*lg
    n<rprime|'>>*<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|n*lg
    n|)>.>>>>
  </eqnarray*>

  To minimise the leading constant, we must choose <math|b> to grow faster
  than <math|lg n>, and <math|r> to grow faster than <math|lg p>. For
  example, taking <math|r\<assign\><around*|(|lg lg n|)><rsup|2>> and
  <math|k\<assign\>lg n-r-lg <around*|(|lg<rsup|2> n|)>> leads to
  <math|b=4*n/2<rsup|k+r>\<asymp\>lg<rsup|2> n> and <math|lg p\<asymp\>lg
  b\<asymp\>lg lg n>. The function mapping <math|n> to <math|n<rprime|'>> is
  then bounded by a logarithmically slow function, and a similar argument to
  section<nbsp><reference|even-faster-sec> shows that
  <math|<math-ss|I><around*|(|n|)>=O<around*|(|n*log
  n*16<rsup|log<rsup|\<asterisk\>> n>|)>>.

  <section|Fast multiplication using modular arithmetic><label|param-sec>

  Shortly after Frer's algorithm appeared, De et al<nbsp><cite|DeKuSaSa2013>
  presented a variant based on modular arithmetic that also achieves the
  complexity bound <math|<math-ss|I><around*|(|n|)>=O<around*|(|n*log
  n*K<rsup|log<rsup|\<asterisk\>> n>|)>> for some <math|K\<gtr\>1>. Roughly
  speaking, they replace the coefficient ring <math|\<bbb-C\>> with the field
  <math|\<bbb-Q\><rsub|p>> of <math|p><nbhyph>adic numbers, for
  a<nbsp>suitable prime<nbsp><math|p>. In this context, working to ``finite
  precision'' means performing computations in
  <math|\<bbb-Z\>/p<rsup|\<lambda\>>*\<bbb-Z\>>, where
  <math|\<lambda\>\<geqslant\>1> is a precision parameter.

  The main advantage of this approach is that the error analysis becomes
  trivial; indeed <math|\<bbb-Z\>/p<rsup|\<lambda\>>*\<bbb-Z\>> is a ring
  (unlike our <math|\<bbb-C\><rsub|p>>), and arithmetic operations never lead
  to precision loss (unless one divides by <math|p>, which never happens in
  these algorithms). The main disadvantage is that there are certain
  technical difficulties associated with finding an appropriate <math|p>;
  this is discussed in section<nbsp><reference|computing-p-sec> below.

  The aim of this section is to sketch an analogue of the algorithm of
  section<nbsp><reference|even-faster-sec> that achieves
  <math|<math-ss|I><around*|(|n|)>=O<around*|(|n*log
  n*8<rsup|log<rsup|\<asterisk\>> n>|)>> using modular arithmetic instead of
  <math|\<bbb-C\>>. We assume familiarity with <math|p>-adic numbers,
  referring the reader to <cite|Gou-p-adic> for an elementary introduction.

  <subsection|Sketch of the algorithm><label|modular-sketch-sec>

  For the basic problem, we take multiplication in
  <math|\<bbb-Z\>/<around*|\<nobracket\>|<around*|(|2<rsup|n>-1|)>*\<bbb-Z\>|\<nobracket\>>>,
  where <math|n> is admissible (in the sense of
  section<nbsp><reference|even-faster-sec>) and where one of the arguments is
  fixed over <math|t\<geqslant\>1> multiplications. As before, we take
  <math|k\<assign\>\<kappa\><around*|(|n|)>>, and cut the inputs into chunks
  of <math|b\<assign\>n/2<rsup|k>=O<around*|(|lg<rsup|2> n|)>> bits. Thus we
  reduce to multiplying polynomials in <math|\<bbb-Z\><around*|[|X|]>/<around|(|X<rsup|2<rsup|k>>-1|)>>
  with coefficients of at most <math|b> bits. The coefficients of the product
  have at most <math|2*b+k> bits.

  Let <math|p> be a prime such that <math|p=1 <pmod|2<rsup|k>>>, so that
  <math|\<bbb-Q\><rsub|p>> contains a primitive <math|2<rsup|k>>-th root of
  unity<nbsp><math|\<omega\>>. The problem of finding such <math|p> and
  <math|\<omega\>> is discussed in the next section; for now we assume only
  that <math|lg p=O<around*|(|lg n|)>>. We may then embed the multiplication
  problem into <math|\<bbb-Q\><rsub|p><around*|[|X|]>/<around|(|X<rsup|2<rsup|k>>-1|)>>,
  and use DFTs with respect to <math|\<omega\>> to compute the product. On a
  Turing machine, we cannot represent elements of <math|\<bbb-Q\><rsub|p>>
  exactly, so we perform all computations in
  <math|\<bbb-Z\>/p<rsup|\<lambda\>>*\<bbb-Z\>> where

  <\eqnarray*>
    <tformat|<table|<row|<cell|\<lambda\>>|<cell|\<assign\>>|<cell|<around*|\<lceil\>|<frac|2*b+k|<around*|(|lg
    p|)>-1>|\<rceil\>>.>>>>
  </eqnarray*>

  This choice ensures that <math|lg<around*|(|p<rsup|\<lambda\>>|)>\<geqslant\>2*b+k>,
  so knowledge of the product in <math|<around*|(|\<bbb-Z\>/p<rsup|\<lambda\>>*\<bbb-Z\>|)><around*|[|X|]>/<around|(|X<rsup|2<rsup|k>>-1|)>>
  determines it unambiguously in <math|\<bbb-Z\><around*|[|X|]>/<around|(|X<rsup|2<rsup|k>>-1|)>>.

  To compute each DFT, we first use the Cooley--Tukey algorithm to decompose
  it into ``short transforms'' of length <math|2<rsup|r>>, where
  <math|r\<assign\><around*|(|lg lg n|)><rsup|2>>. (As in
  section<nbsp><reference|even-faster-sec>, there are also residual
  transforms of length <math|2<rsup|r<rsub|d>>\<nocomma\>> for some
  <math|r<rsub|d>\<leqslant\>r>, whose contribution to the complexity is
  negligible.) Multiplications in <math|\<bbb-Z\>/p<rsup|\<lambda\>>*\<bbb-Z\>\<nocomma\>>,
  such as the multiplications by twiddle factors, are handled using
  Schnhage--Strassen's algorithm, with the divisions by
  <math|p<rsup|\<lambda\>>> being reduced to multiplication via Newton's
  method. We then use Bluestein's algorithm to convert each short transform
  to a cyclic convolution of length <math|2<rsup|r>> over
  <math|\<bbb-Z\>/p<rsup|\<lambda\>>*\<bbb-Z\>\<nocomma\>>, and apply
  Kronecker substitution to convert this to multiplication in
  <math|\<bbb-Z\>/<around*|(|2<rsup|n<rprime|'>>-1|)>*\<bbb-Z\>>,
  where<nbsp><math|n<rprime|'>> is the smallest admissible integer exceeding
  <math|2<rsup|r>*<around*|(|2*\<lambda\>*lg p+r|)>>. This multiplication is
  then handled recursively.

  Now, since <math|lg p=O<around*|(|lg n|)>>, <math|lg p\<geqslant\>k>,
  <math|b\<asymp\>lg<rsup|2> n> and <math|k=O<around*|(|lg n|)>>, we have
  <math|\<lambda\>=<around*|(|2+O<around*|(|1/lg n|)>|)>*b/lg p>, and hence
  <math|n<rprime|'>=<around*|(|4+O<around*|(|1/lg lg
  n|)>|)>*b*2<rsup|r>\<nocomma\>>, just as in
  section<nbsp><reference|even-faster-sec>. The rest of the complexity
  analysis follows exactly as in the proof of
  Theorem<nbsp><reference|main-lem>, except for the computation of <math|p>
  and <math|\<omega\>>, which is considered below.

  <\remark>
    The role of the precision parameter <math|\<lambda\>> is to give some
    extra flexibility regarding the choice of <math|p>. If there was an
    efficient way to find a prime <math|p=1 <pmod|2<rsup|k>>> larger than
    <math|2<rsup|2*b+k>> (but not too much larger), and an efficient way to
    find a suitable <math|2<rsup|k>>-th root of unity modulo <math|p>, then
    we could always take <math|\<lambda\>\<assign\>1> and obtain an algorithm
    working directly over the finite field <math|\<bbb-F\><rsub|p>>.
  </remark>

  <subsection|Computing suitable <math|p> and
  <math|\<omega\>>><label|computing-p-sec>

  Given a transform length <math|2<rsup|k>> for <math|k\<geqslant\>1>, our
  aim is to find a prime <math|p> such that <math|p=1
  <pmod|2<rsup|k>>\<nocomma\>>, i.e., such that<nbsp><math|2<rsup|k>> divides
  <math|p-1>. Denote by <math|p<rsub|0><around*|(|k|)>> the smallest such
  prime.

  Heath-Brown has conjectured that <math|p<rsub|0><around*|(|k|)>=O<around*|(|2<rsup|k>*k<rsup|2>|)>>
  <cite|HB-almost>, but given the current state of knowledge in number
  theory, we are only able to prove a result of the following type.

  <\lemma>
    <label|Linnik-lem>For all sufficiently large <math|k> we have
    <math|p<rsub|0><around*|(|k|)>\<less\>2<rsup|6*k>>, and we may
    compute<nbsp><math|p<rsub|0><around*|(|k|)>> in time
    <math|O<around|(|2<rsup|5*k>*k<rsup|O<around|(|1|)>>|)>>.
  </lemma>

  <\proof>
    This is a special case of Linnik's theorem<nbsp><cite|Lin44a|Lin44b>,
    which states that there exist constants <math|C> and <math|L> such that
    for any <math|a,b\<in\>\<bbb-N\>> with <math|gcd <around|(|a,b|)>=1>,
    there exists a prime number <math|p=a <pmod|b>> with
    <math|p\<less\>C*b<rsup|L>>. The best currently known estimate
    <math|L\<leqslant\>5.2> for <math|L> is due to
    Xylouris<nbsp><cite|Xyl11>. Applying this result for <math|a=1> and
    <math|b=2<rsup|k>>, we get the bound <math|p\<less\>2<rsup|6*k>> for
    large enough <math|k>. The complexity bound follows by testing
    <math|2<rsup|k>+1,2\<cdot\>2<rsup|k>+1,3\<cdot\>2<rsup|k>+1,\<ldots\>>
    for primality until we find <math|p>, using a<nbsp>polynomial time
    primality test<nbsp><cite|AKS04>.
  </proof>

  The difficulty with this result <emdash> already noted
  in<nbsp><cite|DeKuSaSa2013> <emdash> is that the time required to
  find<nbsp><math|p> greatly exceeds the time bound we are trying to prove
  for <math|<math-ss|I><around*|(|n|)>>!

  To avoid this problem, De et al suggested using a multivariate splitting,
  i.e., by encoding each integer as a polynomial in
  <math|\<bbb-Z\><around*|[|X<rsub|1>,\<ldots\>,X<rsub|m>|]>> for suitable
  <math|m>, say <math|m\<geqslant\>7>. One then uses
  <math|m><nbhyph>dimensional DFTs to multiply the polynomials. Since the
  transform length is shorter, one can get away with a smaller <math|p>.
  Unfortunately, this introduces further zero-padding and leads to a larger
  value of <math|K>, ruining our attempt to achieve the bound
  <math|O<around*|(|n*log n*8<rsup|log<rsup|\<asterisk\>> n>|)>>.

  On the other hand, we note that the problem only really occurs at the top
  recursion level. Indeed, at deeper recursion levels, there is
  <with|font-shape|italic|exponentially more time> available at the previous
  level to compute <math|p>. So one possible workaround is to use a
  different, sufficiently fast algorithm at the top level, such as Frer's
  algorithm, and then switch to the algorithm sketched in
  section<nbsp><reference|modular-sketch-sec> for the remaining levels. In
  this way one still obtains the bound <math|O<around*|(|n*log
  n*8<rsup|log<rsup|\<asterisk\>> n>|)>>, and asymptotically almost all of
  the computation is done using the algorithm of
  section<nbsp><reference|modular-sketch-sec>.

  If one insists on avoiding <math|\<bbb-C\>> entirely, there are still many
  choices: one could use the algorithm of De et al at the top level, or use a
  multivariate version of the algorithm of
  section<nbsp><reference|modular-sketch-sec>. One could even use the
  Schnhage--Strassen algorithm, whose main recursive step yields the bound
  <math|<math-ss|I><around*|(|n|)>=O<around|(|n<rsup|1/2>*<math-ss|I><around|(|n<rsup|1/2>|)>+n*log
  n|)>>; applying this three times gives <math|<math-ss|I><around*|(|n|)>=O<around|(|n<rsup|7/8>*<math-ss|I><around|(|n<rsup|1/8>|)>+n*log
  n|)>>, and then to multiply integers with <math|n<rsup|1/8>> bits, one can
  find a suitable prime using Lemma<nbsp><reference|Linnik-lem> in time
  <math|O<around|(|n<rsup|3/4+o<around*|(|1|)>>|)>=O<around*|(|n|)>>.

  Another way to work around the problem is to assume the generalised Riemann
  hypothesis (GRH). De et al pointed out that under GRH, it is possible to
  find a suitable prime efficiently using a randomised algorithm. Here we
  show that, under GRH, we can even use deterministic algorithms.

  <\lemma>
    <label|GRH>Assume GRH. Then <math|p<rsub|0><around*|(|k|)>=O<around*|(|2<rsup|2*k>*k<rsup|2>|)>>,
    and we may compute<nbsp><math|p<rsub|0><around*|(|k|)>> in time
    <math|O<around|(|2<rsup|*k>*k<rsup|O<around|(|1|)>>|)>>.
  </lemma>

  <\proof>
    The first bound is given in <cite|HB-linnik>, and the complexity bound
    follows similarly to the proof of Lemma<nbsp><reference|Linnik-lem>.
  </proof>

  To use this result, we must modify the algorithm of
  section<nbsp><reference|modular-sketch-sec> slightly. Choose
  a<nbsp>constant <math|C\<gtr\>3> so that we can compute
  <math|p<rsub|0><around*|(|k|)>> in time
  <math|O<around*|(|2<rsup|k>*k<rsup|C>|)>>, as in
  Lemma<nbsp><reference|GRH>. Increase the coefficient size from
  <math|<around*|(|lg n|)><rsup|2>> to <math|<around*|(|lg n|)><rsup|C-1>>,
  and change the definition of admissibility accordingly. The transform
  length then decreases to <math|2<rsup|k>=O<around*|(|n/<around*|(|lg
  n|)><rsup|C-1>|)>>, and the cost of computing <math|p> decreases to only
  <math|O<around*|(|n*lg n|)>>. The rest of the complexity analysis is
  essentially unchanged; the result is an algorithm with complexity
  <math|O<around*|(|n*log n*8<rsup|log<rsup|\<asterisk\>> n>|)>>, working
  entirely with modular arithmetic, in which the top recursion level does not
  need any special treatment.

  Finally, we consider the computation of a suitable approximation to a
  <math|2<rsup|k>>-th root of unity in <math|\<bbb-Q\><rsub|p>>.

  <\lemma>
    Given <math|k,\<lambda\>\<geqslant\>1> and a prime <math|p=1
    <pmod|2<rsup|k>>\<nocomma\>>, we may find
    <math|<wide|\<omega\>|~>\<in\>\<bbb-Z\>/p<rsup|\<lambda\>>*\<bbb-Z\>>
    such that <math|<wide|\<omega\>|~>=\<omega\> <pmod|p<rsup|\<lambda\>>>>
    for some primitive <math|2<rsup|k>>-th root of unity
    <math|\<omega\>\<in\>\<bbb-Q\><rsub|p>>, in time
    <math|O<around|(|p<rsup|1/4+\<epsilon\>>+<around*|(|k*\<lambda\>*log
    p|)><rsup|1+\<epsilon\>>|)>> for any <math|\<epsilon\>\<gtr\>0>.
  </lemma>

  <\proof>
    We may find a generator <math|g> of <math|<around*|(|\<bbb-Z\>/p*\<bbb-Z\>|)><rsup|\<asterisk\>>>
    deterministically in time <math|O<around|(|p<rsup|1/4+\<epsilon\>>|)>>
    <cite|Shparlinski1996>. Then <math|<wide|\<omega\>|~><rsub|0>=g<rsup|<around*|(|p-1|)>/2<rsup|k>>>
    is a primitive <math|2<rsup|k>>-th root of unity in
    <math|\<bbb-Z\>/p*\<bbb-Z\>>, and there is a unique primitive
    <math|2<rsup|k>>-th root of unity <math|\<omega\>\<in\>\<bbb-Q\><rsub|p>>
    congruent to <math|<wide|\<omega\>|~><rsub|0>> modulo <math|p>. Given
    <math|<wide|\<omega\>|~><rsub|0>>, we may compute <math|\<omega\>
    <pmod|p<rsup|\<lambda\>>>> using fast Newton lifting in time
    <math|O<around*|(|<around*|(|k*\<lambda\>*log
    p|)><rsup|1+\<epsilon\>>|)>> <cite-detail|CFA+-handbook|Section<nbsp>12.3>.
  </proof>

  In the context of section<nbsp><reference|modular-sketch-sec>, we may
  assume that <math|\<lambda\>=O<around|(|<around*|(|lg
  n|)><rsup|O<around*|(|1|)>>|)>> and <math|k=O<around*|(|lg n|)>>, so the
  cost of finding <math|\<omega\>> is <math|O<around|(|p<rsup|1/4+\<epsilon\>>|)>>.
  This is certainly less than the cost of finding <math|p> itself, using
  either Lemma<nbsp><reference|Linnik-lem> or Lemma<nbsp><reference|GRH>.

  <section|Conjecturally faster multiplication><label|yet-faster-sec>

  It is natural to ask whether the approaches from
  sections<nbsp><reference|even-faster-sec>,<nbsp><reference|Furer-sec>
  or<nbsp><reference|param-sec> can be further optimised, to obtain a
  complexity bound <math|<math-ss|I><around*|(|n|)>=O<around*|(|n*log
  n*K<rsup|log<rsup|\<ast\>> n>|)>> with <math|K\<less\>8>.

  In Frer's algorithm, the complexity is dominated by the cost of
  multiplications in <math|R=\<bbb-C\><around*|[|X|]>/<around|(|X<rsup|2<rsup|r-1>>+1|)>>.
  If we could use a<nbsp>similar algorithm for a much simpler <math|R>, then
  we might achieve a<nbsp>better bound. Such an algorithm was actually given
  by Frer<nbsp><cite|Fur89>, under the assumption that there exist
  sufficiently many Fermat primes, i.e., primes of the form
  <math|F<rsub|m>=2<rsup|2<rsup|m>>+1>. More precisely, his algorithm
  requires that there exists a positive integer <math|k> such that for every
  <math|m\<in\>\<bbb-N\>>, the sequence <math|F<rsub|m+1>,\<ldots\>,F<rsub|2<rsup|m+k>>>
  contains a prime number. The DFTs are then computed directly over
  <math|R=\<bbb-F\><rsub|F<rsub|m>>> for suitable<nbsp><math|m>, taking
  advantage of the fact that <math|\<bbb-F\><rsub|F<rsub|m>>> contains a fast
  <math|2<rsup|m+1>>-th primitive root of unity (namely the element 2) as
  well as a<nbsp><math|2<rsup|2<rsup|m>>>-th primitive root of unity. It can
  be shown that a suitably optimised version of this hypothetical algorithm
  achieves <math|K=4>: we still pay a factor of two due to the fact that we
  compute both forward and inverse transforms, and we pay another factor of
  two for the zero-padding in the recursive reduction. Unfortunately, it is
  likely that <math|F<rsub|4>=65537> is the last Fermat prime <cite|CP05>.

  In the <math|K=8> algorithm of section<nbsp><reference|even-faster-sec>, a
  potential bottleneck arises during the short transforms, when we use
  Kronecker substitution to multiply polynomials in
  <rigid|<math|\<bbb-C\><rsub|p><around*|[|X|]>/<around*|(|X<rsup|2<rsup|r>>-1|)>>>.
  We really only need the high <math|p> bits of each coefficient of the
  product (i.e., of the real and imaginary parts), but we are forced to
  allocate roughly <math|2*p> bits per coefficient in the Kronecker
  substitution, and then we discard roughly half of the output. This problem
  is similar to the well-known obstruction that prevents us from using FFT
  methods to compute a ``short product'', i.e., the high <math|n> bits or low
  <math|n> bits of the product of two <math|n><nbhyph>bit integers, any
  faster than computing the full <math|2*n> bits.

  In this section, we present a variant of the algorithm of
  section<nbsp><reference|even-faster-sec>, in which the coefficient
  ring<nbsp><math|\<bbb-C\>> is replaced by a finite field
  <math|\<bbb-F\><rsub|p><around*|[|\<mathi\>|]>>, where <math|p=2<rsup|q>-1>
  is a Mersenne prime. Thus ``short products'' are replaced by ``cyclic
  products'', namely by multiplications modulo <math|2<rsup|q>-1>. This saves
  a factor of two at each recursion level, and consequently reduces <math|K>
  from <math|8> to <math|4>.

  This change of coefficient ring introduces several technical complications.
  First, it is of course unknown if there are infinitely many Mersenne
  primes. Thus we are forced to rely on unproved conjectures about the
  distribution of Mersenne primes.

  Second, <math|q> is always prime (except possibly at the top recursion
  level). Thus we cannot cut up an element of
  <math|\<bbb-Z\>/p*\<bbb-Z\>><math|> into equal-sized chunks with an
  integral number of bits, and still expect to take advantage of cyclic
  products. In other words, <math|q> is very far from being admissible in the
  sense of section<nbsp>6. To work around this, we deploy a variant of an
  algorithm of Crandall and Fagin<nbsp><cite|CF94>, which allows us to work
  with chunks of varying size. The Crandall--Fagin algorithm was originally
  presented over <math|\<bbb-C\>>, and depended crucially on the fact that
  <math|\<bbb-R\>> contains suitable roots of<nbsp><math|2>. In our setting,
  we work over <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>\<cong\>\<bbb-F\><rsub|<around*|(|p<rprime|'>|)><rsup|2>>>,
  where <math|p<rprime|'>=2<rsup|q<rprime|'>>-1> is a Mersenne prime
  exponentially smaller than <math|p>. Happily,
  <math|\<bbb-F\><rsub|p<rprime|'>>> contains suitable roots of <math|2>, and
  this enables us to adapt their algorithm to our setting. Moreover, since
  <math|<around*|(|p<rprime|'>|)><rsup|2>-1=2<rsup|q<rprime|'>+1>*<around*|(|2<rsup|q<rprime|'>-1>-1|)>>,
  the field <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>>
  contains roots of unity of high power-of-two order, namely of order
  <math|2<rsup|q<rprime|'>+1>>, so we can perform FFTs over
  <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>> very efficiently.

  Finally, we can no longer use Kronecker substitution, as this would
  reintroduce the very zero-padding we are trying to avoid. Instead, we take
  our basic problem to be <with|font-shape|italic|polynomial> multiplication
  over <math|<around*|(|\<bbb-Z\>/p*\<bbb-Z\>|)><around*|[|\<mathi\>|]>>
  (where <math|p=2<rsup|q>-1> is not necessarily prime). After the
  Crandall--Fagin splitting step, we have a
  <with|font-shape|italic|bivariate> multiplication problem over
  <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>>, which is solved
  using 2-dimensional FFTs over <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>>.
  These FFTs are in turn reduced to 1<nbhyph>dimensional FFTs using standard
  methods; this dimension reduction is, roughly speaking, the analogue of
  Kronecker substitution in this algorithm. (Indeed, it is also possible to
  give an algorithm along these lines that works over <math|\<bbb-C\>> but
  avoids Kronecker substitution entirely; this still yields <math|K=8>
  because of the ``short product'' problem mentioned above.) For the
  1<nbhyph>dimensional transforms, we use the same technique as in previous
  sections: we use Cooley--Tukey's algorithm to decompose them into ``short
  transforms'' of exponentially shorter length, then use Bluestein's method
  to convert them to (univariate) polynomial products, and finally evaluate
  these products recursively.

  <subsection|Mersenne primes>

  Let <math|\<pi\><rsub|m><around*|(|x|)>> denote the number of Mersenne
  primes less than <math|x>. Based on probabilistic arguments and numerical
  evidence, Lenstra, Pomerance and Wagstaff have conjectured that

  <\eqnarray*>
    <tformat|<table|<row|<cell|\<pi\><rsub|m><around*|(|x|)>>|<cell|\<sim\>>|<cell|<frac|\<mathe\><rsup|\<matheuler\>>|log
    2>*log log x>>>>
  </eqnarray*>

  as <math|x\<rightarrow\>\<infty\>>, where
  <math|\<matheuler\>=0.5772<math-ignore|\<ldots\>>> is the Euler
  constant<nbsp><cite|Wag83|Pom-primality>. Our fast multiplication algorithm
  relies on the following slightly weaker conjecture.

  <\conjecture>
    <label|Mersenne-conj>There exist constants <math|0\<less\>a\<less\>b>
    such that for all <math|x\<gtr\>3>,

    <\equation*>
      a*log log x\<less\>\<pi\><rsub|m><around*|(|x|)>\<less\>b*log log x.
    </equation*>
  </conjecture>

  <\proposition>
    <label|Mersenne-prop>Assume Conjecture<nbsp><reference|Mersenne-conj> and
    let <math|c\<assign\>b/a>. For any integer <math|n\<geqslant\>2>, there
    exists a Mersenne prime <math|p=2<rsup|q>-1> in the interval
    <math|2<rsup|n>\<less\>p\<less\>2<rsup|n<rsup|c>>>. Given <math|n>, we
    may compute the smallest such <math|p>, and find a primitive
    <math|2<rsup|q+1>>-th root of unity in
    <math|\<bbb-F\><rsub|p><around*|[|\<mathi\>|]>>, in time
    <math|O<around|(|n<rsup|<around*|(|3+o<around*|(|1|)>|)>*c>|)>>.
  </proposition>

  <\proof>
    The required prime exists since for <math|n\<geqslant\>2> we have

    <\equation*>
      \<pi\><rsub|m><around*|(|2<rsup|n<rsup|c>>|)>\<gtr\>a*log log
      <around*|(|2<rsup|n<rsup|c>>|)>=a*c*log n+a*log log 2\<gtr\>b*log
      n+b*log log 2=b*log log<around*|(|2<rsup|n>|)>\<gtr\>\<pi\><rsub|m><around*|(|2<rsup|n>|)>.
    </equation*>

    An integer of the form <math|2<rsup|q>-1> may be tested for primality in
    time <math|q<rsup|2+o<around*|(|1|)>>> using the Lucas--Lehmer primality
    test<nbsp><cite|CP05>. A simple way to compute <math|p> is to apply this
    test successively for all <math|q\<in\><around*|{|n+1,\<ldots\>,<around*|\<lfloor\>|n<rsup|c>|\<rfloor\>>|}>>;
    this takes time <math|O<around|(|n<rsup|<around*|(|3+o<around*|(|1|)>|)>*c>|)>>.
    A primitive <math|2<rsup|q+1>>-th root of unity <math|\<omega\>> may be
    computed by the formula <math|\<omega\>\<assign\>2<rsup|2<rsup|q-2>>+<around*|(|-3|)><rsup|2<rsup|q-2>>*\<mathi\>\<in\>\<bbb-F\><rsub|p><around*|[|\<mathi\>|]>>
    in time <math|O<around|(|q<rsup|2+o<around*|(|1|)>>|)>>;
    see<nbsp><cite|RT-convolutions> or<nbsp><cite-detail|CT89|Corollary<nbsp>5>.
  </proof>

  <subsection|Crandall and Fagin's algorithm revisited>

  Let <math|p=2<rsup|q>-1> be a Mersenne number (not necessarily prime). The
  main integer multiplication algorithm depends on a<nbsp>variant of Crandall
  and Fagin's algorithm that reduces multiplication in
  <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]><around*|[|X|]>/<around*|(|X<rsup|M>-1|)>>
  to multiplication in <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]><around*|[|X,Y|]>/<rigid|<around*|(|X<rsup|M>-1,Y<rsup|N>-1|)>>>,
  where <math|p<rprime|'>=2<rsup|q<rprime|'>>-1> is a suitably smaller
  Mersenne prime (assuming that such a<nbsp>prime exists).

  To explain the idea of this reduction, we first consider the simpler
  univariate case, in which we reduce multiplication in
  <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]>>
  to multiplication in<nbsp><math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]><around*|[|Y|]>/<around*|(|Y<rsup|N>-1|)>>.
  Here we require that <math|N\<leqslant\>q>, that
  <math|gcd<around*|(|N,q<rprime|'>|)>=1> and that
  <math|q<rprime|'>\<geqslant\>2*<around*|\<lceil\>|q/N|\<rceil\>>+lg N+3>.
  For any <math|k\<in\>\<bbb-N\>>, we will write
  <math|\<bbb-N\><rsub|k>=<around*|{|0,\<ldots\>,k-1|}>> and
  <math|\<bbb-Z\><rsub|k>=<around*|{|-<around*|(|k-1|)>,\<ldots\>,k-1|}>>.

  Assume that we wish to compute the product of
  <math|u,v\<in\><around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]>>.
  Considering <math|u> and <math|v> as elements of
  <math|\<bbb-N\><rsub|p><around*|[|\<mathi\>|]>> modulo <math|p>, we
  decompose them as

  <\equation>
    <label|uv-dec>u=<big|sum><rsub|i=0><rsup|N-1>u<rsub|i>*2<rsup|e<rsub|i>>,<space|2em>v=<big|sum><rsub|i=0><rsup|N-1>v<rsub|i>*2<rsup|e<rsub|i>>,
  </equation>

  where

  <\eqnarray*>
    <tformat|<table|<row|<cell|e<rsub|i>>|<cell|\<assign\>>|<cell|<around*|\<lceil\>|q*i/N|\<rceil\>>,>>|<row|<cell|u<rsub|i>,v<rsub|i>>|<cell|\<in\>>|<cell|\<bbb-N\><rsub|2<rsup|e<rsub|i+1>-e<rsub|i>>><around*|[|\<mathi\>|]>.>>>>
  </eqnarray*>

  We regard <math|u<rsub|i>> and <math|v<rsub|i>> as complex ``digits'' of
  <math|u> and <math|v>, where the base <math|2<rsup|e<rsub|i+1>-e<rsub|i>>>
  varies with the position<nbsp><math|i>. Notice that
  <math|e<rsub|i+1>-e<rsub|i>> takes only two possible values:
  <math|<around*|\<lfloor\>|q/N|\<rfloor\>>> or
  <math|<around*|\<lceil\>|q/N|\<rceil\>>>.

  For <math|0\<leqslant\>i\<less\>N>, let

  <\eqnarray*>
    <tformat|<table|<row|<cell|c<rsub|i>>|<cell|\<assign\>>|<cell|N*e<rsub|i>-q*i\<nocomma\>,<eq-number><label|ci-def>>>>>
  </eqnarray*>

  so that <math|0\<leqslant\>c<rsub|i>\<less\>N>. For any
  <math|0\<leqslant\>i<rsub|1>,i<rsub|2>\<less\>N>, define
  <math|\<delta\><rsub|i<rsub|1>,i<rsub|2>>\<in\>\<bbb-Z\>> as follows.
  Choose <math|\<sigma\>\<in\><around*|{|0,1|}>> so that
  <math|i\<assign\>i<rsub|1>+i<rsub|2>-\<sigma\>*N> lies in the interval
  <math|0\<leqslant\>i\<less\>N>, \ and put

  <\eqnarray*>
    <tformat|<table|<row|<cell|\<delta\><rsub|i<rsub|1>,i<rsub|2>>>|<cell|\<assign\>>|<cell|e<rsub|i<rsub|1>>+e<rsub|i<rsub|2>>-e<rsub|i>-\<sigma\>*q.>>>>
  </eqnarray*>

  From<nbsp>(<reference|ci-def>), we have

  <\equation*>
    c<rsub|i<rsub|1>>+c<rsub|i<rsub|2>>-c<rsub|i>=N*<around*|(|e<rsub|i<rsub|1>>+e<rsub|i<rsub|2>>-e<rsub|i>|)>-q*<around*|(|i<rsub|1>+i<rsub|2>-i|)>=N*\<delta\><rsub|i<rsub|1>,i<rsub|2>>.
  </equation*>

  Since the left hand side lies in the interval <math|<around*|(|-N,2*N|)>>,
  this shows that <math|\<delta\><rsub|i<rsub|1>,i<rsub|2>>\<in\><around*|{|0,1|}>>.
  Now, since <math|2<rsup|q>=1 <pmod|p>> and
  <math|e<rsub|i<rsub|1>>+e<rsub|i<rsub|2>>=e<rsub|i>+\<delta\><rsub|i<rsub|1>,i<rsub|2>>
  <pmod|q>>, we have

  <\equation*>
    u*v<space|0.6spc>=<space|0.6spc><big|sum><rsub|i<rsub|1>=0><rsup|N-1><big|sum><rsub|i<rsub|2>=0><rsup|N-1>u<rsub|i<rsub|1>>*v<rsub|i<rsub|2>>*2<rsup|e<rsub|i<rsub|1>>+e<rsub|i<rsub|2>>><space|0.6spc>=<around*|\<nobracket\>|<big|sum><rsub|i=0><rsup|N-1>w<rsub|i>*2<rsup|e<rsub|i>>|\<nobracket\>><space|0.6spc>
    <pmod|p>,
  </equation*>

  where

  <\eqnarray*>
    <tformat|<table|<row|<cell|w<rsub|i>>|<cell|\<assign\>>|<cell|<big|sum><rsub|i<rsub|1>+i<rsub|2>=i
    <pmod|N>>2<rsup|\<delta\><rsub|i<rsub|1>,i<rsub|2>>>*u<rsub|i<rsub|1>>*v<rsub|i<rsub|2>>.>>>>
  </eqnarray*>

  Since <math|<around*|\||u<rsub|i<rsub|1>>|\|>\<less\><sqrt|2>\<cdot\>2<rsup|<around*|\<lceil\>|q/N|\<rceil\>>>>
  and similarly for <math|v<rsub|i<rsub|2>>>, we have
  <math|w<rsub|i>\<in\>\<bbb-Z\><rsub|4<rsup|<around*|\<lceil\>|q/N|\<rceil\>>+1>*N><around*|[|\<mathi\>|]>>.
  Note that we may recover <math|u*v> from
  <math|w<rsub|0>,\<ldots\>,w<rsub|N-1>> in time
  <math|O<around*|(|q|)>\<nocomma\>>, by a standard overlap-add procedure
  (provided that <math|N=O<around*|(|q/lg q|)>>).

  Let <math|h> be the inverse of <math|q<rprime|'>> modulo <math|N>; this
  inverse exists since we assumed <math|gcd<around*|(|N,q<rprime|'>|)>=1>.
  Let <math|\<theta\>\<assign\>2<rsup|h>\<in\>\<bbb-F\><rsub|p<rprime|'>>>,
  so that

  <\equation*>
    \<theta\><rsup|N>=2<rsup|h*N>=2,
  </equation*>

  since <math|2> has order <math|q<rprime|'>> in
  <math|\<bbb-F\><rsub|p<rprime|'>>>. The quantity <math|\<theta\>> plays the
  same role as the real <math|N><nbhyph>th root of <math|2> appearing in
  Crandall--Fagin's algorithm.

  Now define polynomials <math|U,V\<in\>\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]><around*|[|Y|]>/<around*|(|Y<rsup|N>-1|)>>
  by <math|U<rsub|i>\<assign\>\<theta\><rsup|c<rsub|i>>*u<rsub|i>> and
  <math|V<rsub|i>\<assign\>\<theta\><rsup|c<rsub|i>>*v<rsub|i>> for
  <math|0\<leqslant\>i\<less\>N>, and let
  <math|W=W<rsub|0>+\<cdots\>+W<rsub|N-1>*Y<rsup|N-1>\<assign\>U*V> be their
  (cyclic) product. Then

  <\equation*>
    <wide|w|~><rsub|i>\<assign\>\<theta\><rsup|-c<rsub|i>>*W<rsub|i>=<big|sum><rsub|i<rsub|1>+i<rsub|2>=i
    <pmod|N>>\<theta\><rsup|-c<rsub|i>>*U<rsub|i<rsub|1>>*V<rsub|i<rsub|2>>=<big|sum>\<theta\><rsup|c<rsub|i<rsub|1>>+c<rsub|i<rsub|2>>-c<rsub|i>>*u<rsub|i<rsub|1>>*v<rsub|i<rsub|2>>=<big|sum>2<rsup|\<delta\><rsub|i<rsub|1>,i<rsub|2>>>*u<rsub|i<rsub|1>>*v<rsub|i<rsub|2>>
  </equation*>

  coincides with the reinterpretation of <math|w<rsub|i>> as an element of
  <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>>. Moreover, we may
  recover <math|w<rsub|i>> unambiguously from <math|<wide|w|~><rsub|i>>, as
  <math|q<rprime|'>\<geqslant\>2*<around*|\<lceil\>|q/N|\<rceil\>>+lg N+3>
  and <math|w<rsub|i>\<in\>\<bbb-Z\><rsub|4<rsup|<around*|\<lceil\>|q/N|\<rceil\>>+1>*N><around*|[|\<mathi\>|]>>.
  Altogether, this shows how to reduce multiplication in
  <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]>>
  to multiplication in <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]><around*|[|Y|]>/<around*|(|Y<rsup|N>-1|)>>.

  <\remark>
    <label|CF-rem>The pair <math|<around*|(|e<rsub|i+1>,c<rsub|i+1>|)>> can
    be computed from <math|<around*|(|e<rsub|i>,c<rsub|i>|)>> in
    <math|O<around*|(|lg q|)>> bit operations, so we may compute the
    sequences <math|e<rsub|0>,\<ldots\>,e<rsub|N-1>> and
    <math|c<rsub|0>,\<ldots\>,c<rsub|N-1>> in time <math|O<around*|(|N*lg
    q|)>>. Moreover, since <math|c<rsub|i+1>-c<rsub|i>> takes on only two
    possible values, we may compute the sequence
    <math|\<theta\><rsup|c<rsub|0>>,\<ldots\>,\<theta\><rsup|c<rsub|N-1>>>
    using <math|O<around*|(|N|)>> multiplications in
    <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>>.
  </remark>

  <subsection|Bivariate Crandall--Fagin reduction><label|CF-sec>

  Generalising the discussion of the previous section, we now show how to
  reduce multiplication in <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]><around*|[|X|]>/<around*|(|X<rsup|M>-1|)>>,
  for a given <math|M\<geqslant\>1>, to multiplication in
  <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]><around*|[|X,Y|]>/<around*|(|X<rsup|M>-1,Y<rsup|N>-1|)>>.
  For this, we require that <math|N\<leqslant\>q>, that
  <math|gcd<around*|(|N,q<rprime|'>|)>=1> and that
  <math|q<rprime|'>\<geqslant\>2*<around*|\<lceil\>|q/N|\<rceil\>>+lg
  <around|(|M*N|)>+3>.

  Indeed, consider two cyclic polynomials
  <math|u=u<rsub|0>+\<cdots\>+u<rsub|M-1>*X<rsup|M-1>> and
  <math|v=v<rsub|0>+\<cdots\>+v<rsub|M-1>*X<rsup|M-1>> in
  <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]><around*|[|X|]>/<around*|(|X<rsup|M>-1|)>>.
  We cut each of the coefficients <math|u<rsub|i>,v<rsub|i>\<in\><around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]>>
  into <math|N> chunks <math|u<rsub|i,j>> and <math|v<rsub|i,j>> of bit size
  at most <math|<around*|\<lceil\>|q/N|\<rceil\>>>, using the same varying
  base strategy as above. With <math|\<theta\><rsup|N>=2> and
  <math|c<rsub|j>> as before, we next form the bivariate cyclic polynomials

  <\equation*>
    U\<assign\><big|sum><rsub|i,j>u<rsub|i,j>*\<theta\><rsup|c<rsub|j>>*X<rsup|i>*Y<rsup|j>,<space|2em>V\<assign\><big|sum><rsub|i,j>v<rsub|i,j>*\<theta\><rsup|c<rsub|j>>*X<rsup|i>*Y<rsup|j>
  </equation*>

  in <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]><around*|[|X,Y|]>/<around*|(|X<rsup|M>-1,Y<rsup|N>-1|)>>.
  Setting

  <\eqnarray*>
    <tformat|<table|<row|<cell|W>|<cell|\<assign\>>|<cell|U*V=<big|sum><rsub|i,j>w<rsub|i,j>*\<theta\><rsup|c<rsub|j>>*X<rsup|i>*Y<rsup|j>,>>>>
  </eqnarray*>

  the same arguments as in the previous section yield

  <\eqnarray*>
    <tformat|<table|<row|<cell|w<rsub|i,j>>|<cell|=>|<cell|<big|sum><rsub|i<rsub|1>+i<rsub|2>=i
    <pmod|M>><big|sum><rsub|j<rsub|1>+j<rsub|2>=j
    <pmod|N>>2<rsup|\<delta\><rsub|j<rsub|1>,j<rsub|2>>>*u<rsub|i<rsub|1>,j<rsub|1>>*v<rsub|i<rsub|2>,j<rsub|2>>.>>>>
  </eqnarray*>

  Using the assumption that <math|q<rprime|'>\<geqslant\>2*<around*|\<lceil\>|q/N|\<rceil\>>+lg
  <around|(|M*N|)>+3>, we recover the coefficients <math|w<rsub|i,j>>, and
  hence the product <math|u*v>, from the bivariate cyclic convolution product
  <math|W=U*V>.

  <subsection|Conjecturally faster multiplication>

  Let <math|q\<geqslant\>2> and <math|p\<assign\>2<rsup|q>-1> (not
  necessarily prime). We will take our basic recursive problem to be
  multiplication in <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]><around*|[|X|]>/<around|(|X<rsup|M>-1|)>>
  for suitable <math|M>. We need <math|M> somewhat larger than <math|q>; this
  is analogous to the situation in section<nbsp><reference|even-faster-sec>,
  where we chose a ``short transform length'' somewhat larger than the
  coefficient size. Thus we set <math|M=M<around*|(|q|)>\<assign\>2<rsup|\<mu\><around*|(|q|)>>>
  where <math|\<mu\><around*|(|q|)>> is defined as follows.

  <\lemma>
    There exists an increasing function <math|\<mu\>:\<bbb-N\>\<rightarrow\>\<bbb-N\>>
    such that

    <\equation>
      <label|mu-estimate>0\<leqslant\>\<mu\><around*|(|q|)>-<around*|(|log<rsub|2>
      q|)>*<around*|(|log<rsub|2> log<rsub|2> q|)>\<leqslant\>2
    </equation>

    for all <math|q\<geqslant\>2>, and such that we may compute
    <math|\<mu\><around*|(|q|)>> in time <math|<around*|(|log
    q|)><rsup|1+o<around*|(|1|)>>>.
  </lemma>

  <\proof>
    Let <math|f<around*|(|q|)>\<assign\><around*|(|log<rsub|2>
    q|)>*<around*|(|log<rsub|2> log<rsub|2> q|)>>. Using<nbsp><cite|Br76b>,
    we may construct a function <math|g<around*|(|q|)>> such that
    <math|<around*|\||g<around*|(|q|)>-f<around*|(|q|)>|\|>\<leqslant\>1/q>
    for all <math|q\<geqslant\>2>, and which may be computed in time
    <math|<around*|(|log q|)><rsup|1+o<around*|(|1|)>>>. One checks that
    <math|f<around*|(|q+1|)>-f<around*|(|q|)>\<geqslant\>2/q> for all
    <math|q\<geqslant\>2>, so <math|g<around*|(|q+1|)>\<geqslant\>f<around*|(|q+1|)>-<frac|1|q+1>\<geqslant\>f<around*|(|q+1|)>-<frac|1|q>\<geqslant\>f<around*|(|q|)>+<frac|1|q>\<geqslant\>g<around*|(|q|)>>
    for <math|q\<geqslant\>2>. Thus <math|g<around*|(|q|)>> is increasing,
    and <math|\<mu\><around*|(|q|)>\<assign\><around*|\<lfloor\>|g<around*|(|q|)>+3/2|\<rfloor\>>>
    has the desired properties.
  </proof>

  We say that an integer <math|n\<geqslant\>2> is
  <with|font-shape|italic|admissible> if it is of the form <math|n=q*M> where
  <math|M\<assign\>M<around*|(|q|)>> for some <math|q\<geqslant\>2>. (This
  should not be confused with the notion of admissibility of
  section<nbsp><reference|even-faster-sec>.) An element of
  <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]><around*|[|X|]>/<around*|(|X<rsup|M>-1|)>>
  is then represented by <math|2*n> bits. Note that
  <math|q\<mapsto\>q*M<around*|(|q|)>> is strictly increasing, so there is a
  one-to-one correspondence between integers <math|q\<geqslant\>2> and
  admissible <math|n>. For <math|x\<geqslant\>2> we define
  <math|\<beta\><around*|(|x|)>> to be the smallest admissible integer
  <math|n\<geqslant\>x>.

  <\lemma>
    We have <math|\<beta\><around*|(|n|)>=O<around*|(|n|)>> as
    <math|n\<rightarrow\>\<infty\>>. Given <math|n\<geqslant\>2>, we may
    compute <math|\<beta\><around*|(|n|)>>, and the corresponding <math|q>,
    in time <math|o<around*|(|n|)>>.
  </lemma>

  <\proof>
    From<nbsp><eqref|mu-estimate> we have
    <math|<around*|(|q+1|)>*M<around*|(|q+1|)>/<around*|(|q*M<around*|(|q|)>|)>=O<around*|(|2<rsup|\<mu\><around*|(|q+1|)>-\<mu\><around*|(|q|)>>|)>=O<around*|(|1|)>>;
    this immediately implies that <math|\<beta\><around*|(|n|)>=O<around*|(|n|)>>.

    Suppose that we wish to compute <math|\<beta\><around*|(|n|)>> for some
    <math|n>. We assume that <math|n> is large enough that the definition
    <math|q<rsub|0>\<assign\>2<rsup|<around*|\<lceil\>|lg n/<around*|(|lg lg
    n-lg lg lg n-1|)>|\<rceil\>>>> makes sense and so that
    <math|q<rsub|0>\<geqslant\>2>. One checks that
    <math|<around*|(|log<rsub|2> q<rsub|0>|)>*<around*|(|log<rsub|2>
    log<rsub|2> q<rsub|0>|)>\<geqslant\>lg n>, so
    <math|\<mu\><around*|(|q<rsub|0>|)>\<geqslant\>lg n> and hence
    <math|q<rsub|0>*M<around*|(|q<rsub|0>|)>\<geqslant\>n>. To find the
    smallest suitable<nbsp><math|q>, we may simply compute
    <math|q*M<around*|(|q|)>> for each <math|q=2,3,\<ldots\>,q<rsub|0>>, and
    compare with <math|n>. This takes time
    <math|O<around*|(|q<rsub|0>*<around*|(|log
    q<rsub|0>|)><rsup|1+o<around*|(|1|)>>|)>=o<around*|(|n|)>>.
  </proof>

  Now let <math|q\<geqslant\>2>, <math|p\<assign\>2<rsup|q>-1> and
  <math|M\<assign\>M<around*|(|q|)>>. Consider the problem of computing
  <math|t\<geqslant\>1> products <math|u<rsub|1>*v,\<ldots\>,u<rsub|t>*v>
  with <math|<rigid|u<rsub|1>,\<ldots\>,u<rsub|t>>,v\<in\><around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]><around*|[|X|]>/<around|(|X<rsup|M>-1|)>>.
  We denote by <math|<math-ss|C><rsub|t><around*|(|n|)>> the complexity of
  this problem, where <math|n\<assign\>q*M<around*|(|q|)>> is the admissible
  integer corresponding to <math|q>. As in
  section<nbsp><reference|even-faster-sec>, we define
  <math|<math-ss|C><around*|(|n|)>\<assign\>sup<rsub|t\<geqslant\>1>
  <math-ss|C><rsub|t><around*|(|n|)>/<around*|(|2*t+1|)>>.

  Notice that multiplication of two integers of bit size <math|\<leqslant\>k>
  reduces to the above problem, for <math|t=1>, via a suitable Kronecker
  segmentation. Indeed, let <math|n\<assign\>\<beta\><around*|(|8*k|)>=q*M<around*|(|q|)>>
  for some <math|q>, and encode the integers as integer polynomials of degree
  less than <math|M/2> with coefficients of bit size
  <math|m\<assign\><around*|\<lceil\>|k/<around*|(|M/2|)>|\<rceil\>>>. The
  desired product may be recovered from the product in
  <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]><around*|[|X|]>/<around|(|X<rsup|M>-1|)>>,
  as

  <\equation*>
    2*m+lg <around*|(|M/2|)>\<leqslant\><frac|4*k|M>+\<mu\><around*|(|q|)>\<leqslant\><frac|q|2>+\<mu\><around*|(|q|)>\<leqslant\>q-1
  </equation*>

  for large <math|q>. Thus, as in section<nbsp><reference|even-faster-sec>,
  we have <math|<math-ss|I><around*|(|k|)>\<leqslant\>3*<math-ss|C><around*|(|O<around*|(|k|)>|)>+O<around*|(|k|)>>,
  and it suffices to obtain a good bound for
  <math|<math-ss|C><around*|(|n|)>>.

  Now suppose additionally that <math|p=2<rsup|q>-1> is
  <with|font-shape|italic|prime>. In this case
  <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]>=\<bbb-F\><rsub|p><around*|[|\<mathi\>|]>>
  is a field, and as noted above, it contains <math|2<rsup|q+1>>-th roots of
  unity, so we may define DFTs of length <math|2<rsup|r>> over
  <math|\<bbb-F\><rsub|p><around*|[|\<mathi\>|]>> for any
  <math|r\<leqslant\>q+1>. In particular, for <math|r\<leqslant\>q> we may
  use Bluestein's algorithm to compute DFTs of length <math|2<rsup|r>>.
  Denote by <math|><math|<math-ss|B><rsub|q,t><around*|(|2<rsup|r>|)>> the
  cost of evaluating <math|t> independent DFTs of length <math|2<rsup|r>>
  over <math|\<bbb-F\><rsub|p><around*|[|\<mathi\>|]>>, and put
  <math|<math-ss|B><rsub|q><around*|(|2<rsup|r>|)>\<assign\>sup<rsub|t\<geqslant\>1>
  <math-ss|B><rsub|q,t><around*|(|2<rsup|r>|)>/<around*|(|2*t+1|)>>. Here we
  assume as usual that a <math|2<rsup|r+1>>-th root of unity is known, and
  that the corresponding Bluestein root table has been precomputed.

  Let us apply these definitions in the case <math|r\<assign\>lg M>; this is
  permissible, as <math|lg M\<leqslant\>q> for sufficiently large <math|q>.
  Since convolution of length <math|M> over
  <math|\<bbb-F\><rsub|p><around*|[|\<mathi\>|]>> is exactly the basic
  recursive problem, and since one of the operands is fixed, we have
  <math|<math-ss|B><rsub|q,t><around*|(|M|)>\<leqslant\><math-ss|C><rsub|t><around*|(|n|)>+O<around*|(|t*M*<math-ss|I><around*|(|q|)>|)>>,
  where <math|n\<assign\>q*M>, and hence

  <\eqnarray*>
    <tformat|<table|<row|<cell|<math-ss|B><rsub|q><around*|(|M|)>>|<cell|\<leqslant\>>|<cell|<math-ss|C><around*|(|n|)>+O<around*|(|M*<math-ss|I><around*|(|q|)>|)>.<eq-number><label|Bluestein-mersenne>>>>>
  </eqnarray*>

  <\theorem>
    <label|conj-rec-th>Assume Conjecture<nbsp><reference|Mersenne-conj>. Then
    there exists <math|x<rsub|0>\<geqslant\>2> and a logarithmically slow
    function <math|\<Phi\>:<around*|(|x<rsub|0>,\<infty\>|)>\<rightarrow\>\<bbb-R\>>
    with the following property. For all admissible
    <math|n\<gtr\>x<rsub|0>\<nocomma\>>, there exists an admissible
    <math|n<rprime|'>\<leqslant\>\<Phi\><around*|(|n|)>> such that

    <\eqnarray*>
      <tformat|<table|<row|<cell|<frac|<math-ss|C><around*|(|n|)>|n*lg
      n>>|<cell|\<leqslant\>>|<cell|<around*|(|4+O<around*|(|<frac|1|lg lg lg
      n>|)>|)>*<frac|<math-ss|C><around*|(|n<rprime|'>|)>|n<rprime|'>*lg
      n<rprime|'>>+O<around*|(|1|)>.<eq-number><label|mersenne-rel-1>>>>>
    </eqnarray*>
  </theorem>

  <\proof>
    Let <math|n\<assign\>q*M> with <math|M=M<around*|(|q|)>>. Assume that we
    wish to compute <math|t\<geqslant\>1> products with one fixed operand.
    Our goal is to reduce to a problem of the same form, but for
    exponentially smaller <math|n>.<vspace|0.5fn>

    <no-indent><strong|Choose parameters.> Let
    <math|p<rprime|'>=2<rsup|q<rprime|'>>-1> be the smallest Mersenne prime
    larger than <math|2<rsup|<around*|(|lg M|)><rsup|2>>>. By
    Proposition<nbsp><reference|Mersenne-prop>, we have
    <math|2<rsup|<around*|(|lg M|)><rsup|2>>\<less\>p<rprime|'>\<less\>2<rsup|<around*|(|lg
    M|)><rsup|2*c>>>, whence <math|<around*|(|lg
    M|)><rsup|2>\<leqslant\>q<rprime|'>\<leqslant\><around*|(|lg
    M|)><rsup|2*c>>, for some absolute constant <math|c\<gtr\>1>. Moreover,
    we may compute <math|p<rprime|'>>, together with a primitive
    <math|2<rsup|q<rprime|'>+1>>-th root of unity <math|\<omega\>> in
    <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>>, in time
    <math|O<around|(|<around*|(|lg M|)><rsup|<around*|(|6+o<around*|(|1|)>|)>*c>|)>=O<around*|(|n*lg
    n|)>>. We define <math|M<rprime|'>\<assign\>M<around*|(|q<rprime|'>|)>>
    and <math|n<rprime|'>\<assign\>q<rprime|'>*M<rprime|'>>.

    The algorithm must perform various multiplications in
    <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>>, at cost
    <math|O<around*|(|<math-ss|I><around*|(|q<rprime|'>|)>|)>>. For
    simplicity we will use Schnhage--Strassen's algorithm for these
    multiplications, i.e., we will take <math|<math-ss|I><around*|(|q<rprime|'>|)>=O<around*|(|q<rprime|'>*lg
    q<rprime|'>*lg lg q<rprime|'>|)>>. Since <math|lg
    q<rprime|'>=O<around*|(|lg lg M|)>=O<around*|(|lg lg n|)>\<nocomma\>>, we
    have

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|I><around*|(|q<rprime|'>|)>>|<cell|=>|<cell|O<around*|(|q<rprime|'>*lg
      lg n*lg lg lg n|)>.>>>>
    </eqnarray*>

    <no-indent><strong|Crandall--Fagin reduction.> We use the framework of
    section<nbsp><reference|CF-sec> to reduce the basic multiplication
    problem in <math|<around*|(|\<bbb-Z\>/<around*|\<nobracket\>|p*\<bbb-Z\>|\<nobracket\>>|)><around*|[|\<mathi\>|]><around*|[|X|]>/<around*|(|X<rsup|M>-1|)>>
    to multiplication in <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]><around*|[|X,Y|]>/<around*|(|X<rsup|M>-1,Y<rsup|N>-1|)>>
    for suitable <math|N>. We take <math|N\<assign\>2<rsup|\<ell\>>*s> where

    <\eqnarray*>
      <tformat|<cwith|1|1|3|3|cell-bsep|1spc>|<table|<row|<cell|\<ell\>>|<cell|\<assign\>>|<cell|lg
      <around*|(|<frac|2*q|q<rprime|'>*lg lg
      q>|)>,>>|<row|<cell|s>|<cell|\<assign\>>|<cell|2*<around*|\<lceil\>|<frac|q|2<rsup|\<ell\>>*<around*|(|q<rprime|'>-lg<rsup|2>
      q|)>>|\<rceil\>>+1.>>>>
    </eqnarray*>

    We also write <math|L\<assign\>2<rsup|\<ell\>>>. The definition of
    <math|s> makes sense for large <math|q> since
    <math|q<rprime|'>\<geqslant\><around*|(|lg
    M|)><rsup|2>\<asymp\><around*|(|lg q*lg lg q|)><rsup|2>>. Let us check
    that the hypotheses of section<nbsp><reference|CF-sec> are satisfied for
    large <math|q>. We have <math|L\<asymp\>q/<around*|(|q<rprime|'>*lg lg
    q|)>> and hence <math|s\<asymp\>lg lg q>; in particular,
    <math|s\<neq\>q<rprime|'>>, so <math|gcd<around*|(|N,q<rprime|'>|)>=1>,
    and also <math|N\<asymp\>q/q<rprime|'>\<prec\>q/lg q>. Since
    <math|N=L*s\<geqslant\>2*q/<around*|(|q<rprime|'>-lg<rsup|2> q|)>>, we
    also have <math|2*<around*|\<lceil\>|q/N|\<rceil\>>\<leqslant\>q<rprime|'>-lg<rsup|2>
    q+O<around*|(|1|)>>, and thus <math|2*<around*|\<lceil\>|q/N|\<rceil\>>+lg
    <around|(|M*N|)>+3\<leqslant\>q<rprime|'>> since
    <math|lg<around*|(|M*N|)>=O<around*|(|lg q*lg lg q|)>>.

    We also note for later use the estimate

    <\eqnarray*>
      <tformat|<table|<row|<cell|M*N*q<rprime|'>>|<cell|=>|<cell|<around*|(|2+O<around*|(|<frac|1|lg
      lg n>|)>|)>*n.>>>>
    </eqnarray*>

    Indeed, since <math|s\<asymp\>lg lg q> we have

    <\eqnarray*>
      <tformat|<table|<row|<cell|s>|<cell|=>|<cell|<around*|(|2+O<around*|(|<frac|1|lg
      lg q>|)>|)>*<frac|q|2<rsup|\<ell\>>*<around*|(|q<rprime|'>-lg<rsup|2>
      q|)>>,>>>>
    </eqnarray*>

    and we already noticed earlier that <math|<around*|(|lg<rsup|2>
    q|)>/q<rprime|'>=O<around*|(|1/<around*|(|lg lg
    q|)><rsup|2>|)>=O<around*|(|1/lg lg q|)>>.

    To assess the cost of the Crandall--Fagin reduction, we note that
    computing the <math|e<rsub|i>> and <math|c<rsub|i>> costs
    <math|O<around*|(|N*lg q|)>=O<around*|(|n*lg n|)>> (see
    Remark<nbsp><reference|CF-rem>), the splitting itself and final
    overlap-add phase require time <math|O<around*|(|t*n|)>>, and the various
    multiplications by <math|\<theta\>>, <math|\<theta\><rsup|c<rsub|i>>> and
    <math|\<theta\><rsup|-c<rsub|i>>> have cost
    <math|O<around*|(|t*M*N*<math-ss|I><around*|(|q<rprime|'>|)>|)>=O<around*|(|t*n*<math-ss|I><around*|(|q<rprime|'>|)>/q<rprime|'>|)>=O<around*|(|t*n*lg
    n|)>>.<vspace|0.5fn>

    <no-indent><strong|Reduction to power-of-two lengths.> Next we reduce
    multiplication in <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]><around*|[|X,Y|]>/<around*|(|X<rsup|M>-1,Y<rsup|N>-1|)>>
    to multiplication in <math|\<cal-R\><around*|[|X,Z|]>/<around*|(|X<rsup|M>-1,Z<rsup|L>-1|)>>,
    where <math|\<cal-R\>\<assign\>\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]><around*|[|U|]>/<around*|(|U<rsup|s>-1|)>>.
    In fact, since <math|gcd<around*|(|L,s|)>=1>, these rings are isomorphic,
    via the map that sends <math|X> to <math|X> and <math|Y> to <math|Z*U>.
    Evaluating this isomorphism corresponds to rearranging the coefficients
    according to the rule <math|i\<mapsto\><around*|(|i<rsub|0>,i<rsub|1>|)>>,
    where <math|i\<in\><around*|{|0,\<ldots\>,N-1|}>> is the exponent of
    <math|Y> and where <math|i<rsub|0>\<assign\>i mod L> and
    <math|i<rsub|1>\<assign\>i mod s> are the exponents of <math|Z> and
    <math|U>. This may be achieved in time <math|O<around*|(|t*M*N*lg
    N*<around*|(|q<rprime|'>+lg N|)>|)>=O<around*|(|t*n*lg n|)>> using the
    same sorting strategy as in section<nbsp><reference|FFT-sec>. The inverse
    rearrangement is handled similarly.<vspace|0.5fn>

    <no-indent><strong|Reduction to univariate transforms.> For
    multiplication in <math|\<cal-R\><around*|[|X,Z|]>/<around*|(|X<rsup|M>-1,Z<rsup|L>-1|)>>,
    we will use bivariate DFTs over <math|\<cal-R\>>. This is possible
    because <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>>
    contains both <math|M><nbhyph>th and <math|L>-th primitive roots of
    unity, namely <math|\<omega\><rsup|2<rsup|q<rprime|'>+1>/M>> and
    <math|\<omega\><rsup|2<rsup|q<rprime|'>+1>/L>>, since
    <math|q<rprime|'>\<succ\>lg M> and <math|q<rprime|'>\<succ\>lg L>. More
    precisely, we must perform <math|t+1> forward bivariate DFTs
    and<nbsp><math|t> inverse bivariate DFTs of length <math|M\<times\>L>
    over<nbsp><math|\<cal-R\>>, and <math|t*M*L> multiplications in
    <math|\<cal-R\>>. Each bivariate DFT reduces further to <math|s*M>
    univariate DFTs of length<nbsp><math|L> over
    <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>> (with respect
    to <math|Z>) and <math|s*L> univariate DFTs of length <math|M> over
    <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>> (with respect
    to <math|X>). Interspersed between these steps are various matrix
    transpose operations of total cost <math|O<around*|(|t*s*M*L*lg<around*|(|s*M*L|)>*q<rprime|'>|)>=O<around*|(|t*n*lg
    n|)>>, to enable efficient access to the ``rows'' and ``columns'' (see
    section<nbsp><reference|arrays-sec>).

    Multiplications in <math|\<cal-R\>> are handled by zero-padding, i.e., we
    first use Cooley--Tukey to multiply in
    <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]><around*|[|U|]>/<around|(|U<rsup|2<rsup|<around*|\<lceil\>|lg
    s|\<rceil\>>+1>>-1|)>>, and then reduce modulo <math|U<rsup|s>-1>. The
    total cost of these multiplications is <math|O<around*|(|t*M*L*s*lg
    s*<math-ss|I><around*|(|q<rprime|'>|)>|)>=O<around*|(|t*n*lg
    s*<math-ss|I><around*|(|q<rprime|'>|)>/q<rprime|'>|)>=O<around*|(|t*n*lg
    lg n*<around*|(|lg lg lg n|)><rsup|2>|)>=O<around*|(|t*n*lg
    n|)>>.<vspace|0.5fn>

    <no-indent><strong|Reduction to short transforms.> Consider one of the
    ``long'' univariate DFTs of length <math|2<rsup|k>\<in\><around*|{|M,L|}>>
    over <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>>. We
    decompose the DFT into ``short'' DFTs of length <math|M<rprime|'>> as
    follows. Let <math|r\<assign\>lg M<rprime|'>=O<around*|(|lg lg n*lg lg lg
    n|)>> and <math|d\<assign\><around*|\<lceil\>|k/r|\<rceil\>>=O<around*|(|lg
    n/<around*|(|lg lg n*lg lg lg n|)>|)>>, and write
    <math|k=r<rsub|1>+\<cdots\>+r<rsub|d>> where <math|r<rsub|i>\<assign\>r>
    for <math|1\<leqslant\>i\<leqslant\>d-1> and
    <math|r<rsub|d>\<assign\>k-<around*|(|d-1|)>*r\<leqslant\>r>. We use the
    algorithm <math|\<cal-A\>\<assign\>\<cal-A\><rsub|1>\<odot\>\<cdots\>\<odot\>\<cal-A\><rsub|d>>,
    where for <math|1\<leqslant\>i\<leqslant\>d-1> we take
    <math|\<cal-A\><rsub|i>> to be the algorithm based on Bluestein's method
    (discussed immediately before<nbsp><eqref|Bluestein-mersenne>), and
    where<nbsp><math|\<cal-A\><rsub|d>> is the usual Cooley--Tukey algorithm
    over <math|\<bbb-F\><rsub|p<rprime|'>><around*|[|\<mathi\>|]>>. Let
    <math|<math-ss|D><rsub|k>> be the cost of a single invocation
    of<nbsp><math|\<cal-A\>> (or of the corresponding inverse transform
    <math|\<cal-A\><rprime|'>>). By <eqref|fft-rec-bound2> we have

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|D><rsub|k>>|<cell|\<leqslant\>>|<cell|<around*|(|d-1|)>*<math-ss|B><rsub|q<rprime|'>,2<rsup|k-r>><around*|(|2<rsup|r>|)>+O<around*|(|2<rsup|k-r<rsub|d>>*2<rsup|r<rsub|d>>*r<rsub|d>*<math-ss|I><around*|(|q<rprime|'>|)>|)>+O<around*|(|d*2<rsup|k>*<math-ss|I><around*|(|q<rprime|'>|)>|)>+O<around*|(|2<rsup|k>*q<rprime|'>*lg
      n|)>.>>>>
    </eqnarray*>

    The cost of precomputing the necessary root tables is only
    <math|O<around*|(|2<rsup|k>*<math-ss|I><around*|(|q<rprime|'>|)>|)>>. By
    definition <math|<math-ss|B><rsub|q<rprime|'>,*2<rsup|k-r>><around*|(|2<rsup|r>|)>\<leqslant\><around*|(|2\<cdot\>2<rsup|k-r>+1|)>*<math-ss|B><rsub|q<rprime|'>><around*|(|M<rprime|'>|)>>.
    From<nbsp><eqref|Bluestein-mersenne> and the estimate
    <math|2<rsup|k-r>\<succ\>lg lg n>, the first term becomes

    <\eqnarray*>
      <tformat|<table|<row|<cell|<around*|(|d-1|)>*<math-ss|B><rsub|q<rprime|'>,2<rsup|k-r>><around*|(|2<rsup|r>|)>>|<cell|\<leqslant\>>|<cell|<around*|(|2+O<around*|(|1/lg
      lg n|)>|)>*<around*|(|d-1|)>*2<rsup|k-r>*<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|d*2<rsup|k-r>*M<rprime|'>*<math-ss|I><around*|(|q<rprime|'>|)>|)>.>>>>
    </eqnarray*>

    The contribution to <math|<math-ss|D><rsub|k>> from all terms involving
    <math|<math-ss|I><around*|(|q<rprime|'>|)>> is

    <\equation*>
      O<around*|(|2<rsup|k>*<around*|(|r<rsub|d>+d|)>*<math-ss|I><around*|(|q<rprime|'>|)>|)>=O<around*|(|2<rsup|k>*<frac|lg
      n|lg lg n*lg lg lg n>*q<rprime|'>*lg lg n*lg lg lg
      n|)>=O<around*|(|2<rsup|k>*q<rprime|'>*lg n|)>,
    </equation*>

    so

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|D><rsub|k>>|<cell|\<leqslant\>>|<cell|<around*|(|2+O<around*|(|1/lg
      lg n|)>|)>*<around*|(|d-1|)>*2<rsup|k-r>*<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|2<rsup|k>*q<rprime|'>*lg
      n|)>.>>>>
    </eqnarray*>

    Denoting by <math|<math-ss|D>> the cost of a bivariate DFT of length
    <math|M\<times\>L> over <math|\<cal-R\>>, we thus have (ignoring the
    transposition costs, which were included earlier)

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|D>>|<cell|=>|<cell|s*L*<math-ss|D><rsub|lg
      M>+s*M*<math-ss|D><rsub|lg L>>>|<row|<cell|>|<cell|\<leqslant\>>|<cell|<around*|(|2+O<around*|(|<frac|1|lg
      lg n>|)>|)>*<around*|(|s*L*<around*|\<lfloor\>|<frac|lg M|lg
      M<rprime|'>>|\<rfloor\>>*<frac|M|M<rprime|'>>+s*M*<around*|\<lfloor\>|<frac|lg
      L|lg M<rprime|'>>|\<rfloor\>>*<frac|L|M<rprime|'>>|)>*<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|s*L*M*q<rprime|'>*lg
      n|)>>>|<row|<cell|>|<cell|\<leqslant\>>|<cell|<around*|(|2+O<around*|(|<frac|1|lg
      lg n>|)>|)>*s*L*M*<frac|lg <around*|(|L*M|)>|M<rprime|'>*lg
      M<rprime|'>>*<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|s*L*M*q<rprime|'>*lg
      n|)>>>|<row|<cell|>|<cell|\<leqslant\>>|<cell|<around*|(|4+O<around*|(|<frac|1|lg
      lg n>|)>|)>*<frac|n*lg n|n<rprime|'>*lg
      M<rprime|'>>*<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|n*lg
      n|)>.>>>>
    </eqnarray*>

    Moreover, since

    <\equation*>
      <frac|lg n<rprime|'>|lg M<rprime|'>>=1+<frac|lg q<rprime|'>|lg
      M<rprime|'>>=1+O<around*|(|<frac|1|lg lg
      q<rprime|'>>|)>=1+O<around*|(|<frac|1|lg lg lg n>|)>\<nocomma\>,
    </equation*>

    we get

    <\eqnarray*>
      <tformat|<table|<row|<cell|<math-ss|D>>|<cell|\<leqslant\>>|<cell|<around*|(|4+O<around*|(|<frac|1|lg
      lg lg n>|)>|)>*<frac|n*lg n|n<rprime|'>*lg
      n<rprime|'>>*<math-ss|C><around*|(|n<rprime|'>|)>+O<around*|(|n*lg
      n|)>.>>>>
    </eqnarray*>

    <yes-indent>We must perform <math|2*t+1> bivariate DFTs; the
    bound<nbsp><eqref|mersenne-rel-1> then follows exactly as in the proof of
    Theorem<nbsp><reference|main-lem>.

    For large <math|n>, we have <math|log q<rprime|'>=O<around*|(|log log
    M|)>=O<around*|(|log log n|)>>, so <math|log n<rprime|'>=log
    q<rprime|'>+O<around*|(|\<mu\><around*|(|q<rprime|'>|)>|)>=O<around*|(|log
    q<rprime|'>*log log q<rprime|'>|)>=O<around*|(|log log n*log log log
    n|)>>. Thus there exists a constant <math|C\<gtr\>0> such that <math|log
    log log n<rprime|'>\<leqslant\>log log log log n+C> for large <math|n>,
    and we may take <math|\<Phi\><around*|(|x|)>\<assign\>exp<rsup|\<circ\>3><around*|(|log<rsup|\<circ\>4>
    x+C|)>>.
  </proof>

  <\summarized-plain>
    <\render-proof|Proof of Theorem <reference|mersenne-thm>>
      Follows from Theorem<nbsp><reference|conj-rec-th> and
      Proposition<nbsp><reference|slow-rec-lem>, analogously to the proof of
      Theorem<nbsp><reference|main-thm>.
    </render-proof>
  <|summarized-plain>
    \<#423\>\<#43C\>\<#43D\>\<#43E\>\<#436\>\<#435\>\<#43D\>\<#438\>\<#435\>
    \<#43C\>\<#43D\>\<#43E\>\<#433\>\<#43E\>\<#437\>\<#43D\>\<#430\>\<#447\>\<#43D\>\<#44B\>\<#445\>
    \<#447\>\<#438\>\<#441\>\<#435\>\<#43B\> \<#43D\>\<#430\>
    \<#430\>\<#432\>\<#442\>\<#43E\>\<#43C\>\<#430\>\<#442\>\<#430\>\<#445\>
    / \<#41E\> \<#441\>\<#43B\>\<#43E\>\<#436\>\<#43D\>\<#43E\>\<#441\>\<#442\>\<#438\>
    \<#441\>\<#445\>\<#435\>\<#43C\>\<#44B\> \<#438\>\<#437\>
    \<#444\>\<#443\>\<#43D\>\<#43A\>\<#446\>\<#438\>\<#43E\>\<#43D\>\<#430\>\<#43B\>\<#44C\>\<#43D\>\<#44B\>\<#445\>
    \<#44D\>\<#43B\>\<#435\>\<#43C\>\<#435\>\<#43D\>\<#442\>\<#43E\>\<#432\>,
    \<#440\>\<#435\>\<#430\>\<#43B\>\<#438\>\<#437\>\<#443\>\<#44E\>\<#449\>\<#435\>\<#439\>
    \<#443\>\<#43C\>\<#43D\>\<#43E\>\<#436\>\<#435\>\<#43D\>\<#438\>\<#435\>
    \<#446\>\<#435\>\<#43B\>\<#44B\>\<#445\>
    \<#447\>\<#438\>\<#441\>\<#435\>\<#43B\>
  </summarized-plain>

  <\bibliography|bib|plain|mul.bib>
    <\bib-list|10>
      <bibitem*|1><label|bib-AKS04>M.<nbsp>Agrawal, N.<nbsp>Kayal, and
      N.<nbsp>Saxena. <newblock>PRIMES is in P.
      <newblock><with|font-shape|italic|Annals of Math.>, 160(2):781--793,
      2004.

      <bibitem*|2><label|bib-AhHoUl1974>A.<nbsp>V. Aho, J.<nbsp>E. Hopcroft,
      and J.<nbsp>D. Ullman. <newblock><with|font-shape|italic|The design and
      analysis of computer algorithms>. <newblock>Addison-Wesley, 1974.

      <bibitem*|3><label|bib-Bluestein1970>L.<nbsp>I. Bluestein. <newblock>A
      linear filtering approach to the computation of discrete Fourier
      transform. <newblock><with|font-shape|italic|IEEE Transactions on Audio
      and Electroacoustics>, 18(4):451--455, 1970.

      <bibitem*|4><label|bib-BGS07>A.<nbsp>Bostan, P.<nbsp>Gaudry, and .
      Schost. <newblock>Linear recurrences with polynomial coefficients and
      application to integer factorization and Cartier-Manin operator.
      <newblock><with|font-shape|italic|SIAM J. Comput.>, 36:1777--1806,
      2007.

      <bibitem*|5><label|bib-Boy85>C.<nbsp>B. Boyer.
      <newblock><with|font-shape|italic|A History of Mathematics>.
      <newblock>Princeton Univ. Press, first paperback edition, 1985.

      <bibitem*|6><label|bib-Br76b>R.<nbsp>P. Brent. <newblock>Fast
      multiple-precision evaluation of elementary functions.
      <newblock><with|font-shape|italic|J. Assoc. Comput. Mach.>,
      23(2):242--251, 1976.

      <bibitem*|7><label|bib-BZ10>R.<nbsp>P. Brent and P.<nbsp>Zimmermann.
      <newblock><with|font-shape|italic|Modern Computer Arithmetic>.
      <newblock>Cambridge University Press, 2010.

      <bibitem*|8><label|bib-BuClSh1997>P.<nbsp>Brgisser, M.<nbsp>Clausen,
      and M.<nbsp>A. Shokrollahi. <newblock><with|font-shape|italic|Algebraic
      complexity theory>. <newblock>Springer-Verlag, 1997.

      <bibitem*|9><label|bib-CFA+-handbook>H.<nbsp>Cohen, G.<nbsp>Frey,
      R.<nbsp>Avanzi, Ch. Doche, T.<nbsp>Lange, K.<nbsp>Nguyen, and
      F.<nbsp>Vercauteren, editors. <newblock><with|font-shape|italic|Handbook
      of elliptic and hyperelliptic curve cryptography>. <newblock>Discrete
      Mathematics and its Applications. Chapman & Hall/CRC, Boca Raton, FL,
      2006.

      <bibitem*|10><label|bib-Cook66>S.<nbsp>A. Cook.
      <newblock><with|font-shape|italic|On the minimum computation time of
      functions>. <newblock>PhD thesis, Harvard University, 1966.

      <bibitem*|11><label|bib-CT65>J.<nbsp>W. Cooley and J.<nbsp>W. Tukey.
      <newblock>An algorithm for the machine calculation of complex Fourier
      series. <newblock><with|font-shape|italic|Math. Computat.>,
      19:297--301, 1965.

      <bibitem*|12><label|bib-CF94>R.<nbsp>Crandall and B.<nbsp>Fagin.
      <newblock>Discrete weighted transforms and large-integer arithmetic.
      <newblock><with|font-shape|italic|Math. Comp.>, 62(205):305--324, 1994.

      <bibitem*|13><label|bib-CP05>R.<nbsp>Crandall and C.<nbsp>Pomerance.
      <newblock><with|font-shape|italic|Prime numbers. A computational
      perspective>. <newblock>Springer, New York, 2nd edition, 2005.

      <bibitem*|14><label|bib-CT89>R.<nbsp>Creutzburg and M.<nbsp>Tasche.
      <newblock>Parameter determination for complex number-theoretic
      transforms using cyclotomic polynomials.
      <newblock><with|font-shape|italic|Math. Comp.>, 52(185):189--200, 1989.

      <bibitem*|15><label|bib-DeKuSaSa2013>A.<nbsp>De, P.<nbsp>P. Kurur,
      C.<nbsp>Saha, and R.<nbsp>Saptharishi. <newblock>Fast integer
      multiplication using modular arithmetic.
      <newblock><with|font-shape|italic|SIAM J. Comput.>, 42(2):685--699,
      2013.

      <bibitem*|16><label|bib-Ec92>J.<nbsp>calle.
      <newblock><with|font-shape|italic|Introduction aux fonctions
      analysables et preuve constructive de la conjecture de Dulac>.
      <newblock>Hermann, collection: Actualits mathmatiques, 1992.

      <bibitem*|17><label|bib-Fur89>M.<nbsp>Frer. <newblock>On the
      complexity of integer multiplication (extended abstract).
      <newblock>Technical Report CS-89-17, Pennsylvania State University,
      1989.

      <bibitem*|18><label|bib-Furer2007>M.<nbsp>Frer. <newblock>Faster
      integer multiplication. <newblock>In
      <with|font-shape|italic|Proceedings of the Thirty-Ninth ACM Symposium
      on Theory of Computing, STOC 2007>, pages 57--66, New York, NY, USA,
      2007. ACM Press.

      <bibitem*|19><label|bib-Furer2009>M.<nbsp>Frer. <newblock>Faster
      integer multiplication. <newblock><with|font-shape|italic|SIAM J.
      Comp.>, 39(3):979--1005, 2009.

      <bibitem*|20><label|bib-Furer2014>M.<nbsp>Frer. <newblock>How fast can
      we multiply large integers on an actual computer? <newblock>Technical
      report, <rigid|<slink|http://arxiv.org/abs/1402.1811>>, 2014.

      <bibitem*|21><label|bib-GaGe2002>J.<nbsp>von<nbsp>zur Gathen and
      J.<nbsp>Gerhard. <newblock><with|font-shape|italic|Modern Computer
      Algebra>. <newblock>Cambridge University Press, 2nd edition, 2002.

      <bibitem*|22><label|bib-Gou-p-adic>F.<nbsp>Q. Gouva.
      <newblock><with|font-shape|italic|<math|p>-adic numbers. An
      introduction>. <newblock>Universitext. Springer-Verlag, Berlin, 1993.

      <bibitem*|23><label|bib-GMP>T.<nbsp>Granlund et<nbsp>al. <newblock>GMP,
      the GNU multiple precision arithmetic library.
      <newblock><slink|http://gmplib.org>, 1991. Latest version 6.0.0
      released in 2014.

      <bibitem*|24><label|bib-vdH:ffmul>D.<nbsp>Harvey, J.<nbsp>van<nbsp>der
      Hoeven, and G.<nbsp>Lecerf. <newblock>Faster polynomial multiplication
      over finite fields. <newblock>Technical report, HAL, 2014.
      <newblock><slink|http://hal.archives-ouvertes.fr>.

      <bibitem*|25><label|bib-HB-almost>D.<nbsp>R. Heath-Brown.
      <newblock>Almost-primes in arithmetic progressions and short intervals.
      <newblock><with|font-shape|italic|Math. Proc. Cambridge Philos. Soc.>,
      83(3):357--375, 1978.

      <bibitem*|26><label|bib-HB-linnik>D.<nbsp>R. Heath-Brown.
      <newblock>Zero-free regions for Dirichlet <math|L>-functions, and the
      least prime in an arithmetic progression.
      <newblock><with|font-shape|italic|Proc. London Math. Soc. (3)>,
      64(2):265--338, 1992.

      <bibitem*|27><label|bib-HJB-gauss-fft>M.<nbsp>T. Heideman, D.<nbsp>H.
      Johnson, and C.<nbsp>S. Burrus. <newblock>Gauss and the history of the
      fast Fourier transform. <newblock><with|font-shape|italic|Arch. Hist.
      Exact Sci.>, 34(3):265--277, 1985.

      <bibitem*|28><label|bib-vdH:jncf>J.<nbsp>van<nbsp>der Hoeven.
      <newblock><with|font-shape|italic|Journes Nationales de Calcul Formel
      (2011)>, volume<nbsp>2 of <with|font-shape|italic|Les cours du CIRM>,
      chapter Calcul analytique. <newblock>CEDRAM, 2011. <newblock>Exp. No.
      4, 85 pages, <rigid|<slink|http://ccirm.cedram.org/ccirm-bin/fitem?id=CCIRM_2011__2_1_A4_0>>.

      <bibitem*|29><label|bib-vdH:mmx>J.<nbsp>van<nbsp>der Hoeven,
      G.<nbsp>Lecerf, B.<nbsp>Mourrain, et<nbsp>al. <newblock>Mathemagix,
      2002. <newblock><slink|http://www.mathemagix.org>.

      <bibitem*|30><label|bib-Kar62>A.<nbsp>Karatsuba and J.<nbsp>Ofman.
      <newblock>\<#423\>\<#43C\>\<#43D\>\<#43E\>\<#436\>\<#435\>\<#43D\>\<#438\>\<#435\>
      \<#43C\>\<#43D\>\<#43E\>\<#433\>\<#43E\>\<#437\>\<#43D\>\<#430\>\<#447\>\<#43D\>\<#44B\>\<#445\>
      \<#447\>\<#438\>\<#441\>\<#435\>\<#43B\> \<#43D\>\<#430\>
      \<#430\>\<#432\>\<#442\>\<#43E\>\<#43C\>\<#430\>\<#442\>\<#430\>\<#445\>.
      <newblock><with|font-shape|italic|Doklady Akad. Nauk SSSR>, 7:293--294,
      1962. <newblock>English translation in <cite|Kar63>.

      <bibitem*|31><label|bib-Kar63>A.<nbsp>Karatsuba and J.<nbsp>Ofman.
      <newblock>Multiplication of multidigit numbers on automata.
      <newblock><with|font-shape|italic|Soviet Physics Doklady>, 7:595--596,
      1963.

      <bibitem*|32><label|bib-Kn69>D.<nbsp>E. Knuth.
      <newblock><with|font-shape|italic|The Art of Computer Programming>,
      volume 2: Seminumerical Algorithms. <newblock>Addison-Wesley, 1969.

      <bibitem*|33><label|bib-Knu-vol3>D.<nbsp>E. Knuth.
      <newblock><with|font-shape|italic|The art of computer programming>,
      volume 3: Sorting and Searching. <newblock>Addison-Wesley, Reading, MA,
      1998.

      <bibitem*|34><label|bib-Lin44a>Yu.<nbsp>V. Linnik. <newblock>On the
      least prime in an arithmetic progression I. The basic theorem.
      <newblock><with|font-shape|italic|Rec. Math. (Mat. Sbornik) N.S.>,
      15(57):139--178, 1944.

      <bibitem*|35><label|bib-Lin44b>Yu.<nbsp>V. Linnik. <newblock>On the
      least prime in an arithmetic progression II. The Deuring-Heilbronn
      phenomenon. <newblock><with|font-shape|italic|Rec. Math. (Mat. Sbornik)
      N.S.>, 15(57):347--168, 1944.

      <bibitem*|36><label|bib-Moo66>R.<nbsp>E. Moore.
      <newblock><with|font-shape|italic|Interval Analysis>.
      <newblock>Prentice Hall, Englewood Cliffs, N.J., 1966.

      <bibitem*|37><label|bib-Neu57>O.<nbsp>Neugebauer.
      <newblock><with|font-shape|italic|The Exact Sciences in Antiquity>.
      <newblock>Brown Univ. Press, Providence, R.I., 1957.

      <bibitem*|38><label|bib-Pap94>C.<nbsp>H. Papadimitriou.
      <newblock><with|font-shape|italic|Computational Complexity>.
      <newblock>Addison-Wesley, 1994.

      <bibitem*|39><label|bib-Pol71>J.<nbsp>M. Pollard. <newblock>The fast
      Fourier transform in a finite field.
      <newblock><with|font-shape|italic|Math. Comp.>, 25(114):365--374, 1971.

      <bibitem*|40><label|bib-Pom-primality>C.<nbsp>Pomerance.
      <newblock>Recent developments in primality testing.
      <newblock><with|font-shape|italic|Math. Intelligencer>, 3(3):97--105,
      1980/81.

      <bibitem*|41><label|bib-Rad-prime>C.<nbsp>M. Rader. <newblock>Discrete
      Fourier transforms when the number of data samples is prime.
      <newblock><with|font-shape|italic|Proc. IEEE>, 56(6):1107--1108, June
      1968.

      <bibitem*|42><label|bib-RaKiHw2010>K.<nbsp>R. Rao, D.<nbsp>N. Kim, and
      J.<nbsp>J. Hwang. <newblock><with|font-shape|italic|Fast Fourier
      Transform - Algorithms and Applications>. <newblock>Signals and
      Communication Technology. Springer-Verlag, 2010.

      <bibitem*|43><label|bib-RT-convolutions>I.<nbsp>S. Reed and T.<nbsp>K.
      Truong. <newblock>The use of finite fields to compute convolutions.
      <newblock><with|font-shape|italic|IEEE Trans. Inform. Theory>,
      IT-21:208--213, 1975.

      <bibitem*|44><label|bib-Schm01>M.<nbsp>C. Schmeling.
      <newblock><with|font-shape|italic|Corps de transsries>. <newblock>PhD
      thesis, Universit Paris-VII, France, 2001.

      <bibitem*|45><label|bib-Sch66>A.<nbsp>Schnhage.
      <newblock>Multiplikation groer Zahlen.
      <newblock><with|font-shape|italic|Computing>, 1(3):182--196, 1966.

      <bibitem*|46><label|bib-Sch80>A.<nbsp>Schnhage. <newblock>Storage
      modification machines. <newblock><with|font-shape|italic|SIAM J. on
      Comp.>, 9:490--508, 1980.

      <bibitem*|47><label|bib-SS71>A.<nbsp>Schnhage and V.<nbsp>Strassen.
      <newblock>Schnelle Multiplikation groer Zahlen.
      <newblock><with|font-shape|italic|Computing>, 7:281--292, 1971.

      <bibitem*|48><label|bib-Shparlinski1996>I.<nbsp>Shparlinski.
      <newblock>On finding primitive roots in finite fields.
      <newblock><with|font-shape|italic|Theoret. Comput. Sci.>,
      157(2):273--275, 1996.

      <bibitem*|49><label|bib-Smi58>D.<nbsp>E. Smith.
      <newblock><with|font-shape|italic|History of Mathematics>,
      volume<nbsp>2. <newblock>Dover, 1958.

      <bibitem*|50><label|bib-Toom63b>A.<nbsp>L. Toom. <newblock>The
      complexity of a scheme of functional elements realizing the
      multiplication of integers. <newblock><with|font-shape|italic|Soviet
      Mathematics>, 4(2):714--716, 1963.

      <bibitem*|51><label|bib-Toom63a>A.<nbsp>L. Toom. <newblock>\<#41E\>
      \<#441\>\<#43B\>\<#43E\>\<#436\>\<#43D\>\<#43E\>\<#441\>\<#442\>\<#438\>
      \<#441\>\<#445\>\<#435\>\<#43C\>\<#44B\> \<#438\>\<#437\>
      \<#444\>\<#443\>\<#43D\>\<#43A\>\<#446\>\<#438\>\<#43E\>\<#43D\>\<#430\>\<#43B\>\<#44C\>\<#43D\>\<#44B\>\<#445\>
      \<#44D\>\<#43B\>\<#435\>\<#43C\>\<#435\>\<#43D\>\<#442\>\<#43E\>\<#432\>,
      \<#440\>\<#435\>\<#430\>\<#43B\>\<#438\>\<#437\>\<#443\>\<#44E\>\<#449\>\<#435\>\<#439\>
      \<#443\>\<#43C\>\<#43D\>\<#43E\>\<#436\>\<#435\>\<#43D\>\<#438\>\<#435\>
      \<#446\>\<#435\>\<#43B\>\<#44B\>\<#445\>
      \<#447\>\<#438\>\<#441\>\<#435\>\<#43B\>.
      <newblock><with|font-shape|italic|Doklady Akad. Nauk SSSR>,
      150:496--498, 1963. <newblock>English translation in <cite|Toom63b>.

      <bibitem*|52><label|bib-Wag83>S.<nbsp>Wagstaff. <newblock>Divisors of
      Mersenne numbers. <newblock><with|font-shape|italic|Math. Comp.>,
      40(161):385--397, 1983.

      <bibitem*|53><label|bib-Xyl11>T.<nbsp>Xylouris. <newblock>On the least
      prime in an arithmetic progression and estimates for the zeros of
      Dirichlet L-functions. <newblock><with|font-shape|italic|Acta Arith.>,
      1:65--91, 2011.
    </bib-list>
  </bibliography>
</body>

<\initial>
  <\collection>
    <associate|font-base-size|11>
    <associate|info-flag|short>
    <associate|par-kerning-stretch|tolerant>
  </collection>
</initial>

<\references>
  <\collection>
    <associate|BK-cyclic-prop|<tuple|6.3|15>>
    <associate|BK-prop|<tuple|4.1|11>>
    <associate|Bluestein-mersenne|<tuple|9.4|25>>
    <associate|Bluestein-roots-prop|<tuple|12|10>>
    <associate|Bluestein-sec|<tuple|2.5|7>>
    <associate|CF-rem|<tuple|9.3|23>>
    <associate|CF-sec|<tuple|9.3|23>>
    <associate|CT-cor|<tuple|3.9|10>>
    <associate|CT-roots-prop|<tuple|11|9>>
    <associate|DFT-prop|<tuple|9|?>>
    <associate|DFT-sec|<tuple|2.2|5>>
    <associate|E-bound|<tuple|5.5|14>>
    <associate|F-dec|<tuple|1|?|../../../../../.TeXmacs/texts/scratch/no_name_21.tm>>
    <associate|F-red|<tuple|11|?>>
    <associate|F-red-bis|<tuple|4|?>>
    <associate|FFT-dec|<tuple|2.2|6>>
    <associate|FFT-dec-bound|<tuple|5|6>>
    <associate|FFT-mult|<tuple|2.5|6>>
    <associate|FFT-sec|<tuple|2.3|6>>
    <associate|Furer-lem|<tuple|7.1|17>>
    <associate|Furer-sec|<tuple|7|17>>
    <associate|GRH|<tuple|8.3|20>>
    <associate|H-cond|<tuple|14|14>>
    <associate|K-bound|<tuple|1.1|1>>
    <associate|Kronecker-sec|<tuple|2.6|8>>
    <associate|Linnik-conj|<tuple|36|21>>
    <associate|Linnik-lem|<tuple|8.2|20>>
    <associate|Mersenne-conj|<tuple|9.1|22>>
    <associate|Mersenne-prop|<tuple|9.2|22>>
    <associate|T-rec|<tuple|14|13>>
    <associate|Tbar-rec|<tuple|12|11>>
    <associate|U-ineq|<tuple|9|?>>
    <associate|add-prop|<tuple|3.2|9>>
    <associate|admissible|<tuple|6.1|15>>
    <associate|admissible-sec|<tuple|8.1|11>>
    <associate|analysis-sec|<tuple|7.4|?>>
    <associate|arrays-sec|<tuple|2.1|5>>
    <associate|auto-1|<tuple|1|1>>
    <associate|auto-10|<tuple|2.5|7>>
    <associate|auto-11|<tuple|2.6|8>>
    <associate|auto-12|<tuple|3|8>>
    <associate|auto-13|<tuple|3.1|8>>
    <associate|auto-14|<tuple|3.2|9>>
    <associate|auto-15|<tuple|3.3|10>>
    <associate|auto-16|<tuple|3.4|10>>
    <associate|auto-17|<tuple|4|10>>
    <associate|auto-18|<tuple|5|12>>
    <associate|auto-19|<tuple|6|14>>
    <associate|auto-2|<tuple|1.1|2>>
    <associate|auto-20|<tuple|7|17>>
    <associate|auto-21|<tuple|8|19>>
    <associate|auto-22|<tuple|8.1|19>>
    <associate|auto-23|<tuple|8.2|20>>
    <associate|auto-24|<tuple|9|21>>
    <associate|auto-25|<tuple|9.1|22>>
    <associate|auto-26|<tuple|9.2|22>>
    <associate|auto-27|<tuple|9.3|23>>
    <associate|auto-28|<tuple|9.4|24>>
    <associate|auto-29|<tuple|9.5|27>>
    <associate|auto-3|<tuple|1.1|3>>
    <associate|auto-30|<tuple|23|28>>
    <associate|auto-31|<tuple|23|26>>
    <associate|auto-32|<tuple|9.3|29>>
    <associate|auto-33|<tuple|9.4|23>>
    <associate|auto-34|<tuple|24|26>>
    <associate|auto-35|<tuple|38|17>>
    <associate|auto-36|<tuple|35|17>>
    <associate|auto-37|<tuple|<with|mode|<quote|math>|\<bullet\>>|?>>
    <associate|auto-38|<tuple|<with|mode|<quote|math>|\<bullet\>>|?>>
    <associate|auto-4|<tuple|1.2|3>>
    <associate|auto-5|<tuple|2|5>>
    <associate|auto-6|<tuple|2.1|5>>
    <associate|auto-7|<tuple|2.2|5>>
    <associate|auto-8|<tuple|2.3|6>>
    <associate|auto-9|<tuple|2.4|7>>
    <associate|bib-AKS04|<tuple|1|27>>
    <associate|bib-AhHoUl1974|<tuple|2|27>>
    <associate|bib-BGS07|<tuple|4|27>>
    <associate|bib-BZ10|<tuple|7|27>>
    <associate|bib-Bach97|<tuple|3|29>>
    <associate|bib-Bluestein1970|<tuple|3|27>>
    <associate|bib-Boy85|<tuple|5|27>>
    <associate|bib-Br76b|<tuple|6|27>>
    <associate|bib-BuClSh1997|<tuple|8|27>>
    <associate|bib-Bur62|<tuple|8|29>>
    <associate|bib-CF94|<tuple|12|27>>
    <associate|bib-CFA+-handbook|<tuple|9|27>>
    <associate|bib-CP05|<tuple|13|27>>
    <associate|bib-CT65|<tuple|11|27>>
    <associate|bib-CT89|<tuple|14|27>>
    <associate|bib-Cook66|<tuple|10|27>>
    <associate|bib-DeKuSaSa2013|<tuple|15|27>>
    <associate|bib-Ec92|<tuple|16|27>>
    <associate|bib-Fur89|<tuple|17|27>>
    <associate|bib-Furer2007|<tuple|18|27>>
    <associate|bib-Furer2009|<tuple|19|27>>
    <associate|bib-Furer2014|<tuple|20|27>>
    <associate|bib-GMP|<tuple|23|27>>
    <associate|bib-GaGe2002|<tuple|21|27>>
    <associate|bib-Gou-p-adic|<tuple|22|27>>
    <associate|bib-HB-almost|<tuple|25|27>>
    <associate|bib-HB-linnik|<tuple|26|27>>
    <associate|bib-HJB-gauss-fft|<tuple|27|27>>
    <associate|bib-Kar62|<tuple|30|27>>
    <associate|bib-Kar63|<tuple|31|28>>
    <associate|bib-Kn69|<tuple|32|28>>
    <associate|bib-Knu-vol3|<tuple|33|28>>
    <associate|bib-Lin44a|<tuple|34|28>>
    <associate|bib-Lin44b|<tuple|35|28>>
    <associate|bib-MCR79|<tuple|29|29>>
    <associate|bib-Moo66|<tuple|36|28>>
    <associate|bib-Neu57|<tuple|37|28>>
    <associate|bib-Pap94|<tuple|38|28>>
    <associate|bib-Pol71|<tuple|39|28>>
    <associate|bib-Pom-primality|<tuple|40|28>>
    <associate|bib-RT-convolutions|<tuple|43|28>>
    <associate|bib-RaKiHw2010|<tuple|42|28>>
    <associate|bib-Rad-prime|<tuple|41|28>>
    <associate|bib-SS71|<tuple|47|28>>
    <associate|bib-Sch66|<tuple|45|28>>
    <associate|bib-Sch80|<tuple|46|28>>
    <associate|bib-Schm01|<tuple|44|28>>
    <associate|bib-Shparlinski1996|<tuple|48|28>>
    <associate|bib-Smi58|<tuple|49|28>>
    <associate|bib-Toom63a|<tuple|51|28>>
    <associate|bib-Toom63b|<tuple|50|28>>
    <associate|bib-Wag83|<tuple|52|28>>
    <associate|bib-Xyl11|<tuple|53|28>>
    <associate|bib-vdH:ffmul|<tuple|24|27>>
    <associate|bib-vdH:jncf|<tuple|28|27>>
    <associate|bib-vdH:mmx|<tuple|29|27>>
    <associate|cache-sec|<tuple|7.3|?>>
    <associate|chirp-formula|<tuple|2.5|7>>
    <associate|ci-def|<tuple|9.2|23>>
    <associate|comp-FFT|<tuple|3.8|10>>
    <associate|computing-p-sec|<tuple|8.2|20>>
    <associate|conj-rec-th|<tuple|9.6|25>>
    <associate|data-rem|<tuple|4|6>>
    <associate|defn:admissible|<tuple|10|?>>
    <associate|eq:defn-k|<tuple|14|11>>
    <associate|eq:linnik|<tuple|11|?>>
    <associate|eq:rho-bound|<tuple|6.1|15>>
    <associate|eq:simple|<tuple|4.1|11>>
    <associate|eqn-H|<tuple|12|?>>
    <associate|err-BK|<tuple|3.4|9>>
    <associate|err-FFT|<tuple|3.4|10>>
    <associate|err-conv|<tuple|3.4|9>>
    <associate|err-mul|<tuple|3.3|9>>
    <associate|err-sec|<tuple|3|8>>
    <associate|even-faster-sec|<tuple|6|14>>
    <associate|fast-roots-sec|<tuple|7|17>>
    <associate|faster-rem|<tuple|4.2|11>>
    <associate|ff-chunked-prop|<tuple|16|?>>
    <associate|ff-regroup-prop|<tuple|3|?>>
    <associate|fft-rec-bound|<tuple|2.3|6>>
    <associate|fft-rec-bound2|<tuple|2.4|6>>
    <associate|good-prime-conj|<tuple|11|?>>
    <associate|good-prime-conj-0|<tuple|3|?>>
    <associate|history-sec|<tuple|1.1|?>>
    <associate|intro-sec|<tuple|1|1>>
    <associate|it-gen|<tuple|5.1|12>>
    <associate|it-log|<tuple|1.2|1>>
    <associate|iter-lem|<tuple|5.2|13>>
    <associate|iter-sec|<tuple|5|12>>
    <associate|lem-H|<tuple|21|14>>
    <associate|lem:admissible|<tuple|11|?>>
    <associate|lem:b-bound|<tuple|10|?>>
    <associate|lem:linnik|<tuple|10|?>>
    <associate|log-bound|<tuple|8|8>>
    <associate|log-slow-cond|<tuple|5.2|12>>
    <associate|log-slow-lem|<tuple|3|?>>
    <associate|main-cor|<tuple|1|?>>
    <associate|main-lem|<tuple|6.4|15>>
    <associate|main-rec-rel|<tuple|6.2|15>>
    <associate|main-rec-rel-bis|<tuple|19|15>>
    <associate|main-thm|<tuple|1.1|1>>
    <associate|mersenne-rel-1|<tuple|9.5|25>>
    <associate|mersenne-rel-2|<tuple|22|26>>
    <associate|mersenne-thm|<tuple|1.2|2>>
    <associate|mod-fft-sec|<tuple|2.5|7>>
    <associate|modular-sketch-sec|<tuple|8.1|19>>
    <associate|mu-estimate|<tuple|9.3|24>>
    <associate|negacyclic-sec|<tuple|7.1|?>>
    <associate|param-sec|<tuple|8|19>>
    <associate|pc-sec|<tuple|7.2|?>>
    <associate|phi-bound|<tuple|5.1|13>>
    <associate|phi-sigma|<tuple|5.4|13>>
    <associate|phistar|<tuple|5.4|?>>
    <associate|pricipal-root-unity|<tuple|1|?>>
    <associate|prim-root-bound|<tuple|37|22>>
    <associate|prime-mod-sec|<tuple|8.3|21>>
    <associate|principal-p-sec|<tuple|8.4|22>>
    <associate|principal-root-unity|<tuple|2.1|5>>
    <associate|q-rec-rel|<tuple|12|?>>
    <associate|q-rec-rel-bis|<tuple|11|?>>
    <associate|qprime-ineq|<tuple|9.4|?>>
    <associate|rec-ineq|<tuple|5.3|13>>
    <associate|rec-ineq2|<tuple|11|11|mul.tm>>
    <associate|rec-rem|<tuple|25|16>>
    <associate|roots-prop|<tuple|3.6|10>>
    <associate|s-bound|<tuple|16|11>>
    <associate|scale-prop|<tuple|15|?>>
    <associate|sec:simple|<tuple|3|?>>
    <associate|simple-algo|<tuple|12|?>>
    <associate|simple-algo-0|<tuple|4|?>>
    <associate|simple-algo-sec|<tuple|4|10>>
    <associate|simple-th|<tuple|4.4|12>>
    <associate|slow-Phi|<tuple|14|?>>
    <associate|slow-rec-corr|<tuple|4|?>>
    <associate|slow-rec-corr-gen|<tuple|5|?>>
    <associate|slow-rec-lem|<tuple|5.3|13>>
    <associate|spec-iter-lem|<tuple|22|12>>
    <associate|sqrt-prop|<tuple|3.5|10>>
    <associate|survey-sec|<tuple|2|5>>
    <associate|tau-rec|<tuple|14|13>>
    <associate|thm:mersenne|<tuple|2|?>>
    <associate|thm:simple|<tuple|4.3|11>>
    <associate|twiddle-prop|<tuple|9|9>>
    <associate|uv-dec|<tuple|9.1|22>>
    <associate|yet-faster-sec|<tuple|9|21>>
  </collection>
</references>

<\auxiliary>
  <\collection>
    <\associate|bib>
      Pap94

      Furer2007

      Furer2009

      vdH:ffmul

      GaGe2002

      Smi58

      Neu57

      Boy85

      Kar62

      Kar63

      Toom63a

      Toom63b

      Cook66

      Sch66

      Kn69

      CT65

      HJB-gauss-fft

      SS71

      Pol71

      Furer2007

      SS71

      DeKuSaSa2013

      Pol71

      Neu57

      Kar62

      Kar63

      Toom63a

      Toom63b

      Sch66

      Kn69

      SS71

      Furer2007

      Pap94

      Sch80

      Furer2014

      BuClSh1997

      Schm01

      Ec92

      Pol71

      DeKuSaSa2013

      vdH:ffmul

      Fur89

      CF94

      vdH:mmx

      GMP

      vdH:ffmul

      AhHoUl1974

      BuClSh1997

      GaGe2002

      RaKiHw2010

      BGS07

      Knu-vol3

      Bluestein1970

      Rad-prime

      GaGe2002

      BZ10

      Moo66

      vdH:jncf

      BZ10

      SS71

      vdH:ffmul

      Furer2009

      Furer2009

      DeKuSaSa2013

      Gou-p-adic

      HB-almost

      Lin44a

      Lin44b

      Xyl11

      AKS04

      DeKuSaSa2013

      HB-linnik

      Shparlinski1996

      CFA+-handbook

      Fur89

      CP05

      CF94

      Wag83

      Pom-primality

      CP05

      RT-convolutions

      CT89

      Br76b

      Kar63

      Toom63b
    </associate>
    <\associate|table>
      <tuple|normal|Historical overview of known complexity bounds for
      <with|mode|<quote|math>|n>-bit integer
      multiplication.|<pageref|auto-3>>
    </associate>
    <\associate|toc>
      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|1.<space|2spc>Introduction>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-1><vspace|0.5fn>

      <with|par-left|<quote|1tab>|1.1.<space|2spc>Brief history and related
      work <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-2>>

      <with|par-left|<quote|1tab>|1.2.<space|2spc>Our contributions and
      outline of the paper <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-4>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|2.<space|2spc>Survey
      of classical tools> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-5><vspace|0.5fn>

      <with|par-left|<quote|1tab>|2.1.<space|2spc>Arrays and sorting
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-6>>

      <with|par-left|<quote|1tab>|2.2.<space|2spc>Discrete Fourier transforms
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-7>>

      <with|par-left|<quote|1tab>|2.3.<space|2spc>The Cooley--Tukey FFT
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-8>>

      <with|par-left|<quote|1tab>|2.4.<space|2spc>Fast Fourier multiplication
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-9>>

      <with|par-left|<quote|1tab>|2.5.<space|2spc>Bluestein's chirp transform
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-10>>

      <with|par-left|<quote|1tab>|2.6.<space|2spc>Kronecker substitution and
      segmentation <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-11>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|3.<space|2spc>Fixed
      point computations and error bounds>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-12><vspace|0.5fn>

      <with|par-left|<quote|1tab>|3.1.<space|2spc>Fixed point numbers
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-13>>

      <with|par-left|<quote|1tab>|3.2.<space|2spc>Basic arithmetic
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-14>>

      <with|par-left|<quote|1tab>|3.3.<space|2spc>Precomputing roots of unity
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-15>>

      <with|par-left|<quote|1tab>|3.4.<space|2spc>Error analysis for fast
      Fourier transforms <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-16>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|4.<space|2spc>A
      simple and fast multiplication algorithm>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-17><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|5.<space|2spc>Logarithmically
      slow recurrence inequalities> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-18><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|6.<space|2spc>Even
      faster multiplication> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-19><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|7.<space|2spc>An
      optimised variant of Frer's algorithm>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-20><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|8.<space|2spc>Fast
      multiplication using modular arithmetic>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-21><vspace|0.5fn>

      <with|par-left|<quote|1tab>|8.1.<space|2spc>Sketch of the algorithm
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-22>>

      <with|par-left|<quote|1tab>|8.2.<space|2spc>Computing suitable
      <with|mode|<quote|math>|p> and <with|mode|<quote|math>|\<omega\>>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-23>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|9.<space|2spc>Conjecturally
      faster multiplication> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-24><vspace|0.5fn>

      <with|par-left|<quote|1tab>|9.1.<space|2spc>Mersenne primes
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-25>>

      <with|par-left|<quote|1tab>|9.2.<space|2spc>Crandall and Fagin's
      algorithm revisited <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-26>>

      <with|par-left|<quote|1tab>|9.3.<space|2spc>Bivariate Crandall--Fagin
      reduction <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-27>>

      <with|par-left|<quote|1tab>|9.4.<space|2spc>Conjecturally faster
      multiplication <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-28>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|Bibliography>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-29><vspace|0.5fn>
    </associate>
  </collection>
</auxiliary>