Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small prime FFT based on ulong #2107

Draft
wants to merge 74 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
0bf0127
add profile for powmod
vneiger Sep 16, 2024
c363adb
Merge branch 'main' into introduce_nmod_fft
vneiger Oct 19, 2024
4a887b4
add .h file
vneiger Oct 19, 2024
aee38b3
fix ifndef
vneiger Oct 19, 2024
17faaea
context and init code
vneiger Oct 19, 2024
e738256
add profile
vneiger Oct 19, 2024
dcaede7
fix include
vneiger Oct 19, 2024
cd50787
improve profile init
vneiger Oct 19, 2024
afa5ddc
rename ctx init
vneiger Oct 20, 2024
fd24de2
testing init
vneiger Oct 20, 2024
211ab75
fix explanations and complete test for init
vneiger Oct 20, 2024
6368823
remove printf
vneiger Oct 20, 2024
9eeedd6
forgot to add main
vneiger Oct 22, 2024
3fa7944
dft, test passes
vneiger Oct 23, 2024
ff33533
add profile
vneiger Oct 23, 2024
f4520c9
clean things a bit
vneiger Oct 24, 2024
e10c29c
introducing dft32 base case
vneiger Oct 24, 2024
7b605a6
dft32 base case
vneiger Oct 24, 2024
1f236d8
cleaning things
vneiger Oct 24, 2024
9bf18c7
testing from length 1
vneiger Oct 24, 2024
fb88c54
fix
vneiger Oct 24, 2024
f6cc96c
remove useless function argument
vneiger Oct 24, 2024
a675b68
vaguely faster with added lazy14 layer
vneiger Oct 24, 2024
28b3276
clean explanations
vneiger Oct 24, 2024
b71649d
finalize lazy14 version
vneiger Oct 24, 2024
8cd392c
small fixes
vneiger Oct 24, 2024
9fa9020
tentative fix for flint_bits == 32
vneiger Oct 24, 2024
ccd3f71
dft8 is now a macro, code generation was too unpredictable
vneiger Oct 24, 2024
f0587e5
putting more args slightly slows down for large lengths...
vneiger Oct 24, 2024
4cf7343
macro for dft16 helps, let's see for dft32
vneiger Oct 24, 2024
b96e0a5
dft32 macroified does help a bit
vneiger Oct 24, 2024
82a6c85
mod4 currently unused
vneiger Oct 24, 2024
0cae527
some notes / todos
vneiger Oct 25, 2024
4818198
test with prime close to announced limitw
vneiger Oct 25, 2024
c413b63
limit is 62 bits
vneiger Oct 25, 2024
174bb2d
fix assert and try struct for args... for circumventing strange behav…
vneiger Oct 25, 2024
c7f6dff
fft args and introducing iw
vneiger Oct 27, 2024
5b84260
dft 16 macro pointers
vneiger Oct 27, 2024
1fb2479
dft 16 macro pointers
vneiger Oct 27, 2024
4d8071f
dft8 simplify a bit
vneiger Oct 27, 2024
6e751ea
clean
vneiger Oct 27, 2024
192a575
dft16 simplify a bit
vneiger Oct 27, 2024
375a84d
dft16 4 4 now ok as well
vneiger Oct 27, 2024
c9b6575
dft32 expanded as well; this is ugly but will help for versions with …
vneiger Oct 27, 2024
c24e7b6
starting precomp of tab_inverse(w)
vneiger Oct 27, 2024
e0ace33
add tab_iw in fitdepth
vneiger Oct 27, 2024
b376f7e
unroll a bit is faster
vneiger Oct 27, 2024
0e0df84
notes about init
vneiger Oct 27, 2024
40539de
wip: use multipoint eval in test
vneiger Oct 27, 2024
8c8b08b
use multipoint eval in test
vneiger Oct 27, 2024
b1fb674
idft_t
vneiger Oct 27, 2024
dcf2ae7
idft_t, not tested yet
vneiger Oct 27, 2024
9d8845d
minor changes
vneiger Oct 27, 2024
a872720
progress
vneiger Oct 27, 2024
c29a5d1
add files
vneiger Oct 27, 2024
66523a7
idft test passes
vneiger Oct 27, 2024
453c068
idft test passes
vneiger Oct 27, 2024
82d5cbf
idft in progress
vneiger Oct 27, 2024
07a7dbe
fix name
vneiger Oct 27, 2024
f0fed07
idft in progress
vneiger Oct 28, 2024
f79ba6f
idft in progress
vneiger Oct 28, 2024
c41bcd4
idft in progress
vneiger Oct 28, 2024
f09bca0
idft in progress
vneiger Oct 28, 2024
38a5aba
idft in progress
vneiger Oct 28, 2024
ece93dc
minor fix
vneiger Oct 28, 2024
9ef50f1
a bit lazier
vneiger Oct 28, 2024
3b5669d
idft becoming good, remains some fine tuning to do
vneiger Oct 28, 2024
614dacb
a bit lazier
vneiger Oct 28, 2024
9567b15
fix name
vneiger Oct 28, 2024
42f6fe6
Merge branch 'flintlib:main' into introduce_nmod_fft
vneiger Nov 9, 2024
c438c43
minor comment
vneiger Nov 9, 2024
eaf6d5c
Merge branch 'introduce_nmod_fft' of github.com:vneiger/flint into in…
vneiger Nov 9, 2024
4bb083a
remove unnecessary includes
vneiger Nov 9, 2024
99284b5
add todo about tests idft any node small depths
vneiger Nov 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion Makefile.in
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,8 @@ HEADER_DIRS := \
fmpz_mod_mpoly_factor fmpq_mpoly_factor \
fq_nmod_mpoly_factor fq_zech_mpoly_factor \
\
fft @FFT_SMALL@ fmpz_poly_q fmpz_lll \
fft n_fft @FFT_SMALL@ \
fmpz_poly_q fmpz_lll \
n_poly arith qsieve aprcl \
\
nf nf_elem qfb \
Expand Down
258 changes: 258 additions & 0 deletions src/n_fft.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
/*
Copyright (C) 2024 Vincent Neiger

This file is part of FLINT.

FLINT is free software: you can redistribute it and/or modify it under
the terms of the GNU Lesser General Public License (LGPL) as published
by the Free Software Foundation; either version 3 of the License, or
(at your option) any later version. See <https://www.gnu.org/licenses/>.
*/

#ifndef N_FFT_H
#define N_FFT_H

#include "ulong_extras.h"

#define N_FFT_CTX_DEFAULT_DEPTH 12

#ifdef __cplusplus
extern "C" {
#endif

/**
* TODO[short term] augment precomputations with inverse roots
* TODO[short term] add testing for general variants, not only node0
* TODO[long term] large depth can lead to heavy memory usage
* --> provide precomputation-free functions
* TODO[long term] on zen4 (likely on other cpus as well) ctx_init becomes
* slower at some point, losing a factor 4 or more, probably due to caching;
* what is annoying is that the depth where it becomes slower is significantly
* smaller (~13-14) when tab_iw has been incorporated compared to without
* tab_iw (it was depth ~20-21); see if this can be understood, and maybe play
* with vectorization for those simple functions
* TODO[later] provide forward function which reduces output to [0..n) ?
* unclear this is useful... to be decided later
*/

/** n_fft context:
* parameters and tabulated powers of the primitive root of unity "w".
**/

typedef struct
{
ulong mod; // modulus, odd prime
ulong max_depth; // maximum supported depth (w has order 2**max_depth)
ulong cofactor; // prime is 1 + cofactor * 2**max_depth
ulong depth; // depth supported by current precomputation
nn_ptr tab_w; // tabulated powers of w, see below
nn_ptr tab_iw; // tabulated powers of 1/w, see below
ulong tab_w2[2*FLINT_BITS]; // tabulated powers w**(2**k), see below
ulong tab_inv2[2*FLINT_BITS]; // tabulated inverses of 2**k, see below
} n_fft_ctx_struct;
typedef n_fft_ctx_struct n_fft_ctx_t[1];


/** Requirements (not checked upon init):
* - mod is an odd prime < 2**(FLINT_BITS-2)
* - max_depth must be >= 3 (so, 8 must divide mod - 1)
* Total memory cost of precomputations for arrays tab_{w,iw,w2,inv2}:
* at most 2 * (2*FLINT_BITS + 2**depth) ulong's
*/

/** tab_w2:
* - length 2*FLINT_BITS, with undefined entries at index 2*(max_depth-1) and beyond
* - contains powers w**d for d a power of 2, and corresponding
* precomputations for modular multiplication:
* -- for 0 <= k < max_depth-1, tab_w2[2*k] = w**(2**(max_depth-2-k))
* and tab_w2[2*k+1] = floor(tab_w2[2*k] * 2**FLINT_BITS / mod)
* -- for 2*(max_depth-1) <= k < 2*FLINT_BITS, tab_w2[k] is undefined
*
* --> one can retrieve w as tab_w2[2 * (max_depth-2)]
* --> the first elements are tab_w2 = [I, I_pr, J, J_pr, ...]
* where I is a square root of -1 and J is a square root of I
*/

/** tab_w:
* - length 2**depth
* - contains 2**(depth-1) first powers of w in (max_depth-1)-bit reversed order,
* and corresponding precomputations for modular multiplication:
* -- for 0 <= k < 2**(depth-1), tab_w[2*k] = w**(br[k])
* and tab_w[2*k+1] = floor(tab_w[2*k] * 2**FLINT_BITS / mod)
* where br = [0, 2**(max_depth-2), 2**(max_depth-3), 3 * 2**(max_depth-3), ...]
* is the bit reversal permutation of length 2**(max_depth-1)
* (https://en.wikipedia.org/wiki/Bit-reversal_permutation)
*
* In particular the first elements are
* tab_w = [1, 1_pr, I, I_pr, J, J_pr, IJ, IJ_pr, ...]
* where I is a square root of -1, J is a square root of I, and IJ = I*J. Note
* that powers of w beyond 2**(max_depth-1), for example -1, -I, -J, etc. are
* not stored.
**/

/** tab_iw: same as tab_w but for the primitive root 1/w */

/** tab_inv2:
* - length 2*FLINT_BITS, with undefined entries at index 2*max_depth and beyond
* - contains the modular inverses of 2**k, and corresponding
* precomputations for modular multiplication:
* -- for 0 <= k < max_depth, tab_inv2[2*k] = the inverse of 2**(k+1)
* modulo mod, and tab_inv2[2*k+1] = floor(tab_inv2[2*k] * 2**FLINT_BITS / mod)
* -- for 2*max_depth <= k < 2*FLINT_BITS, tab_inv2[k] is undefined
*
* Recall F->mod == 1 + cofactor * 2**max_depth, so
* 1 == F->mod - cofactor * 2**(max_depth - k) * 2**k
* --> the inverse of 2**k in (0, F->mod) is
* F->mod - cofactor * 2**(max_depth - k),
* we do not really need to store it, but we want the precomputations as well
*/







/** Note for init functions, when depth is provided:
* - if it is < 3, it is pretended that it is 3
* - it it is more than F->max_depth (the maximum possible with the given
* prime), it is reduced to F->max_depth
* After calling init, precomputations support DFTs of length up to 2**depth
*/

// initialize with given root and given depth
void n_fft_ctx_init2_root(n_fft_ctx_t F, ulong w, ulong max_depth, ulong cofactor, ulong depth, ulong mod);

// find primitive root, initialize with given depth
void n_fft_ctx_init2(n_fft_ctx_t F, ulong depth, ulong p);

// same, with default depth
FLINT_FORCE_INLINE
void n_fft_ctx_init_root(n_fft_ctx_t F, ulong w, ulong max_depth, ulong cofactor, ulong p)
{ n_fft_ctx_init2_root(F, w, max_depth, cofactor, N_FFT_CTX_DEFAULT_DEPTH, p); }

FLINT_FORCE_INLINE
void n_fft_ctx_init(n_fft_ctx_t F, ulong p)
{ n_fft_ctx_init2(F, N_FFT_CTX_DEFAULT_DEPTH, p); }

// grows F->depth and precomputations to support DFTs of depth up to depth
void n_fft_ctx_fit_depth(n_fft_ctx_t F, ulong depth);

void n_fft_ctx_clear(n_fft_ctx_t F);



typedef struct
{
ulong mod; // modulus, odd prime
ulong mod2; // 2*mod (storing helps for speed)
//ulong mod4; // 4*mod (storing helps for speed)
nn_srcptr tab_w; // tabulated powers of w, see below
} n_fft_args_struct;
typedef n_fft_args_struct n_fft_args_t[1];

FLINT_FORCE_INLINE
void n_fft_set_args(n_fft_args_t F, ulong mod, nn_srcptr tab_w)
{
F->mod = mod;
F->mod2 = 2*mod;
F->tab_w = tab_w;
}





/** dft:
* transforms / inverse transforms / transposed transforms
* at length a power of 2
*/

void dft_node0_lazy14(nn_ptr p, ulong depth, n_fft_args_t F);

/** 2**depth-point DFT
* * in [0..n) / out [0..4n) / max < 4n
* * In-place transform p of length len == 2**depth into
* the concatenation of
* [sum(p[i] * w_k**i for i in range(len), sum(p[i] * (-w_k)**i for i in range(len)]
* for k in range(len),
* where w_k = F->tab_w[2*k] for 0 <= k < 2**(depth-1)
* * By construction these evaluation points are the roots of the polynomial
* x**len - 1, precisely they are all powers of the chosen len-th primitive
* root of unity with exponents listed in bit reversed order
* * Requirements (not checked): depth <= F.depth
*/
FLINT_FORCE_INLINE void n_fft_dft(nn_ptr p, ulong depth, n_fft_ctx_t F)
{
n_fft_args_t Fargs;
n_fft_set_args(Fargs, F->mod, F->tab_w);
dft_node0_lazy14(p, depth, Fargs);
}

// FIXME in progress
// not tested yet --> test == applying dft yields identity
// DOC. Note: output < n.
void idft_node0_lazy12(nn_ptr p, ulong depth, n_fft_args_t F);
FLINT_FORCE_INLINE void n_fft_idft(nn_ptr p, ulong depth, n_fft_ctx_t F)
{
n_fft_args_t Fargs;
n_fft_set_args(Fargs, F->mod, F->tab_iw);
idft_node0_lazy12(p, depth, Fargs);

if (depth > 0)
{
const ulong inv2 = F->tab_inv2[2*depth-2];
const ulong inv2_pr = F->tab_inv2[2*depth-1];
//ulong p_hi, p_lo;
for (ulong k = 0; k < (UWORD(1) << depth); k++)
{
p[k] = n_mulmod_shoup(inv2, p[k], inv2_pr, F->mod);
//umul_ppmm(p_hi, p_lo, inv2_pr, p[k]);
//p[k] = inv2 * p[k] - p_hi * F->mod;
}
// NOTE: apparently no gain from lazy variant, so
// probably better to use non-lazy one (ensures output < n)
}
// FIXME see if that can be made less expensive at least for depths not too
// small, by integrating into base cases of dft_node0
}



// FIXME in progress
// not tested yet --> test == naive version?
// DOC. Note: output < 2n (?).
FLINT_FORCE_INLINE void n_fft_dft_t(nn_ptr p, ulong depth, n_fft_ctx_t F)
{
n_fft_args_t Fargs;
n_fft_set_args(Fargs, F->mod, F->tab_w);
idft_node0_lazy12(p, depth, Fargs);
}

// FIXME in progress
// not tested yet --> test == applying dft_t yields identity?
// DOC. Note: output < n.
FLINT_FORCE_INLINE void n_fft_idft_t(nn_ptr p, ulong depth, n_fft_ctx_t F)
{
n_fft_args_t Fargs;
n_fft_set_args(Fargs, F->mod, F->tab_iw);
dft_node0_lazy14(p, depth, Fargs);

if (depth > 0)
{
// see comments in idft concerning this loop
const ulong inv2 = F->tab_inv2[2*depth-2];
const ulong inv2_pr = F->tab_inv2[2*depth-1];
for (ulong k = 0; k < (UWORD(1) << depth); k++)
p[k] = n_mulmod_shoup(inv2, p[k], inv2_pr, F->mod);
}
}




#ifdef __cplusplus
}
#endif

#endif /* N_FFT_H */
Loading
Loading