SUSPERMS - Editorial

PROBLEM LINK:

Practice
Contest: Division 1
Contest: Division 2
Contest: Division 3
Contest: Division 4

Author: tibinyte
Testers: tabr, iceknight1093
Editorialist: iceknight1093

DIFFICULTY:

TBD

PREREQUISITES:

Combinatorics, the inclusion-exclusion principle, polynomial convolution with arbitrary mod

PROBLEM:

A permutation P of length 2N is called N-beautiful if the N pairs (P_i\bmod N, P_{i+N}\bmod N) are all distinct.

Given K, For each 1 \leq i \leq K find the number of i-beautiful permutations, modulo a given prime number M.

EXPLANATION:

Weā€™ll first solve subtask 1, and then extend that solution to subtask 2.

Subtask 1

The limit on K is small enough here that \mathcal{O}(K^2) will pass.
In particular, we can solve for a fixed N in \mathcal{O}(N) time; so letā€™s attempt to do that.

Note that we only really care about the values of P_i \bmod N, which range from 0 to N-1. Also, we have exactly two of each value.
Weā€™ll call arrays of length 2N containing each integer from 0 to N-1 twice good arrays.

Letā€™s call a permutation that isnā€™t N-beautiful, N-ugly.
Weā€™ll try to count the number of N-ugly permutations.

Suppose we have a N-ugly permutation.
Then, there exist two indices 1 \leq i \lt j \leq N such that P_i = P_j and P_{i+N} = P_{j+N}.
Such a pair will be called a bad pair.

In particular, note that these four positions use all copies of P_i and P_{i+N}, so the rest of the elements are simply a good array of length 2N-4 (after appropriate relabelling).

Itā€™s not too hard to count the number of such arrays, since fixing i, j, the values P_i and P_{i+N}, and which copy of each value goes to which position is simple math and a bunch of multiplications.

However, we canā€™t directly subtract this count from (2N)!, since a good array with multiple bad pairs was counted multiple times!
This should lead you to the idea of using the inclusion-exclusion principle.

Let B_i be the number of good arrays with at least i bad pairs.
By the principle of inclusion-exclusion, the answer weā€™re looking for is

\sum_{i=0}^{N} (-1)^N B_i

So, letā€™s focus on computing B_i quickly.
This is a generalization of how we computed B_1, and is just a bunch of combinatorics.

Details

There are i bad pairs, which fixes 2i positions among the first N. These can be chosen in \binom{N}{2i} ways.
We need to now split these 2i positions into i pairs.
This can be done in \binom{2i}{2, 2, 2, \ldots, 2} = \frac{(2i)!}{2^i} ways by definition of the multinomial coefficient.

Next, we need to fix which elements go into each pair. Letā€™s fix an order of the pairs and assign values one by one.
This can be done in N\cdot (N-1)\cdot (N-2) \cdot \ldots \cdot (N-2i+1) ways, which after a little algebraic manipulation is seen to be just \binom{N}{2i} (2i)!.

However, we also need to divide by i! since the order in which we assign values to the pairs doesnā€™t actually matter (it makes no difference to the final permutation).

Each bad pair now has four choices (we can swap elements with the same value modulo N to obtain a different permutation with the same bad pair count).
This gives us 4^i choices in total.

Finally, there are (2N - 4i)! ways to arrange the elements not among these bad pairs.

Putting everything together, we have

B_i = \binom{N}{2i} \binom{N}{2i} \frac{4^i(2i)! (2i)! (2N - 4i)!}{2^ii!}

which can be computed in \mathcal{O}(\log N) or even \mathcal{O}(1) time, since everything is either a factorial or a power of two (or an inverse of one of them)

Since each B_i can be computed quickly, we can solve for a fixed N in \mathcal{O}(N) time, which is good enough for subtask 1.

Subtask 2

Letā€™s look at our expression for B_i again, and clean it up a bit by cancelling out common terms.
We obtain

B_i = \frac{2^i N! N! (2N - 4i)!}{(N - 2i)! (N - 2i)! i!} = \frac{2^i N! N!}{i!} \binom{2N - 4i}{N - 2i}

The N! terms are independent of i, so letā€™s ignore them for now.
Notice that we have the product of an ā€œi-partā€ comprising \frac{2^i}{i!} and something like an ā€œN-i partā€ comprising \binom{2N - 4i}{N - 2i}.

This should make you think along the lines of convolution.
However, we canā€™t directly multiply two polynomials: the N-i part doesnā€™t quite depend on N-i, it depends on N - 2i (and thereā€™s no apparent nice way to make the i-part depend on 2i instead).
Itā€™d be nice if we had 2N instead of N, so letā€™s do just that!

We split the computation into two cases: even N and odd N. Letā€™s deal with the even case first.
So, suppose N = 2n.

We now have \displaystyle\binom{2N - 4i}{N - 2i} = \binom{4n - 4i}{2n - 2i} = \binom{4(n - i)}{2(n-i)}, which does depend on n - i.

So, create the polynomials p and q, with coefficients

  • p_i =\frac{ (-1)^i 2^i}{i!}
  • q_i = \binom{4i}{2i}

and multiply them together so obtain r.

The answer for 2n is then r_n, multiplied by (2n)!^2 (which we took out at the start).

For odd N, a similar technique works: write N = 2n + 1, and once again youā€™ll get a very similar relation on n-i. Following this, multiply the appropriate polynomials and weā€™re done.

Note that this problem requires multiplying two polynomials and keeping their coefficients modulo some M, where M isnā€™t necessarily ā€˜niceā€™.
Standard FFT and NTT wonā€™t suffice here; a tutorial on how to do this can be found at cp-algorithms.

TIME COMPLEXITY

\mathcal{O}(K\log K) per testcase.

CODE:

Setter's code (C++)
#include <bits/stdc++.h>
using namespace std;

int n, mod;
struct modint
{
    int32_t value;
    modint() = default;
    modint(int32_t value_) : value(value_ % mod) {}
    modint(int64_t value_) : value(value_ % mod) {}
    inline modint operator+(modint other) const
    {
        int32_t c = this->value + other.value;
        return modint(c >= mod ? c - mod : c);
    }
    inline modint operator-(modint other) const
    {
        int32_t c = this->value - other.value;
        return modint(c < 0 ? c + mod : c);
    }
    inline modint operator*(modint other) const
    {
        int32_t c = (int64_t)this->value * other.value % mod;
        return modint(c < 0 ? c + mod : c);
    }
    inline modint &operator+=(modint other)
    {
        this->value += other.value;
        if (this->value >= mod)
            this->value -= mod;
        return *this;
    }
    inline modint &operator-=(modint other)
    {
        this->value -= other.value;
        if (this->value < 0)
            this->value += mod;
        return *this;
    }
    inline modint &operator*=(modint other)
    {
        this->value = (int64_t)this->value * other.value % mod;
        if (this->value < 0)
            this->value += mod;
        return *this;
    }
    inline modint operator-() const { return modint(this->value ? mod - this->value : 0); }
    modint pow(int32_t k) const
    {
        modint x = *this, y = 1;
        for (; k; k >>= 1)
        {
            if (k & 1)
                y *= x;
            x *= x;
        }
        return y;
    }
    modint inv() const { return pow(mod - 2); } // MOD must be a prime
    inline modint operator/(modint other) const { return *this * other.inv(); }
    inline modint operator/=(modint other) { return *this *= other.inv(); }
    inline bool operator==(modint other) const { return value == other.value; }
    inline bool operator!=(modint other) const { return value != other.value; }
    inline bool operator<(modint other) const { return value < other.value; }
    inline bool operator>(modint other) const { return value > other.value; }
};
modint operator*(int64_t value, modint n) { return modint(value) * n; }
modint operator*(int32_t value, modint n) { return modint(value) * n; }
istream &operator>>(istream &in, modint &n) { return in >> n.value; }
ostream &operator<<(ostream &out, modint n) { return out << n.value; }
struct combi
{
    int n;
    vector<modint> facts, finvs, invs;
    combi(int _n) : n(_n), facts(_n), finvs(_n), invs(_n)
    {
        facts[0] = finvs[0] = 1;
        invs[1] = 1;
        for (int i = 2; i < n; i++)
            invs[i] = invs[mod % i] * (-mod / i);
        for (int i = 1; i < n; i++)
        {
            facts[i] = facts[i - 1] * i;
            finvs[i] = finvs[i - 1] * invs[i];
        }
    }
    inline modint fact(int n) { return facts[n]; }
    inline modint finv(int n) { return finvs[n]; }
    inline modint inv(int n) { return invs[n]; }
    inline modint ncr(int n, int k) { return n < k or k < 0 ? 0 : facts[n] * finvs[k] * finvs[n - k]; }
    inline modint aranj(int n, int k) { return ncr(n, k) * facts[k]; }
};
struct base
{
    double x, y;
    base() { x = y = 0; }
    base(double x, double y) : x(x), y(y) {}
};
inline base operator+(base a, base b) { return base(a.x + b.x, a.y + b.y); }
inline base operator-(base a, base b) { return base(a.x - b.x, a.y - b.y); }
inline base operator*(base a, base b) { return base(a.x * b.x - a.y * b.y, a.x * b.y + a.y * b.x); }
inline base conj(base a) { return base(a.x, -a.y); }
int lim = 1;
vector<base> roots = {{0, 0}, {1, 0}};
vector<int> rev = {0, 1};
const double PI = acosl(-1.0);
void ensure_base(int p)
{
    if (p <= lim)
        return;
    rev.resize(1 << p);
    for (int i = 0; i < (1 << p); i++)
        rev[i] = (rev[i >> 1] >> 1) + ((i & 1) << (p - 1));
    roots.resize(1 << p);
    while (lim < p)
    {
        double angle = 2 * PI / (1 << (lim + 1));
        for (int i = 1 << (lim - 1); i < (1 << lim); i++)
        {
            roots[i << 1] = roots[i];
            double angle_i = angle * (2 * i + 1 - (1 << lim));
            roots[(i << 1) + 1] = base(cos(angle_i), sin(angle_i));
        }
        lim++;
    }
}
void fft(vector<base> &a, int n = -1)
{
    if (n == -1)
        n = a.size();
    assert((n & (n - 1)) == 0);
    int zeros = __builtin_ctz(n);
    ensure_base(zeros);
    int shift = lim - zeros;
    for (int i = 0; i < n; i++)
        if (i < (rev[i] >> shift))
            swap(a[i], a[rev[i] >> shift]);
    for (int k = 1; k < n; k <<= 1)
    {
        for (int i = 0; i < n; i += 2 * k)
        {
            for (int j = 0; j < k; j++)
            {
                base z = a[i + j + k] * roots[j + k];
                a[i + j + k] = a[i + j] - z;
                a[i + j] = a[i + j] + z;
            }
        }
    }
}
vector<int> multiply(vector<int> &a, vector<int> &b, int eq = 0)
{
    int need = a.size() + b.size() - 1;
    int p = 0;
    while ((1 << p) < need)
        p++;
    ensure_base(p);
    int sz = 1 << p;
    vector<base> A, B;
    if (sz > (int)A.size())
        A.resize(sz);
    for (int i = 0; i < (int)a.size(); i++)
    {
        int x = (a[i] % mod + mod) % mod;
        A[i] = base(x & ((1 << 15) - 1), x >> 15);
    }
    fill(A.begin() + a.size(), A.begin() + sz, base{0, 0});
    fft(A, sz);
    if (sz > (int)B.size())
        B.resize(sz);
    if (eq)
        copy(A.begin(), A.begin() + sz, B.begin());
    else
    {
        for (int i = 0; i < (int)b.size(); i++)
        {
            int x = (b[i] % mod + mod) % mod;
            B[i] = base(x & ((1 << 15) - 1), x >> 15);
        }
        fill(B.begin() + b.size(), B.begin() + sz, base{0, 0});
        fft(B, sz);
    }
    double ratio = 0.25 / sz;
    base r2(0, -1), r3(ratio, 0), r4(0, -ratio), r5(0, 1);
    for (int i = 0; i <= (sz >> 1); i++)
    {
        int j = (sz - i) & (sz - 1);
        base a1 = (A[i] + conj(A[j])), a2 = (A[i] - conj(A[j])) * r2;
        base b1 = (B[i] + conj(B[j])) * r3, b2 = (B[i] - conj(B[j])) * r4;
        if (i != j)
        {
            base c1 = (A[j] + conj(A[i])), c2 = (A[j] - conj(A[i])) * r2;
            base d1 = (B[j] + conj(B[i])) * r3, d2 = (B[j] - conj(B[i])) * r4;
            A[i] = c1 * d1 + c2 * d2 * r5;
            B[i] = c1 * d2 + c2 * d1;
        }
        A[j] = a1 * b1 + a2 * b2 * r5;
        B[j] = a1 * b2 + a2 * b1;
    }
    fft(A, sz);
    fft(B, sz);
    vector<int> res(need);
    for (int i = 0; i < need; i++)
    {
        long long aa = A[i].x + 0.5;
        long long bb = B[i].x + 0.5;
        long long cc = A[i].y + 0.5;
        res[i] = (aa + ((bb % mod) << 15) + ((cc % mod) << 30)) % mod;
    }
    return res;
}
int main()
{
    cin.tie(nullptr)->sync_with_stdio(false);
    cin >> n >> mod;
    combi C(n + n + 1);
    int lim = n / 2;
    vector<int> first(lim + 1);
    for (int i = 0; i <= lim; ++i)
    {
        modint rep = C.fact(4 * i) * C.finv(2 * i) * C.finv(2 * i);
        first[i] = rep.value;
    }
    vector<int> second(lim + 1);
    for (int i = 0; i <= lim; ++i)
    {
        modint rep = modint(2).pow(i) * modint(-1).pow(i) * C.finv(i);
        second[i] = rep.value;
    }
    vector<int> sumeven = multiply(first, second);
    first = vector<int>(lim + 2);
    for (int i = 1; i <= lim + 1; ++i)
    {
        modint rep = C.fact(4 * i - 2) * C.finv(2 * i - 1) * C.finv(2 * i - 1);
        first[i] = rep.value;
    }
    vector<int> sumodd = multiply(first, second);
    for (int i = 1; i <= n; ++i)
    {
        if (i % 2 == 1)
        {
            modint rep = C.fact(i) * C.fact(i) * sumodd[(i + 1) / 2];
            cout << rep << ' ';
        }
        else
        {
            modint rep = C.fact(i) * C.fact(i) * sumeven[i / 2];
            cout << rep << ' ';
        }
    }
}
Tester's code (C++)
#include <bits/stdc++.h>
using namespace std;
#ifdef tabr
#include "library/debug.cpp"
#else
#define debug(...)
#endif

struct input_checker {
    string buffer;
    int pos;

    const string all = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
    const string number = "0123456789";
    const string upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    const string lower = "abcdefghijklmnopqrstuvwxyz";

    input_checker() {
        pos = 0;
        while (true) {
            int c = cin.get();
            if (c == -1) {
                break;
            }
            buffer.push_back((char) c);
        }
    }

    string readOne() {
        assert(pos < (int) buffer.size());
        string res;
        while (pos < (int) buffer.size() && buffer[pos] != ' ' && buffer[pos] != '\n') {
            res += buffer[pos];
            pos++;
        }
        return res;
    }

    string readString(int min_len, int max_len, const string& pattern = "") {
        assert(min_len <= max_len);
        string res = readOne();
        assert(min_len <= (int) res.size());
        assert((int) res.size() <= max_len);
        for (int i = 0; i < (int) res.size(); i++) {
            assert(pattern.empty() || pattern.find(res[i]) != string::npos);
        }
        return res;
    }

    int readInt(int min_val, int max_val) {
        assert(min_val <= max_val);
        int res = stoi(readOne());
        assert(min_val <= res);
        assert(res <= max_val);
        return res;
    }

    long long readLong(long long min_val, long long max_val) {
        assert(min_val <= max_val);
        long long res = stoll(readOne());
        assert(min_val <= res);
        assert(res <= max_val);
        return res;
    }

    vector<int> readInts(int size, int min_val, int max_val) {
        assert(min_val <= max_val);
        vector<int> res(size);
        for (int i = 0; i < size; i++) {
            res[i] = readInt(min_val, max_val);
            if (i != size - 1) {
                readSpace();
            }
        }
        return res;
    }

    vector<long long> readLongs(int size, long long min_val, long long max_val) {
        assert(min_val <= max_val);
        vector<long long> res(size);
        for (int i = 0; i < size; i++) {
            res[i] = readLong(min_val, max_val);
            if (i != size - 1) {
                readSpace();
            }
        }
        return res;
    }

    void readSpace() {
        assert((int) buffer.size() > pos);
        assert(buffer[pos] == ' ');
        pos++;
    }

    void readEoln() {
        assert((int) buffer.size() > pos);
        assert(buffer[pos] == '\n');
        pos++;
    }

    void readEof() {
        assert((int) buffer.size() == pos);
    }
};

namespace math314 {
typedef long long ll;
typedef pair<int, int> Pii;

#define FOR(i, n) for (int i = 0; i < (n); i++)
#define sz(c) ((int) (c).size())
#define ten(x) ((int) 1e##x)

template <class T>
T extgcd(T a, T b, T& x, T& y) {
    for (T u = y = 1, v = x = 0; a;) {
        T q = b / a;
        swap(x -= q * u, u);
        swap(y -= q * v, v);
        swap(b -= q * a, a);
    }
    return b;
}
template <class T>
T mod_inv(T a, T m) {
    T x, y;
    extgcd(a, m, x, y);
    return (m + x % m) % m;
}
ll mod_pow(ll a, ll n, ll mod) {
    ll ret = 1;
    ll p = a % mod;
    while (n) {
        if (n & 1) ret = ret * p % mod;
        p = p * p % mod;
        n >>= 1;
    }
    return ret;
}

template <int mod, int primitive_root>
class NTT {
   public:
    int get_mod() const { return mod; }
    void _ntt(vector<ll>& a, int sign) {
        const int n = sz(a);
        assert((n ^ (n & -n)) == 0);  // n = 2^k

        const int g = 3;                               // g is primitive root of mod
        int h = (int) mod_pow(g, (mod - 1) / n, mod);  // h^n = 1
        if (sign == -1) h = (int) mod_inv(h, mod);     // h = h^-1 % mod

        // bit reverse
        int i = 0;
        for (int j = 1; j < n - 1; ++j) {
            for (int k = n >> 1; k > (i ^= k); k >>= 1)
                ;
            if (j < i) swap(a[i], a[j]);
        }

        for (int m = 1; m < n; m *= 2) {
            const int m2 = 2 * m;
            const ll base = mod_pow(h, n / m2, mod);
            ll w = 1;
            FOR(x, m) {
                for (int s = x; s < n; s += m2) {
                    ll u = a[s];
                    ll d = a[s + m] * w % mod;
                    a[s] = u + d;
                    if (a[s] >= mod) a[s] -= mod;
                    a[s + m] = u - d;
                    if (a[s + m] < 0) a[s + m] += mod;
                }
                w = w * base % mod;
            }
        }

        for (auto& x : a)
            if (x < 0) x += mod;
    }
    void ntt(vector<ll>& input) {
        _ntt(input, 1);
    }
    void intt(vector<ll>& input) {
        _ntt(input, -1);
        const int n_inv = mod_inv(sz(input), mod);
        for (auto& x : input) x = x * n_inv % mod;
    }

    // ē•³ćæč¾¼ćæę¼”ē®—ć‚’č”Œć†
    vector<ll> convolution(const vector<ll>& a, const vector<ll>& b) {
        int ntt_size = 1;
        while (ntt_size < sz(a) + sz(b)) ntt_size *= 2;

        vector<ll> _a = a, _b = b;
        _a.resize(ntt_size);
        _b.resize(ntt_size);

        ntt(_a);
        ntt(_b);

        FOR(i, ntt_size) {
            (_a[i] *= _b[i]) %= mod;
        }

        intt(_a);
        return _a;
    }
};

ll garner(vector<Pii> mr, int mod) {
    mr.emplace_back(mod, 0);

    vector<ll> coffs(sz(mr), 1);
    vector<ll> constants(sz(mr), 0);
    FOR(i, sz(mr) - 1) {
        // coffs[i] * v + constants[i] == mr[i].second (mod mr[i].first) ć‚’č§£ć
        ll v = (mr[i].second - constants[i]) * mod_inv<ll>(coffs[i], mr[i].first) % mr[i].first;
        if (v < 0) v += mr[i].first;

        for (int j = i + 1; j < sz(mr); j++) {
            (constants[j] += coffs[j] * v) %= mr[j].first;
            (coffs[j] *= mr[i].first) %= mr[j].first;
        }
    }

    return constants[sz(mr) - 1];
}

typedef NTT<167772161, 3> NTT_1;
typedef NTT<469762049, 3> NTT_2;
typedef NTT<1224736769, 3> NTT_3;

vector<ll> int32mod_convolution(vector<ll> a, vector<ll> b, int mod) {
    for (auto& x : a) x %= mod;
    for (auto& x : b) x %= mod;
    NTT_1 ntt1;
    NTT_2 ntt2;
    NTT_3 ntt3;
    auto x = ntt1.convolution(a, b);
    auto y = ntt2.convolution(a, b);
    auto z = ntt3.convolution(a, b);

    vector<ll> ret(sz(x));
    vector<Pii> mr(3);
    FOR(i, sz(x)) {
        mr[0].first = ntt1.get_mod(), mr[0].second = (int) x[i];
        mr[1].first = ntt2.get_mod(), mr[1].second = (int) y[i];
        mr[2].first = ntt3.get_mod(), mr[2].second = (int) z[i];
        ret[i] = garner(mr, mod);
    }

    return ret;
}
}  // namespace math314

long long mod = 2;

struct mint {
    long long value;
    mint(long long x = 0) {
        value = x % mod;
        if (value < 0) value += mod;
    }
    mint& operator+=(const mint& other) {
        if ((value += other.value) >= mod) value -= mod;
        return *this;
    }
    mint& operator-=(const mint& other) {
        if ((value -= other.value) < 0) value += mod;
        return *this;
    }
    mint& operator*=(const mint& other) {
        value = value * other.value % mod;
        return *this;
    }
    mint& operator/=(const mint& other) {
        long long a = 0, b = 1, c = other.value, m = mod;
        while (c != 0) {
            long long t = m / c;
            m -= t * c;
            swap(c, m);
            a -= t * b;
            swap(a, b);
        }
        a %= mod;
        if (a < 0) a += mod;
        value = value * a % mod;
        return *this;
    }
    friend mint operator+(const mint& lhs, const mint& rhs) { return mint(lhs) += rhs; }
    friend mint operator-(const mint& lhs, const mint& rhs) { return mint(lhs) -= rhs; }
    friend mint operator*(const mint& lhs, const mint& rhs) { return mint(lhs) *= rhs; }
    friend mint operator/(const mint& lhs, const mint& rhs) { return mint(lhs) /= rhs; }
    mint& operator++() { return *this += 1; }
    mint& operator--() { return *this -= 1; }
    mint operator++(int) {
        mint result(*this);
        *this += 1;
        return result;
    }
    mint operator--(int) {
        mint result(*this);
        *this -= 1;
        return result;
    }
    mint operator-() const { return mint(-value); }
    bool operator==(const mint& rhs) const { return value == rhs.value; }
    bool operator!=(const mint& rhs) const { return value != rhs.value; }
    bool operator<(const mint& rhs) const { return value < rhs.value; }
};
string to_string(const mint& x) {
    return to_string(x.value);
}
ostream& operator<<(ostream& stream, const mint& x) {
    return stream << x.value;
}
istream& operator>>(istream& stream, mint& x) {
    stream >> x.value;
    x.value %= mod;
    if (x.value < 0) x.value += mod;
    return stream;
}

mint power(mint a, long long n) {
    mint res = 1;
    while (n > 0) {
        if (n & 1) {
            res *= a;
        }
        a *= a;
        n >>= 1;
    }
    return res;
}

vector<mint> fact(1, 1);
vector<mint> finv(1, 1);

mint C(int n, int k) {
    if (n < k || k < 0) {
        return mint(0);
    }
    while ((int) fact.size() < n + 1) {
        fact.emplace_back(fact.back() * (int) fact.size());
        finv.emplace_back(1 / fact.back());
    }
    return fact[n] * finv[k] * finv[n - k];
}

bool is_prime(long long n) {
    if (n <= 1) {
        return false;
    }
    for (long long i = 2; i * i <= n; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);
    input_checker in;
    int k = in.readInt(1, 2e5);
    in.readSpace();
    mod = in.readInt(1e7 + 19, 1e9 + 7);
    in.readEoln();
    assert(is_prime(mod));
    C(2 * k + 10, 0);
    vector<mint> f(k + 1);
    f[0] = 1;
    for (int i = 2; i <= k; i += 2) {
        f[i] = f[i - 2] * -4 / i;
    }
    vector<mint> g(k + 1);
    g[0] = 1;
    for (int i = 1; i <= k; i++) {
        g[i] = g[i - 1] * (4 * i - 2) / i;
    }
    vector<long long> ff(k + 1), gg(k + 1);
    for (int i = 0; i <= k; i++) {
        ff[i] = f[i].value;
        gg[i] = g[i].value;
    }
    auto ans = math314::int32mod_convolution(ff, gg, mod);
    for (int i = 1; i <= k; i++) {
        cout << ans[i] * fact[i] * fact[i] << " \n"[i == k];
    }
    in.readEof();
    return 0;
}
Editorialist's code (C++)
#include "bits/stdc++.h"
// #pragma GCC optimize("O3,unroll-loops")
// #pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")
using namespace std;
using ll = long long int;
mt19937_64 rng(chrono::high_resolution_clock::now().time_since_epoch().count());

template<typename T>
struct Z_p{
	using Type = typename decay<decltype(T::value)>::type;
	static vector<Type> MOD_INV;
	constexpr Z_p(): value(){ }
	template<typename U> Z_p(const U &x){ value = normalize(x); }
	template<typename U> static Type normalize(const U &x){
		Type v;
		if(-mod() <= x && x < mod()) v = static_cast<Type>(x);
		else v = static_cast<Type>(x % mod());
		if(v < 0) v += mod();
		return v;
	}
	const Type& operator()() const{ return value; }
	template<typename U> explicit operator U() const{ return static_cast<U>(value); }
	constexpr static Type mod(){ return T::value; }
	Z_p &operator+=(const Z_p &otr){ if((value += otr.value) >= mod()) value -= mod(); return *this; }
	Z_p &operator-=(const Z_p &otr){ if((value -= otr.value) < 0) value += mod(); return *this; }
	template<typename U> Z_p &operator+=(const U &otr){ return *this += Z_p(otr); }
	template<typename U> Z_p &operator-=(const U &otr){ return *this -= Z_p(otr); }
	Z_p &operator++(){ return *this += 1; }
	Z_p &operator--(){ return *this -= 1; }
	Z_p operator++(int){ Z_p result(*this); *this += 1; return result; }
	Z_p operator--(int){ Z_p result(*this); *this -= 1; return result; }
	Z_p operator-() const{ return Z_p(-value); }
	template<typename U = T>
	typename enable_if<is_same<typename Z_p<U>::Type, int>::value, Z_p>::type &operator*=(const Z_p& rhs){
		#ifdef _WIN32
		uint64_t x = static_cast<int64_t>(value) * static_cast<int64_t>(rhs.value);
		uint32_t xh = static_cast<uint32_t>(x >> 32), xl = static_cast<uint32_t>(x), d, m;
		asm(
			"divl %4; \n\t"
			: "=a" (d), "=d" (m)
			: "d" (xh), "a" (xl), "r" (mod())
		);
		value = m;
		#else
		value = normalize(static_cast<int64_t>(value) * static_cast<int64_t>(rhs.value));
		#endif
		return *this;
	}
	template<typename U = T>
	typename enable_if<is_same<typename Z_p<U>::Type, int64_t>::value, Z_p>::type &operator*=(const Z_p &rhs){
		uint64_t ret = static_cast<uint64_t>(value) * static_cast<uint64_t>(rhs.value) - static_cast<uint64_t>(mod()) * static_cast<uint64_t>(1.L / static_cast<uint64_t>(mod()) * static_cast<uint64_t>(value) * static_cast<uint64_t>(rhs.value));
		value = normalize(static_cast<int64_t>(ret + static_cast<uint64_t>(mod()) * (ret < 0) - static_cast<uint64_t>(mod()) * (ret >= static_cast<uint64_t>(mod()))));
		return *this;
	}
	template<typename U = T>
	typename enable_if<!is_integral<typename Z_p<U>::Type>::value, Z_p>::type &operator*=(const Z_p &rhs){
		value = normalize(value * rhs.value);
		return *this;
	}
	template<typename U>
	Z_p &operator^=(U e){
		if(e < 0) *this = 1 / *this, e = -e;
		Z_p res = 1;
		for(; e; *this *= *this, e >>= 1) if(e & 1) res *= *this;
		return *this = res;
	}
	template<typename U>
	Z_p operator^(U e) const{
		return Z_p(*this) ^= e;
	}
	Z_p &operator/=(const Z_p &otr){
		Type a = otr.value, m = mod(), u = 0, v = 1;
		if(a < (int)MOD_INV.size()) return *this *= MOD_INV[a];
		while(a){
			Type t = m / a;
			m -= t * a; swap(a, m);
			u -= t * v; swap(u, v);
		}
		assert(m == 1);
		return *this *= u;
	}
	template<typename U> friend const Z_p<U> &abs(const Z_p<U> &v){ return v; }
	Type value;
};
template<typename T> bool operator==(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value == rhs.value; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator==(const Z_p<T>& lhs, U rhs){ return lhs == Z_p<T>(rhs); }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator==(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) == rhs; }
template<typename T> bool operator!=(const Z_p<T> &lhs, const Z_p<T> &rhs){ return !(lhs == rhs); }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator!=(const Z_p<T> &lhs, U rhs){ return !(lhs == rhs); }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator!=(U lhs, const Z_p<T> &rhs){ return !(lhs == rhs); }
template<typename T> bool operator<(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value < rhs.value; }
template<typename T> bool operator>(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value > rhs.value; }
template<typename T> bool operator<=(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value <= rhs.value; }
template<typename T> bool operator>=(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value >= rhs.value; }
template<typename T> Z_p<T> operator+(const Z_p<T> &lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) += rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator+(const Z_p<T> &lhs, U rhs){ return Z_p<T>(lhs) += rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator+(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) += rhs; }
template<typename T> Z_p<T> operator-(const Z_p<T> &lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) -= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator-(const Z_p<T>& lhs, U rhs){ return Z_p<T>(lhs) -= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator-(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) -= rhs; }
template<typename T> Z_p<T> operator*(const Z_p<T> &lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) *= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator*(const Z_p<T>& lhs, U rhs){ return Z_p<T>(lhs) *= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator*(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) *= rhs; }
template<typename T> Z_p<T> operator/(const Z_p<T> &lhs, const Z_p<T> &rhs) { return Z_p<T>(lhs) /= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator/(const Z_p<T>& lhs, U rhs) { return Z_p<T>(lhs) /= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator/(U lhs, const Z_p<T> &rhs) { return Z_p<T>(lhs) /= rhs; }
template<typename T> istream &operator>>(istream &in, Z_p<T> &number){
	typename common_type<typename Z_p<T>::Type, int64_t>::type x;
	in >> x;
	number.value = Z_p<T>::normalize(x);
	return in;
}
template<typename T> ostream &operator<<(ostream &out, const Z_p<T> &number){ return out << number(); }


using ModType = int;
struct VarMod{ static ModType value; };
ModType VarMod::value;
ModType &mod = VarMod::value;
using Zp = Z_p<VarMod>;


// constexpr int mod = 1e9 + 7; // 1000000007
// constexpr int mod = (119 << 23) + 1; // 998244353
// constexpr int mod = 1e9 + 9; // 1000000009
// using Zp = Z_p<integral_constant<decay<decltype(mod)>::type, mod>>;

template<typename T> vector<typename Z_p<T>::Type> Z_p<T>::MOD_INV;
template<typename T = integral_constant<decay<decltype(mod)>::type, mod>>
void precalc_inverse(int SZ){
	auto &inv = Z_p<T>::MOD_INV;
	if(inv.empty()) inv.assign(2, 1);
	for(; inv.size() <= SZ; ) inv.push_back((mod - 1LL * mod / (int)inv.size() * inv[mod % (int)inv.size()]) % mod);
}

template<typename T>
vector<T> precalc_power(T base, int SZ){
	vector<T> res(SZ + 1, 1);
	for(auto i = 1; i <= SZ; ++ i) res[i] = res[i - 1] * base;
	return res;
}

template<typename T>
vector<T> precalc_factorial(int SZ){
	vector<T> res(SZ + 1, 1); res[0] = 1;
	for(auto i = 1; i <= SZ; ++ i) res[i] = res[i - 1] * i;
	return res;
}

using C = complex<double>;
using vd = vector<double>;
const long double PI = 2*acos(0.0);
void fft(vector<C>& a) {
	int n = size(a), L = 31 - __builtin_clz(n);
	static vector<complex<long double>> R(2, 1);
	static vector<C> rt(2, 1);  // (^ 10% faster if double)
	for (static int k = 2; k < n; k *= 2) {
		R.resize(n); rt.resize(n);
		const long double PI = 2*acos(0.0);
		auto x = polar(1.0L, PI / k); // M_PI, lower-case L
		for (int i = k; i < 2*k; ++i) rt[i] = R[i] = i&1 ? R[i/2] * x : R[i/2];
	}
	vector<int> rev(n);
	for (int i = 0; i < n; ++i) rev[i] = (rev[i / 2] | (i & 1) << L) / 2;
	for (int i = 0; i < n; ++i) if (i < rev[i]) swap(a[i], a[rev[i]]);
	for (int k = 1; k < n; k *= 2)
		for (int i = 0; i < n; i += 2 * k) for (int j = 0; j < k; ++j) {
			C z = rt[j+k] * a[i+j+k]; // (25% faster if hand-rolled)  /// include-line
			a[i + j + k] = a[i + j] - z;
			a[i + j] += z;
		}
}
vd conv(const vd& a, const vd& b) {
	if (a.empty() || b.empty()) return {};
	vd res(size(a) + size(b) - 1);
	int L = 32 - __builtin_clz(size(res)), n = 1 << L;
	vector<C> in(n), out(n);
	copy(begin(a), end(a), begin(in));
	for (int i = 0; i < (int)size(b); ++i) in[i].imag(b[i]);
	fft(in);
	for (auto &x : in) x *= x;
	for (int i = 0; i < n; ++i) out[i] = in[-i & (n - 1)] - conj(in[i]);
	fft(out);
	for (int i = 0; i < (int)size(res); ++i) res[i] = imag(out[i]) / (4 * n);
	return res;
}

// FFT-MOD
typedef vector<ll> vl;
vl convMod(const vl &a, const vl &b, int M) {
	if (a.empty() || b.empty()) return {};
	vl res(size(a) + size(b) - 1);
	int B=32-__builtin_clz(size(res)), n=1<<B, cut=int(sqrt(M));
	vector<C> L(n), R(n), outs(n), outl(n);
	for (int i = 0; i < (int)size(a); ++i) L[i] = C((int)a[i] / cut, (int)a[i] % cut);
	for (int i = 0; i < (int)size(b); ++i) R[i] = C((int)b[i] / cut, (int)b[i] % cut);
	fft(L), fft(R);
	for (int i = 0; i < n; ++i) {
		int j = -i & (n - 1);
		outl[j] = (L[i] + conj(L[j])) * R[i] / (2.0 * n);
		outs[j] = (L[i] - conj(L[j])) * R[i] / (2.0 * n) / 1i;
	}
	fft(outl), fft(outs);
	for (int i = 0; i < (int)size(res); ++i) {
		ll av = ll(real(outl[i])+.5), cv = ll(imag(outs[i])+.5);
		ll bv = ll(imag(outl[i])+.5) + ll(real(outs[i])+.5);
		res[i] = ((av % M * cut + bv) % M * cut + cv) % M;
	}
	return res;
}

int main()
{
	ios::sync_with_stdio(false); cin.tie(0);

	int k; cin >> k >> mod;

	auto fac = precalc_factorial<Zp>(2*k+10);
	auto C = [&] (int n, int r) {
		if (n < r or r < 0) return Zp(0);
		return fac[n] / (fac[r] * fac[n-r]);
	};

	vector<ll> poly1(k/2 + 1), poly2(k/2 + 1), poly3(k/2 + 1);
	for (int i = 0; i <= k/2; ++i) {
		Zp cur = (Zp(-2) ^ i) / fac[i];
		poly1[i] = cur.value;

		cur = C(4*i, 2*i);
		poly2[i] = cur.value;

		cur = C(4*i+2, 2*i+1);
		poly3[i] = cur.value;
	}

	auto res1 = convMod(poly1, poly2, mod);
	auto res2 = convMod(poly1, poly3, mod);

	cout << res2[0] << ' ';
	for (int i = 1; i < (k+1)/2; ++i) {
		cout << fac[2*i] * fac[2*i] * res1[i] << ' ';
		cout << fac[2*i+1] * fac[2*i+1] * res2[i] << ' ';
	}
	if (k%2 == 0) cout << fac[k] * fac[k] * res1[k/2] << '\n';
}

This problem can be solved more simply using the theory of holonomic functions. As explained above, the number of n-beautiful permutations is b_n:=n!^2 \ a_n, where

a_n = \sum_{0\le i\le n/2} \frac{(-2)^i}{i!} {2n - 4i \choose n - 2i}.

But since (1-4z)^{-1/2}=\sum_{i\ge 0} {2i \choose i} z^i, a_n will equal the coefficient of z^n in

f(z):=e^{-2 z^2} (1-4 z)^{-1/2}.

Now

f'(z)=(-4z + 2 (1-4z)^{-1})f(z)

so

(1 - 4 z) f'(z) = (2 - 4 z + 16 z^2) f(z)

and, taking the coefficient of z^n and assuming n\ge 2,

(n+1) a_{n+1} - 4 n a_n = 2 a_n - 4 a_{n-1} + 16 a_{n-2}.

Therefore

a_{n+1} = (( 4 n+2) a_n - 4 a_{n-1} + 16 a_{n-2}) (n + 1)^{-1}

and if we multiply by (n+1)!^2, we can get a recurrence for b_n:

b_{n+1} = (n + 1) (( 4 n+2) b_n - 4 b_{n-1} n^2 + 16(n-1)^2n^2 b_{n-2}),

still assuming that n\ge 2. Together with the initial values b_0=1, b_1=2, and b_2=16, this gives a short, quick way of calculating b_1, \dots, b_K modulo M without needing to use Fourier transforms or number-theoretic transforms.

1 Like