SUSPERMS - Editorial

iceknight1093 · March 1, 2023, 5:30pm

PROBLEM LINK:

Practice
Contest: Division 1
Contest: Division 2
Contest: Division 3
Contest: Division 4

Author: tibinyte
Testers: tabr, iceknight1093
Editorialist: iceknight1093

DIFFICULTY:

TBD

PREREQUISITES:

Combinatorics, the inclusion-exclusion principle, polynomial convolution with arbitrary mod

PROBLEM:

A permutation P of length 2N is called N-beautiful if the N pairs (P_i\bmod N, P_{i+N}\bmod N) are all distinct.

Given K, For each 1 \leq i \leq K find the number of i-beautiful permutations, modulo a given prime number M.

EXPLANATION:

We’ll first solve subtask 1, and then extend that solution to subtask 2.

Subtask 1

The limit on K is small enough here that \mathcal{O}(K^2) will pass.
In particular, we can solve for a fixed N in \mathcal{O}(N) time; so let’s attempt to do that.

Note that we only really care about the values of P_i \bmod N, which range from 0 to N-1. Also, we have exactly two of each value.
We’ll call arrays of length 2N containing each integer from 0 to N-1 twice good arrays.

Let’s call a permutation that isn’t N-beautiful, N-ugly.
We’ll try to count the number of N-ugly permutations.

Suppose we have a N-ugly permutation.
Then, there exist two indices 1 \leq i \lt j \leq N such that P_i = P_j and P_{i+N} = P_{j+N}.
Such a pair will be called a bad pair.

In particular, note that these four positions use all copies of P_i and P_{i+N}, so the rest of the elements are simply a good array of length 2N-4 (after appropriate relabelling).

It’s not too hard to count the number of such arrays, since fixing i, j, the values P_i and P_{i+N}, and which copy of each value goes to which position is simple math and a bunch of multiplications.

However, we can’t directly subtract this count from (2N)!, since a good array with multiple bad pairs was counted multiple times!
This should lead you to the idea of using the inclusion-exclusion principle.

Let B_i be the number of good arrays with at least i bad pairs.
By the principle of inclusion-exclusion, the answer we’re looking for is

\sum_{i=0}^{N} (-1)^N B_i

So, let’s focus on computing B_i quickly.
This is a generalization of how we computed B_1, and is just a bunch of combinatorics.

Details

There are i bad pairs, which fixes 2i positions among the first N. These can be chosen in \binom{N}{2i} ways.
We need to now split these 2i positions into i pairs.
This can be done in \binom{2i}{2, 2, 2, \ldots, 2} = \frac{(2i)!}{2^i} ways by definition of the multinomial coefficient.

Next, we need to fix which elements go into each pair. Let’s fix an order of the pairs and assign values one by one.
This can be done in N\cdot (N-1)\cdot (N-2) \cdot \ldots \cdot (N-2i+1) ways, which after a little algebraic manipulation is seen to be just \binom{N}{2i} (2i)!.

However, we also need to divide by i! since the order in which we assign values to the pairs doesn’t actually matter (it makes no difference to the final permutation).

Each bad pair now has four choices (we can swap elements with the same value modulo N to obtain a different permutation with the same bad pair count).
This gives us 4^i choices in total.

Finally, there are (2N - 4i)! ways to arrange the elements not among these bad pairs.

Putting everything together, we have

B_i = \binom{N}{2i} \binom{N}{2i} \frac{4^i(2i)! (2i)! (2N - 4i)!}{2^ii!}

which can be computed in \mathcal{O}(\log N) or even \mathcal{O}(1) time, since everything is either a factorial or a power of two (or an inverse of one of them)

Since each B_i can be computed quickly, we can solve for a fixed N in \mathcal{O}(N) time, which is good enough for subtask 1.

Subtask 2

Let’s look at our expression for B_i again, and clean it up a bit by cancelling out common terms.
We obtain

B_i = \frac{2^i N! N! (2N - 4i)!}{(N - 2i)! (N - 2i)! i!} = \frac{2^i N! N!}{i!} \binom{2N - 4i}{N - 2i}

The N! terms are independent of i, so let’s ignore them for now.
Notice that we have the product of an “i-part” comprising \frac{2^i}{i!} and something like an “N-i part” comprising \binom{2N - 4i}{N - 2i}.

This should make you think along the lines of convolution.
However, we can’t directly multiply two polynomials: the N-i part doesn’t quite depend on N-i, it depends on N - 2i (and there’s no apparent nice way to make the i-part depend on 2i instead).
It’d be nice if we had 2N instead of N, so let’s do just that!

We split the computation into two cases: even N and odd N. Let’s deal with the even case first.
So, suppose N = 2n.

We now have \displaystyle\binom{2N - 4i}{N - 2i} = \binom{4n - 4i}{2n - 2i} = \binom{4(n - i)}{2(n-i)}, which does depend on n - i.

So, create the polynomials p and q, with coefficients

p_i =\frac{ (-1)^i 2^i}{i!}
q_i = \binom{4i}{2i}

and multiply them together so obtain r.

The answer for 2n is then r_n, multiplied by (2n)!^2 (which we took out at the start).

For odd N, a similar technique works: write N = 2n + 1, and once again you’ll get a very similar relation on n-i. Following this, multiply the appropriate polynomials and we’re done.

Note that this problem requires multiplying two polynomials and keeping their coefficients modulo some M, where M isn’t necessarily ‘nice’.
Standard FFT and NTT won’t suffice here; a tutorial on how to do this can be found at cp-algorithms.

TIME COMPLEXITY

\mathcal{O}(K\log K) per testcase.

CODE:

Setter's code (C++)

#include <bits/stdc++.h>
using namespace std;

int n, mod;
struct modint
{
    int32_t value;
    modint() = default;
    modint(int32_t value_) : value(value_ % mod) {}
    modint(int64_t value_) : value(value_ % mod) {}
    inline modint operator+(modint other) const
    {
        int32_t c = this->value + other.value;
        return modint(c >= mod ? c - mod : c);
    }
    inline modint operator-(modint other) const
    {
        int32_t c = this->value - other.value;
        return modint(c < 0 ? c + mod : c);
    }
    inline modint operator*(modint other) const
    {
        int32_t c = (int64_t)this->value * other.value % mod;
        return modint(c < 0 ? c + mod : c);
    }
    inline modint &operator+=(modint other)
    {
        this->value += other.value;
        if (this->value >= mod)
            this->value -= mod;
        return *this;
    }
    inline modint &operator-=(modint other)
    {
        this->value -= other.value;
        if (this->value < 0)
            this->value += mod;
        return *this;
    }
    inline modint &operator*=(modint other)
    {
        this->value = (int64_t)this->value * other.value % mod;
        if (this->value < 0)
            this->value += mod;
        return *this;
    }
    inline modint operator-() const { return modint(this->value ? mod - this->value : 0); }
    modint pow(int32_t k) const
    {
        modint x = *this, y = 1;
        for (; k; k >>= 1)
        {
            if (k & 1)
                y *= x;
            x *= x;
        }
        return y;
    }
    modint inv() const { return pow(mod - 2); } // MOD must be a prime
    inline modint operator/(modint other) const { return *this * other.inv(); }
    inline modint operator/=(modint other) { return *this *= other.inv(); }
    inline bool operator==(modint other) const { return value == other.value; }
    inline bool operator!=(modint other) const { return value != other.value; }
    inline bool operator<(modint other) const { return value < other.value; }
    inline bool operator>(modint other) const { return value > other.value; }
};
modint operator*(int64_t value, modint n) { return modint(value) * n; }
modint operator*(int32_t value, modint n) { return modint(value) * n; }
istream &operator>>(istream &in, modint &n) { return in >> n.value; }
ostream &operator<<(ostream &out, modint n) { return out << n.value; }
struct combi
{
    int n;
    vector<modint> facts, finvs, invs;
    combi(int _n) : n(_n), facts(_n), finvs(_n), invs(_n)
    {
        facts[0] = finvs[0] = 1;
        invs[1] = 1;
        for (int i = 2; i < n; i++)
            invs[i] = invs[mod % i] * (-mod / i);
        for (int i = 1; i < n; i++)
        {
            facts[i] = facts[i - 1] * i;
            finvs[i] = finvs[i - 1] * invs[i];
        }
    }
    inline modint fact(int n) { return facts[n]; }
    inline modint finv(int n) { return finvs[n]; }
    inline modint inv(int n) { return invs[n]; }
    inline modint ncr(int n, int k) { return n < k or k < 0 ? 0 : facts[n] * finvs[k] * finvs[n - k]; }
    inline modint aranj(int n, int k) { return ncr(n, k) * facts[k]; }
};
struct base
{
    double x, y;
    base() { x = y = 0; }
    base(double x, double y) : x(x), y(y) {}
};
inline base operator+(base a, base b) { return base(a.x + b.x, a.y + b.y); }
inline base operator-(base a, base b) { return base(a.x - b.x, a.y - b.y); }
inline base operator*(base a, base b) { return base(a.x * b.x - a.y * b.y, a.x * b.y + a.y * b.x); }
inline base conj(base a) { return base(a.x, -a.y); }
int lim = 1;
vector<base> roots = {{0, 0}, {1, 0}};
vector<int> rev = {0, 1};
const double PI = acosl(-1.0);
void ensure_base(int p)
{
    if (p <= lim)
        return;
    rev.resize(1 << p);
    for (int i = 0; i < (1 << p); i++)
        rev[i] = (rev[i >> 1] >> 1) + ((i & 1) << (p - 1));
    roots.resize(1 << p);
    while (lim < p)
    {
        double angle = 2 * PI / (1 << (lim + 1));
        for (int i = 1 << (lim - 1); i < (1 << lim); i++)
        {
            roots[i << 1] = roots[i];
            double angle_i = angle * (2 * i + 1 - (1 << lim));
            roots[(i << 1) + 1] = base(cos(angle_i), sin(angle_i));
        }
        lim++;
    }
}
void fft(vector<base> &a, int n = -1)
{
    if (n == -1)
        n = a.size();
    assert((n & (n - 1)) == 0);
    int zeros = __builtin_ctz(n);
    ensure_base(zeros);
    int shift = lim - zeros;
    for (int i = 0; i < n; i++)
        if (i < (rev[i] >> shift))
            swap(a[i], a[rev[i] >> shift]);
    for (int k = 1; k < n; k <<= 1)
    {
        for (int i = 0; i < n; i += 2 * k)
        {
            for (int j = 0; j < k; j++)
            {
                base z = a[i + j + k] * roots[j + k];
                a[i + j + k] = a[i + j] - z;
                a[i + j] = a[i + j] + z;
            }
        }
    }
}
vector<int> multiply(vector<int> &a, vector<int> &b, int eq = 0)
{
    int need = a.size() + b.size() - 1;
    int p = 0;
    while ((1 << p) < need)
        p++;
    ensure_base(p);
    int sz = 1 << p;
    vector<base> A, B;
    if (sz > (int)A.size())
        A.resize(sz);
    for (int i = 0; i < (int)a.size(); i++)
    {
        int x = (a[i] % mod + mod) % mod;
        A[i] = base(x & ((1 << 15) - 1), x >> 15);
    }
    fill(A.begin() + a.size(), A.begin() + sz, base{0, 0});
    fft(A, sz);
    if (sz > (int)B.size())
        B.resize(sz);
    if (eq)
        copy(A.begin(), A.begin() + sz, B.begin());
    else
    {
        for (int i = 0; i < (int)b.size(); i++)
        {
            int x = (b[i] % mod + mod) % mod;
            B[i] = base(x & ((1 << 15) - 1), x >> 15);
        }
        fill(B.begin() + b.size(), B.begin() + sz, base{0, 0});
        fft(B, sz);
    }
    double ratio = 0.25 / sz;
    base r2(0, -1), r3(ratio, 0), r4(0, -ratio), r5(0, 1);
    for (int i = 0; i <= (sz >> 1); i++)
    {
        int j = (sz - i) & (sz - 1);
        base a1 = (A[i] + conj(A[j])), a2 = (A[i] - conj(A[j])) * r2;
        base b1 = (B[i] + conj(B[j])) * r3, b2 = (B[i] - conj(B[j])) * r4;
        if (i != j)
        {
            base c1 = (A[j] + conj(A[i])), c2 = (A[j] - conj(A[i])) * r2;
            base d1 = (B[j] + conj(B[i])) * r3, d2 = (B[j] - conj(B[i])) * r4;
            A[i] = c1 * d1 + c2 * d2 * r5;
            B[i] = c1 * d2 + c2 * d1;
        }
        A[j] = a1 * b1 + a2 * b2 * r5;
        B[j] = a1 * b2 + a2 * b1;
    }
    fft(A, sz);
    fft(B, sz);
    vector<int> res(need);
    for (int i = 0; i < need; i++)
    {
        long long aa = A[i].x + 0.5;
        long long bb = B[i].x + 0.5;
        long long cc = A[i].y + 0.5;
        res[i] = (aa + ((bb % mod) << 15) + ((cc % mod) << 30)) % mod;
    }
    return res;
}
int main()
{
    cin.tie(nullptr)->sync_with_stdio(false);
    cin >> n >> mod;
    combi C(n + n + 1);
    int lim = n / 2;
    vector<int> first(lim + 1);
    for (int i = 0; i <= lim; ++i)
    {
        modint rep = C.fact(4 * i) * C.finv(2 * i) * C.finv(2 * i);
        first[i] = rep.value;
    }
    vector<int> second(lim + 1);
    for (int i = 0; i <= lim; ++i)
    {
        modint rep = modint(2).pow(i) * modint(-1).pow(i) * C.finv(i);
        second[i] = rep.value;
    }
    vector<int> sumeven = multiply(first, second);
    first = vector<int>(lim + 2);
    for (int i = 1; i <= lim + 1; ++i)
    {
        modint rep = C.fact(4 * i - 2) * C.finv(2 * i - 1) * C.finv(2 * i - 1);
        first[i] = rep.value;
    }
    vector<int> sumodd = multiply(first, second);
    for (int i = 1; i <= n; ++i)
    {
        if (i % 2 == 1)
        {
            modint rep = C.fact(i) * C.fact(i) * sumodd[(i + 1) / 2];
            cout << rep << ' ';
        }
        else
        {
            modint rep = C.fact(i) * C.fact(i) * sumeven[i / 2];
            cout << rep << ' ';
        }
    }
}

Tester's code (C++)

#include <bits/stdc++.h>
using namespace std;
#ifdef tabr
#include "library/debug.cpp"
#else
#define debug(...)
#endif

struct input_checker {
    string buffer;
    int pos;

    const string all = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
    const string number = "0123456789";
    const string upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    const string lower = "abcdefghijklmnopqrstuvwxyz";

    input_checker() {
        pos = 0;
        while (true) {
            int c = cin.get();
            if (c == -1) {
                break;
            }
            buffer.push_back((char) c);
        }
    }

    string readOne() {
        assert(pos < (int) buffer.size());
        string res;
        while (pos < (int) buffer.size() && buffer[pos] != ' ' && buffer[pos] != '\n') {
            res += buffer[pos];
            pos++;
        }
        return res;
    }

    string readString(int min_len, int max_len, const string& pattern = "") {
        assert(min_len <= max_len);
        string res = readOne();
        assert(min_len <= (int) res.size());
        assert((int) res.size() <= max_len);
        for (int i = 0; i < (int) res.size(); i++) {
            assert(pattern.empty() || pattern.find(res[i]) != string::npos);
        }
        return res;
    }

    int readInt(int min_val, int max_val) {
        assert(min_val <= max_val);
        int res = stoi(readOne());
        assert(min_val <= res);
        assert(res <= max_val);
        return res;
    }

    long long readLong(long long min_val, long long max_val) {
        assert(min_val <= max_val);
        long long res = stoll(readOne());
        assert(min_val <= res);
        assert(res <= max_val);
        return res;
    }

    vector<int> readInts(int size, int min_val, int max_val) {
        assert(min_val <= max_val);
        vector<int> res(size);
        for (int i = 0; i < size; i++) {
            res[i] = readInt(min_val, max_val);
            if (i != size - 1) {
                readSpace();
            }
        }
        return res;
    }

    vector<long long> readLongs(int size, long long min_val, long long max_val) {
        assert(min_val <= max_val);
        vector<long long> res(size);
        for (int i = 0; i < size; i++) {
            res[i] = readLong(min_val, max_val);
            if (i != size - 1) {
                readSpace();
            }
        }
        return res;
    }

    void readSpace() {
        assert((int) buffer.size() > pos);
        assert(buffer[pos] == ' ');
        pos++;
    }

    void readEoln() {
        assert((int) buffer.size() > pos);
        assert(buffer[pos] == '\n');
        pos++;
    }

    void readEof() {
        assert((int) buffer.size() == pos);
    }
};

namespace math314 {
typedef long long ll;
typedef pair<int, int> Pii;

#define FOR(i, n) for (int i = 0; i < (n); i++)
#define sz(c) ((int) (c).size())
#define ten(x) ((int) 1e##x)

template <class T>
T extgcd(T a, T b, T& x, T& y) {
    for (T u = y = 1, v = x = 0; a;) {
        T q = b / a;
        swap(x -= q * u, u);
        swap(y -= q * v, v);
        swap(b -= q * a, a);
    }
    return b;
}
template <class T>
T mod_inv(T a, T m) {
    T x, y;
    extgcd(a, m, x, y);
    return (m + x % m) % m;
}
ll mod_pow(ll a, ll n, ll mod) {
    ll ret = 1;
    ll p = a % mod;
    while (n) {
        if (n & 1) ret = ret * p % mod;
        p = p * p % mod;
        n >>= 1;
    }
    return ret;
}

template <int mod, int primitive_root>
class NTT {
   public:
    int get_mod() const { return mod; }
    void _ntt(vector<ll>& a, int sign) {
        const int n = sz(a);
        assert((n ^ (n & -n)) == 0);  // n = 2^k

        const int g = 3;                               // g is primitive root of mod
        int h = (int) mod_pow(g, (mod - 1) / n, mod);  // h^n = 1
        if (sign == -1) h = (int) mod_inv(h, mod);     // h = h^-1 % mod

        // bit reverse
        int i = 0;
        for (int j = 1; j < n - 1; ++j) {
            for (int k = n >> 1; k > (i ^= k); k >>= 1)
                ;
            if (j < i) swap(a[i], a[j]);
        }

        for (int m = 1; m < n; m *= 2) {
            const int m2 = 2 * m;
            const ll base = mod_pow(h, n / m2, mod);
            ll w = 1;
            FOR(x, m) {
                for (int s = x; s < n; s += m2) {
                    ll u = a[s];
                    ll d = a[s + m] * w % mod;
                    a[s] = u + d;
                    if (a[s] >= mod) a[s] -= mod;
                    a[s + m] = u - d;
                    if (a[s + m] < 0) a[s + m] += mod;
                }
                w = w * base % mod;
            }
        }

        for (auto& x : a)
            if (x < 0) x += mod;
    }
    void ntt(vector<ll>& input) {
        _ntt(input, 1);
    }
    void intt(vector<ll>& input) {
        _ntt(input, -1);
        const int n_inv = mod_inv(sz(input), mod);
        for (auto& x : input) x = x * n_inv % mod;
    }

    // 畳み込み演算を行う
    vector<ll> convolution(const vector<ll>& a, const vector<ll>& b) {
        int ntt_size = 1;
        while (ntt_size < sz(a) + sz(b)) ntt_size *= 2;

        vector<ll> _a = a, _b = b;
        _a.resize(ntt_size);
        _b.resize(ntt_size);

        ntt(_a);
        ntt(_b);

        FOR(i, ntt_size) {
            (_a[i] *= _b[i]) %= mod;
        }

        intt(_a);
        return _a;
    }
};

ll garner(vector<Pii> mr, int mod) {
    mr.emplace_back(mod, 0);

    vector<ll> coffs(sz(mr), 1);
    vector<ll> constants(sz(mr), 0);
    FOR(i, sz(mr) - 1) {
        // coffs[i] * v + constants[i] == mr[i].second (mod mr[i].first) を解く
        ll v = (mr[i].second - constants[i]) * mod_inv<ll>(coffs[i], mr[i].first) % mr[i].first;
        if (v < 0) v += mr[i].first;

        for (int j = i + 1; j < sz(mr); j++) {
            (constants[j] += coffs[j] * v) %= mr[j].first;
            (coffs[j] *= mr[i].first) %= mr[j].first;
        }
    }

    return constants[sz(mr) - 1];
}

typedef NTT<167772161, 3> NTT_1;
typedef NTT<469762049, 3> NTT_2;
typedef NTT<1224736769, 3> NTT_3;

vector<ll> int32mod_convolution(vector<ll> a, vector<ll> b, int mod) {
    for (auto& x : a) x %= mod;
    for (auto& x : b) x %= mod;
    NTT_1 ntt1;
    NTT_2 ntt2;
    NTT_3 ntt3;
    auto x = ntt1.convolution(a, b);
    auto y = ntt2.convolution(a, b);
    auto z = ntt3.convolution(a, b);

    vector<ll> ret(sz(x));
    vector<Pii> mr(3);
    FOR(i, sz(x)) {
        mr[0].first = ntt1.get_mod(), mr[0].second = (int) x[i];
        mr[1].first = ntt2.get_mod(), mr[1].second = (int) y[i];
        mr[2].first = ntt3.get_mod(), mr[2].second = (int) z[i];
        ret[i] = garner(mr, mod);
    }

    return ret;
}
}  // namespace math314

long long mod = 2;

struct mint {
    long long value;
    mint(long long x = 0) {
        value = x % mod;
        if (value < 0) value += mod;
    }
    mint& operator+=(const mint& other) {
        if ((value += other.value) >= mod) value -= mod;
        return *this;
    }
    mint& operator-=(const mint& other) {
        if ((value -= other.value) < 0) value += mod;
        return *this;
    }
    mint& operator*=(const mint& other) {
        value = value * other.value % mod;
        return *this;
    }
    mint& operator/=(const mint& other) {
        long long a = 0, b = 1, c = other.value, m = mod;
        while (c != 0) {
            long long t = m / c;
            m -= t * c;
            swap(c, m);
            a -= t * b;
            swap(a, b);
        }
        a %= mod;
        if (a < 0) a += mod;
        value = value * a % mod;
        return *this;
    }
    friend mint operator+(const mint& lhs, const mint& rhs) { return mint(lhs) += rhs; }
    friend mint operator-(const mint& lhs, const mint& rhs) { return mint(lhs) -= rhs; }
    friend mint operator*(const mint& lhs, const mint& rhs) { return mint(lhs) *= rhs; }
    friend mint operator/(const mint& lhs, const mint& rhs) { return mint(lhs) /= rhs; }
    mint& operator++() { return *this += 1; }
    mint& operator--() { return *this -= 1; }
    mint operator++(int) {
        mint result(*this);
        *this += 1;
        return result;
    }
    mint operator--(int) {
        mint result(*this);
        *this -= 1;
        return result;
    }
    mint operator-() const { return mint(-value); }
    bool operator==(const mint& rhs) const { return value == rhs.value; }
    bool operator!=(const mint& rhs) const { return value != rhs.value; }
    bool operator<(const mint& rhs) const { return value < rhs.value; }
};
string to_string(const mint& x) {
    return to_string(x.value);
}
ostream& operator<<(ostream& stream, const mint& x) {
    return stream << x.value;
}
istream& operator>>(istream& stream, mint& x) {
    stream >> x.value;
    x.value %= mod;
    if (x.value < 0) x.value += mod;
    return stream;
}

mint power(mint a, long long n) {
    mint res = 1;
    while (n > 0) {
        if (n & 1) {
            res *= a;
        }
        a *= a;
        n >>= 1;
    }
    return res;
}

vector<mint> fact(1, 1);
vector<mint> finv(1, 1);

mint C(int n, int k) {
    if (n < k || k < 0) {
        return mint(0);
    }
    while ((int) fact.size() < n + 1) {
        fact.emplace_back(fact.back() * (int) fact.size());
        finv.emplace_back(1 / fact.back());
    }
    return fact[n] * finv[k] * finv[n - k];
}

bool is_prime(long long n) {
    if (n <= 1) {
        return false;
    }
    for (long long i = 2; i * i <= n; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);
    input_checker in;
    int k = in.readInt(1, 2e5);
    in.readSpace();
    mod = in.readInt(1e7 + 19, 1e9 + 7);
    in.readEoln();
    assert(is_prime(mod));
    C(2 * k + 10, 0);
    vector<mint> f(k + 1);
    f[0] = 1;
    for (int i = 2; i <= k; i += 2) {
        f[i] = f[i - 2] * -4 / i;
    }
    vector<mint> g(k + 1);
    g[0] = 1;
    for (int i = 1; i <= k; i++) {
        g[i] = g[i - 1] * (4 * i - 2) / i;
    }
    vector<long long> ff(k + 1), gg(k + 1);
    for (int i = 0; i <= k; i++) {
        ff[i] = f[i].value;
        gg[i] = g[i].value;
    }
    auto ans = math314::int32mod_convolution(ff, gg, mod);
    for (int i = 1; i <= k; i++) {
        cout << ans[i] * fact[i] * fact[i] << " \n"[i == k];
    }
    in.readEof();
    return 0;
}

Editorialist's code (C++)

#include "bits/stdc++.h"
// #pragma GCC optimize("O3,unroll-loops")
// #pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")
using namespace std;
using ll = long long int;
mt19937_64 rng(chrono::high_resolution_clock::now().time_since_epoch().count());

template<typename T>
struct Z_p{
	using Type = typename decay<decltype(T::value)>::type;
	static vector<Type> MOD_INV;
	constexpr Z_p(): value(){ }
	template<typename U> Z_p(const U &x){ value = normalize(x); }
	template<typename U> static Type normalize(const U &x){
		Type v;
		if(-mod() <= x && x < mod()) v = static_cast<Type>(x);
		else v = static_cast<Type>(x % mod());
		if(v < 0) v += mod();
		return v;
	}
	const Type& operator()() const{ return value; }
	template<typename U> explicit operator U() const{ return static_cast<U>(value); }
	constexpr static Type mod(){ return T::value; }
	Z_p &operator+=(const Z_p &otr){ if((value += otr.value) >= mod()) value -= mod(); return *this; }
	Z_p &operator-=(const Z_p &otr){ if((value -= otr.value) < 0) value += mod(); return *this; }
	template<typename U> Z_p &operator+=(const U &otr){ return *this += Z_p(otr); }
	template<typename U> Z_p &operator-=(const U &otr){ return *this -= Z_p(otr); }
	Z_p &operator++(){ return *this += 1; }
	Z_p &operator--(){ return *this -= 1; }
	Z_p operator++(int){ Z_p result(*this); *this += 1; return result; }
	Z_p operator--(int){ Z_p result(*this); *this -= 1; return result; }
	Z_p operator-() const{ return Z_p(-value); }
	template<typename U = T>
	typename enable_if<is_same<typename Z_p<U>::Type, int>::value, Z_p>::type &operator*=(const Z_p& rhs){
		#ifdef _WIN32
		uint64_t x = static_cast<int64_t>(value) * static_cast<int64_t>(rhs.value);
		uint32_t xh = static_cast<uint32_t>(x >> 32), xl = static_cast<uint32_t>(x), d, m;
		asm(
			"divl %4; \n\t"
			: "=a" (d), "=d" (m)
			: "d" (xh), "a" (xl), "r" (mod())
		);
		value = m;
		#else
		value = normalize(static_cast<int64_t>(value) * static_cast<int64_t>(rhs.value));
		#endif
		return *this;
	}
	template<typename U = T>
	typename enable_if<is_same<typename Z_p<U>::Type, int64_t>::value, Z_p>::type &operator*=(const Z_p &rhs){
		uint64_t ret = static_cast<uint64_t>(value) * static_cast<uint64_t>(rhs.value) - static_cast<uint64_t>(mod()) * static_cast<uint64_t>(1.L / static_cast<uint64_t>(mod()) * static_cast<uint64_t>(value) * static_cast<uint64_t>(rhs.value));
		value = normalize(static_cast<int64_t>(ret + static_cast<uint64_t>(mod()) * (ret < 0) - static_cast<uint64_t>(mod()) * (ret >= static_cast<uint64_t>(mod()))));
		return *this;
	}
	template<typename U = T>
	typename enable_if<!is_integral<typename Z_p<U>::Type>::value, Z_p>::type &operator*=(const Z_p &rhs){
		value = normalize(value * rhs.value);
		return *this;
	}
	template<typename U>
	Z_p &operator^=(U e){
		if(e < 0) *this = 1 / *this, e = -e;
		Z_p res = 1;
		for(; e; *this *= *this, e >>= 1) if(e & 1) res *= *this;
		return *this = res;
	}
	template<typename U>
	Z_p operator^(U e) const{
		return Z_p(*this) ^= e;
	}
	Z_p &operator/=(const Z_p &otr){
		Type a = otr.value, m = mod(), u = 0, v = 1;
		if(a < (int)MOD_INV.size()) return *this *= MOD_INV[a];
		while(a){
			Type t = m / a;
			m -= t * a; swap(a, m);
			u -= t * v; swap(u, v);
		}
		assert(m == 1);
		return *this *= u;
	}
	template<typename U> friend const Z_p<U> &abs(const Z_p<U> &v){ return v; }
	Type value;
};
template<typename T> bool operator==(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value == rhs.value; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator==(const Z_p<T>& lhs, U rhs){ return lhs == Z_p<T>(rhs); }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator==(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) == rhs; }
template<typename T> bool operator!=(const Z_p<T> &lhs, const Z_p<T> &rhs){ return !(lhs == rhs); }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator!=(const Z_p<T> &lhs, U rhs){ return !(lhs == rhs); }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator!=(U lhs, const Z_p<T> &rhs){ return !(lhs == rhs); }
template<typename T> bool operator<(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value < rhs.value; }
template<typename T> bool operator>(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value > rhs.value; }
template<typename T> bool operator<=(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value <= rhs.value; }
template<typename T> bool operator>=(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value >= rhs.value; }
template<typename T> Z_p<T> operator+(const Z_p<T> &lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) += rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator+(const Z_p<T> &lhs, U rhs){ return Z_p<T>(lhs) += rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator+(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) += rhs; }
template<typename T> Z_p<T> operator-(const Z_p<T> &lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) -= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator-(const Z_p<T>& lhs, U rhs){ return Z_p<T>(lhs) -= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator-(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) -= rhs; }
template<typename T> Z_p<T> operator*(const Z_p<T> &lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) *= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator*(const Z_p<T>& lhs, U rhs){ return Z_p<T>(lhs) *= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator*(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) *= rhs; }
template<typename T> Z_p<T> operator/(const Z_p<T> &lhs, const Z_p<T> &rhs) { return Z_p<T>(lhs) /= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator/(const Z_p<T>& lhs, U rhs) { return Z_p<T>(lhs) /= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator/(U lhs, const Z_p<T> &rhs) { return Z_p<T>(lhs) /= rhs; }
template<typename T> istream &operator>>(istream &in, Z_p<T> &number){
	typename common_type<typename Z_p<T>::Type, int64_t>::type x;
	in >> x;
	number.value = Z_p<T>::normalize(x);
	return in;
}
template<typename T> ostream &operator<<(ostream &out, const Z_p<T> &number){ return out << number(); }


using ModType = int;
struct VarMod{ static ModType value; };
ModType VarMod::value;
ModType &mod = VarMod::value;
using Zp = Z_p<VarMod>;


// constexpr int mod = 1e9 + 7; // 1000000007
// constexpr int mod = (119 << 23) + 1; // 998244353
// constexpr int mod = 1e9 + 9; // 1000000009
// using Zp = Z_p<integral_constant<decay<decltype(mod)>::type, mod>>;

template<typename T> vector<typename Z_p<T>::Type> Z_p<T>::MOD_INV;
template<typename T = integral_constant<decay<decltype(mod)>::type, mod>>
void precalc_inverse(int SZ){
	auto &inv = Z_p<T>::MOD_INV;
	if(inv.empty()) inv.assign(2, 1);
	for(; inv.size() <= SZ; ) inv.push_back((mod - 1LL * mod / (int)inv.size() * inv[mod % (int)inv.size()]) % mod);
}

template<typename T>
vector<T> precalc_power(T base, int SZ){
	vector<T> res(SZ + 1, 1);
	for(auto i = 1; i <= SZ; ++ i) res[i] = res[i - 1] * base;
	return res;
}

template<typename T>
vector<T> precalc_factorial(int SZ){
	vector<T> res(SZ + 1, 1); res[0] = 1;
	for(auto i = 1; i <= SZ; ++ i) res[i] = res[i - 1] * i;
	return res;
}

using C = complex<double>;
using vd = vector<double>;
const long double PI = 2*acos(0.0);
void fft(vector<C>& a) {
	int n = size(a), L = 31 - __builtin_clz(n);
	static vector<complex<long double>> R(2, 1);
	static vector<C> rt(2, 1);  // (^ 10% faster if double)
	for (static int k = 2; k < n; k *= 2) {
		R.resize(n); rt.resize(n);
		const long double PI = 2*acos(0.0);
		auto x = polar(1.0L, PI / k); // M_PI, lower-case L
		for (int i = k; i < 2*k; ++i) rt[i] = R[i] = i&1 ? R[i/2] * x : R[i/2];
	}
	vector<int> rev(n);
	for (int i = 0; i < n; ++i) rev[i] = (rev[i / 2] | (i & 1) << L) / 2;
	for (int i = 0; i < n; ++i) if (i < rev[i]) swap(a[i], a[rev[i]]);
	for (int k = 1; k < n; k *= 2)
		for (int i = 0; i < n; i += 2 * k) for (int j = 0; j < k; ++j) {
			C z = rt[j+k] * a[i+j+k]; // (25% faster if hand-rolled)  /// include-line
			a[i + j + k] = a[i + j] - z;
			a[i + j] += z;
		}
}
vd conv(const vd& a, const vd& b) {
	if (a.empty() || b.empty()) return {};
	vd res(size(a) + size(b) - 1);
	int L = 32 - __builtin_clz(size(res)), n = 1 << L;
	vector<C> in(n), out(n);
	copy(begin(a), end(a), begin(in));
	for (int i = 0; i < (int)size(b); ++i) in[i].imag(b[i]);
	fft(in);
	for (auto &x : in) x *= x;
	for (int i = 0; i < n; ++i) out[i] = in[-i & (n - 1)] - conj(in[i]);
	fft(out);
	for (int i = 0; i < (int)size(res); ++i) res[i] = imag(out[i]) / (4 * n);
	return res;
}

// FFT-MOD
typedef vector<ll> vl;
vl convMod(const vl &a, const vl &b, int M) {
	if (a.empty() || b.empty()) return {};
	vl res(size(a) + size(b) - 1);
	int B=32-__builtin_clz(size(res)), n=1<<B, cut=int(sqrt(M));
	vector<C> L(n), R(n), outs(n), outl(n);
	for (int i = 0; i < (int)size(a); ++i) L[i] = C((int)a[i] / cut, (int)a[i] % cut);
	for (int i = 0; i < (int)size(b); ++i) R[i] = C((int)b[i] / cut, (int)b[i] % cut);
	fft(L), fft(R);
	for (int i = 0; i < n; ++i) {
		int j = -i & (n - 1);
		outl[j] = (L[i] + conj(L[j])) * R[i] / (2.0 * n);
		outs[j] = (L[i] - conj(L[j])) * R[i] / (2.0 * n) / 1i;
	}
	fft(outl), fft(outs);
	for (int i = 0; i < (int)size(res); ++i) {
		ll av = ll(real(outl[i])+.5), cv = ll(imag(outs[i])+.5);
		ll bv = ll(imag(outl[i])+.5) + ll(real(outs[i])+.5);
		res[i] = ((av % M * cut + bv) % M * cut + cv) % M;
	}
	return res;
}

int main()
{
	ios::sync_with_stdio(false); cin.tie(0);

	int k; cin >> k >> mod;

	auto fac = precalc_factorial<Zp>(2*k+10);
	auto C = [&] (int n, int r) {
		if (n < r or r < 0) return Zp(0);
		return fac[n] / (fac[r] * fac[n-r]);
	};

	vector<ll> poly1(k/2 + 1), poly2(k/2 + 1), poly3(k/2 + 1);
	for (int i = 0; i <= k/2; ++i) {
		Zp cur = (Zp(-2) ^ i) / fac[i];
		poly1[i] = cur.value;

		cur = C(4*i, 2*i);
		poly2[i] = cur.value;

		cur = C(4*i+2, 2*i+1);
		poly3[i] = cur.value;
	}

	auto res1 = convMod(poly1, poly2, mod);
	auto res2 = convMod(poly1, poly3, mod);

	cout << res2[0] << ' ';
	for (int i = 1; i < (k+1)/2; ++i) {
		cout << fac[2*i] * fac[2*i] * res1[i] << ' ';
		cout << fac[2*i+1] * fac[2*i+1] * res2[i] << ' ';
	}
	if (k%2 == 0) cout << fac[k] * fac[k] * res1[k/2] << '\n';
}

el_tunel · March 2, 2023, 1:33am

This problem can be solved more simply using the theory of holonomic functions. As explained above, the number of n-beautiful permutations is b_n:=n!^2 \ a_n, where

a_n = \sum_{0\le i\le n/2} \frac{(-2)^i}{i!} {2n - 4i \choose n - 2i}.

But since (1-4z)^{-1/2}=\sum_{i\ge 0} {2i \choose i} z^i, a_n will equal the coefficient of z^n in

f(z):=e^{-2 z^2} (1-4 z)^{-1/2}.

Now

f'(z)=(-4z + 2 (1-4z)^{-1})f(z)

so

(1 - 4 z) f'(z) = (2 - 4 z + 16 z^2) f(z)

and, taking the coefficient of z^n and assuming n\ge 2,

(n+1) a_{n+1} - 4 n a_n = 2 a_n - 4 a_{n-1} + 16 a_{n-2}.

Therefore

a_{n+1} = (( 4 n+2) a_n - 4 a_{n-1} + 16 a_{n-2}) (n + 1)^{-1}

and if we multiply by (n+1)!^2, we can get a recurrence for b_n:

b_{n+1} = (n + 1) (( 4 n+2) b_n - 4 b_{n-1} n^2 + 16(n-1)^2n^2 b_{n-2}),

still assuming that n\ge 2. Together with the initial values b_0=1, b_1=2, and b_2=16, this gives a short, quick way of calculating b_1, \dots, b_K modulo M without needing to use Fourier transforms or number-theoretic transforms.