CPS - Editorial

PROBLEM LINK:

Practice
Contest: Division 1
Contest: Division 2
Contest: Division 3
Contest: Division 4

Author: helloLad
Tester: sky_nik
Editorialist: iceknight1093

DIFFICULTY:

TBD

PREREQUISITES:

Math, Dynamic programming

PROBLEM:

The beauty of an array A is defined to be the maximum number of mutually disjoint subarrays with GCD 1 that can be found in it.
You’re given N and K. Compute the sum of beauty across all arrays with length N containing elements between 1 and K.

EXPLANATION:

First, observe that the beauty of an array can be computed greedily: simply repeatedly choose the shortest prefix (or equivalently suffix) that has GCD 1, and discard it.
Once the array has GCD \gt 1, stop.

Proof

We’ll prove that taking the shortest valid suffix is optimal.

Consider any optimal solution, say with segments [l_1, r_1], \ldots, [l_k, r_k].
If r_k \lt N, then replacing [l_k, r_k] with [l_k, N] gives us a solution with the same number of segments, but with a suffix chosen instead (since [l_k, r_k] already had a GCD of 1, adding more elements to it will not change this).

Now we have [l_k, N].
If [l_k+1, N] also has a GCD of 1, clearly we can just replace [l_k, N] with it to end up with a solution of the same size but shorter suffix.
Repeat this till the chosen suffix is as short as possible, as claimed.
Then, repeat the arguments for the other segments.

We can use this strategy to formulate a dynamic programming solution.
Let dp_i denote the answer for length i - that is, the sum of beauty across all arrays of length i and elements in [1, K].
To compute dp_i, we have the following:

  • As noted at the start, we can greedily choose a suffix of the array to cut out.
    Fix the length of this suffix, say x (1 \leq x \leq i).
  • Then, choose some array of length x whose GCD is 1, the actual suffix we’re cutting out.
    Suppose there are c_x ways to do this.
  • Each of these c_x choices will contribute 1 to the answer of K^{i-x} arrays (the first i-x elements can be chosen freely).
    On top of that, for each such choice of subarray, we also get a score of dp_{i-x} from the remaining elements.

Putting it together, we have

dp_i = \sum_{x = 1}^i c_x\cdot (K^{i-x} + dp_{i-x})

Now, we focus on finding c_x.
c_x is, by definition, the number of arrays of length x that form a valid suffix.
An array of length x is valid if and only if:

  • The GCD of the array is 1, and
  • No smaller suffix of it has GCD 1.

Let’s look at only the first condition for now: how many arrays of length x with elements in [1, K] have a GCD of 1?

Answer

This quantity is something that can be computed using dynamic programming.

Let A_{x, g} denote the number of arrays of length x and elements in [1, K] whose GCD is exactly g.
To get a gcd of g, certainly every chosen element should be a multiple of g.
There are \left(\left\lfloor \frac{K}{g} \right\rfloor\right)^x ways to make every element a multiple of g.
However, not all of these are valid: when all the elements are multiples of g, the GCD will also be a multiple of g, which might be greater than g.
So, we must subtract the values of A_{x,2g}, A_{x,3g}, A_{x,4g}, \ldots.

That is, we obtain

A_{x,g} = \left(\left\lfloor \frac{K}{g} \right\rfloor\right)^x - \left(A_{x,2g} + A_{x,3g} + A_{x,4g} + \ldots\right)

Computing this for all g from K to 1 takes \mathcal{O}(K\log K) time.

It’s also possible to be slightly faster.
Since we only care about A_{x,1}, it can be seen that a more ‘direct’ formula for it is

A_{x,1} = \sum_{g=1}^K \mu(g) \left(\left\lfloor \frac{K}{g} \right\rfloor\right)^x

where \mu is the Möbius function.
This can be derived by applying inclusion-exclusion starting from g = 1.
It’s still \mathcal{O}(K\log N) because of the power computation, or \mathcal{O}(K) with precomputed powers (which of course comes at the cost of memory).
Nonetheless, such an optimization isn’t required to get AC.


Next, let’s try to bring in the second condition.
That turns out to be fairly simple: after all, if a smaller suffix has GCD 1, then certainly the suffix of length x-1 must have GCD 1.
So, the number of ‘bad’ arrays of length x simply equals the number of arrays of length x-1 with GCD 1, multiplied by K (our choice for the first element).
That is, we simply have

c_x = A_{x, 1} - K\cdot A_{x-1, 1}

Since the A_{x, 1} values for every x \leq N can be computed in \mathcal{O}(NK(\log K +\log N)) time in total, and our initial DP was quadratic once c_x is known, the problem is solved.

TIME COMPLEXITY:

\mathcal{O}(N^2 + NK(\log K + \log N)) per testcase.
One or both logs can be removed with certain optimizations.

CODE:

Author's code (C++)
#include <bits/stdc++.h>
using namespace std;
#define IOS ios_base::sync_with_stdio(0); cin.tie(0); cout.tie(0);
#define ll long long
const ll mod=1e9+7;

int main(){
    IOS
    int t;
    cin>>t;
    while(t--){
        int n,m;
        cin>>n>>m;
        vector<vector<ll>> pow(m+1,vector<ll>(n+1));
        for(int i=1;i<=m;++i){
            pow[i][0]=1;
            for(int j=1;j<=n;++j){
                pow[i][j]=(i*pow[i][j-1])%mod;
            }
        }
        vector<ll> c(n+1);
        for(ll i=1;i<=n;++i){
            vector<ll> ndp(m+1);
            for(ll j=m;j>=1;--j){
                ndp[j]=pow[m/j][i];
                for(ll k=2*j;k<=m;k+=j){
                    ndp[j]-=ndp[k];
                    ndp[j]+=mod;
                    ndp[j]%=mod;
                }
            }
            c[i]=ndp[1];
        }
        vector<ll> dp(n+1);
        vector<ll> dp2(n+1);
        dp[1]=1;
        dp2[0]=dp2[1]=1;
        for(ll i=2;i<=n;++i){
            dp[i]=m*dp[i-1];
            dp[i]%=mod;
            for(ll j=0;j<i;++j){
                dp2[i]+=(dp2[j]*(c[i-j]-(m*c[i-j-1])%mod+mod)%mod)%mod;
                dp2[i]%=mod;
            }
            dp[i]+=dp2[i];
            dp[i]%=mod;
        }
        cout<<dp[n]<<'\n';
    }
    return 0;
}


Tester's code (C++)
#ifndef ATCODER_INTERNAL_MATH_HPP
#define ATCODER_INTERNAL_MATH_HPP 1

#include <utility>

#ifdef _MSC_VER
#include <intrin.h>
#endif

namespace atcoder {

namespace internal {

// @param m `1 <= m`
// @return x mod m
constexpr long long safe_mod(long long x, long long m) {
    x %= m;
    if (x < 0) x += m;
    return x;
}

// Fast modular multiplication by barrett reduction
// Reference: https://en.wikipedia.org/wiki/Barrett_reduction
// NOTE: reconsider after Ice Lake
struct barrett {
    unsigned int _m;
    unsigned long long im;

    // @param m `1 <= m`
    explicit barrett(unsigned int m) : _m(m), im((unsigned long long)(-1) / m + 1) {}

    // @return m
    unsigned int umod() const { return _m; }

    // @param a `0 <= a < m`
    // @param b `0 <= b < m`
    // @return `a * b % m`
    unsigned int mul(unsigned int a, unsigned int b) const {
        // [1] m = 1
        // a = b = im = 0, so okay

        // [2] m >= 2
        // im = ceil(2^64 / m)
        // -> im * m = 2^64 + r (0 <= r < m)
        // let z = a*b = c*m + d (0 <= c, d < m)
        // a*b * im = (c*m + d) * im = c*(im*m) + d*im = c*2^64 + c*r + d*im
        // c*r + d*im < m * m + m * im < m * m + 2^64 + m <= 2^64 + m * (m + 1) < 2^64 * 2
        // ((ab * im) >> 64) == c or c + 1
        unsigned long long z = a;
        z *= b;
#ifdef _MSC_VER
        unsigned long long x;
        _umul128(z, im, &x);
#else
        unsigned long long x =
            (unsigned long long)(((unsigned __int128)(z)*im) >> 64);
#endif
        unsigned long long y = x * _m;
        return (unsigned int)(z - y + (z < y ? _m : 0));
    }
};

// @param n `0 <= n`
// @param m `1 <= m`
// @return `(x ** n) % m`
constexpr long long pow_mod_constexpr(long long x, long long n, int m) {
    if (m == 1) return 0;
    unsigned int _m = (unsigned int)(m);
    unsigned long long r = 1;
    unsigned long long y = safe_mod(x, m);
    while (n) {
        if (n & 1) r = (r * y) % _m;
        y = (y * y) % _m;
        n >>= 1;
    }
    return r;
}

// Reference:
// M. Forisek and J. Jancina,
// Fast Primality Testing for Integers That Fit into a Machine Word
// @param n `0 <= n`
constexpr bool is_prime_constexpr(int n) {
    if (n <= 1) return false;
    if (n == 2 || n == 7 || n == 61) return true;
    if (n % 2 == 0) return false;
    long long d = n - 1;
    while (d % 2 == 0) d /= 2;
    constexpr long long bases[3] = {2, 7, 61};
    for (long long a : bases) {
        long long t = d;
        long long y = pow_mod_constexpr(a, t, n);
        while (t != n - 1 && y != 1 && y != n - 1) {
            y = y * y % n;
            t <<= 1;
        }
        if (y != n - 1 && t % 2 == 0) {
            return false;
        }
    }
    return true;
}
template <int n> constexpr bool is_prime = is_prime_constexpr(n);

// @param b `1 <= b`
// @return pair(g, x) s.t. g = gcd(a, b), xa = g (mod b), 0 <= x < b/g
constexpr std::pair<long long, long long> inv_gcd(long long a, long long b) {
    a = safe_mod(a, b);
    if (a == 0) return {b, 0};

    // Contracts:
    // [1] s - m0 * a = 0 (mod b)
    // [2] t - m1 * a = 0 (mod b)
    // [3] s * |m1| + t * |m0| <= b
    long long s = b, t = a;
    long long m0 = 0, m1 = 1;

    while (t) {
        long long u = s / t;
        s -= t * u;
        m0 -= m1 * u;  // |m1 * u| <= |m1| * s <= b

        // [3]:
        // (s - t * u) * |m1| + t * |m0 - m1 * u|
        // <= s * |m1| - t * u * |m1| + t * (|m0| + |m1| * u)
        // = s * |m1| + t * |m0| <= b

        auto tmp = s;
        s = t;
        t = tmp;
        tmp = m0;
        m0 = m1;
        m1 = tmp;
    }
    // by [3]: |m0| <= b/g
    // by g != b: |m0| < b/g
    if (m0 < 0) m0 += b / s;
    return {s, m0};
}

// Compile time primitive root
// @param m must be prime
// @return primitive root (and minimum in now)
constexpr int primitive_root_constexpr(int m) {
    if (m == 2) return 1;
    if (m == 167772161) return 3;
    if (m == 469762049) return 3;
    if (m == 754974721) return 11;
    if (m == 998244353) return 3;
    int divs[20] = {};
    divs[0] = 2;
    int cnt = 1;
    int x = (m - 1) / 2;
    while (x % 2 == 0) x /= 2;
    for (int i = 3; (long long)(i)*i <= x; i += 2) {
        if (x % i == 0) {
            divs[cnt++] = i;
            while (x % i == 0) {
                x /= i;
            }
        }
    }
    if (x > 1) {
        divs[cnt++] = x;
    }
    for (int g = 2;; g++) {
        bool ok = true;
        for (int i = 0; i < cnt; i++) {
            if (pow_mod_constexpr(g, (m - 1) / divs[i], m) == 1) {
                ok = false;
                break;
            }
        }
        if (ok) return g;
    }
}
template <int m> constexpr int primitive_root = primitive_root_constexpr(m);

// @param n `n < 2^32`
// @param m `1 <= m < 2^32`
// @return sum_{i=0}^{n-1} floor((ai + b) / m) (mod 2^64)
unsigned long long floor_sum_unsigned(unsigned long long n,
                                      unsigned long long m,
                                      unsigned long long a,
                                      unsigned long long b) {
    unsigned long long ans = 0;
    while (true) {
        if (a >= m) {
            ans += n * (n - 1) / 2 * (a / m);
            a %= m;
        }
        if (b >= m) {
            ans += n * (b / m);
            b %= m;
        }

        unsigned long long y_max = a * n + b;
        if (y_max < m) break;
        // y_max < m * (n + 1)
        // floor(y_max / m) <= n
        n = (unsigned long long)(y_max / m);
        b = (unsigned long long)(y_max % m);
        std::swap(m, a);
    }
    return ans;
}

}  // namespace internal

}  // namespace atcoder

#endif  // ATCODER_INTERNAL_MATH_HPP

#ifndef ATCODER_INTERNAL_TYPE_TRAITS_HPP
#define ATCODER_INTERNAL_TYPE_TRAITS_HPP 1

#include <cassert>
#include <numeric>
#include <type_traits>

namespace atcoder {

namespace internal {

#ifndef _MSC_VER
template <class T>
using is_signed_int128 =
    typename std::conditional<std::is_same<T, __int128_t>::value ||
                                  std::is_same<T, __int128>::value,
                              std::true_type,
                              std::false_type>::type;

template <class T>
using is_unsigned_int128 =
    typename std::conditional<std::is_same<T, __uint128_t>::value ||
                                  std::is_same<T, unsigned __int128>::value,
                              std::true_type,
                              std::false_type>::type;

template <class T>
using make_unsigned_int128 =
    typename std::conditional<std::is_same<T, __int128_t>::value,
                              __uint128_t,
                              unsigned __int128>;

template <class T>
using is_integral = typename std::conditional<std::is_integral<T>::value ||
                                                  is_signed_int128<T>::value ||
                                                  is_unsigned_int128<T>::value,
                                              std::true_type,
                                              std::false_type>::type;

template <class T>
using is_signed_int = typename std::conditional<(is_integral<T>::value &&
                                                 std::is_signed<T>::value) ||
                                                    is_signed_int128<T>::value,
                                                std::true_type,
                                                std::false_type>::type;

template <class T>
using is_unsigned_int =
    typename std::conditional<(is_integral<T>::value &&
                               std::is_unsigned<T>::value) ||
                                  is_unsigned_int128<T>::value,
                              std::true_type,
                              std::false_type>::type;

template <class T>
using to_unsigned = typename std::conditional<
    is_signed_int128<T>::value,
    make_unsigned_int128<T>,
    typename std::conditional<std::is_signed<T>::value,
                              std::make_unsigned<T>,
                              std::common_type<T>>::type>::type;

#else

template <class T> using is_integral = typename std::is_integral<T>;

template <class T>
using is_signed_int =
    typename std::conditional<is_integral<T>::value && std::is_signed<T>::value,
                              std::true_type,
                              std::false_type>::type;

template <class T>
using is_unsigned_int =
    typename std::conditional<is_integral<T>::value &&
                                  std::is_unsigned<T>::value,
                              std::true_type,
                              std::false_type>::type;

template <class T>
using to_unsigned = typename std::conditional<is_signed_int<T>::value,
                                              std::make_unsigned<T>,
                                              std::common_type<T>>::type;

#endif

template <class T>
using is_signed_int_t = std::enable_if_t<is_signed_int<T>::value>;

template <class T>
using is_unsigned_int_t = std::enable_if_t<is_unsigned_int<T>::value>;

template <class T> using to_unsigned_t = typename to_unsigned<T>::type;

}  // namespace internal

}  // namespace atcoder

#endif  // ATCODER_INTERNAL_TYPE_TRAITS_HPP

#ifndef ATCODER_MODINT_HPP
#define ATCODER_MODINT_HPP 1

#include <cassert>
#include <numeric>
#include <type_traits>

#ifdef _MSC_VER
#include <intrin.h>
#endif

namespace atcoder {

namespace internal {

struct modint_base {};
struct static_modint_base : modint_base {};

template <class T> using is_modint = std::is_base_of<modint_base, T>;
template <class T> using is_modint_t = std::enable_if_t<is_modint<T>::value>;

}  // namespace internal

template <int m, std::enable_if_t<(1 <= m)>* = nullptr>
struct static_modint : internal::static_modint_base {
    using mint = static_modint;

  public:
    static constexpr int mod() { return m; }
    static mint raw(int v) {
        mint x;
        x._v = v;
        return x;
    }

    static_modint() : _v(0) {}
    template <class T, internal::is_signed_int_t<T>* = nullptr>
    static_modint(T v) {
        long long x = (long long)(v % (long long)(umod()));
        if (x < 0) x += umod();
        _v = (unsigned int)(x);
    }
    template <class T, internal::is_unsigned_int_t<T>* = nullptr>
    static_modint(T v) {
        _v = (unsigned int)(v % umod());
    }

    unsigned int val() const { return _v; }

    mint& operator++() {
        _v++;
        if (_v == umod()) _v = 0;
        return *this;
    }
    mint& operator--() {
        if (_v == 0) _v = umod();
        _v--;
        return *this;
    }
    mint operator++(int) {
        mint result = *this;
        ++*this;
        return result;
    }
    mint operator--(int) {
        mint result = *this;
        --*this;
        return result;
    }

    mint& operator+=(const mint& rhs) {
        _v += rhs._v;
        if (_v >= umod()) _v -= umod();
        return *this;
    }
    mint& operator-=(const mint& rhs) {
        _v -= rhs._v;
        if (_v >= umod()) _v += umod();
        return *this;
    }
    mint& operator*=(const mint& rhs) {
        unsigned long long z = _v;
        z *= rhs._v;
        _v = (unsigned int)(z % umod());
        return *this;
    }
    mint& operator/=(const mint& rhs) { return *this = *this * rhs.inv(); }

    mint operator+() const { return *this; }
    mint operator-() const { return mint() - *this; }

    mint pow(long long n) const {
        assert(0 <= n);
        mint x = *this, r = 1;
        while (n) {
            if (n & 1) r *= x;
            x *= x;
            n >>= 1;
        }
        return r;
    }
    mint inv() const {
        if (prime) {
            assert(_v);
            return pow(umod() - 2);
        } else {
            auto eg = internal::inv_gcd(_v, m);
            assert(eg.first == 1);
            return eg.second;
        }
    }

    friend mint operator+(const mint& lhs, const mint& rhs) {
        return mint(lhs) += rhs;
    }
    friend mint operator-(const mint& lhs, const mint& rhs) {
        return mint(lhs) -= rhs;
    }
    friend mint operator*(const mint& lhs, const mint& rhs) {
        return mint(lhs) *= rhs;
    }
    friend mint operator/(const mint& lhs, const mint& rhs) {
        return mint(lhs) /= rhs;
    }
    friend bool operator==(const mint& lhs, const mint& rhs) {
        return lhs._v == rhs._v;
    }
    friend bool operator!=(const mint& lhs, const mint& rhs) {
        return lhs._v != rhs._v;
    }

  private:
    unsigned int _v;
    static constexpr unsigned int umod() { return m; }
    static constexpr bool prime = internal::is_prime<m>;
};

template <int id> struct dynamic_modint : internal::modint_base {
    using mint = dynamic_modint;

  public:
    static int mod() { return (int)(bt.umod()); }
    static void set_mod(int m) {
        assert(1 <= m);
        bt = internal::barrett(m);
    }
    static mint raw(int v) {
        mint x;
        x._v = v;
        return x;
    }

    dynamic_modint() : _v(0) {}
    template <class T, internal::is_signed_int_t<T>* = nullptr>
    dynamic_modint(T v) {
        long long x = (long long)(v % (long long)(mod()));
        if (x < 0) x += mod();
        _v = (unsigned int)(x);
    }
    template <class T, internal::is_unsigned_int_t<T>* = nullptr>
    dynamic_modint(T v) {
        _v = (unsigned int)(v % mod());
    }

    unsigned int val() const { return _v; }

    mint& operator++() {
        _v++;
        if (_v == umod()) _v = 0;
        return *this;
    }
    mint& operator--() {
        if (_v == 0) _v = umod();
        _v--;
        return *this;
    }
    mint operator++(int) {
        mint result = *this;
        ++*this;
        return result;
    }
    mint operator--(int) {
        mint result = *this;
        --*this;
        return result;
    }

    mint& operator+=(const mint& rhs) {
        _v += rhs._v;
        if (_v >= umod()) _v -= umod();
        return *this;
    }
    mint& operator-=(const mint& rhs) {
        _v += mod() - rhs._v;
        if (_v >= umod()) _v -= umod();
        return *this;
    }
    mint& operator*=(const mint& rhs) {
        _v = bt.mul(_v, rhs._v);
        return *this;
    }
    mint& operator/=(const mint& rhs) { return *this = *this * rhs.inv(); }

    mint operator+() const { return *this; }
    mint operator-() const { return mint() - *this; }

    mint pow(long long n) const {
        assert(0 <= n);
        mint x = *this, r = 1;
        while (n) {
            if (n & 1) r *= x;
            x *= x;
            n >>= 1;
        }
        return r;
    }
    mint inv() const {
        auto eg = internal::inv_gcd(_v, mod());
        assert(eg.first == 1);
        return eg.second;
    }

    friend mint operator+(const mint& lhs, const mint& rhs) {
        return mint(lhs) += rhs;
    }
    friend mint operator-(const mint& lhs, const mint& rhs) {
        return mint(lhs) -= rhs;
    }
    friend mint operator*(const mint& lhs, const mint& rhs) {
        return mint(lhs) *= rhs;
    }
    friend mint operator/(const mint& lhs, const mint& rhs) {
        return mint(lhs) /= rhs;
    }
    friend bool operator==(const mint& lhs, const mint& rhs) {
        return lhs._v == rhs._v;
    }
    friend bool operator!=(const mint& lhs, const mint& rhs) {
        return lhs._v != rhs._v;
    }

  private:
    unsigned int _v;
    static internal::barrett bt;
    static unsigned int umod() { return bt.umod(); }
};
template <int id> internal::barrett dynamic_modint<id>::bt(998244353);

using modint998244353 = static_modint<998244353>;
using modint1000000007 = static_modint<1000000007>;
using modint = dynamic_modint<-1>;

namespace internal {

template <class T>
using is_static_modint = std::is_base_of<internal::static_modint_base, T>;

template <class T>
using is_static_modint_t = std::enable_if_t<is_static_modint<T>::value>;

template <class> struct is_dynamic_modint : public std::false_type {};
template <int id>
struct is_dynamic_modint<dynamic_modint<id>> : public std::true_type {};

template <class T>
using is_dynamic_modint_t = std::enable_if_t<is_dynamic_modint<T>::value>;

}  // namespace internal

}  // namespace atcoder

#endif  // ATCODER_MODINT_HPP

using namespace atcoder;
using mint = modint1000000007;

#include <bits/stdc++.h>
using namespace std;

constexpr int mx = 20'000;

int main() {
    vector<int> pd(mx + 1);
    iota(pd.begin(), pd.end(), 0);
    for (int d = 2; d * d <= mx; ++d) {
        if (pd[d] != d) {
            continue;
        }
        for (int m = 2 * d; m <= mx; m += d) {
            pd[m] = d;
        }
    }

    auto mu_func = [&](int n) -> int {
        map<int, int> nu;
        while (n > 1) {
            ++nu[pd[n]];
            n /= pd[n];
        }
        for (const auto& [k, v] : nu) {
            if (v >= 2) {
                return 0;
            }
        }
        return (nu.size() & 1) ? -1 : 1;
    };

    vector<int> mu(mx + 1);
    for (int i = 1; i <= mx; ++i) {
        mu[i] = mu_func(i);
    }

    int t; cin >> t; while (t--) {
        int n, k;
        cin >> n >> k;

        vector<mint> cnt(n + 1);
        for (int i = 1; i <= n; ++i) {
            for (int j = 1; j <= k; ++j) {
                cnt[i] += mu[j] * mint(k / j).pow(i);
            }
        }
        for (int i = n; i >= 1; --i) {
            cnt[i] -= k * cnt[i - 1];
        }

        vector<mint> dp(n + 1);
        dp[0] = 1;
        for (int i = 1; i <= n; ++i) {
            for (int j = 1; j <= i; ++j) {
                dp[i] += cnt[j] * dp[i - j];
            }
        }

        mint ans = 0;
        for (int i = 1; i <= n; ++i) {
            ans += dp[i] * mint(k).pow(n - i);
        }
        cout << ans.val() << '\n';
    }
}
Editorialist's code (C++)
// #pragma GCC optimize("O3,unroll-loops")
// #pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")
#include "bits/stdc++.h"
using namespace std;
using ll = long long int;
mt19937_64 RNG(chrono::high_resolution_clock::now().time_since_epoch().count());

int main()
{
    ios::sync_with_stdio(false); cin.tie(0);

    const int mxn = 405;
    const int mxk = 20005;
    const int mod = 1e9 + 7;
    vector<int> pr(mxk, 1);
    vector<int> mu(mxk, 1);
    vector pw(mxk, vector(mxn, 1));
    for (int k = 1; k < mxk; ++k)
        for (int n = 1; n < mxn; ++n)
            pw[k][n] = 1ll * pw[k][n-1] * k % mod;
    for (int i = 2; i < mxk; ++i) {
        if (!pr[i]) continue;
        for (int j = i; j < mxk; j += i) {
            mu[j] *= -1;
            pr[j] = 0;
        }
        for (ll j = 1ll*i*i; j < mxk; j += 1ll*i*i)
            mu[j] = 0;
    }
    
    int t; cin >> t;
    while (t--) {
        int n, k; cin >> n >> k;

        vector<int> dp(n+1), dp3(n+1);
        for (int i = 1; i <= n; ++i) {
            // dp[i] = number of arrays of length i with gcd 1
            for (int g = 1; g <= k; ++g) {
                int muls = k/g;
                if (mu[g] == 0) continue;
                dp[i] = (dp[i] + mu[g] * pw[muls][i]) % mod;
            }
        }
        for (int i = n; i >= 1; --i) {
            dp3[i] = (dp[i] - 1ll*k*dp[i-1])%mod;
            dp3[i] = (dp3[i] + mod) % mod;
        }

        vector<int> dp2(n+1);
        for (int i = 1; i <= n; ++i) {
            for (int x = 1; x <= i; ++x) {
                dp2[i] = (dp2[i] + 1ll * dp3[x] * (pw[k][i-x] + dp2[i-x])) % mod;
            }
        }
        cout << dp2[n]  << '\n';
    }
}

I would like to describe my O(NKlogK) solution which is much simpler than this editorial.
Following the greedy idea described in this editorial, we define a good prefix as the shortest prefix subarray that gcd=1. Then, we define dp status as follows:
dp_{i,j}= the number of arrays with length i, and after recursively removing all good prefix subarray, the remaining subarray’s gcd=j.
After each length i, we add dp_{i, 1} \times K^{N - i} to answer, because for each position after, there are K choices. Then we can move dp_{i,1} to dp_{i,0} because now we restart from an empty array.
The initial status is dp_{0,0}=1 where we define j=0 as after removing all good prefix, the array is empty.
To transform from i - 1 to i, a naive way is brute force through each j and each k,

dp_{i,l}=\sum_{j,k \in [1,K], gcd(j,k)=l}dp_{i-1,j}

However, this results in O(NK^2) time complexity.
To speedup, instead of trying to get dp_{i,j} directly, we first get

g_{i,j}=\sum_{k \in [1,K], j | k} dp_{i,k}

To get g_{i,j}, we can choose all dp_{i-1,k(j|k)}, and for the ith element, we can also choose all k that j | k (which has \left\lfloor{\frac{K}{j}}\right\rfloor choices). So

g_{i,j}=\left\lfloor{\frac{K}{j}}\right\rfloor\times \sum_{k \in [1,K],j|k}dp_{i-1,k}

Then dp_{i,j} could be derived by

dp_{i,j}=g_{i,j}-\sum_{k \gt j, j | k}dp_{i, k}

which we can iterate j from K down to 1 with harmonical trick.

The total time complexity is O(NKlogK) and space complexity is O(K) if using rolling array.
In implementation, we don’t need to keep a g array, just use a temporary variable to record current sum. See my submission https://www.codechef.com/viewsolution/1063419363 for more detail.

1 Like

Thank you for taking the time to write down a detailed explanation! I tried similar ideas in testing but had an O(K^2) overhead in some places and couldn’t optimize it away. We eventually made the second subtask to allow O(K^2 + NK \log K). It’s good to know that it can be further optimized, thus making the subtask not as useless for the full solution as we initially thought.