CNTDISJOINT - Editorial

PROBLEM LINK:

Practice
Contest: Division 1
Contest: Division 2
Contest: Division 3
Contest: Division 4

Author: raysh07
Tester: raysh07
Editorialist: iceknight1093

DIFFICULTY:

Medium

PREREQUISITES:

Dynamic programming, elementary combinatorics

PROBLEM:

Given M, count the number of arrays A, of any length, such that:

  • 1 \leq A_i \leq M
  • A_i\land A_j = 0 for all 1 \leq i \lt j \leq |A|, i.e. every pair of elements have a bitwise AND of 0.

EXPLANATION:

A_i\land A_j = 0 means that any bit can appear in at most one element of A.

This condition tells us that any valid array can’t be very long: in particular the longest we can achieve is \left\lfloor \log_2 M \right\rfloor + 1 , for example with the array 2^0, 2^1, 2^2, \ldots

Let’s solve the problem for a simple case first: suppose M = 2^k - 1, so that we can essentially freely use the first k bits without having to worry about exceeding M.
As noted above, the length can’t exceed k here.

This case can be solved by dynamic programming.
Let dp(i, x) denote the number of valid arrays of length i, such that they’ve used x bits (out of k) so far.
To compute this, let’s fix b, the number of bits used for the element at index i.
Then,

  • There are dp(i-1, x-b) ways to form the first i-1 elements.
  • For the i-th element, we’re choosing b bits. There are k - (x - b) bits remaining, and any b out of them will give us a unique array. So, there are \binom{k-x+b}{b} choices for the value at this index.

This means we obtain dp(i, x) = \sum_{b = 1}^x \binom{k-x+b}{b}\cdot dp(i-1, x-b).

This only needs to be computed for i, x \leq k, so there are \mathcal{O}(k^2) states and \mathcal{O}(k) transitions from each of them, leading to \mathcal{O}(k^3) overall.


Let’s generalize this solution to arbitrary M.

Suppose 2^k \leq M is the largest power of 2 not exceeding M.
Note that the array can contain at most one element in [2^k, M], because these are the elements available to us with the k-th bit set.

If we choose to not have any element in this range, then the problem simply reduces to solving for [0, 2^k) instead, which we already know how to do from above.
What if we do choose an element in [2^k, M]?

Suppose the element we choose is x, and it has b_0 bits set in it other than 2^k.
That leaves us with elements in [0, 2^k) which don’t have any of these b_0 bits set.
Observe that this is equivalent to just solving for [0, 2^{k - b_0}) which we already know how to do!
The only thing that needs to be taken care of, is where to place x.
However, this is easy: if we have an array of length i from [0, 2^{k-b_0}), there are (i+1) choices for where to place x into it, and all of these lead to distinct arrays.

Now, observe that the actual value of x doesn’t matter at all: only the number of its set bits.
The number of set bits (ignoring 2^k) can’t exceed k, so we can simply solve the problem for each b_0 from 0 to k, and multiply this by the number of elements in [2^k, M] that have b_0 set.

That leaves two details:

  1. Computing, for each b_0, the count of integers in [2^k, M] that have exactly b_0 bits set.
  2. Ensuring our solution is fast enough.

The first can be done as follows.

Details

Let \text{freq}[b] denote the number of elements in [2^k, M] that have exactly b bits set.
Note that since we’re ignoring the k+1-th bit, this is the same as asking for the popcount frequencies of the range [0, M-2^k] instead, so we’ll focus on that instead.

In particular, we’ll solve the problem “given an integer N, find the popcount frequencies of all integers \leq N”.
This can be done in a similar fashion to how we’re analyzing this problem in the first place.
Let 2^r \leq N be the largest power of 2 not exceeding N.
Then,

  • For the range [0, 2^r), bits can be chosen freely. There are \binom{r}{b} ways to choose a number with b bits in this range.
  • This leaves the range [2^r, N].
    However, this is basically the same problem we started with, so we can simply repeat the process and solve for [0, N - 2^r] instead.
    The only difference is that the bit 2^r will always be set now, so all popcounts obtained in this case must be increased by 1 - which is easy to do, just maintain the necessary shift and increment it.

This process gives us all the frequencies for [0, N] in \mathcal{O}(\log^2 N) time, because we do \mathcal{O}(\log N) work updating frequencies for each bit.
This is fast enough for our purposes.

As for the second, observe that we have \mathcal{O}(\log M) distinct values of b_0, and for each of them we run a DP that takes \mathcal{O}(\log^3 M) time, we currently have a solution that’s \mathcal{O}(\log^4 M).
This would be fast enough for a single test, but we have 10^4 of them so it’ll likely be too slow since \log M can be as large as 60.

To optimize this, note that the DP doesn’t need to be recomputed over and over again for different tests: it only depends on the initial number of bits present (i.e. the value k), the array length, and the number of bits used so far.
This means we can simply precompute all the necessary DP values once, before answering any tests, and then simply look up these precomputed values when necessary.
What we want to look up is simply the sum of values in some table, which can be done in constant time if it’s saved beforehand, so we’re down to \mathcal{O}(\log^2 M) per test which is quite fast.


To summarize, our solution is as follows.
Let 2^k be the largest power of two not exceeding M.
Initialize the answer to the answer for [0, 2^k) (which is precomputed and stored).

Let \text{freq}[b] be the number of integers in [0, N-2^k] with b set bits.
For each b, and each length l, add \text{freq}[b] \cdot (l+1) \cdot f(k - b, l) to the answer, where f(k - b, l) denotes the number of valid arrays of length l when there are k-b bits to work with initially (which is also a precomputed value).

TIME COMPLEXITY:

\mathcal{O}(B^4) precomputation followed by \mathcal{O}(\log^2 M) per testcase, where B = \log_2 {10^{18}} \approx 60.

CODE:

Author's code (C++)
#include <bits/stdc++.h>
using namespace std;
#define int long long
#define INF (int)1e18

mt19937_64 RNG(chrono::steady_clock::now().time_since_epoch().count());

const int mod = 998244353;

struct mint{
    int x;

    mint (){ x = 0;}
    mint (int32_t xx){ x = xx % mod; if (x < 0) x += mod;}
    mint (long long xx){ x = xx % mod; if (x < 0) x += mod;}

    int val(){
        return x;
    }
    mint &operator++(){
        x++;
        if (x == mod) x = 0;
        return *this;
    }
    mint &operator--(){
        if (x == 0) x = mod;
        x--;
        return *this;
    }
    mint operator++(int32_t){
        mint result = *this;
        ++*this;
        return result;
    }
    
    mint operator--(int32_t){
        mint result = *this;
        --*this;
        return result;
    }
    mint& operator+=(const mint &b){
        x += b.x;
        if (x >= mod) x -= mod;
        return *this;
    }
    mint& operator-=(const mint &b){
        x -= b.x;
        if (x < 0) x += mod;
        return *this;
    }
    mint& operator*=(const mint &b){
        long long z = x;
        z *= b.x;
        z %= mod;
        x = (int)z;
        return *this;
    }
    mint operator+() const {
        return *this;
    }
    mint operator-() const {
        return mint() - *this;
    }
    mint operator/=(const mint &b){
        return *this = *this * b.inv();
    }
    mint power(long long n) const {
        mint ok = *this, r = 1;
        while (n){
            if (n & 1){
                r *= ok;
            }
            ok *= ok;
            n >>= 1;
        }
        return r;
    }
    mint inv() const {
        return power(mod - 2);
    }
    friend mint operator+(const mint& a, const mint& b){ return mint(a) += b;}
    friend mint operator-(const mint& a, const mint& b){ return mint(a) -= b;}
    friend mint operator*(const mint& a, const mint& b){ return mint(a) *= b;}
    friend mint operator/(const mint& a, const mint& b){ return mint(a) /= b;}
    friend bool operator==(const mint& a, const mint& b){ return a.x == b.x;}
    friend bool operator!=(const mint& a, const mint& b){ return a.x != b.x;}
    mint power(mint a, long long n){
        return a.power(n);
    }
    friend ostream &operator<<(ostream &os, const mint &m) {
        os << m.x;
        return os;
    }
    explicit operator bool() const {
        return x != 0;
    }
};

// Remember to check MOD

struct factorials{
    int n;
    vector <mint> ff, iff;
    
    factorials(int nn){
        n = nn;
        ff.resize(n + 1);
        iff.resize(n + 1);
        
        ff[0] = 1;
        for (int i = 1; i <= n; i++){
            ff[i] = ff[i - 1] * i;
        }
        
        iff[n] = ff[n].inv();
        for (int i = n - 1; i >= 0; i--){
            iff[i] = iff[i + 1] * (i + 1);
        }
    }
    
    mint C(int n, int r){
        if (n == r) return mint(1);
        if (n < 0 || r < 0 || r > n) return mint(0);
        return ff[n] * iff[r] * iff[n - r];
    }
    
    mint P(int n, int r){
        if (n < 0 || r < 0 || r > n) return mint(0);
        return ff[n] * iff[n - r];
    }
    
    mint solutions(int n, int r){
        // Solutions to x1 + x2 + ... + xn = r, xi >= 0 
        return C(n + r - 1, n - 1);
    }
    
    mint catalan(int n){
        return ff[2 * n] * iff[n] * iff[n + 1];
    }
};

const int PRECOMP = 3e6 + 69;
factorials F(PRECOMP);

// REMEMBER To check MOD and PRECOMP

const int N = 60;
vector<vector<mint>> f(N, vector<mint>(N));
vector<vector<mint>> g(N, vector<mint>(N));

void Solve() 
{
    int n; cin >> n;
    
    int m = 0;
    for (int i = 0; i < 60; i++){
        if ((1LL << i) <= n){
            m++;
        }
    }
    
    mint ans = 0;
    
    // do not use your last bit, you have m - 1 bits
    for (int n = 1; n <= m - 1; n++){
        ans += g[n][m - 1];
    }
    
    vector <mint> dp(m + 1, 0);
    // dp[i] = ways to get exactly i bits used up in the suffix 
    
    for (int diff = m - 2; diff >= -1; diff--){
        if (diff != -1 && !(n >> diff & 1)){
            continue;
        }
        // different bit here 
        int forced = 0;
        for (int i = m - 1; i > diff; i--){
            if (n >> i & 1){
                forced += 1;
            }
        }
        
        int fr = max(0LL, diff);
        
        for (int i = 0; i <= fr; i++){
            // choose i free bits
            dp[i + forced] += F.C(fr, i);
        }
    }
    
    for (int i = 1; i <= m; i++){
        int left = m - i;
        
        for (int n = 0; n <= left; n++){
            ans += g[n][left] * (n + 1) * dp[i];
        }
    }
    
    cout << ans << "\n";
}

int32_t main() 
{
    auto begin = std::chrono::high_resolution_clock::now();
    ios_base::sync_with_stdio(0);
    cin.tie(0);
    int t = 1;
    // freopen("in",  "r", stdin);
    // freopen("out", "w", stdout);
    
    // f[i][j] = number of arrays using j things with length exactly i 
    f[0][0] = 1;
    for (int i = 1; i < N; i++){
        for (int j = i; j < N; j++){
            mint ans = 0;
            
            for (int k = 1; k <= i; k++){
                mint ways = mint(k).power(j) * F.C(i, k);
                if (k % 2 == i % 2){
                    ans += ways;
                } else {
                    ans -= ways;
                }
            }
            
            f[i][j] = ans;
        }
    }
    
    // g[i][j] = number of arrays using <= j things and i length 
    for (int i = 0; i < N; i++){
        for (int j = 0; j < N; j++){
            for (int use = 0; use <= j; use++){
                g[i][j] += f[i][use] * F.C(j, use);
            }
        }
    }
    
    cin >> t;
    for(int i = 1; i <= t; i++) 
    {
        //cout << "Case #" << i << ": ";
        Solve();
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin);
    cerr << "Time measured: " << elapsed.count() * 1e-9 << " seconds.\n"; 
    return 0;
}
Editorialist's code (C++)
// #include <bits/allocator.h>
// #pragma GCC optimize("O3,unroll-loops")
// #pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")
#include "bits/stdc++.h"
using namespace std;
using ll = long long int;
mt19937_64 RNG(chrono::high_resolution_clock::now().time_since_epoch().count());

/**
 * Integers modulo p, where p is a prime
 * Source: Aeren (modified from tourist?)
 *         Modmul for 64-bit mod from kactl:ModMulLL
 * Works with p < 7.2e18 with x87 80-bit long double, and p < 2^52 ~ 4.5e12 with 64-bit
 */
template<typename T>
struct Z_p{
	using Type = typename decay<decltype(T::value)>::type;
	static vector<Type> MOD_INV;
	constexpr Z_p(): value(){ }
	template<typename U> Z_p(const U &x){ value = normalize(x); }
	template<typename U> static Type normalize(const U &x){
		Type v;
		if(-mod() <= x && x < mod()) v = static_cast<Type>(x);
		else v = static_cast<Type>(x % mod());
		if(v < 0) v += mod();
		return v;
	}
	const Type& operator()() const{ return value; }
	template<typename U> explicit operator U() const{ return static_cast<U>(value); }
	constexpr static Type mod(){ return T::value; }
	Z_p &operator+=(const Z_p &otr){ if((value += otr.value) >= mod()) value -= mod(); return *this; }
	Z_p &operator-=(const Z_p &otr){ if((value -= otr.value) < 0) value += mod(); return *this; }
	template<typename U> Z_p &operator+=(const U &otr){ return *this += Z_p(otr); }
	template<typename U> Z_p &operator-=(const U &otr){ return *this -= Z_p(otr); }
	Z_p &operator++(){ return *this += 1; }
	Z_p &operator--(){ return *this -= 1; }
	Z_p operator++(int){ Z_p result(*this); *this += 1; return result; }
	Z_p operator--(int){ Z_p result(*this); *this -= 1; return result; }
	Z_p operator-() const{ return Z_p(-value); }
	template<typename U = T>
	typename enable_if<is_same<typename Z_p<U>::Type, int>::value, Z_p>::type &operator*=(const Z_p& rhs){
		#ifdef _WIN32
		uint64_t x = static_cast<int64_t>(value) * static_cast<int64_t>(rhs.value);
		uint32_t xh = static_cast<uint32_t>(x >> 32), xl = static_cast<uint32_t>(x), d, m;
		asm(
			"divl %4; \n\t"
			: "=a" (d), "=d" (m)
			: "d" (xh), "a" (xl), "r" (mod())
		);
		value = m;
		#else
		value = normalize(static_cast<int64_t>(value) * static_cast<int64_t>(rhs.value));
		#endif
		return *this;
	}
	template<typename U = T>
	typename enable_if<is_same<typename Z_p<U>::Type, int64_t>::value, Z_p>::type &operator*=(const Z_p &rhs){
		uint64_t ret = static_cast<uint64_t>(value) * static_cast<uint64_t>(rhs.value) - static_cast<uint64_t>(mod()) * static_cast<uint64_t>(1.L / static_cast<uint64_t>(mod()) * static_cast<uint64_t>(value) * static_cast<uint64_t>(rhs.value));
		value = normalize(static_cast<int64_t>(ret + static_cast<uint64_t>(mod()) * (ret < 0) - static_cast<uint64_t>(mod()) * (ret >= static_cast<uint64_t>(mod()))));
		return *this;
	}
	template<typename U = T>
	typename enable_if<!is_integral<typename Z_p<U>::Type>::value, Z_p>::type &operator*=(const Z_p &rhs){
		value = normalize(value * rhs.value);
		return *this;
	}
	template<typename U>
	Z_p &operator^=(U e){
		if(e < 0) *this = 1 / *this, e = -e;
		Z_p res = 1;
		for(; e; *this *= *this, e >>= 1) if(e & 1) res *= *this;
		return *this = res;
	}
	template<typename U>
	Z_p operator^(U e) const{
		return Z_p(*this) ^= e;
	}
	Z_p &operator/=(const Z_p &otr){
		Type a = otr.value, m = mod(), u = 0, v = 1;
		if(a < (int)MOD_INV.size()) return *this *= MOD_INV[a];
		while(a){
			Type t = m / a;
			m -= t * a; swap(a, m);
			u -= t * v; swap(u, v);
		}
		assert(m == 1);
		return *this *= u;
	}
	template<typename U> friend const Z_p<U> &abs(const Z_p<U> &v){ return v; }
	Type value;
};
template<typename T> bool operator==(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value == rhs.value; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator==(const Z_p<T>& lhs, U rhs){ return lhs == Z_p<T>(rhs); }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator==(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) == rhs; }
template<typename T> bool operator!=(const Z_p<T> &lhs, const Z_p<T> &rhs){ return !(lhs == rhs); }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator!=(const Z_p<T> &lhs, U rhs){ return !(lhs == rhs); }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> bool operator!=(U lhs, const Z_p<T> &rhs){ return !(lhs == rhs); }
template<typename T> bool operator<(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value < rhs.value; }
template<typename T> bool operator>(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value > rhs.value; }
template<typename T> bool operator<=(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value <= rhs.value; }
template<typename T> bool operator>=(const Z_p<T> &lhs, const Z_p<T> &rhs){ return lhs.value >= rhs.value; }
template<typename T> Z_p<T> operator+(const Z_p<T> &lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) += rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator+(const Z_p<T> &lhs, U rhs){ return Z_p<T>(lhs) += rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator+(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) += rhs; }
template<typename T> Z_p<T> operator-(const Z_p<T> &lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) -= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator-(const Z_p<T>& lhs, U rhs){ return Z_p<T>(lhs) -= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator-(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) -= rhs; }
template<typename T> Z_p<T> operator*(const Z_p<T> &lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) *= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator*(const Z_p<T>& lhs, U rhs){ return Z_p<T>(lhs) *= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator*(U lhs, const Z_p<T> &rhs){ return Z_p<T>(lhs) *= rhs; }
template<typename T> Z_p<T> operator/(const Z_p<T> &lhs, const Z_p<T> &rhs) { return Z_p<T>(lhs) /= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator/(const Z_p<T>& lhs, U rhs) { return Z_p<T>(lhs) /= rhs; }
template<typename T, typename U, typename enable_if<is_integral<U>::value>::type* = nullptr> Z_p<T> operator/(U lhs, const Z_p<T> &rhs) { return Z_p<T>(lhs) /= rhs; }
template<typename T> istream &operator>>(istream &in, Z_p<T> &number){
	typename common_type<typename Z_p<T>::Type, int64_t>::type x;
	in >> x;
	number.value = Z_p<T>::normalize(x);
	return in;
}
template<typename T> ostream &operator<<(ostream &out, const Z_p<T> &number){ return out << number(); }

/*
using ModType = int;
struct VarMod{ static ModType value; };
ModType VarMod::value;
ModType &mod = VarMod::value;
using Zp = Z_p<VarMod>;
*/

// constexpr int mod = 1e9 + 7; // 1000000007
constexpr int mod = (119 << 23) + 1; // 998244353
// constexpr int mod = 1e9 + 9; // 1000000009
using Zp = Z_p<integral_constant<decay<decltype(mod)>::type, mod>>;

template<typename T> vector<typename Z_p<T>::Type> Z_p<T>::MOD_INV;
template<typename T = integral_constant<decay<decltype(mod)>::type, mod>>
void precalc_inverse(int SZ){
	auto &inv = Z_p<T>::MOD_INV;
	if(inv.empty()) inv.assign(2, 1);
	for(; inv.size() <= SZ; ) inv.push_back((mod - 1LL * mod / (int)inv.size() * inv[mod % (int)inv.size()]) % mod);
}

template<typename T>
vector<T> precalc_power(T base, int SZ){
	vector<T> res(SZ + 1, 1);
	for(auto i = 1; i <= SZ; ++ i) res[i] = res[i - 1] * base;
	return res;
}

template<typename T>
vector<T> precalc_factorial(int SZ){
	vector<T> res(SZ + 1, 1); res[0] = 1;
	for(auto i = 1; i <= SZ; ++ i) res[i] = res[i - 1] * i;
	return res;
}

const int N = 65;
Zp dp[N][N][N]; // (i, j, k) -> starting with i bits, length j, k used
Zp C[N][N];
Zp save[N], save2[N][N];

int main()
{
    ios::sync_with_stdio(false); cin.tie(0);

    for (int i = 0; i < N; ++i) {
        C[i][0] = 1;
        for (int j = 1; j <= i; ++j)
            C[i][j] = C[i-1][j] + C[i-1][j-1];
    }

    save2[0][0] = 1;
    for (int i = 1; i < N; ++i) {
        dp[i][0][0] = 1;
        save2[i][0] = 1;
        for (int j = 1; j <= i; ++j) {
            for (int k = 1; k <= i; ++k) {
                for (int x = 0; x < k; ++x) {
                    dp[i][j][k] += dp[i][j-1][x] * C[i-x][k-x];
                }
                save[i] += dp[i][j][k];
                save2[i][j] += dp[i][j][k];
            }
        }
    }

    auto gen_freq = [&] (ll n) {
        // compute number of elements in [0, n] which have b bits set, for every b
        array<Zp, N> ans{};
        int add = 0;
        for (int b = 61; b >= 0; --b) {
            if ((n >> b) & 1) {
                // do [0, 2^b)
                for (int i = 0; i <= b; ++i)
                    ans[add+i] += C[b][i];
                
                // update n and add
                n ^= 1ll << b;
                ++add;
            }
        }
        ++ans[add];
        return ans;
    };

    int t; cin >> t;
    while (t--) {
        ll n; cin >> n;

        int k = 0;
        while ((1LL << (k+1)) <= n) ++k;

        Zp ans = 0;
        // 2^k <= n
        // Solve for [0, 2^k) separately
        ans += save[k];
        
        // Then populate popcount frequencies for [0, n - 2^k]
        auto freq = gen_freq(n - (1LL << k));
        for (int b = 0; b <= k; ++b) {
            int rem = k - b;
            
            for (int i = 0; i <= rem; ++i) {
                ans += freq[b] * (i+1) * save2[rem][i];
            }
        }
        cout << ans << '\n';
    }
}
2 Likes

I have a different solution with digit DP which does not require precomputation.
From the editorial we know below conclusions:

  • the length of final array is at most \left\lfloor \log_2 M \right\rfloor + 1
  • each bit only occurs on one element
  • at most one element could equal to M

So we can define DP status as follow:
At bit k, dp[i] is the number of arrays with length i and have one element equal to the current binary prefix of M, and dp2[i] is the number of arrays with length i and have no element equal to the current binary prefix of M.

Traverse from highest bit to lowest bit, we can have below transitions:

  • if the current bit of M is 0.
    If we set no 1 to any element, this will not change the number of element equal to M.
    Also, we can’t set this 1 on the element equal to M, so we have
    dp[i] = dp[i]+dp[i]*(i-1)
    dp2[i]=dp2[i]+dp2[i]*i
  • if the current bit of M is 1.
    we can set 1 on any element, so we have
    dp[i]=dp[i]
    dp2[i]=dp[i]+dp2[i]+dp2[i]*i

Also, we can always add a new element 2^k if the current bit is k, so dp[1]+=1 if k is the most siginificant bit of M, otherwise dp2[i+1]+=dp[i]+dp2[i]

After finishing dp, we can calculate final result as below, because the order of elements could be rearranged freely:
ans=\Sigma_{i=1}^{60}{(dp[i]+dp2[i])}*i!

The time complexity is O(\log^2M) per testcase.