CURSQURS - Editorial



Author: Ritesh Gupta
Tester: Raja Vardhan Reddy
Editorialist: Akash Bhalotia




Fast Fourier Transform, Bit Manipulation.


Consider a matrix C to be created by two arrays A and B, such that C[i][j]=A[i] \oplus B[j]. For each valid L, find the sum of beauties of all submatrices of dimensions \le L, where the beauty of a submatrix is the xor of all its elements, multiplied by its size, modulo 10^9+7.



C is created using xor on A and B. Thus, whenever the number of rows or columns of a submatrix is even, there will be no contribution of the corresponding array towards the answer. When they are odd, the problem can be reduced to the problem of finding the same, but on a single dimensional array.

This can be solved bit by bit, by computing the prefix xors. Only the subarrays which have an xor of 1 will contribute to the answer. These can be obtained from the prefix array, as they start after a 0 and end at a 1, or vice versa.

However, we want the contributions for subarrays of every size. So, we create two polynomials, one from the prefix array, and one by inverting its coefficients and negating its powers. Now, we just need to match the 1 s. This can be done efficiently by multiplying the two polynomials using FFT. The coefficients of the terms of the product thus obtained shall denote the contribution of the subarrays of length equivalent to the powers.



Let’s first try to solve the problem for a single dimensional array. Suppose our problem is:

Given an array of N elements, for each L, such that 1 \le L \le N, find sum of beauties of all subarrays of size \le L, where the beauty of a subarray is the xor of all its elements, multiplied by the size of the subarray.

For a particular L, we need the sum of beauties of all subarrays of sizes less than L too. So, we can simply find the sum of beauties of subarrays of size L, and add this with the answer for (L-1), to get the answer for L.

Now, how to find the sum of beauties of subarrays of all sizes efficiently? Let’s solve this problem bit by bit. Let’s compute the answers bit by bit, and then add them up. This is possible to do because the operator involved is xor. So, we can compute the answer for the lowest bit, then for the second lowest bit, and so on, then add them as: (ans[0] \times 2^0) + (ans[1] \times 2^1) + (ans[2] \times 2^2) + (ans[3] \times 2^3) and so on. Here, ans[i] signifies the answer for bit at position i, i.e., contribution of bit i to the sum of beauties of all subarrays.

Let’s take an example. Let N=5, and A=\{5,2,9,12,7\}.
Let’s look at their lowest bits. The resulting array will be: Z= \{1,0,1,0,1\}.

So, we need to find the sum of xors of all its subarrays. Let’s see what happens if we consider its prefix xor array.

pref[0]=Z[0]; if i=0
pref[i]=Z[i] xor pref[i-1]; otherwise


Using this, we can easily find the xor of any subarray. Xor of subarray [l,r] =pref[r] \oplus pref[l-1].

Since we need the value of position l-1 for it, we’ll append a 0 to the beginning of the prefix array, to easily get the xor of subarrays starting at position 0. So, our prefix array now is: pref=\{0,1,1,0,0,1\}. Using this, we can easily get the xor of any subarray.

As we are considering only a single bit, only those subarrays will actually contribute to the answer which have an xor value of 1. We can easily find these subarrays from the prefix array: The subarrays which have an xor of 1 either start after a position in the prefix array, which has a prefix value of 1 and end at a position which has a prefix value of 0, or vice versa. So, we can match every 1 in the prefix array, with a 0 in it, and we’ll get the count of the total number of subarrays which contribute to the answer. However, we need the answer for each L. Thus, we need to know the contribution of subarrays of every size.

To efficiently compute the count of the subarrays, we can create an additional array, formed by inverting the bits of the prefix array: \{1,0,0,1,1,0\}. So now, we have two arrays. Let’s call them p_1 and p_2.


Since the bits in p_2 are simply the inverted bits of p_1, we now have to match every 1 in p_1 with every 1 in p_2.

Let’s position p_1 from 0 to 5, left to right.
We’ll position p_2 as the negative of the positions of p_1. So p_2 will be positioned as: 0,-1,-2,-3,-4,-5. Let’s create two polynomials from p_1 and p_2, where the coefficients are the values of the bits in them, and the powers of x are the positions we just defined. Thus,


If we multiply the two polynomials, we get: 2x^1+1x^{-1}+2x^2+2x^{-2}+1x^{-3}+1x^{5}.
The terms in the product, with a particular power of x, actually denote the contribution to the answer by subarrays of the size equivalent to the power. For example, a power of 2 denotes the contribution of subarrays of length 2, which begin after a 0 in the prefix array, and end at a 1, while the terms with power -2 of denote the contribution of subarrays of length 2, which begin after a 1 in the prefix array, and end at a 0.

So, to get the total contribution of all subarrays of length L, we add the coefficients of x with powers L and -L in the product of the two polynomials.

So, the contributions will be: \{3,4,1,0,1\}. However, we still need to multiply them with the lengths of the subarrays, to get the final beauties. So, this will result in: \{3,8,3,0,5\}.

This only represented the contribution of subarrays for the lowest bit. Let’s do the same for other bits:

For bits positioned 1: \{2,6,9,4,0\}
For bits positioned 2: \{3,4,6,0,5\}
For bits positioned 3: \{2,4,3,0,0\}

So, the total contribution of all subarrays of size 1 will be: 3*1+2*2+3*4+2*8=35. This is the answer for L=1.

Similarly, we can find the answers for all subarrays sizes: \{35,68,69,8,25\}.
Answer for a particular L includes the answer for smaller L too. Thus, the final solution will be: \{35,103,172,180,205\}.

Thus, we have reduced the problem to the problem of multiplying two polynomials. We can do this efficiently using Fast Fourier Transform. As the original coefficients only comprise of 0 s and 1 s, the resulting coefficients shall be at most N. So, this is sufficient. Here are some tutorials which may help: [1], [2], [3], [4]. FFT allows multiplication of polynomials with only non-negative powers. To deal with this, we can multiply both the polynomials with x^N, and then proceed as before. This will ensure that there are only non-negative powers of x for multiplication.

Now that we know how to solve the problem for a single dimensional array, let’s see how to solve the original problem. The main observation involved is that the matrix C is actually made up of arrays A and B, using the xor operator. C[i][j]=A[i] \oplus B[j]. Let’s look at the contributions of the submatrices, on a case by case basis.

When the number of rows and the number of columns, both are even:

Here, the contribution will be 0. This is because each A[i] will have an even number of occurrence in the submatrix, and same for each B[j] which made up that submatrix. When a number is xorred with itself, the result is 0. So, submatrices with an even number of rows and columns will have no contribution.

When the number of columns is even, and the number of rows is odd:

Here, array A will have no effect on the contributions of the submatrices, as each A[i] will occur an even number of times in a submatrix. So, we can simply calculate the contribution of each subarray of B normally, as explained in the solution for a single dimensional array. Then, we’ll multiply this with the number of rows to get the total contribution of all submatrices. Let’s consider all 1 \times 2 submatrices. Their contribution will be equal to the contribution of all subarrays of B of size 2 multiplied by N. For 1 \times 4 submatrices, it will equal the contribution of subarrays of B of size 4, multiplied by N. For 3 \times 4 submatrices, it will equal the contributions of subarrays of size 4 of B, multiplied by (N-2), and so on.

When the number of columns is odd, and the number of rows is even:

This case is similar to the previous case. Here, we’ll consider the subarrays of A, and multiply their contributions to the corresponding number of columns.

When the number of rows and the number of columns, both are odd:

Remember that we are solving the problem bit by bit. Considering a particular bit position, let’s again look at the contributions of B for a particular L. Let it be count for a particular L. Let’s first consider it when a single row is involved.

Considering 1 \times L submatrices, let X be the number of elements in A which are 0, and Y be the number of elements in A which are 1. When A[i]=0, it won’t affect the contributions of the subarrays of B. Thus, we’ll add count*X in total for them. If A[i]=1, the subarrays of B of length L, which did not contribute to the answer previously will now do, and those which did, will no longer do so. Thus, they’ll contribute (M-L+1-count) * Y, in total.

But, this was only for submatrices which had a single row. We can do the same for all other odd sized rows too. To do this efficiently, we can compute prefix sums of X's and Y's, using the FFT that we had performed on A. For an X of subarray of size K, Y will simply be (N-K+1-X). After we have this, we can use the value in the prefix sums, to obtain the result for each L.

You can also refer to the setter’s notes here.



Let X denote the maximum element among all A[i] and all B[j].

The time complexity is: O((NlogN+MlogM)*log(X)).
Space required will be O(N+M).


We’ll be performing FFT on N elements of array A. This will take O(NlogN) operations. However, we shall do this for every bit. There are log(X) bits. We’ll also perform FFT on array B, which has M elements, for each bit. Thus, total cost will be O((NlogN+MlogM)*log(X)).


#include <algorithm>
#include <cassert>
#include <cmath>
#include <iostream>
#include <limits>
#include <vector>
using namespace std;
//neal_wu's Iterative FFT:
// This is noticeably faster than std::complex for some reason.
template<typename float_t>
struct complex {
	float_t x, y;
	complex<float_t>(float_t _x = 0, float_t _y = 0) : x(_x), y(_y) {}
	float_t real() const { return x; }
	float_t imag() const { return y; }
	void real(float_t _x) { x = _x; }
	void imag(float_t _y) { y = _y; }
	complex<float_t>& operator+=(const complex<float_t> &other) { x += other.x; y += other.y; return *this; }
	complex<float_t>& operator-=(const complex<float_t> &other) { x -= other.x; y -= other.y; return *this; }
	complex<float_t> operator+(const complex<float_t> &other) const { return complex<float_t>(*this) += other; }
	complex<float_t> operator-(const complex<float_t> &other) const { return complex<float_t>(*this) -= other; }
	complex<float_t> operator*(const complex<float_t> &other) const {
	    return {x * other.x - y * other.y, x * other.y + other.x * y};
	complex<float_t> operator*(float_t mult) const {
	    return {x * mult, y * mult};
	friend complex<float_t> conj(const complex<float_t> &c) {
	    return {c.x, -c.y};
	friend ostream& operator<<(ostream &stream, const complex<float_t> &c) {
	    return stream << '(' << c.x << ", " << c.y << ')';
template<typename float_t>
complex<float_t> polar(float_t magnitude, float_t angle) {
	return {magnitude * cos(angle), magnitude * sin(angle)};
namespace FFT {
	using float_t = double;
	const float_t ONE = 1;
	const float_t PI = acos(-ONE);
	vector<complex<float_t>> roots = {{0, 0}, {1, 0}};
	vector<int> bit_reverse;
	bool is_power_of_two(int n) {
	    return (n & (n - 1)) == 0;
	int round_up_power_two(int n) {
	    while (n & (n - 1))
	        n = (n | (n - 1)) + 1;
	    return max(n, 1);
	// Given n (a power of two), finds k such that n == 1 << k.
	int get_length(int n) {
	    return __builtin_ctz(n);
	// Rearranges the indices to be sorted by lowest bit first, then second lowest, etc., rather than highest bit first.
	// This makes even-odd div-conquer much easier.
	template<typename complex_array>
	void bit_reorder(int n, complex_array &&values) {
	    if ((int) bit_reverse.size() != n) {
	        bit_reverse.assign(n, 0);
	        int length = get_length(n);
	        for (int i = 0; i < n; i++)
	            bit_reverse[i] = (bit_reverse[i >> 1] >> 1) | ((i & 1) << (length - 1));
	    for (int i = 0; i < n; i++)
	        if (i < bit_reverse[i])
	            swap(values[i], values[bit_reverse[i]]);
	void prepare_roots(int n) {
	    if ((int) roots.size() >= n)
	    int length = get_length(roots.size());
	    // The roots array is set up such that for a given power of two n >= 2, roots[n / 2] through roots[n - 1] are
	    // the first half of the n-th roots of unity.
	    while (1 << length < n) {
	        float_t min_angle = 2 * PI / (1 << (length + 1));
	        for (int i = 0; i < 1 << (length - 1); i++) {
	            int index = (1 << (length - 1)) + i;
	            roots[2 * index] = roots[index];
	            roots[2 * index + 1] = polar(ONE, min_angle * (2 * i + 1));
	template<typename complex_array>
	void fft_iterative(int N, complex_array &&values) {
	    bit_reorder(N, values);
	    for (int n = 1; n < N; n *= 2)
	        for (int start = 0; start < N; start += 2 * n)
	            for (int i = 0; i < n; i++) {
	                const complex<float_t> &even = values[start + i];
	                complex<float_t> odd = values[start + n + i] * roots[n + i];
	                values[start + n + i] = even - odd;
	                values[start + i] = even + odd;
	inline complex<float_t> extract(int N, const vector<complex<float_t>> &values, int index, int side) {
	    if (side == -1) {
	        // Return the product of 0 and 1.
	        int other = (N - index) & (N - 1);
	        return (conj(values[other] * values[other]) - values[index] * values[index]) * complex<float_t>(0, 0.25);
	    int other = (N - index) & (N - 1);
	    int sign = side == 0 ? +1 : -1;
	    complex<float_t> multiplier = side == 0 ? complex<float_t>(0.5, 0) : complex<float_t>(0, -0.5);
	    return multiplier * complex<float_t>(values[index].real() + values[other].real() * sign,
	                                         values[index].imag() - values[other].imag() * sign);
	void invert_fft(int N, vector<complex<float_t>> &values) {
	    assert(N >= 2);
	    for (int i = 0; i < N; i++)
	        values[i] = conj(values[i]) * (ONE / N);
	    for (int i = 0; i < N / 2; i++) {
	        complex<float_t> first = values[i] + values[N / 2 + i];
	        complex<float_t> second = (values[i] - values[N / 2 + i]) * roots[N / 2 + i];
	        values[i] = first + second * complex<float_t>(0, 1);
	    fft_iterative(N / 2, values);
	    for (int i = N - 1; i >= 0; i--)
	        values[i] = i % 2 == 0 ? values[i / 2].real() : values[i / 2].imag();
	const int FFT_CUTOFF = 150;
	template<typename T_out, typename T_in>
	vector<T_out> multiply(const vector<T_in> &left, const vector<T_in> &right) {
	    if (left.empty() || right.empty())
	        return {0};
	    int n = left.size();
	    int m = right.size();
	    if (min(n, m) < FFT_CUTOFF) {
	        vector<T_out> result(n + m - 1, 0);
	        for (int i = 0; i < n; i++)
	            for (int j = 0; j < m; j++)
	                result[i + j] += (T_out) left[i] * right[j];
	        return result;
	    int N = round_up_power_two(n + m - 1);
	    vector<complex<float_t>> values(N, 0);
	    for (int i = 0; i < n; i++)
	    for (int i = 0; i < m; i++)
	    fft_iterative(N, values);
	    for (int i = 0; i <= N / 2; i++) {
	        int j = (N - i) & (N - 1);
	        complex<float_t> product_i = extract(N, values, i, -1);
	        values[i] = product_i;
	        values[j] = conj(product_i);
	    invert_fft(N, values);
	    vector<T_out> result(n + m - 1, 0);
	    for (int i = 0; i < (int) result.size(); i++)
	        result[i] = is_integral<T_out>::value ? round(values[i].real()) : values[i].real();
	    return result;
char input[] = "input/input00.txt";
char output[] = "output_solution.txt";
int a[100010],b[100010],ans[100010];
vector<int> A,B;
int N,M,curr;
#define MOD 1000000007
inline int add(int val1, int val2)
	int64_t val = val1;
	val += val2;
	return val%MOD;
inline int mul(int val1, int val2)
	int64_t val = val1;
	val *= val2;
	return val%MOD;
void solve_fft(vector <int> &left, vector <int> &answer) 
	auto &&multiply_and_add = [&](vector<int> &left, vector<int> &right) {
	    vector<int> product = FFT::multiply<int>(left, right);
	    for(int i=1;i<left.size();i++)
	        answer[i] = (product[left.size()-i-1] + product[left.size()+i-1]);
	vector <int> right = left;
	for(int &i:right)
	    i ^= 1;
	reverse(right.begin(), right.end());
	multiply_and_add(left, right);
void solve()
	int n = max(N,M)+1;
	vector<int> A_xor_1(n,0);
	vector<int> B_xor_1(n,0);
	solve_fft(A, A_xor_1);
	solve_fft(B, B_xor_1);
	vector<int> A_xor_0(n,0);
	vector<int> B_xor_0(n,0);
	for(int i=1;i<n;i++)
	    A_xor_0[i] = (A.size()-i-A_xor_1[i]);
	    B_xor_0[i] = (B.size()-i-B_xor_1[i]);
	    if(i >= A.size())
	        A_xor_0[i] = A_xor_1[i] = 0;
	    if(i >= B.size())
	        B_xor_0[i] = B_xor_1[i] = 0;
	    A_xor_0[i] = mul(A_xor_0[i], i);
	    A_xor_1[i] = mul(A_xor_1[i], i);
	    B_xor_0[i] = mul(B_xor_0[i], i);
	    B_xor_1[i] = mul(B_xor_1[i], i);
	    if(i > 1)
	        A_xor_0[i] = add(A_xor_0[i], A_xor_0[i-2]);
	        A_xor_1[i] = add(A_xor_1[i], A_xor_1[i-2]);
	        B_xor_0[i] = add(B_xor_0[i], B_xor_0[i-2]);
	        B_xor_1[i] = add(B_xor_1[i], B_xor_1[i-2]);
	    int cnt = mul(A_xor_1[i - i%2], add(B_xor_0[i - (i-1)%2], B_xor_1[i - (i-1)%2]));
	    cnt = add(cnt, mul(B_xor_1[i - i%2], add(A_xor_0[i - (i-1)%2], A_xor_1[i - (i-1)%2])));
	    cnt = add(cnt, mul(B_xor_1[i - (i-1)%2], A_xor_0[i - (i-1)%2]));
	    cnt = add(cnt, mul(B_xor_0[i - (i-1)%2], A_xor_1[i - (i-1)%2]));
	    cnt = mul(cnt, curr);
	    ans[i] = add(ans[i], cnt);
int32_t main()
	//freopen(input, "r", stdin);
	//freopen(output, "w", stdout);
	int t;
	cin >> t;
	    cin >> N >> M;
	    for(int i=1;i<=max(N,M);i++)
	        ans[i] = 0;
	    A.resize(N+1, 0);
	    B.resize(M+1, 0);
	    for(int i=0;i<N;i++)
	        cin >> a[i];
	    for(int i=1;i<N;i++)
	        a[i] ^= a[i-1];
	    for(int i=0;i<M;i++)
	        cin >> b[i];
	    for(int i=1;i<M;i++)
	        b[i] ^= b[i-1];
	    curr = 1;
	    while(curr < 1e6)
	        A[0] = B[0] = 0;
	        for(int i=0;i<N;i++)
	            A[i+1] = (a[i]&1);
	            a[i] >>= 1;
	        for(int i=0;i<M;i++)
	            B[i+1] = (b[i]&1);
	            b[i] >>= 1;
	        curr <<= 1;
	    for(int i=1;i<=max(N,M);i++)
	        assert(0 <= ans[i] && ans[i] < MOD);
	        cout << ans[i] << " ";
	    cout << endl;
	return 0;
//#pragma comment(linker, "/stack:200000000")
//#pragma GCC optimize("Ofast")
//#pragma GCC target("sse,sse2,sse3,ssse3,sse4,avx,avx2")
#include <bits/stdc++.h>
#include <vector>
#include <set>
#include <map>
#include <string> 
#include <cstdio>
#include <cstdlib>
#include <climits>
#include <utility>
#include <algorithm>
#include <cmath>
#include <queue>
#include <stack>
#include <iomanip> 
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp> 
//setbase - cout << setbase (16)a; cout << 100 << endl; Prints 64
//setfill -   cout << setfill ('x') << setw (5); cout << 77 <<endl;prints xxx77
//setprecision - cout << setprecision (14) << f << endl; Prints x.xxxx
//cout.precision(x)  cout<<fixed<<val;  // prints x digits after decimal in val
using namespace std;
using namespace __gnu_pbds;
#define f(i,a,b) for(i=a;i<b;i++)
#define rep(i,n) f(i,0,n)
#define fd(i,a,b) for(i=a;i>=b;i--)
#define pb push_back
#define mp make_pair
#define vi vector< int >
#define vl vector< ll >
#define ss second
#define ff first
#define ll long long
#define pii pair< int,int >
#define pll pair< ll,ll >
#define sz(a) a.size()
#define inf (1000*1000*1000+5)
#define all(a) a.begin(),a.end()
#define tri pair<int,pii>
#define vii vector<pii>
#define vll vector<pll>
#define viii vector<tri>
#define mod (1000*1000*1000+7)
#define pqueue priority_queue< int >
#define pdqueue priority_queue< int,vi ,greater< int > >
#define int ll
typedef tree<
ll addmod(ll a,ll b){
	return a;
ll submod(ll a,ll b){
	return a;
ll mulmod(ll a,ll b){
	return (a*b)%mod;
const long double pi = 4.0 * atan(1.0);
//const int N = 2e5 + 5;
int P = (1<<19);
int K = 19;
struct base{
	double x , y;
	    x = 0;
	    y = 0;
	base(double a){
	    x = a;
	    y = 0;
	base(double a , double b){
	    x = a;
	    y = b;
	base conj(){
	    return base(x , -y);
	base operator = (const base &o){
	    x = o.x;
	    y = o.y;
	    return *this;
	base operator + (const base &o) const {
	    return base(x + o.x , y + o.y);
	base operator - (const base &o) const {
	    return base(x - o.x , y - o.y);
	base operator * (const base &o) const {
	    return base(x * o.x - y * o.y , y * o.x + x * o.y);
	base operator * (const double num) const {
	    return base(x * num , y * num);
	base operator / (const double num) const {
	    return base(x / num , y / num);
// initializations
int rev[(1<<19)+3];
ll res[(1<<19)+3]; 
void pre(){
	int res;
	for(int i = 0 ; i < P ; ++i){
	    res = 0;
	    for(int j = 0 ; j < K ; ++j){
	        if((i >> j) & 1){
	            res |= 1 << (K - j - 1);
	    rev[i] = res;
// fft works 
void fft(base a[] , bool inv){
	for(int i = 0 ; i < P ; ++i){
	    if(rev[i] > i){
	        swap(a[i] , a[rev[i]]);
	for(int size = 2 ; size <= P ; size <<= 1){
	    int m = size >> 1;
	    base w(cos(2 * pi / size) , sin(2 * pi / size) * (inv ? -1 : 1));
	    for(int i = 0 ; i < P ; i += size){
	        base cur(1 , 0);
	        for(int j = 0 ; j < m ; ++j){
	            base v = a[i + j];
	            base u = a[i + j + m] * cur;
	            a[i + j] = v + u;
	            a[i + j + m] = v - u;
	            cur = cur * w;
// call this to multiply two polynomials result is stored in res
void multiply (base a[],base b[]) {
	fft (a, false),  fft (b, false);
	for (int i=0; i<P; ++i){
		a[i] =a[i]*b[i];
	fft (a, true);
	for (int i=0; i<P; ++i){
		res[i] = (ll)((a[i].x)/P + 0.5);
base p[(1<<19)+3],q[(1<<19)+3];
int coef[2][(1<<19)+3],pref[2][(1<<19)+3],a[2][(1<<19)+3],gg[2][(1<<19)+3],sum[2][(1<<19)+3],odd[(1<<19)+3],mulodd[(1<<19)+3],haha[2][(1<<19)+3],lol[2][(1<<19)+3],ans[(1<<19)+3],result[(1<<19)+3];
int get_even(int fl,int n,int m,int l){
		return 0;
	int x,val,val1;
	return val;
int get_odd(int fl,int n,int m,int l,int fl1){
		return 0;
	int x,y,val,val1;
	return val;
void solve(int bit,int fl,int n){
	int i,j;
	int t,i;
		int n,m,val,qq,l,i,j,fl,id,val1;
		scanf("%lld %lld",&n,&m);
		K =id;
		P = (1<<id);
			printf("%lld ",result[l]);
	return 0;

Feel free to share your approach if it differs. You can ask your doubts below. Please let me know if something’s unclear. I would LOVE to hear suggestions :slight_smile: