 # FOURSQ - Editorial

Practice

Contest

Author: Sidhant Bansal

Tester: Istvan Nagy

Editorialist: Misha Chorniy

Medium

# Pre-Requisites:

segment tree or related data structure

## Problem Statement

You are given array A consisting N elements. All the elements in the input are integers. You must be able to make two queries, first one - change the value of the element, second one - is to
present the product of elements with indices from l to r by modulo P in the sum of the squares of four non-negative integers.

## Explanation

We need to make some precalculation, for each number between 0 and 1000000 know to calculate his decomposition in the sum of four squares. The key idea in this problem, if we can decompose number A in the sum of four squares and B in the sum of four squares, then A*B can be decomposed in the sum of four squares.

A*B = (x_{1}^2+x_{2}^2+x_{3}^2+x_{4}^2)*(y_{1}^2+y_{2}^2+y_{3}^2+y_{4}^2) =

(x_{1}*y_{1}+x_{2}*y_{2}+x_{3}*y_{3}+x_{4}*y_{4})^2 +

(x_{1}*y_{2}-x_{2}*y_{1}+x_{3}*y_{4}-x_{4}*y_{3})^2 +

(x_{1}*y_{3}-x_{2}*y_{4}-x_{3}*y_{1}+x_{4}*y_{2})^2 +

(x_{1}*y_{4}-x_{4}*y_{1}+x_{2}*y_{3}-x_{2}*y_{3})^2

The second observation is Lagrange’s four-square theorem, it’s well-known fact in math. Every number can be represented in the sum of four squares, from here there are no -1 answers in the problem.

## Subtask 1

P is less or equal than 1000. Let’s make precalculation in a next way, iterate with four cycles, where each number less than sqrt(1000)<32

``````
for i = 0..32
for j = i..32
for k = j..32
for l = k..32
s[i*i+j*j+k*k+l*l]=(i,j,k,l)  //t is the array of tuples from 4 numbers

``````

When we know this thing, numbers can be multiplied and values of the sum of 4 square squares can be recalculated. Of course, all operations must be processed by modulo P. After that queries can be simulated, the first one in O(1) time, the second one in O(N) time. Totally, it takes O(Q*N+32^4) time.

## Subtask 2

Precomputation with four for’s as in Subtask 1 get TL for P=10^6. How to speed up this one, let’s make next thing, find all numbers which can be represented as the sum of 3 square numbers.

``````
for i=0..1000
if i*i>1000000
break
for j=i..1000
if i*i+j*j>1000000
break
for k=j..1000
if i*i+j*j+k*k>1000000
break
t[i*i+j*j+k*k]=(i,j,k)  //t is the array of tuples from 3 numbers
s[i*i+j*j+k*k]=(i,j,k,0)

``````

We will find almost 85% of representions after this preprocessing. How to find the rest:

``````
for i=0..1000000
if s[i] != null
continue
for j=1..1000
if i - j * j<0
break
if s[i-j * j]==0   // If number i - j * j can be represented in a sum of 3 squares
s[i]=(s[i-j * j],s[i-j * j],s[i-j * j],j)
break

``````

Let’s estimate total time of work of these two fragments of code, first fragment has three for’s each of those to 1000, but 0<=i<=j<=k<=1000, number of possible triplets (i, j, k) the same thing as 1000 \choose 3. The second fragment works in time which equal number of numbers which can’t be represented as a sum of 3 squares multiply by 1000 less than 2*10^5*1000 = 2*10^8, summary both fragments will process in time less than 5*10^8.

Now we have all decompositions for numbers in the range between 0 and 10^6, but we don’t know how to speed up processing of the 2nd query. A function of representation in a sum of 4 square is associative and multiplicative, we can use segment tree for solving this problem. In each node v which assigned with interval l..r storing decomposition of the product A_{l}*A_{l+1}*...*A_{r} in the sum of 4 squares by modulo P.

Now both queries can be done with complexity O(log N)

## Subtask 3

The difference of a subtask 3 from a subtask 2 that P can be up to 10^{12}. It does problems because 64-bit integers type can store numbers up to 1.8 * 10^{19}, how to cope with a multiplication of two numbers which can’t be stored in the 64-bit integer type, there are several ways:

### First way:

``````
mult(a, b, p)
if b == 0
return 0
if b % 2 == 0
a2=mult(a,b/2,p)     // Finding a * (b / 2), a * b = a * (b / 2) + a * (b / 2)
return (a2+a2) mod p // (a+...+a)+(a+...+a)
return (a+mult(a,b-1, p)) mod p //a * b = a * (b - 1) + a

``````

Complexity of this algorithm is O(log N), unfortunately, this algorithm can’t pass TL

### Second way:

In C++ exists types like __int128, __int128_t, Python and Java has libraries for working with Long Arithmetics, but all of it is very slow to get AC.

### Third way:

Divide multipliers into two parts \$X=A*2^{20}+B\$, \$0<=A, B<2^{20}\$
``````
mult(a, b, p)
a0=a % (2^20)
a1=a / (2^20)
b0=b % (2^20)
b1=b / (2^20)
//(a1*(2^20)+a0)*(b1*(2^20)+b0)=a1*b1*(2^40)+(a1*b0+b1*a0)*(2^20)+(a0*b0)
return ((a1*b1)%p*(2^20)%p*(2^20)+(a1*b0+a0*b1)%p*2^20+a0*b0)%p

``````

If change % and / into bit operations, correct solution must get AC.

### Fourth way:

To calcuate product in big real types(C++ version):
``````
//a * b mod p = a * b - [a * b / p] * p
long long mul(long long a, long long b, long long p) {
long long k = (long long)((long double)a * b / p + 0.5);
return a * b - k * p;
}

``````

This optimization for multiplying numbers up to 10^{12} modulo 10^{12} is fastest.

The overall time complexity of this approach is O(N log N) but with huge constant.

## Solution:

Setter’s solution can be found here

Tester’s solution can be found here

Please feel free to post comments if anything is not clear to you.

1 Like

Sorry I can’t get one thing. In the 4th way of finding modulo we have returned ( ab - kp )

But what would happen if a was like 10^11 , b = 10^11 and p = 10^12. Then won’t a*b cause overflow for long long int ? The 3rd way is really elegant What is zi-j**j == 0 in the second fragment of the code ??
does that , mean j*j is a multiple of i ?

Setter’s solution gets 10 points.Not expected!!! Was I the only one who did all the computation for four squares of each number online and still managed to get 100 points?

1 Like

@ridslk22 How did you manage to do it online?

Here is a fifth way which seemed most intuitive to me. Represent each number as x*10^3 + y.

Why? Because x will be less than 10^9 and you can still compute the product without overflow.

@suraj021, here’s what i did…

1. each number except of the form 4^m(8k+7) can be expressed as sum of 3 square number. Therefore I try to find a square number close to the multiplication that, when subtracted from it, will give me a number that satisifies this condition. Like if 107 is the given number, closest square number is 100. But 107-100=7 is a number violating the condition. So I take the next closest square 81, 107-81=26, which is not of the mentioned format. So I will take my first square number as sqrt(81) = 9. I haven’t proved it but this should not take more than 2 or 3 iterations.

2. After getting such a number, I try to form a three square representation. (in the example, 26) Its a very naive recursive method. I try to create the number by taking closest square numbers iteratively. I kept a dp map to avoid regenerating the same numbers… I found out that most numbers can be formed on the very first attempt (by greedily taking the closest square numbers, like 26 for example: take closest square 25 and you get next two numbers 1 and 0 easily). However there are some numbers which may take upto quite a lot of iterations. But I had a feeling there wouldn’t be too many. And my hunch was proved when I got AC.

1 Like

I used the same algorithms as discussed here, but,i don’t have any idea why am i getting WA in 2 test cases 1 apiece in both the subtasks :(.
Please can anyone help me?

I have doubt in fourth way of multipllication.if a and b are long long types and suppose a=10^11 ,b=10^11 and p=10^12 then how can we directly multiply a*b as given in statement *return a * b - k * p; It should over flow there.Can any one clear my doubt??

yes! @manjunath1996 It will overflow! So we have to add another factor that will make the result in correct form. See my code below! It is correct over all the range upto 10^12

``````long long mul(long long a, long long b, long long p) {
long long k = (long long)((long double)a * b / p + 0.5);
long long ans=a * b - k * p;
while(ans<0)ans+=p;
return ans;
}``````
2 Likes

Yes it will overflow, and that is intentional. I found it on Wiki and it’s really neat. It uses the fact that floating-point multiplication results in the most significant bits of the product kept, while integer multiplication results in the least significant bits kept. Notice that both (a * b) and (k * p) will overflow, however because the difference between them is (a * b) % p, subtracting them will still give us (a * b) % p. It’s like two wrongs making a right 2 Likes

It’s problem with TEX formatting. Fixed, thanks!

But tell me one thing that how did you manage to deduce the number in this form 4^a (8b+7)?

It’s just Legendre’s [three-square theorem].
: https://en.wikipedia.org/wiki/Legendre’s_three-square_theorem

But i want to know about factoring part! How did you factor the numbers in this three square theorem!

This is pretty much the same as the 3rd way of representing each number as `x*2^20 + y`. However multiplication by `2^20` can be done by bit shifting, but that is not so with `10^3`, so this is a slower method.