Author: Sidhant Bansal Difficulty:Medium PreRequisites:segment tree or related data structure Problem StatementYou are given array $A$ consisting $N$ elements. All the elements in the input are integers. You must be able to make two queries, first one  change the value of the element, second one  is to present the product of elements with indices from l to r by modulo P in the sum of the squares of four nonnegative integers. ExplanationWe need to make some precalculation, for each number between 0 and 1000000 know to calculate his decomposition in the sum of four squares. The key idea in this problem, if we can decompose number $A$ in the sum of four squares and $B$ in the sum of four squares, then $A*B$ can be decomposed in the sum of four squares. $A*B$ = $(x_{1}^2+x_{2}^2+x_{3}^2+x_{4}^2)*(y_{1}^2+y_{2}^2+y_{3}^2+y_{4}^2)$ = $(x_{1}*y_{1}+x_{2}*y_{2}+x_{3}*y_{3}+x_{4}*y_{4})^2 + $ $(x_{1}*y_{2}x_{2}*y_{1}+x_{3}*y_{4}x_{4}*y_{3})^2 + $ $(x_{1}*y_{3}x_{2}*y_{4}x_{3}*y_{1}+x_{4}*y_{2})^2 + $ $(x_{1}*y_{4}x_{4}*y_{1}+x_{2}*y_{3}x_{2}*y_{3})^2 $ The second observation is Lagrange's foursquare theorem, it's wellknown fact in math. Every number can be represented in the sum of four squares, from here there are no $1$ answers in the problem.Subtask 1$P$ is less or equal than $1000$. Let's make precalculation in a next way, iterate with four cycles, where each number less than $sqrt(1000)<32$
When we know this thing, numbers can be multiplied and values of the sum of 4 square squares can be recalculated. Of course, all operations must be processed by modulo $P$. After that queries can be simulated, the first one in $O(1)$ time, the second one in $O(N)$ time. Totally, it takes $O(Q*N+32^4)$ time. Subtask 2Precomputation with four for's as in Subtask 1 get TL for $P=10^6$. How to speed up this one, let's make next thing, find all numbers which can be represented as the sum of 3 square numbers.
for i=0..1000
if i i>1000000
break
for j=i..1000
if ii+jj>1000000
break
for k=j..1000
if ii+jj+kk>1000000
break
t[ii+jj+kk]=(i,j,k) //t is the array of tuples from 3 numbers
s[ii+jj+kk]=(i,j,k,0)
We will find almost $85$% of representions after this preprocessing. How to find the rest:
Let's estimate total time of work of these two fragments of code, first fragment has three for's each of those to 1000, but $0<=i<=j<=k<=1000$, number of possible triplets (i, j, k) the same thing as $1000 \choose 3$. The second fragment works in time which equal number of numbers which can't be represented as a sum of 3 squares multiply by 1000 less than $2*10^5*1000$ = $2*10^8$, summary both fragments will process in time less than $5*10^8$. Now we have all decompositions for numbers in the range between $0$ and $10^6$, but we don't know how to speed up processing of the 2nd query. A function of representation in a sum of 4 square is associative and multiplicative, we can use segment tree for solving this problem. In each node $v$ which assigned with interval $l..r$ storing decomposition of the product $A_{l}*A_{l+1}*...*A_{r}$ in the sum of 4 squares by modulo $P$. Now both queries can be done with complexity $O(log N)$ Subtask 3The difference of a subtask 3 from a subtask 2 that $P$ can be up to $10^{12}$. It does problems because 64bit integers type can store numbers up to $1.8 * 10^{19}$, how to cope with a multiplication of two numbers which can't be stored in the 64bit integer type, there are several ways: First way:
mult(a, b, p)
if b == 0
return 0
if b % 2 == 0
a2=mult(a,b/2,p) // Finding a * (b / 2), a * b = a * (b / 2) + a * (b / 2)
return (a2+a2) mod p // (a+...+a)+(a+...+a)
return (a+mult(a,b1, p)) mod p //a * b = a * (b  1) + a
Complexity of this algorithm is O(log N), unfortunately, this algorithm can't pass TL Second way:In C++ exists types like int128, int128_t, Python and Java has libraries for working with Long Arithmetics, but all of it is very slow to get AC. Third way:Divide multipliers into two parts $X=A*2^{20}+B$, $0<=A, B<2^{20}$
mult(a, b, p)
a0=a % (2^20)
a1=a / (2^20)
b0=b % (2^20)
b1=b / (2^20)
//(a1 (2^20)+a0)(b1(2^20)+b0)=a1b1(2^40)+(a1b0+b1a0)(2^20)+(a0b0)
return ((a1b1)%p(2^20)%p(2^20)+(a1b0+a0b1)%p2^20+a0*b0)%p
If change % and / into bit operations, correct solution must get AC.
Fourth way:To calcuate product in big real types(C++ version):
//a * b mod p = a * b  [a * b / p] * p
long long mul(long long a, long long b, long long p) {
long long k = (long long)((long double)a * b / p + 0.5);
return a * b  k * p;
}
This optimization for multiplying numbers up to $10^{12}$ modulo $10^{12}$ is fastest. The overall time complexity of this approach is $O(N log N)$ but with huge constant. Solution:Setter's solution can be found here Please feel free to post comments if anything is not clear to you.
This question is marked "community wiki".
asked 18 Jan, 06:38

yes! @manjunath1996 It will overflow! So we have to add another factor that will make the result in correct form. See my code below! It is correct over all the range upto 10^12
answered 19 Jan, 16:50

Was I the only one who did all the computation for four squares of each number online and still managed to get 100 points? answered 18 Jan, 19:57

@suraj021, here's what i did.... 1) each number except of the form 4^m(8k+7) can be expressed as sum of 3 square number. Therefore I try to find a square number close to the multiplication that, when subtracted from it, will give me a number that satisifies this condition. Like if 107 is the given number, closest square number is 100. But 107100=7 is a number violating the condition. So I take the next closest square 81, 10781=26, which is not of the mentioned format. So I will take my first square number as sqrt(81) = 9. I haven't proved it but this should not take more than 2 or 3 iterations. 2) After getting such a number, I try to form a three square representation. (in the example, 26) Its a very naive recursive method. I try to create the number by taking closest square numbers iteratively. I kept a dp map to avoid regenerating the same numbers.... I found out that most numbers can be formed on the very first attempt (by greedily taking the closest square numbers, like 26 for example: take closest square 25 and you get next two numbers 1 and 0 easily). However there are some numbers which may take upto quite a lot of iterations. But I had a feeling there wouldn't be too many. And my hunch was proved when I got AC. answered 19 Jan, 00:27
But tell me one thing that how did you manage to deduce the number in this form 4^a (8b+7)?
(19 Jan, 01:14)
It's just Legendre's threesquare theorem.
(19 Jan, 01:30)
But i want to know about factoring part! How did you factor the numbers in this three square theorem!
(19 Jan, 17:58)

Sorry I can't get one thing. In the 4th way of finding modulo we have returned ( ab  kp ) But what would happen if a was like 10^11 , b = 10^11 and p = 10^12. Then won't a*b cause overflow for long long int ? :/ The 3rd way is really elegant :) answered 18 Jan, 16:23
2
Yes it will overflow, and that is intentional. I found it on Wiki and it's really neat. It uses the fact that floatingpoint multiplication results in the most significant bits of the product kept, while integer multiplication results in the least significant bits kept. Notice that both (a * b) and (k * p) will overflow, however because the difference between them is (a * b) % p, subtracting them will still give us (a * b) % p. It's like two wrongs making a right :)
(18 Jan, 17:16)

What is zij*j == 0 in the second fragment of the code ?? does that , mean jj is a multiple of i ? answered 18 Jan, 18:44

Setter's solution gets 10 points.Not expected!!! ;) answered 18 Jan, 19:11

I used the same algorithms as discussed here, but,i don't have any idea why am i getting WA in 2 test cases 1 apiece in both the subtasks :(. Please can anyone help me? answered 19 Jan, 01:02

I have doubt in fourth way of multipllication.if a and b are long long types and suppose a=10^11 ,b=10^11 and p=10^12 then how can we directly multiply ab as given in statement return a * b  k * p; It should over flow there.Can any one clear my doubt?? answered 19 Jan, 16:23
