**Author:** Sidhant Bansal

**Tester:** Istvan Nagy

**Editorialist:** Misha Chorniy

# Difficulty:

Medium

# Pre-Requisites:

segment tree or related data structure

# Problem Statement

You are given array $A$ consisting $N$ elements. All the elements in the input are integers. You must be able to make two queries, first one - change the value of the element, second one - is to present the product of elements with indices from l to r by modulo P in the sum of the squares of four non-negative integers.

# Explanation

We need to make some precalculation, for each number between 0 and 1000000 know to calculate his decomposition in the sum of four squares. The key idea in this problem, if we can decompose number $A$ in the sum of four squares and $B$ in the sum of four squares, then $A*B$ can be decomposed in the sum of four squares.

$A*B$ = $(x_{1}^2+x_{2}^2+x_{3}^2+x_{4}^2)*(y_{1}^2+y_{2}^2+y_{3}^2+y_{4}^2)$ =

$(x_{1}*y_{1}+x_{2}*y_{2}+x_{3}*y_{3}+x_{4}*y_{4})^2 + $

$(x_{1}*y_{2}-x_{2}*y_{1}+x_{3}*y_{4}-x_{4}*y_{3})^2 + $

$(x_{1}*y_{3}-x_{2}*y_{4}-x_{3}*y_{1}+x_{4}*y_{2})^2 + $

$(x_{1}*y_{4}-x_{4}*y_{1}+x_{2}*y_{3}-x_{2}*y_{3})^2 $

The second observation is Lagrange's four-square theorem, it's well-known fact in math. Every number can be represented in the sum of four squares, from here there are no $-1$ answers in the problem.# Subtask 1

$P$ is less or equal than $1000$. Let's make precalculation in a next way, iterate with four cycles, where each number less than $sqrt(1000)<32$

```
for i = 0..32
for j = i..32
for k = j..32
for l = k..32
s[i*i+j*j+k*k+l*l]=(i,j,k,l) //t is the array of tuples from 4 numbers
```

When we know this thing, numbers can be multiplied and values of the sum of 4 square squares can be recalculated. Of course, all operations must be processed by modulo $P$. After that queries can be simulated, the first one in $O(1)$ time, the second one in $O(N)$ time. Totally, it takes $O(Q*N+32^4)$ time.

# Subtask 2

Precomputation with four for's as in Subtask 1 get TL for $P=10^6$. How to speed up this one, let's make next thing, find all numbers which can be represented as the sum of 3 square numbers.

```
for i=0..1000
if i*i>1000000
break
for j=i..1000
if i*i+j*j>1000000
break
for k=j..1000
if i*i+j*j+k*k>1000000
break
t[i*i+j*j+k*k]=(i,j,k) //t is the array of tuples from 3 numbers
s[i*i+j*j+k*k]=(i,j,k,0)
```

We will find almost $85$% of representions after this preprocessing. How to find the rest:
```
for i=0..1000000
if s* != null
continue
for j=1..1000
if i - j * j<0
break
if s[i-j * j][3]==0 // If number i - j * j can be represented in a sum of 3 squares
s*=(s[i-j * j][0],s[i-j * j][1],s[i-j * j][2],j)
break
```

Let's estimate total time of work of these two fragments of code, first fragment has three for's each of those to 1000, but $0<=i<=j<=k<=1000$, number of possible triplets (i, j, k) the same thing as $1000 \choose 3$. The second fragment works in time which equal number of numbers which can't be represented as a sum of 3 squares multiply by 1000 less than $2*10^5*1000$ = $2*10^8$, summary both fragments will process in time less than $5*10^8$.

Now we have all decompositions for numbers in the range between $0$ and $10^6$, but we don't know how to speed up processing of the 2nd query. A function of representation in a sum of 4 square is associative and multiplicative, we can use segment tree for solving this problem. In each node $v$ which assigned with interval $l..r$ storing decomposition of the product $A_{l}*A_{l+1}*...*A_{r}$ in the sum of 4 squares by modulo $P$.

Now both queries can be done with complexity $O(log N)$

# Subtask 3

The difference of a subtask 3 from a subtask 2 that $P$ can be up to $10^{12}$. It does problems because 64-bit integers type can store numbers up to $1.8 * 10^{19}$, how to cope with a multiplication of two numbers which can't be stored in the 64-bit integer type, there are several ways:

### First way:

```
mult(a, b, p)
if b == 0
return 0
if b % 2 == 0
a2=mult(a,b/2,p) // Finding a * (b / 2), a * b = a * (b / 2) + a * (b / 2)
return (a2+a2) mod p // (a+...+a)+(a+...+a)
return (a+mult(a,b-1, p)) mod p //a * b = a * (b - 1) + a
```

Complexity of this algorithm is O(log N), unfortunately, this algorithm can't pass TL

### Second way:

In C++ exists types like __int128, __int128_t, Python and Java has libraries for working with Long Arithmetics, but all of it is very slow to get AC.

### Third way:

Divide multipliers into two parts $X=A*2^{20}+B$, $0<=A, B<2^{20}$```
mult(a, b, p)
a0=a % (2^20)
a1=a / (2^20)
b0=b % (2^20)
b1=b / (2^20)
//(a1*(2^20)+a0)*(b1*(2^20)+b0)=a1*b1*(2^40)+(a1*b0+b1*a0)*(2^20)+(a0*b0)
return ((a1*b1)%p*(2^20)%p*(2^20)+(a1*b0+a0*b1)%p*2^20+a0*b0)%p
```

If change % and / into bit operations, correct solution must get AC.
### Fourth way:

To calcuate product in big real types(C++ version):```
//a * b mod p = a * b - [a * b / p] * p
long long mul(long long a, long long b, long long p) {
long long k = (long long)((long double)a * b / p + 0.5);
return a * b - k * p;
}
```

This optimization for multiplying numbers up to $10^{12}$ modulo $10^{12}$ is fastest.

The overall time complexity of this approach is $O(N log N)$ but with huge constant.

# Solution:

Setter's solution can be found here

Tester's solution can be found here

**Please feel free to post comments if anything is not clear to you.**