PROBLEM LINK:Author: Shchetinin Kirill DIFFICULTY:Super Hard PREREQUISITES:Spectral graph theory, fast Fourier transform, linear algebra PROBLEM:The problem can be restated in terms of strong products of graphs. You are given $T$ undirected graphs, where the $k$th graph has $N_k$ vertices (with no loops and multiple edges). You are also given $Q$ queries. In each query, you are given three numbers $L_i$, $R_i$, $K_i$, and your task is to find out the number of closed paths of length at most $K_i$ in the strong product of graphs numbered $L_i , L_i + 1 , \ldots , R_i$, modulo $1000000007$. The strong product of two graphs $G_1 = \langle V_1,E_1\rangle$ and $G_2 = \langle V_2,E_2\rangle$, denoted $G_1\times G_2$, is defined as the graph with the vertex set $V_1\times V_2$ (pairs of nodes, one from $G_1$ and another from $G_2$), and in which there is an edge $((u_1, u_2), (v_1, v_2))$ iff one of the following is true:
The strong product is commutative and associative up to isomorphism. (Note that "$F\times G$" is not the standard notation for strong products, but since we're not dealing with any other type of graph products, we'll use it here.) QUICK EXPLANATION:
EXPLANATION:An obvious starting point in solving the problem is understanding how to find the number of closed paths in a graph. Clearly this is a minimum requirement to solve the problem. :) Since we will be focusing on adjacency matrices a lot, for any given graph $G$ we also denote its adjacency matrix by $G$. Which one we're referring to is usually clear from the context. Thus, $G_{i,j}$ means $1$ if $i$ and $j$ are connected, and $0$ otherwise. Counting closed pathsLet's focus on some graph $G\langle V,E \rangle$. Suppose we want to compute the number of closed paths of length $k$. Let's generalize the problem a bit: Let's compute, for all pairs of nodes $(i,j)$, the number of paths from $i$ to $j$ of length $k$, which we'll denote by $P(k) _ {i,j}$. The number of closed paths is simply $\displaystyle \sum _ {i\in V} P(k) _ {i,i}$. Let's start with the base case: $k = 0$. In this case, we trivially have $P(0) _ {i,i} = 1$ (namely the empty path), and $P(0) _ {i,j} = 0$ if $i \not= j$. Okay, now let's assume $k > 0$. Since the last node is $j$, the node before that could have been any node $t$ that is a neighbor of $j$. Thus, we have the following recurrence: $$P(k) _ {i,j} = \sum _ {\text{$t$ is a neighbor of $j$}} P(k1) _ {i,t}$$ But we can rewrite the above recurrence in another way, by consider the adjacency matrix of $G$. We have: $$P(k) _ {i,j} = \sum _ {t\in V} P(k1) _ {i,t}G _ {t,j}$$ If we treat each $P(k)$ as a matrix (of dimension $V$), then in fact the above base case and recurrence is the same as the following: $$P(0) = I_{V}$$ $$P(k) = P(k1)\cdot G$$ This means that $P(k)$ is simply $G^k$ :) Now back to the original problem of this section. We are looking for: $$\sum _ {i\in V} P(k) _ {i,i} = \sum _ {i\in V} (G^k) _ {i,i}$$ This is simply the sum of the diagonal entries of the matrix $P(k) = G^k$. The sum of the diagonal elements of a matrix $M$ is called the trace of $M$, denoted $\text{tr}[M]$. The conclusion is that the number of closed paths of length $k$ in a graph with adjacency matrix $G$ is simply $\text{tr}[G^k]$ :) Spectrum, and trace of powersWith this in mind, let's understand a bit of what $\text{tr}[A]$ means (aside from being the sum of the diagonal entries of $A$). We use a bit of linear algebra. Remember that multiplication of a matrix is equivalent to doing a linear transformation. A nonzero (column) vector $v$ is called an eigenvector of $A$ if multiplying $A$ by $v$ doesn't change the direction of $v$, i.e. there exists a scalar $\lambda$ such that $Av = \lambda v$. In this case, the value $\lambda$ is called the eigenvalue associated with the eigenvector $v$. Let's try to compute the eigenvalues of some $N\times N$ matrix $A$. Notice that an eigenvector $v$ satisfies (for some $\lambda$): $$\begin{align*} Av &= \lambda v \\\ \lambda v  Av &= 0 \\\ (\lambda I  A)v &= 0 \end{align*}$$ where $I$ is an identity matrix with the same dimensions as $A$. It is a basic result that the equation $Mv = 0$ has nonzero solutions $v$ if and only if the determinant of $M$ is zero. Therefore, a number $\lambda$ is an eigenvalue of $A$ if and only if it is a solution to the following equation: $$\text{det}(\lambda I  A) = 0$$ But notice that the left hand side is actually a polynomial in $\lambda$ of degree $N\,$! (It's called the characteristic polynomial of $A$) Therefore, the eigenvalues of a matrix $A$ are simply the roots of the characteristic polynomial of $A$, and $A$ has exactly $N$ eigenvalues (counting multiplicity). The multiset of eigenvalues of $A$ is called the spectrum of $A$, denoted $\text{sp}(A)$. Consider the characteristic polynomial $\text{det}(\lambda I  A)$. It's easy to see that the coefficient of $\lambda^N$ is one (i.e. the polynomial is monic). But what about the coefficient of $\lambda^{N1}$? Of course, this is the negative of the sum of the roots of polynomials (i.e. the eigenvalues), but what is it in terms of the entries of $A$? If you try to expand $\text{det}(\lambda I  A)$ (e.g. for small $N$s), you can see that it is equal to the negative of the sum of the diagonal entries of $A$ (i.e. the trace of $A$)! Therefore, we have the following fact: The trace of a matrix is equal to the sum of its eigenvalues. This is helpful, because the eigenvalues have a few nice properties that will help us solve the problem. For example, if $\lambda$ is an eigenvalue of $A$, then $\lambda^k$ is an eigenvalue of $A^k$ (why?). So the following facts follow:
Closed paths in strong productsLet us now turn to finding the number of closed paths of length $k$ some graph $F\times G$ which is the strong product of two graphs $F$ and $G$. It is clear that we want to compute the trace of $(F\times G)^k$, (where $F\times G$ also denotes the adjacency matrix). In this case, we can show the following fact: The spectrum of $F\times G$ is equal to $\{(\alpha + 1)(\beta + 1)  1 : \alpha \in \text{sp}(F), \beta \in \text{sp}(G)\}$ Proof: The rough idea is to construct an eigenvector of $F\times G$ for the eigenvalue $(\alpha + 1)(\beta + 1)  1$, where $\alpha$ and $\beta$ are eigenvalues of $F$ and $G$. Notice that this value is equivalent to $\alpha\beta + \alpha + \beta$. Let $v$ be $w$ be eigenvectors corresponding to the eigenvalues $\alpha$ and $\beta$ in $F$ and $G$, respectively. Then we have the following: $$Fv = \alpha v$$ $$Gw = \beta w$$ The first is equivalent to (for every node $j$ of $F$): $$\sum _ {\text{$i$ ~ $j$}} v _ i = \alpha v _ j$$ The second is equivalent to (for every node $j$ of $G$): $$\sum _ {\text{$i$ ~ $j$}} w _ i = \beta w _ j$$ The eigenvector we are going to construct for the eigenvalue $\alpha\beta + \alpha + \beta$ is $(v \otimes w)$, or $(v \otimes w) _ {i _ 1,i _ 2} = v _ {i _ 1}w _ {i _ 2}$ In other words, this vector consists of the products of all pairs of entries in $v$ and $w$. Let us now check that this vector is indeed the eigenvector we are looking for. For a fixed node $(j_1,j_2)$ of $F\times G$: $$\begin{align*} \sum _ {\text{$(i _ 1,i _ 2)$ ~ $(j _ 1,j _ 2)$}} (v \otimes w) _ {i _ 1,i _ 2} &= \sum _ {\text{$(i _ 1,i _ 2)$ ~ $(j _ 1,j _ 2)$}} v _ {i _ 1}w _ {i _ 2} \\\ &= \sum _ {\substack{\text{$i _ 1$ ~ $j _ 1$} \\\ \text{$i _ 2$ ~ $j _ 2$}}} v _ {i _ 1}w _ {i _ 2} + \sum _ {\text{$i _ 1$ ~ $j _ 1$}} v _ {i _ 1}w _ {j _ 2} + \sum _ {\text{$i _ 2$ ~ $j _ 2$}} v _ {j _ 1}w _ {i _ 2} \\\ &= \sum _ {\text{$i _ 1$ ~ $j _ 1$}} v _ {i _ 1} \sum _ {\text{$i _ 2$ ~ $j _ 2$}} w _ {i _ 2} + w _ {j _ 2} \sum _ {\text{$i _ 1$ ~ $j _ 1$}} v _ {i _ 1} + v _ {j _ 1} \sum _ {\text{$i _ 2$ ~ $j _ 2$}} w _ {i _ 2} \end{align*}$$ The next step uses the fact that $v$ and $w$ are eigenvectors of their corresponding graphs: $$\begin{align*} \sum _ {\text{$(i _ 1,i _ 2)$ ~ $(j _ 1,j _ 2)$}} (v \otimes w) _ {i _ 1,i _ 2} &= \alpha v _ {j _ 1} \beta w _ {j _ 2} + w _ {j _ 2} \alpha v _ {j _ 1} + v _ {j _ 1} \beta w _ {j _ 2} \\\ &= (\alpha \beta + \alpha + \beta) v _ {j _ 1} w _ {j _ 2} \\\ &= (\alpha \beta + \alpha + \beta) (v \otimes w) _ {j _ 1, j _ 2} \end{align*}$$ This shows that $(v \otimes w)$ is an eigenvector corresponding to the eigenvalue $(\alpha \beta + \alpha + \beta)$. As a corollary, the trace of $(F\times G)^k$ is the following: $$\text{tr}[(F\times G)^k] = \sum_{\alpha \in \text{sp}(F)} \sum_{\beta \in \text{sp}(G)} ((\alpha + 1)(\beta + 1)  1)^k$$ When we expand this (using the binomial theorem), we get the following: $$\begin{align*} \text{tr}[(F\times G)^k] &= \sum_{\alpha \in \text{sp}(F)} \sum_{\beta \in \text{sp}(G)} ((\alpha + 1)(\beta + 1)  1)^k \\\ &= \sum_{\alpha \in \text{sp}(F)} \sum_{\beta \in \text{sp}(G)} \sum_{i=0}^k {k \choose i} (\alpha + 1)^i(\beta + 1)^i (1)^{ki} \\\ &= \sum_{i=0}^k {k \choose i} (1)^{ki} \sum_{\alpha \in \text{sp}(F)} (\alpha + 1)^i \sum_{\beta \in \text{sp}(G)} (\beta + 1)^i \end{align*}$$ We will now use the fact that the spectrum of the matrix $A + cI$ is $\{\alpha + c : \alpha \in \text{sp}(A)\}$ (why?): $$\text{tr}[(F\times G)^k] = \sum_{i=0}^k {k \choose i} (1)^{ki} \text{tr}[(F + I)^i] \text{tr}[(G + I)^i]$$ This equation is very nice! (as we will see below) Towards a solutionNow, let's try answering some query: given $(l,r,k)$, what is the number of closed paths of length $k$ in the graph $G_l\times G_{l+1}\times \ldots\times G_r$? Let's denote this value by $\text{ans}(l,r,k)$. In fact, we will be more ambitious and we'll try to solve the problem for all triples $(l,r,k)$ :) (a.k.a. the dynamic programming mindset) So, let $G(l,r)$ be the graph $G_l\times \ldots\times G_r$. So naturally, we have the following: $$\text{ans}(l,r,k) = \text{tr}[G(l,r)^k]$$ But remember that $G(l,r) = G_l \times G_{l+1} \times \cdots \times G_r$, so we will need an extension of the equation above about $\text{tr}[(F\times G)^k]$. It's not hard to extend it into the following (verify): $$\text{tr}[(G_1 \times \cdots \times G_t)^k] = \sum_{i=0}^k {k \choose i} (1)^{ki} \text{tr}[(G_1 + I)^i] \cdots \text{tr}[(G_t + I)^i]$$ So let's define $P(l,r,k)$ to be the product $\text{tr}[(G_l + I)^k]\text{tr}[(G_{l+1} + I)^k]\cdots \text{tr}[(G_r + I)^k]$. The values of $P(l,r,k)$ can be computed for all triples $(l,r,k)$ in $O(T^2K)$ time, assuming we have the values $\text{tr}[(G_i + I)^k]$ for all $1 \le i \le T$ and $k \le K$ (which we'll deal with later). Then expanding $\text{tr}[G(l,r)^k]$, we get: $$\begin{align*} \text{ans}(l,r,k) &= \text{tr}[G(l,r)^k] \\\ &= \text{tr}[(G_l \times \cdots \times G_r)^k] \\\ &= \sum_{i=0}^k {k \choose i} (1)^{ki} \text{tr}[(G_l + I)^i]\text{tr}[(G_{l+1} + I)^i]\cdots \text{tr}[(G_r + I)^i] \\\ &= \sum_{i=0}^k {k \choose i} (1)^{ki} P(l,r,i) \end{align*}$$ This gives us a nice little formula to compute $G(l,r,k)$ for all $k$ in terms of $P(l,r,k)$ for all $k$! Unfortunately, implementing this as it is will most likely time out, because there are $O(T^2K)$ such pairs, and each sum takes $O(K)$ time to compute, so the overall complexity is $O(T^2K^2)$. Luckily, we can exploit the beautiful structure of the above equation to form a faster algorithm. The first thing to do is to expand ${k \choose i}$ as $\frac{k!}{i!(ki)!}$. $$\begin{align*} \text{ans}(l,r,k) &= \sum_{i=0}^k {k \choose i} (1)^{ki} P(l,r,i) \\\ \text{ans}(l,r,k) &= \sum_{i=0}^k \frac{k!}{i!(ki)!} (1)^{ki} P(l,r,i) \\\ \text{ans}(l,r,k) &= k! \sum_{i=0}^k \frac{1}{i!(ki)!} (1)^{ki} P(l,r,i) \\\ \text{ans}(l,r,k) &= k! \sum_{i=0}^k \frac{P(l,r,i)}{i!} \frac{(1)^{ki}}{(ki)!} \end{align*}$$ Notice that this is now in the form of a convolution of two sequences. Therefore, it seems feasible to use fast polynomial multiplication to compute these numbers quickly. Indeed, let's fix the values $l$ and $r$ (with the intention of calculating $\text{ans}(l,r,k)$ for all $k \le K$). Also, let: $$f(k) = \frac{P(l,r,k)}{k!}$$ $$g(k) = \frac{(1)^k}{k!}$$ Then the above equality is the same as: $$\text{ans}(l,r,k) = k! \sum_{i=0}^k f(i) g(ki)$$ Consider two polynomials: $$F(x) = \sum_{i=0}^K f(i) x^i$$ $$G(x) = \sum_{i=0}^K g(i) x^i$$ Then we have the remarkable property that $\text{ans}(l,r,k)$ is $k!$ times the coefficient of $x^k$ in the polynomial product $F(x)G(x)$! This allows us to calculate all $\text{ans}(l,r,k)$ for all $k$ using a single polynomial multiplication! Multiplying two polynomials of degree $K$ can be done in $O(K^{1.586})$ time using Karatsuba's algorithm, or in $O(K \log K)$ time using fast Fourier transforms. (These are really standard topics and there are lots of resources online about these. For the current problem though, you will need to implement this really efficiently because the time limit is rather tight.) Since we need to do $O(T^2)$ multiplications, the overall running time of this step is $O(T^2 K \log K)$. The remaining thing to do is to compute $\text{tr}[(G_i + I)^k]$ for all $1 \le i \le T$ and $0 \le k \le K$, and for that, we will need a few additional tricks. Computing all traces $\text{tr}[(G_i + I)^k]$We will just focus on a single graph, say $G$, because we can just do the same procedure once for each of the $T$ graphs. Our goal is to compute $\text{tr}[I], \text{tr}[H], \text{tr}[H^2], \text{tr}[H^3], \ldots, \text{tr}[H^K]$, where $H := G + I$. Note that these values can be computed in $O(N^3K)$ time (where $N = G$), by simply calculating all powers of $H$ and computing the traces individually. However, this is too slow for our purposes, so we need to find a different way. The key idea is to use the CayleyHamilton theorem: Let $c_0 + c_1 x + c_2 x^2 + \cdots + x^N$ be the characteristic polynomial of a matrix $H$. Then $c_0 I + c_1 H + c_2 H^2 + \ldots + H^N = 0$. (note that the $0$ in the right hand side is actually a zero matrix, not the number zero) Thus, for our given matrix $H$, we have the following (multiplying by $H^{kN}$ in both sides): $$\begin{align*} c_0 + c_1 H + c_2 H^2 + \ldots + H^N &= 0 \\\ c_0 H^{kN} + c_1 H^{kN+1} + c_2 H^{kN+2} + \ldots + H^k &= 0 \\\ \text{tr}[c_0 H^{kN} + c_1 H^{kN+1} + c_2 H^{kN+2} + \ldots + H^k] &= 0 \end{align*}$$ we can continue manipulating this by using the fact that the trace is a linear map (why?). Therefore, $$c_0 \text{tr}[H^{kN}] + c_1 \text{tr}[H^{kN+1}] + c_2 \text{tr}[H^{kN+2}] + \ldots + \text{tr}[H^k] = 0$$ Therefore, for any $N \le k \le K$, we can compute $\text{tr}[H^k]$ in $O(N)$ time assuming we have already computed the previous values! Thus, we can compute most of the values $\text{tr}[H^k]$ in $O(NK)$ time, which is significantly faster. But to do so, we have to do the following two things:
The first one can just be done naïvely, i.e. compute the powers of $H$ and compute the traces individually, and it runs in $O(N^3\cdot N) = O(N^4)$ time which is acceptable, so the only remaining thing to do is to compute the characteristic polynomial of $H$. Finding the characteristic polynomialWe want to compute the characteristic polynomial of $H$. By definition, it is $\text{det}(xI  H)$, which is a monic polynomial of degree $N$ whose roots are the members of $\text{sp}(H)$. Suppose the eigenvalues of $H$ are $\lambda_1, \lambda_2, \ldots, \lambda_N$. Then its characteristic polynomial is $(x  \lambda_1) (x  \lambda_2) \cdots (x  \lambda_N)$. For instance, let's assume $N = 3$, and let's see what the characteristic polynomial looks like: $$(x  \lambda_1) (x  \lambda_2) (x  \lambda_3) = x^3  (\lambda_1 + \lambda_2 + \lambda_3)x^2 + (\lambda_1\lambda_2 + \lambda_1\lambda_3 + \lambda_2\lambda_3)x  \lambda_1\lambda_2\lambda_3$$ The expressions $(x_1 + x_2 + x_3)$, $(x_1x_2 + x_1x_3 + x_2x_3)$ and $(x_1x_2x_3)$ are called the elementary symmetric polynomials on variables $x_1$, $x_2$ and $x_3$, and these polynomials have a lot of other useful relations with a polynomial's coefficients and roots. The elementary symmetric polynomial of order $k$ in variables $x_1, \ldots, x_N$ is denoted by $e_k(x_1, \ldots, x_N)$, and is defined as the sum of all products of all size$k$ subsets of $\{x_1, \ldots, x_N\}$. For a given monic polynomial $p$, the coefficients of $p$ are the results of substituting the roots of $p$ in the elementary symmetric polynomials, so we have the following clear relationship between $p$'s coefficients and roots (also known as Vieta's formulas): $$e_1(\lambda_1, \ldots, \lambda_N) = c_{N1}$$ $$e_2(\lambda_1, \ldots, \lambda_N) = +c_{N2}$$ $$e_3(\lambda_1, \ldots, \lambda_N) = c_{N3}$$ $$e_4(\lambda_1, \ldots, \lambda_N) = +c_{N4}$$ $$\ldots$$ $$e_N(\lambda_1, \ldots, \lambda_N) = (1)^Nc_0$$ Thus, we wish to compute these values $e_i(\lambda_1, \ldots, \lambda_N)$. The key insight is to use the values $\text{tr}[I], \text{tr}[H], \text{tr}[H^2], \ldots, \text{tr}[H^N]$ (which we have already computed earlier). Remember that $\text{tr}[H^k]$ is just the sum of the $k$th powers of the eigenvalues of $H$, and there are relationships between elementary symmetric polynomials and power sums! They're known as Newton's identities and can be stated as: $$ke_k(x_1, \ldots x_N) = \sum_{i=1}^k (1)^{i1} e_{ki}(x_1, \ldots, x_N)p_i(x_1, \ldots, x_N)$$ for all $k$. Here, $p_i(x_1, \ldots, x_N)$ is defined as $x_1^i + x_2^i + \cdots + x_N^i$. Substituting the $\lambda_i$s, we have $p_i(\lambda_1, \ldots, \lambda_N) = \text{tr}[H^i]$. This allows us to compute the $e_k(\ldots)$s recursively, in $O(N^2)$ time! OptimizationsComputing the necessary traces for each graph requires $O(N^4 + NK)$ time, thus doing it for all $T$ graphs requires $O(TN^4 + TNK)$ time. The second step (of using FFT to compute the $\text{ans}(l,r,k)$s) runs in $O(T^2K \log K)$. Finally, answering $Q$ queries simply require $Q$ lookups in the table for $\text{ans}$, so it takes $O(Q)$ time. Therefore, the overall running time is $O(TN^4 + TNK + T^2K\log K + Q)$, and the memory complexity is $O(T^2K)$ (due to storing the table $\text{ans}$). However, we can actually improve the memory complexity by using the fact that the queries can be processed offline: we can process the queries by increasing $L_i$, and then by increasing $R_i$, so that we only store values of $\text{ans}(l,r,k)$ for two pairs $(l,r)$ values: $(l,r)$ and $(l,r1)$. This reduces our memory requirements to just $O(TK)$ (remember that we still need to store the $O(TK)$ traces of the graphs). Time Complexity:$O(TN^4 + TNK + T^2K\log K + Q)$ AUTHOR'S AND TESTER'S SOLUTIONS:
This question is marked "community wiki".
asked 17 Aug '15, 02:17

"However, storing the table ans requires $O(T^2K)$ memory which is quite high." I don't find 2.5 million to be quite high... Also, I think FFT is much harder to use here than Karatsuba. Doublevalued FFT lacks precision, since the modulo is too large, and it isn't suitable for NTT. I heard there are NTTs that work for a more general class of modulos, but that seems like overkill. Speaking of Karatsuba, I found my implementation kinda slow; are there any esotheric tricks to speed it up? I already used avoiding moduling, mostly working with ints instead of long longs, not copying arrays unless necessary (specifying subarrays to multiply instead) and switching to bruteforce when mutiplying small arrays. answered 17 Aug '15, 16:55
Thanks @xellos0 , I have updated the sentence about $O(T^2K)$ so it doesn't say "quite high". By the way, to use NTT for the current problem, the author and tester used three moduli that are suitable for NTT. I described this approach is my comments in https://discuss.codechef.com/questions/31836/realseteditorial . The downside is that their program's constants were multiplied by three.
(17 Aug '15, 20:15)
1
I wish the modulo were NTT friendly (but probably that would have given some hint, that's why author chose the generic 10^9 + 7). My first version of Karatsuba was very slow, so I thought optimizing that was pointless, but I see that most accepted solutions actually use optimized version of it. Doublevalued FFT was of course giving wrong answers because of precision issues. Finally, I implemented the NTT with three modulos, which was significantly faster than Karatsuba, but still couldn't optimize it to fit in time limits.
(17 Aug '15, 21:47)
I don't think a hint to FFT would have been very helpful here. The problem has many parts, so it's not obvious at all where it should be, and an experienced contestant would be able to guess that fast convolution is necessary at some point, anyway. And there are many primes with $p1$ divisible by a large power of 2, which makes guessing harder. Anyway, with better modulos, at least my code would've been much faster, since computing convolutions was the (by far) slowest part.
(18 Aug '15, 01:32)

There is another simple way to calculate all the traces using the following observation: Since we only need the sum of diagonal elements for some power of G, we compute G^1, G^2, ..., G^100. After that we compute G^200, G^300, G^400, ... , G^1000. Note that these are just 200 matrix multiplications. The trace of any power of G can be computed in N^2 since we only need the diagonal elements and any power of G can now be represented as a product of two already calculated matrices. So the complexity for finding all traces by this method becomes O(200N^3 + K*N^2). answered 17 Aug '15, 15:27
Yes, I used this Idea and managed to get 100 points. Though a I had to use a lot of optimizations like sparse matrix multiplication, multiplying matrices with boolean values without using '*' operator and most importantly by using minimal '%' operators etc. My code : https://www.codechef.com/viewsolution/7803746
(17 Aug '15, 16:37)
Yeah, apparently it required a lot of optimizations to get that 100. I didnt think Q*K would pass in time :'(
(18 Aug '15, 00:00)

I had thought about using FFT for the last step, but I felt using doubles was too imprecise, and using 3 mods was going make the constant factor too big. So instead, I just did the multiplication naively, but with loop unrolling. I also had to unroll my loops for the the matrix multiplication and trace calculations (using @viv001's method). This managed to speed it up almost 2x, and got 100 points. (link to my submission: https://www.codechef.com/viewsolution/7780496 ) answered 17 Aug '15, 23:53
