PROBLEM LINK:
Author: Praveen Dhinwa with help of Alexey Zayakin
Tester: Jingbo Shang and Pushkar Mishra
Editorialist: Florin Chirica
PREREQUISITES
dp, fft, karstuba, binomial theorem, multinomial theorem, combinatorics, matrix exponentiation
PROBLEM
Find number of numbers of N digits divisible by P having their sum of digits \leq M. As number could grow very large, print answer modulo 998244353.
CREDITS
I want to thank Praveen Dhinwa for helping me in writing some parts of the editorial. I have also incorporated ideas from Alexey and Shang too.
QUICK EXPLANATION
We will visualize the process of creating n digit number from left to right (i.e. from LSB (Lowest significant bit) to MSB (Most significant bit)). This visualization will help us in constructing a typical dynamic programming solution.
We can first write a simple dp having states (current place, sum of digits, remainder). It is too slow and only pass subtask 1.
But we can optimize it by using the fact that P is too small. We can observe that putting x at the i^{th} place (from LSB to MSB) contributes (x * 10^i) \, \% \, P into
the remainder and it adds x to the sum of digits. Let us say 10^i \, \% \, P to be contributing remainder by i^{th} digit to the number formed.
We will group the places by their contributing remainders.
So we will do a dp in which we will iterate over their contributes remainders (which goes from 0 to P  1 rather than N).
Let us make an array cnt* which denotes number of places having contributing remainder equal to i.
Mathematically it is number of x (0 \leq x < N) such that 10^x \% P = i. For finding this cnt array, you can simply use digit dp.
You can also use the fact of periodicity of 10^x \% P.
At this point, you can also make an interesting and useful observation that distinct values of cnt* won’t be too much.
There indeed will be at most 4 distinct values of cnt* and these are (0, 1, x  1, x) for some x. It can be easily proved if we consider
how to calculate cnt* using the fact that (10^i \mod P) is periodic.
Let us define function f: f*[d] as number of ways of having sum of digits = d for all the positions having contributing
remainder equal to i.
If we know how to compute f faster, then we can use an \mathbb{O}(M^2 P^2) dp to solve the third subtask. State of that
dp will be (at which contributing remainder we currently are, cur_sum_of_digits, cur_rem):
For computing f faster, we can use matrix exponentiation, binomial theorem, multinomial theorem and combinatorics ideas, Few very basic details are given in this section.
Please see next section fore more details.
Let’s fix contributing remainder as r, Let f*[d] denote the number of ways having first i places of number we want to construct such that it is having sum of digits equal to d.
Note that we are considering in all the positions having contributing remainder equal to r.

Matrix exponentiation:
f*[d]: \sum_{cur = 0}^{9} (f[i  1][d  cur]).
For a fixed contributing remainder r, As i goes from 0 to cnt[r]. We can compute it using matrix exponentiation. 
Binomial theorem:
f*[d] is equal to number of solutions of a_1 + a_2 + a_cnt[r] = d where 0 \leq a_i \leq 9.
It is equal to coefficient of x^d in the expression (1 + x + x^2 + . . + x^9)^cnt[r].
Rewriting it smartly as discussed by us will do our task 
Multinomial expressions:
Please just mention that you can write the above expression in a polynomial in two variables too, which can be helpful in solving
some problems. Note that, here multinomial expression can be reduced in binomial expression using the property that for a fixed remainder r, if you are making sum of digits equal to x, then
it will contribute r * x \% P in the overall remainder.
Note that if M \leq 50 (subtask 2), you can find the above expression very fast using the idea of multinomial theorem.
You can write the above expression as using partition function over cnt[r]. 
Combinatorics:
Using some conbinatorial identity and inclusion exclusion, you can get an expression for f too.
Let us write a bivariate polynomial as follows.
We have to find coefficient of x^m y^0 in it. Note that we can convert this bivariate polynomial into univariate by using a smart trick described in next section.
As we noticed the DP transition is a multiplication by some matrix. We can notice that this matrix has exactly P independent components (each of size M + 1), thus we can consider these components independently. Inside each component multiplication by matrix is equal to multiplication by some polynomial which can be done by FFT or Karatsuba’s multiplication algorithm.
EXPLANATION
Simplest DP idea for solving subtask 1
Let’s start by solving the problem using a simple DP. We need to count strings of length N, with sum of digit X, which are divisible by P. Since it’s a counting problem, we can try solving it by DP. The DP depends on the number of digits added so far, their remainder modulo P and their sum of digits.
Let dp[n][rem][sum] = how many strings of length n have remainder rem modulo P and their sum of digit is sum. The answer is obviously dp[N][0][X]. The standard initialization in counting problems is to consider a “fictitious” string  of length 0 (like we added nothing). In this case, adding nothing translates to dp[0][0][0] = 1.
Let’s see the recurrence. How can we reduce the problem from length N to length N  1? By fixing the first digit (for example). Suppose the first digit is c. It adds 10^n*c to the remainder and also it adds c to the sum of digits. So let’s iterate c from 0 to 9. Now first digit is fixed. How to count the rest? We need to add to dp[n][rem][sum] the value dp[n  1][rem'][sum  c], where rem' + 10^n * c = rem.
This approach times out for big values of n, but it’s fast enough to pass the first subtask.
Let’s change the DP
We somehow have to get rid of that n factor. Let’s take a closer look to what DP actually does  the value of n is important because it determines value of 10^n * c, for the first digit added. But now it comes the interesting part  that is 10^n modulo P. This means, there are only P possible distinct answers for all 10^n values. This means, we can group the values n by the value 10^n modulo P.
What do we get by doing this? The idea is to add digits for a group at the same time. Let’s Denote cnt[x] = number of values y (0 \leq y < N) such as 10^y mod P = x. Let’s assume for now this array calculated.
Suppose we know that cnt[0] = 4 and cnt [1] = 3 . There will be 4 positions (let’s note them p_1, p_2, p_3, p_4) such as 10^{p_1} = 10^{p_2} = 10^{p_3} = 10^{p_4} = 0 (modulo P). If we assign them digits d1, d2, d3, d4, the remainder modifies by 10^{p_1} * d1 + 10^{p_2} * d2 + 10^{p_3} * d3 + 10^{p_4} * d4 = 10^{p_1} * (d1 + d2 + d3 + d4) = 0 * (d1 + d2 + d3 + d4) = 0. Let’s consider the other example: cnt [1] = 3. There will be 3 positions (p1, p2, p3) such as 10^{p_1} = 10^{p_2} = 10^{p_3} = 1 (modulo P). If we assign them digits d1, d2, d3, the remainder modifies by 10^{p_1} * (d1 + d2 + d3) = 1 * (d1 + d2 + d3).
The idea is to iterate over i from 0 to P  1. Let’s consider cnt* positions for the given i. We can fix what digits we add. They will have a sum of digits (let’s note it sd). Then, the remainder modifies by i * sd and sum of digits modifies by sd.
We have a new DP. Let dp[g][rem][sum] = in how many ways can we obtain remainder rem modulo P and sum of digits sum if we considered only first g groups of positions (cnt[0], cnt[1], ..., cnt[g  1]).
For the current group we can iterate the sum of digits we assign to it. This uniquely modifies the remainder and the sum of digits. So, for a calculated dp[g][rem][sum], let’s iterate sd, representing sum of digits added for positions corresponding to group g + 1.
We get the recurrence dp[g + 1][(rem + g * sd) \, \% \, P][sum + sd] += dp[g][rem][sum] * ways[g][sd]. You probably wonder what ways can be. We assigned a sum of digits. It needs to be divided between cnt[g + 1] positions. This means, we need to assign to cnt[g + 1] positions values between 0 and 9 such as, after adding all those values from the positions, we get the sum sd. This is ways[g][sd] = in how many ways can we distribute sum of digits sd to cnt[g] positions, such as in each positions we’re allowed to add numbers between 0 and 9.
If we can calculate tables cnt and ways efficiently, we’re done. We’ll calculate firstly cnt, then ways.
Calculating values of cnt

Using periodicity of 10^y \% P
First one, let’s suppose you calculate 10^0, 10^1, ..., 10^P. There are P + 1 values but only P possible remainders (from 0 to P  1). By pigeon principle, it means that at least two values are equal. Suppose 10^x = 10^y (mod P), with x < y. I’ll claim that all possible values are 10^0, 10^1, ..., 10^x, ..., 10^{y1}, 10^x, ...., 10^{y1}, 10^x, ..., 10^{y1}. In other words, after calculating at most P values, the sequence becomes cyclic. This is because 10^{x+1} = 10^{y+1} (we simply multiplied by 10), 10^{x+2} = 10^{y+2} (we multiplied the last equality by 10) and so on. So it’s enough to find the period
and update the cnt array. 
Digit dp solution
The second method does not require any observation. We can simply do digit DP. This means we add digits of x one by one and remember, for each prefix, the remainder of 10^x modulo P. Suppose we have a prefix Pr with remainder r (remainder of 10^Pr modulo P). Then, we can extend Pr by a digit c. The new power of 10 will be
10^{Pr*10+c} = 10^{Pr * 10} * 10^c = r^{10} * 10^c. So we’ll have to add all prefixes Pr of length n which give remainder r to all prefixes of length n+1 which give remainder r^{10} * 10^c. Now let’s think what digits c we can add to a prefix. There are two cases:
the prefix is identical to one prefix of n. In this case, we must be careful not to add a digit such as the prefix will be greater than n. We can add only digits up to current digit of n.

the prefix is not identical to one prefix of n. In this case, we can add all digits from 0 to 9.
This leads us to digit_dp[same][rem][n] = how many prefixes of length n have remainder rem of 10^{prefix} modulo P with (same = 0 means the current prefix is different to prefix of n and same = 1 means the current prefix is the same with the prefix of n).

Using one of those approaches, we can compute array cnt.
Calculate the array ways
As a quick recap: ways[g][sd] = in how many ways can we split sum of digits sd to cnt[g] positions. Denote by now n = cnt[g]. We have to count how many equations satisfy:
x[1] + x[2] + .... + x[n] = sd such that 0 \le x* \le 9
This is a classical problem. There are lot of ways of solving such problem.

Generating function and binomial expression solution
I’ll describe some approaches to do it. One of them is to associate a polynomial for this expression
(1 + x^1 + .... + x^9) ^ n and we’re interested in the coefficient of x^{sd}. Why does it work? Just try to simulate the calculations on paper… you’ll probably see it.
Let’s rewrite S = 1 + x^1 + ... + x^9 using geometric progressions. I’ll write here how to deduce it, without knowing the formula for a geometric progression.
S = 1 + x^1 + ... + x^9
x*S = x^1 + x^2 + .... + x^{10}
S*(x  1) = x^{10}  1
S = \frac{(x^{10}  1)}{(x  1)}
So the problem reduces to find the coefficient of x^{sd} in \frac{(x^{10}  1) ^ n}{(x  1) ^ n}. The trick is to write the expression as (x^{10}  1) ^ n * (x  1) ^ {n}. Let’s note
First term = (x^{10}  1) ^ n
Second term = (x  1) ^ {n}
We’ll split coefficient of x^{sd} in two parts: let’s say that we need to get x^i from the first term and x^{sd  i} from the second term (we iterate this i). We have two problems left: find the coefficient of x^i in the first term and find the coefficient of x^{sd  i} in the second term.
We’ll take i values only multiples of 10 (because otherwise there is no chance to find x^i in the first term). Now we can simply apply binomial theorem to get it.
For second term it is more tricky, as we have negative powers. However, this link explains it pretty well.
If considering also the remainder, this can be solved by multinomial expressions too. As each operations increases the remainder by g, we can consider a polynomial in two variables: x^i * y^{g*i}. The power of x corresponds now to the number of used operations for a given position and the power of y corresponds now to the “gained” remainder after doing the operations.
So we can also write the expression as (x^0 * y^0 + x^1 * y^g + .... + x^9 * y^{9 * g}) ^ n. We’re now interested in the coefficient x^n * y^{sd*g}. We can now use multinomial theorem to solve it.
Finally, this can be solved by combinatorics and inclusion and exclusion principle. Let’s define F(n, pos) = how many solutions are for equation x_1 + x_2 + ... + x_n = sd such as at least pos positions have x* > 9.
Let’s note that something like 1+10+10 will be added twice in F(3, 1): once for position 2 and once for position 3.
How can we use this idea? Initially, the answer is F(n, 0). However, some elements can have values > 9. Let’s erase those who have one position with elements > 9. This is F(n, 1). What now? It was possible we erased twice some arrays: those having two positions > 9. How to fix this? We add F(n, 2). Next, we erase F(n, 3) and so on. This is called inclusion and exclusion principle.
How to get F(n, k)? Let’s fix firstly the positions which must have elements > 9. We can choose them in \binom {n}{k} ways. We add 10 to those positions, to force them to have value > 9. Now, on the other positions, we may or we may not have values > 9, there is no restriction left. So we’re left with solving
x_1 + x_2 + ... + x_n = sd  k * 10 (since we forced the chosen positions to have value at least 10).
such as x* >= 0This problem is a classical one, you can read more about it here. So F(n, k) is \binom {n}{k} * \binom {sd + n  1  10 * k}{n  1}. Binomial coefficient can be calculated using Pascal’s triangle.

Matrix exponentiation
You can write the above expression as a dp in which the matrix with which you are multiplying to get next state is constant. So you can use matrix exponentation for that. 
Combinatorics based solution
Let t = cnt*.
a_1 + a_2 + + . . + a_t = d.
Assume that a_i \geq 0 is only constraint.
Then number of solutions will be \binom{d + t  1}{t  1}. Explanation for this is usually given by beggar method.Idea of inclusion exclusion:
Now as in the above case, our a_i is unbounded, we need to bound them by 9.
Now we know that some of the variables will go larger than t, we will count them smartly using inclusion exclusion.Using these ideas, you can get a combinatorial solution for computing f.
More details about this method will be added later if requested by participants.

Multinomial theorem
One of the possible ways to compute F, to use multinomial theorem. You can simply write the above expression as a multinomial expression. I will add details about this method later. You
can see Praveen’s Solution for it.
Solving the hardest subtask

Using bivariate polynomial multiplication
Let us write a bivariate polynomial as follows.
We have to find coefficient of $x^m y^0$ in it. Note that we can convert this bivariate polynomial into univariate by using a trick that we can represent powers of both
x and y cumulatively in base (2 * m + 1). So replacing y by x^(2 * m + 1) in the bivariate polynomial will convert it to univariate polynomial.
Note that x^p_1 y^p_2 will be represented as a number p_2 p_1 in base 2 * m + 1, whose actual value will be p2 * (2 * m + 1) + p1.
So maximum degree of univariate polynomial won’t be larger than \mathbb{O}(m * p). Multiplying two polynomials in a prime field can be done using Number Theoritic Transform. You
can use some tricks to multiply two polynomials for any modulo. For this problem, you only needed to use simple Number Theoritic Transform.

Using M * P univariate polynomial multiplications
Alexey came up with the idea to reduce the complexity to \mathbb{O}(P * M^2 + P^2 * Mlog_2(3)). As you noticed before the DP transition is a multiplication by some matrix. We can notice that this matrix has exactly P independent components (each of size M + 1), thus we can consider these components independently. Inside each component multiplication by matrix is equal to multiplication by some polynomial and this can be done in \mathbb{O}(M log M) time using FFT or in time \mathbb{O}(M log_2(3)) using Karatsuba multiplication algorithm.
Actually theoritically the first method should be faster, but practically, the second method worked slightly faster than first method. You can see [Praveen’s solution][] for solution based on the first
method. Please see [Alexey’s Solution][] for second method.
Reasons for not hiding modulo
Actually initially we had intended to use 10^9 + 7 as modulo, but already the solution was taking around 12 secs, For having 10^9 + 7 as modulo, we needed to 3 times FFT multiplications. Then the bruteforce
solution becomes fast enough to pass this kind of subtask. So we just decided to use typical NTT modulo.
Please share any other idea you have regarding the problem !!