Clearly,
In addition, if n = q_1 p + r_1 and k = q_2 p + r_2 (0 \le r_1, r_2 < p), then by Lucas theorem we have
So, we only need to compute the sum
where 0 \le k \le n < p.
In fact, this summation can be computed in O(\sqrt{p} \log p) time by using FFT/NTT.
Let v := \lfloor\sqrt{k}\rfloor, f_m(x) := \prod_{i=1}^{m}(x+i) and g_m(x) := \sum_{i=1}^{m} \left(\prod_{j=1}^{i-1}(n+1-j-x) \cdot \prod_{j=i}^{m} (x+j)\right). Then,
So, it suffices to evaluate f_v(iv), f_v(n-(i+1)v) and g_v(iv) for 0 \le i < v in O(\sqrt{k} \log{k}) time. (The last binomial sum can be computed in O(\sqrt{k} + \log{p}) time with those values.)
Firstly, the polynomials f_m(x) and g_m(x) satisfies the following recurrences
Secondly, by Lagrange interpolation and FFT, we can compute the values f(ai+c) (0 \le i \le \deg{f}) from f(ai+b) (0 \le i \le \deg{f}) in O(\textrm{M}(\deg{f})) time. (see https://specfun.inria.fr/bostan/publications/BoGaSc07.pdf.)
Therefore, we can compute the values f_{2m}(0), \ldots, f_{2m}(2mv) and g_{2m}(0), \cdots, g_{2m}(2mv) from f_{m}(0), \ldots, f_{m}(mv) and g_m(0), \ldots, g_m(mv) in O(m \log m) time. In addition, f_{2m+1}(*) and g_{2m+1}(*) can be computed from f_{2m}(*) and g_{2m}(*) in O(m + \log{p}) time. Thus, we can compute the binomial sum modulo prime in O(\sqrt{p} \log{p}) time.
[Updated]
It takes about 0.13 seconds on ideone to compute
(C++ implementation: https://ideone.com/Q3e3Wo)