I don’t know if anyone’s still reading this editorial or if this comment is too late, but I have a solution for this problem that appears to avoid the extremely bad worst cases mentioned above.
Edited to add:
Some notation:
S = \textrm{the string considered as a decimal integer}
s = \log_{10}(S) = \textrm{the length of }S
N = \textrm{the target value}
n = \log_{10}(N) = \textrm{the length of the target value}
Solution 1
There is a search tree whose root is (S,N), and whose general node is (set of remaining substrings of S, target value). The children nodes are those you get by removing a substring from the current set of strings and subtracting its value from the current target to make the child’s target.
This solution aims to strike a balance between breadth-first search and depth-first search, doing exploration (BFS) nearer the top of the tree and something like DFS nearer the bottom. Simple DFS is not as good because it refuses to spend a small amount of time at the top of the tree looking for a better situation to exhaustively search from.
Here there is a priority queue of unexpanded nodes, and the priority of node x=((S_1,\ldots,S_k),N') is given by
p(x)=\log(N')+\lambda.k+\mu.\textrm{depth}(x)-\nu\sum_i|S_i|
(smaller is better), where |S_i| is the length of the substring S_i (up to 120 here). \lambda, \mu and \nu are constants. I found \lambda=7, \mu=1.75, \nu=0.1 worked fairly well.
This prioritises smaller target values, less-fragmented set of substrings (smaller k),
shallower depth in the tree and more total substring remaining to work with. (I’m sure
there are better priority functions, but this one is reasonably good and simple to
evaluate.)
At each step it expands the best node (lowest p(x)), then chases down the best child from that node, the best child from that child etc. until it reaches the bottom of the tree, which is either a solution (N'=0) or a node which it can prove with (precalculated) min/max bounds can never reach a solution. This chasing-down phase prevents it purely expanding high-priority nodes and indefinitely postponing actually trying to find a solution.
It also keeps a record of which nodes it has visited to avoid duplicate searches. If the same substrings occur in a different position then it treats these as the same, so it is doing more than avoiding duplicate bitmaps of where the substrings occur. This helps in the case where the original string is very self-similar (which is the difficult case). For example, if the original string is 1212126121212121212, then the two nodes (121212612,121212,xxxx) and (121212612,xxxx,121212) (where xxxx indicates a substring that has been removed) are regarded as the same, because the sets of remaining substrings are the same, even though they come from different positions in the original string.
This is an example implementation in
python/pypy. It’s not written in a super-optimised way from the point of view of low-level
coding, e.g., the nodes are represented as tuples of tuples, rather than bitmaps or
something, but hopefully the underlying algorithm is robust enough that it doesn’t have
hidden bad cases.
For example, it solves fjzzq2002’s difficult example
11111111111111111111111111111111121111111111111111111111111 [continued]
1111111311111111111111111111111111111111114111111111111111 122531659665
in about 0.25 seconds.
Solution 2 (two-phase algorithm)
This handles the case separately where there are many repetitions of a single digit in
S. It’s not meant to handle the general case of more random S, where picking off the
biggest number works reasonably well, so it ought to be used in conjunction with a more
general algorithm like solution 1.
My initial implementation of solution 1 was less efficient than it is now, so the
two-phase algorithm was necessary to handle the cases of large numbers of repetitions of a
single digit, but now solution 1 is reasonably good even in high repetition cases, so the
two-phase algorithm is less necessary.
Anyway, let’s say the majority digit is “1” to make the explanation simple - it works in a
similar way for any other non-zero digit (zero is treated slightly differently). It works
by first finding a “virtual” decomposition where (a) the the non-1s are given positions
within their substrings (effectively they are each given a power-of-10 weighting), and (b)
the digits of S are restricted into substrings of specified lengths, but at unspecified
locations.
The next phase tries to realise this set of constraints in an actual decomposition of S
into substrings.
For example, in fjzzq2002’s example, there are three non-1 digits: 2, 3 and 4. The first
phase might find the following formula: digit 2 is in place 5, digit 3 is in place 7,
digit 4 is in 3, and the set of substring lengths are 4, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 9,
9, 11, 12. This works because
(2-1)\times10^5+(3-1)\times10^7+(4-1)\times10^3+1111+111111\times4+11111111\times6+
111111111\times2+11111111111+111111111111=122531659665.
Then phase 1 might find a realisation of this like
111111111+11111111+111111+11111111+11211111+111111+111111+11111111111+
111131111111+111111+11111111+111111111+11114111+1111+11111111
where the substring lengths are taken in the order 9, 8, 6, 8, 8, 6, 6, 11, 12, 6, 8, 9, 8, 4, 8.
(Note that these substrings include the non-1 digits.)
This is an example implementation. It switches
to the two-phase algorithm if the proportion of non-majority digits is less than 0.052, otherwise
it uses solution 1 above.
Analysis part 1
First consider the case where S is a uniform random integer of size s. Fix S for
the moment and consider different values of n. For large n the problem is
overconstrained and morally there are no solutions, though there will always be the
planted one. If you decrease n (i.e., imagine different problem instances with smaller
and smaller n) then eventually there will be a kind of phase change at some critical
value, n_c(s), of n. Below n_c(s) the problem is underconstrained and unplanted
solutions will spontaneously pop into existence. In other words there will be many
solutions for n<n_c(s). In our case of s=120, I think (from experimentation) the
critical value, n_c(120) is something like 20 or 21. You can get a good experimental idea of
this by creating a problem instance and then solving it. If n>n_c(s) then you have to
get back the same solution you started with, but if n<n_c(s) you will probably get
back some random alternative solution. (Really we only care about whether the large
summands are unique. It’s possible that some of the smaller ones can be varied, but this
won’t affect the search very much.)
In these kinds of situations, the most difficult value of n, from the point of view of
finding a solution, is the critical value n_c(s). Or in other words, if you are the
setter and want to make the problem as hard as possible, you should choose something like
n=n_c(s). For n<n_c(s) there are lots of solutions, which means we only need to
search a fraction of the search-space to find one and we can use the extra freedom to our
advantage to search solution-rich areas of the search-space first. For n>n_c(s) there
is only the planted solution, and in general we have to search the whole search-space to
find it, but the bigger n is, the smaller the search-space, so again the most difficult
value of n is n_c(s).
What is n_c(s) and how does it affect the amount of searching you have to do? Here is a
very rough argument which is nonetheless hopefully good enough to get some idea of what is
going on. It relies on subtracting the biggest number from the current target value at
each step. Since there are other options, this should underestimate the chance of finding
a solution, so underestimate the critical value n_c(s). But hopefully subtracting the
biggest number is a good enough approximation to make this calculation somewhat
meaningful.
When you choose a summand from S you can thinking of it as using up the resource S in
order to reduce N as much as possible (keeping it nonnegative of course). It’s
essentially not a problem to reduce N too much, because it’s easy to back off and make
the sum smaller by breaking down the summands (e.g., replace 917 by 91+7). If you do this
then there will be plenty of freedom to choose any small sum you want. So really the game
is to keep reducing N more-or-less as much as possible. How much can you expect to do
this on the first step? Assuming n\ll s, there will be about s-n\approx s positions in
the string S to choose a substring of length n. The minimum of s random integers
less than N is about N/s, so in one step we expect to reduce N to about N/s at the
expense of decreasing s by n. (You can see how rough and ready this argument is. It
doesn’t distinguish between taking a chunk from the end of S which is less damaging than
taking a chunk out of the middle. And it doesn’t take account of the fact that sometimes
you don’t want to subtract the maximum possible value from N. But let’s cross our
fingers and continue.)
If we make the operations N\to N/s and s\to s-n continuous, we get differential equations:
\dot n = -\log_{10}(s)
\dot s = -n
Dividing them to get rid of the time-dependence, we get
dn/ds = \log_{10}(s)/n
which has the solution
\frac12 n^2 = s\log_{10}(s)-s/\log(10)+C
Whether n<0 (meaning we managed to reduce target to 0) or n>0 (we didn’t manage
to do so) at s=0 will determine whether there are spontaneous solutions or not, and if
n=0 at s=0 then we have criticality - you expect just about 1 non-planted solution. So
criticality corresponds to C=0, which gives
n_c(s)=\sqrt{2(s\log_{10}(s)-s/\log(10))}=\sqrt{2s\log_{10}(s/e)}.
If you try s=120 in this formula you get n_c(120)=19.9. The true value is probably around 20 or 21, so it’s a bit flukily worked more accurately than we had a right to expect.
Analysis part 2
How is this related to the time taken to find a solution? (To be continued, maybe.)