×

PLUSEQ - Editorial

Practice

Contest

Author: Stacy Hong

Tester: Misha Chorniy

Editorialist: Bhuvnesh Jain

HARD

Prerequisites

Branch and Bound, Dynamic Programming, Big Integers, Hashing, DFS/Recursion

Problem

You are given a string $S$ and integer $N$. So need to place some '+' symbols in the strings $S$ so that the value of the expression equals $N$.

Explanation

A simple knapsack like dynamic programming is sufficient to pass this subtask. The dp state is as follows:

$$dp[i][j] = \text{1 if first 'i' characters of S can be partitioned such that the sum is 'j' else 0}$$

A recursive pseudo-code for the above logic is given below:


def solve(int idx, int remain):
if idx == len(s):
return remain == 0
if dp[idx][remain] != -1:
return dp[idx][remain]
dp[idx][remain] = 0
num = 0
for i in [idx, len(s) - 1]:
num = num * 10 + s[i] - '0'
if (num > remain):
break
if (solve(i + 1, remain - num)):
PLUS[i] = 1;
return dp[idx][remain] = 1
return dp[idx][remain];

solve(0, N)     # N = integer value of string s in input
for i in [0, len(s) - 1]:
if i and PLUS[i]:
print '+'
print s[i]



The above code works in time complexity $O(|S| * N)$ where $|S|$ is the length of string $S$ and $N$ is the given integer. The space complexity is also same. Note that Codechef servers allow programs using such large memory (around 1G) to pass but you can you bitset or other equivalent structures to reduce the memory of your program too. See the linked solution below for subtask 1 for more details.

The solution for this subtask also exists using hashing and dynamic programming.

Adding 2 numbers of sizes $x$ and $y$ leads to final result having size at most $max(x, y) + 1$. This is easy to prove. The size will be $max(x, y)$ if there is no carry-over in the end otherwise it will increase by 1. Extending the above logic to addition of $m$ number, we can conclude that if the numbers have lengths $x_1, x_2, \cdots, x_m$, then the length of final result is bounded by $(max(x_1 + x_2 + \cdots + x_m) + m - 1)$. You can easily prove it using induction.

Note that number of ways to partition the string $S$ into different expression is exponential. But using the above observation, you can the conclude the following fact:

Given a string $S$ and integer $N$, having the length as $n$, there is very less number way to achieve the target $N$ if $n$ is comparable to $|S|$. This is because most of the partitions will either have the maximum number in them as too low. Equivalently, if the number of partitions we put in $S$ is large, then achieving a larger target $N$ is not possible. This hints that for sufficiently large integers, the greedy technique of choosing a larger size partition and checking if the remaining part can add up to desired $N$ will run very fast in practice as the number of iterations will not be large. For example: $S = 114390211, N = 43915$

Considering greedily large size partition of $S$ such that their value is less than $N$, we have the following numbers: [11439, 14390, 43902, 39021]. (Note that the size of this selected numbers can be at most (|S| - n + 1).) Let us greedily start with $43902$. The remaining parts of $S$ are $(11, 11)$ and the required sum now is $(43915 - 43902) = 13$. Note that there are 2 ways to achieve it $(11 + 1 + 1) \text{ or } (1 + 1 + 11)$. As you can see, the number of iterations were less. The example is just a short one to make to understand how greedy recursion might behave for larger test cases.

But the other case where the integer $N$ is small but $|S|$ is very large, there can be large number of ways to achieve the desired result. For this, we have already seen that a dynamic programming solution already exists (Subtask 1). Trying the above greedy approach can be very bad in this case, a simple example being $(S = 99999999999, N = 108)$.

With some of the above ideas, we design a simple branch and bound based algorithm for the problem:


range_tried = [0 for i in [1, len(S) - 1]]
# Find all range in S such that their value is less than N and sort
# them in decreasing order of their value. Let it be called "RANGES"

def recurse(range_idx, remain):
if remain < 0:
return false
if remain < LIMIT:      # LIMIT ~ 100000
else:
for i in [range_idx, len(RANGES) - 1]:
if conflict(current_range, range_tried):
# If current range we are trying conflicts with one
# already tried in recursion before.
continue

X = value of integer for RANGES[i]
# Update range_tried
if recurse(range_idx, remain - X):
return True
# Update range_tried to old value
return False



Note the above is a simple solution based on the initial observations. But do we need to really check for all possible ranges? Can we decide greedily at some point that given range can never result in an answer as $N$, i.e. Say we have selected some ranges, can we say that with the remaining ones we can never achieve the target $N$ without checking all possible partitions or greedily checking large number ranges.

Actually, given initial choice of our ranges, we can bound the maximum number we can achieve with the remaining ones. A simple check which ignores the digits of remaining parts of $S$ and considers all of them to be $9$ and finds the maximum value possible is a good starting point. If this value is already less than $N$, then we can simple prune our solution. Even stronger checks based on actual values of digits in string $S$ can lead to better pruning. So, the loop in the above code modifies as follows:


for i in [range_idx, len(RANGES) - 1]:
if conflict(current_range, range_tried):
# If current range we are trying conflicts with one
# already tried in recursion before.
continue
MAX_POSSIBLE = get_max(ranges_set)
if MAX_POSSIBLE < N:
# As we iterate in decreasing order of numbers, the
# MAX_POSSIBLE can only decrease as 'i' increases
break

X = value of integer for RANGES[i]
# Update range_tried
if recurse(range_idx, remain - X):
return True
# Update range_tried to old value



Another thing we can do is to remove the early exit of recursion where a check is based on "remain < 0". This can be easily done by directly starting from ranges such that value of considered numbers is always less than "remain". This is again helpful as after greedily choosing a large size partition, it is possible in most case the other large size partitions should be ignored in further recursion either due to conflicts in common ranges or "remain" decreasing too fast to become less than $0$. For this, a simple binary search can help us to find the first index in "RANGES" from where we should begin our search. This is possible as we had initially stored our ranges in decreasing order of the value of integers they represent.

With the above ideas, the recursive solution based on branch and bound works quite fast in practice. A small analysis of the time complexity is as follows:

1. $N < 100000$: Assuming we set LIMIT to this value in above pseudo-code. A dynamic programming based solution is run with complexity $O(|S| * N)$. Again to optimise this, we can use recursive version instead of iterative version. This is because will work in exactly the above complexity but in recursive solution, most of the states will be unvisited due to our initial choices of the ranges. This will significantly reduce the constant factor in your code.

2. For most of the other cases, as $N$ becomes larger i.e. size $n$ reaches closer to $|S|$, the possible solutions reduce significantly as explained before.

A more detailed analysis of time complexity is will available soon.

View Content

Tips for Big integer library (mostly for C++ users)

Languages like python and java already have big integer library implemented in them. But C++ users need to implement the same for their usage in this problem. A small tip is to store the numbers as groups instead of single digits. For example: $S = 123456789123456789$. Below are 2 possible ways to store $S$:

$$S = 1|2|3|4|5|6|7|8|9|1|2|3|4|5|6|7|8|9$$

$$S = 123456|789123|456789$$

This helps to perform operations on base different than 10 and reduces the constant factor of your program. Generally, the base is chosen as ${10}^{9}$ so that all possible operations like $(+, -, * , /)$ fit inside the datatypes provided by the language. You can see setter's library for example.

Time Complexity

To be described in detail later.

Space Complexity

To be described in detail later.

The solution for subtask 1 can be found here

Setter's solution can be found here

Ashmelev's solution can be found here

This question is marked "community wiki".

6★likecs
3.7k1774
accept rate: 9%

19.3k348495534

"A more detailed analysis of time complexity is will available soon.",waiting for this eagerly :),please post it soon.

(21 Jun, 12:22)

"Soon"....for some reason this word brings up bad memories of past xD

(21 Jun, 13:21)

I have mailed the author of the problem regarding his complexity analysis about 10 days back but got no reply yet. The editorial is completely based on my understanding and his insights of his solution. I couldn't solve the problem and can't figure out the complexity myself.

(02 Jul, 04:03) 6★

@likecs - Thats sad :( . Can even @mgch not help here? He was the tester and is quite reachable as far as my experience goes. Please do try that!

(02 Jul, 20:29)

 6 The standard solution seems not very promising, for example, check this data: 111111111111111111111111111111111211111111111111111111111111111111311111111111111111111111111111111114111111111111111 122531659665 There should be a solution: 11111111+11111111+111111+111111+11111211111+11111111+111111111+111111+111131111111+111111+11111111+11111111+111114111+11111111+1111 While in fact no AC solution in contest (except java ones, I didn't have a java compiler installed) or even std can produce a solution in reasonably small time. For example see this link on ideone. answered 14 Jun, 07:16 76●2 accept rate: 100% 14.4k●1●16●53 (14 Jun, 07:17) I was first to solve PLUSEQ, but realize my solution is quite bad. It can handle cases with either small targets or targets with some large terms. It will completely fail for large targets consisting of a lot of similar sized small terms, like the case you gave. I honestly have no clue how to solve that case in a timely manner. (14 Jun, 07:41) algmyr7★ 1 Yeah, I think this problem itself is in a NPH-ish manner, so I was quite impressed to see it appear on formal contests with no restriction on the data :/ (14 Jun, 07:43) 1 This fairly simple solution in pypy takes about 1 minute to solve that case (compared with about 18 hours for Ashmelev's; I didn't time fjzzq's solution). It is more robust against self-similar instances like fjzzq's example, but it is quite a memory-hungry solution and can still get tripped up by some other self-similar examples. (19 Jun, 06:11) I tried fjzzq's program but it got stopped by a power cut(!) so all I know is that it takes at least 8 hours to find a solution. (19 Jun, 19:02) I used a different method to handle this kind of case (see my answer below). It now takes about 1 second in python to find a solution to fjzzq's awkward case. (25 Jun, 02:09) showing 5 of 6 show all

I don't know if anyone's still reading this editorial or if this comment is too late, but I have a solution for this problem that appears to avoid the extremely bad worst cases mentioned above.

Some notation:

$$S = \textrm{the string considered as a decimal integer}$$ $$s = \log_{10}(S) = \textrm{the length of }S$$ $$N = \textrm{the target value}$$ $$n = \log_{10}(N) = \textrm{the length of the target value}$$

Solution 1

There is a search tree whose root is $(S,N)$, and whose general node is (set of remaining substrings of $S$, target value). The children nodes are those you get by removing a substring from the current set of strings and subtracting its value from the current target to make the child's target.

This solution aims to strike a balance between breadth-first search and depth-first search, doing exploration (BFS) nearer the top of the tree and something like DFS nearer the bottom. Simple DFS is not as good because it refuses to spend a small amount of time at the top of the tree looking for a better situation to exhaustively search from.

Here there is a priority queue of unexpanded nodes, and the priority of node $x=((S_1,\ldots,S_k),N')$ is given by $$p(x)=\log(N')+\lambda.k+\mu.\textrm{depth}(x)-\nu\sum_i|S_i|$$

(smaller is better), where $|S_i|$ is the length of the substring $S_i$ (up to 120 here). $\lambda, \mu$ and $\nu$ are constants. I found $\lambda=7$, $\mu=1.75$, $\nu=0.1$ worked fairly well.

This prioritises smaller target values, less-fragmented set of substrings (smaller $k$), shallower depth in the tree and more total substring remaining to work with. (I'm sure there are better priority functions, but this one is reasonably good and simple to evaluate.)

At each step it expands the best node (lowest $p(x)$), then chases down the best child from that node, the best child from that child etc. until it reaches the bottom of the tree, which is either a solution ($N'=0$) or a node which it can prove with (precalculated) min/max bounds can never reach a solution. This chasing-down phase prevents it purely expanding high-priority nodes and indefinitely postponing actually trying to find a solution.

It also keeps a record of which nodes it has visited to avoid duplicate searches. If the same substrings occur in a different position then it treats these as the same, so it is doing more than avoiding duplicate bitmaps of where the substrings occur. This helps in the case where the original string is very self-similar (which is the difficult case). For example, if the original string is 1212126121212121212, then the two nodes (121212612,121212,xxxx) and (121212612,xxxx,121212) (where xxxx indicates a substring that has been removed) are regarded as the same, because the sets of remaining substrings are the same, even though they come from different positions in the original string.

This is an example implementation in python/pypy. It's not written in a super-optimised way from the point of view of low-level coding, e.g., the nodes are represented as tuples of tuples, rather than bitmaps or something, but hopefully the underlying algorithm is robust enough that it doesn't have hidden bad cases.

For example, it solves fjzzq2002's difficult example

11111111111111111111111111111111121111111111111111111111111 [continued] 1111111311111111111111111111111111111111114111111111111111 122531659665

Solution 2 (two-phase algorithm)

This handles the case separately where there are many repetitions of a single digit in $S$. It's not meant to handle the general case of more random $S$, where picking off the biggest number works reasonably well, so it ought to be used in conjunction with a more general algorithm like solution 1.

My initial implementation of solution 1 was less efficient than it is now, so the two-phase algorithm was necessary to handle the cases of large numbers of repetitions of a single digit, but now solution 1 is reasonably good even in high repetition cases, so the two-phase algorithm is less necessary.

Anyway, let's say the majority digit is "1" to make the explanation simple - it works in a similar way for any other non-zero digit (zero is treated slightly differently). It works by first finding a "virtual" decomposition where (a) the the non-1s are given positions within their substrings (effectively they are each given a power-of-10 weighting), and (b) the digits of S are restricted into substrings of specified lengths, but at unspecified locations.

The next phase tries to realise this set of constraints in an actual decomposition of $S$ into substrings.

For example, in fjzzq2002's example, there are three non-1 digits: 2, 3 and 4. The first phase might find the following formula: digit 2 is in place 5, digit 3 is in place 7, digit 4 is in 3, and the set of substring lengths are 4, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 9, 9, 11, 12. This works because

$$(2-1)\times10^5+(3-1)\times10^7+(4-1)\times10^3+1111+111111\times4+11111111\times6+$$ $$111111111\times2+11111111111+111111111111=122531659665.$$

Then phase 1 might find a realisation of this like

111111111+11111111+111111+11111111+11211111+111111+111111+11111111111+ 111131111111+111111+11111111+111111111+11114111+1111+11111111

where the substring lengths are taken in the order 9, 8, 6, 8, 8, 6, 6, 11, 12, 6, 8, 9, 8, 4, 8. (Note that these substrings include the non-1 digits.)

This is an example implementation. It switches to the two-phase algorithm if the proportion of non-majority digits is less than 0.052, otherwise it uses solution 1 above.

Analysis part 1

First consider the case where $S$ is a uniform random integer of size $s$. Fix $S$ for the moment and consider different values of $n$. For large $n$ the problem is overconstrained and morally there are no solutions, though there will always be the planted one. If you decrease $n$ (i.e., imagine different problem instances with smaller and smaller $n$) then eventually there will be a kind of phase change at some critical value, $n_c(s)$, of $n$. Below $n_c(s)$ the problem is underconstrained and unplanted solutions will spontaneously pop into existence. In other words there will be many solutions for $n<n_c(s)$. In our case of $s=120$, I think (from experimentation) the critical value, $n_c(120)$ is something like 20 or 21. You can get a good experimental idea of this by creating a problem instance and then solving it. If $n>n_c(s)$ then you have to get back the same solution you started with, but if $n<n_c(s)$ you will probably get back some random alternative solution. (Really we only care about whether the large summands are unique. It's possible that some of the smaller ones can be varied, but this won't affect the search very much.)

In these kinds of situations, the most difficult value of $n$, from the point of view of finding a solution, is the critical value $n_c(s)$. Or in other words, if you are the setter and want to make the problem as hard as possible, you should choose something like $n=n_c(s)$. For $n<n_c(s)$ there are lots of solutions, which means we only need to search a fraction of the search-space to find one and we can use the extra freedom to our advantage to search solution-rich areas of the search-space first. For $n>n_c(s)$ there is only the planted solution, and in general we have to search the whole search-space to find it, but the bigger $n$ is, the smaller the search-space, so again the most difficult value of $n$ is $n_c(s)$.

What is $n_c(s)$ and how does it affect the amount of searching you have to do? Here is a very rough argument which is nonetheless hopefully good enough to get some idea of what is going on. It relies on subtracting the biggest number from the current target value at each step. Since there are other options, this should underestimate the chance of finding a solution, so underestimate the critical value $n_c(s)$. But hopefully subtracting the biggest number is a good enough approximation to make this calculation somewhat meaningful.

When you choose a summand from $S$ you can thinking of it as using up the resource $S$ in order to reduce $N$ as much as possible (keeping it nonnegative of course). It's essentially not a problem to reduce $N$ too much, because it's easy to back off and make the sum smaller by breaking down the summands (e.g., replace 917 by 91+7). If you do this then there will be plenty of freedom to choose any small sum you want. So really the game is to keep reducing $N$ more-or-less as much as possible. How much can you expect to do this on the first step? Assuming $n\ll s$, there will be about $s-n\approx s$ positions in the string $S$ to choose a substring of length $n$. The minimum of $s$ random integers less than $N$ is about $N/s$, so in one step we expect to reduce $N$ to about $N/s$ at the expense of decreasing $s$ by $n$. (You can see how rough and ready this argument is. It doesn't distinguish between taking a chunk from the end of $S$ which is less damaging than taking a chunk out of the middle. And it doesn't take account of the fact that sometimes you don't want to subtract the maximum possible value from $N$. But let's cross our fingers and continue.)

If we make the operations $N\to N/s$ and $s\to s-n$ continuous, we get differential equations: $$\dot n = -\log_{10}(s)$$ $$\dot s = -n$$

Dividing them to get rid of the time-dependence, we get $$dn/ds = \log_{10}(s)/n$$

which has the solution $$\frac12 n^2 = s\log_{10}(s)-s/\log(10)+C$$

Whether $n<0$ (meaning we managed to reduce target to 0) or $n>0$ (we didn't manage to do so) at $s=0$ will determine whether there are spontaneous solutions or not, and if $n=0$ at $s=0$ then we have criticality - you expect just about 1 non-planted solution. So criticality corresponds to $C=0$, which gives $$n_c(s)=\sqrt{2(s\log_{10}(s)-s/\log(10))}=\sqrt{2s\log_{10}(s/e)}.$$

If you try $s=120$ in this formula you get $n_c(120)=19.9$. The true value is probably around 20 or 21, so it's a bit flukily worked more accurately than we had a right to expect.

Analysis part 2

How is this related to the time taken to find a solution? (To be continued, maybe.)

1924
accept rate: 25%

I would really like to know it!! Please explain it :)

(25 Jun, 11:35)

@alexthelemon , @vivek_1998299 is very much interested. Please explain your solution now :p :) xD

(25 Jun, 20:50)
1

I'll write something soon - sorry been a bit busy!

(29 Jun, 16:46)

I added something. Sorry, I was meaning to write more, but didn't have the time so I wrote what I had.

(05 Jul, 08:40)

simply amazing!!!

Thank you so much..

(05 Jul, 12:43)

Very decent explanation!!

(05 Jul, 15:16)
showing 5 of 6 show all
 1 I talked with ashmelev on codeforces and his solution goes like this Explanation in his words-> The first thing - we just use the branch and bound approach with recursive iteration function. So it looks like a simple DFS - lets select the first term (substring), fix it, then select the second term and so on. If we failed to find a correct solution in some branch, we go back and change the selected terms. Of course it works extremely long time and I think that it is impossible to prove that some optimizations can solve any test in the given time limit. Now, we have to improve the solution to make is significantly faster. In general there are two useful methods: 1. make operations asymptotically faster 2. detect states where we cannot find the solution and stop deeper recursion in such cases The first type optimization are just implementation-depended, there are no useful ideas regarding the problem. For example, if we are selecting the next substring which will be used as a single term, we should check, whether it intersects with already selected substrings (if we already took substring [1..5], we cannot use substring [4...11] for example). The naive implementation iterates through all the positions and checks whether they are empty. It requires about 120 (full string length) operations in the worst case. But we may replace array of 120 boolean values by two long long (64 bit) variables - w1, w2, and we assume that position X is empty if the Xth bit of w1 is 0 (or (X-60)th bit of w2 is 0 for X >= 60). So we have only 2 bitmask operations instead of 120. well this was one question that was easy to undertand but too difficult to fit in TL:) answered 25 Jun, 22:12 11●1 accept rate: 0%
 toggle preview community wiki:
Preview

By Email:

Markdown Basics

• *italic* or _italic_
• **bold** or __bold__
• image?![alt text](/path/img.jpg "title")
• numbered list: 1. Foo 2. Bar
• to add a line break simply add two spaces to where you would like the new line to be.
• basic HTML tags are also supported
• mathemetical formulas in Latex between \$ symbol

Question tags:

×15,005
×1,879
×1,261
×179
×87
×22
×16
×10