PROBLEM LINKSDIFFICULTYHARD PREREQUISITESString Parsing, Predictive Parsing of an LL(1) Grammar, Regular Expressions PROBLEMYou are given a string which defines what an ID should look like followed by a list of IDs. You have to report which one of them are valid. QUICK EXPLANATIONRefer to the problem statement to carefully understand the description of the grammar for parsing the expressions that describe the valid IDs for each test case. Following that description should lead you to the following context free grammar (from the Problem Setter's notes). Tokens or : "or" times : "times" optional: "optional" letter : "letter" letters : "letters" digit : "digit" digits : "digits" to : "to" exactly : "exactly" upto : "upto" l : [AZ] d : [09] num : [09]{2,} //take care of leading zeroes separately string : [AZ09]+ Grammar S : E S' ; S' : E S'  ; E : T E' ; E' : or E  ; T : F T' ; T' : num T''  d T''  optional T'  ; T'' : times T'  ; F : d F1  l F2  upto num F3  upto d F3  exactly num F4  exactly d F4  digit  letter  string  num  ( S ) ; F1 : to d  ; F2 : to l  ; F3 : digits  letters ; F4 : digits  letters ; Note that "" followed by nothing indicates the 'epsilon production'. The grammar is indeed a LL(1) grammar. This means that you can always figure out which of the above rules to apply after parsing some part of the string, just by looking at the next token. EXPLANATIONAfter taking the input you may first want to clean the spaces in the expression
This ensures that all tokens are now separated by single space. Now you can tokenize the string and mark each token to the type of token provided. The next step is to parse the list of tokens according to the rules of the grammar provided. It is important to note that given a LL(1) grammar, you can parse the array of tokens in linear time. The tester's solution below uses the Recursive Descent Parsing with the simplification that because the grammar is LL(1), you never have to back track. The outcome of parsing the expression is creating a meaningful regular expression. As well as, calculating the largest possible ID that can be generated so as to discard the case if it exceeds the bound on the length of the IDs. The GNU C library has <regex.h> include that lets you use create a regex using the POSIX regular expressions syntax. Once you have the regular expression compiled, you can execute it on all the IDs to check whether they match or they don't. Refer to the Setter's Solution for a clean implementation of the Predictive parsing and learn how the various rules map to ideas in POSIX regular expressions. Each rule's corresponding method are named accordingly and the approach is easy to follow. SETTER'S SOLUTIONCan be found here. (some permission issue on uploaded solution, fixing ASAP) TESTER'S SOLUTIONCan be found here.
This question is marked "community wiki".
asked 15 May '13, 21:41

I cannot access the setter's and tester's solutions at the moment, so I am not entirely sure, but, from reading the editorial, it seems that the algorithmically interesting part (deciding if a string matches a regular expression) was "delegated" to the regex.h header/library. What is the time complexity of: 1) "compiling" a regular expression ? (I am using the term from the editorial) 2) deciding if a given string matches the regular expression ? For me, efficiently deciding if a given string matches the regular expression was the most interesting part of the problem. I first built an evaluation tree for the expression (although not in linear time because of parantheses) and then I built a nondeterministic finite automaton (NFA) out of the evaluation tree (the NFA can be built in linear time in the size of the tree). Then, using the NFA, we can decide if a string of length L matches the initial expression or not in O(L * N), where N = the number of states of the NFA (although this is usually much faster, because the number of matched states at each step is usually much smaller than N). The alternative would have been to turn the NFA into a deterministic finite automaton (DFA) in which case deciding if a string matches the regular expression takes only O(L) time. The drawback is that the DFA may have an exponential number of states relative to the number of states of the NFA (and even if this were not the case in this problem, building the DFA would still be algorithmically challenging). Personally I would have expected some discussion about automata in the editorial (because that's what regex.h probably uses). answered 15 May '13, 23:40
i totally agree with you. i didn't succeed in solving this problem because of the second part (meaning matching the string ids to their regexp format). the re module in Python is waaayyyy too slow to handle the number of strings provided by the input test case, and i was convinced that this part could have been implemented with a finite state automaton (but unfortunately, i've been really busy at work at the end of the contest, and i've not been able to find time to implement it properly).
(16 May '13, 00:52)
2
From the discussions on this problem and going through setter's and tester's solutions I assumed that this was the approach I could most easily explain. I never bothered to read and understand how a regex works, but do have some idea about FSM. I did some googling and found this link very helpful in understanding just what, implementing regex is about. The implementation varies from language to language. glibc implements NFA Algorithm with caching to DFA. Python uses backtracking.
(16 May '13, 01:31)
1
oh ! thank you for the link, it's exactly what i was looking for !
(16 May '13, 03:44)

Actually, it's impossible to solve the problem using
Actually much tougher tests could be created but even this one seems tough enough. I even believe that it is impossible to solve the problem by any known algorithm under the given constraints. answered 17 May '13, 12:23
3
@anton_lunyov: I can confirm that my solution CANNOT solve this test case under the given time constraints (the reason is, of course, the large number of states and transitions in the NFA). Since the NFA is too slow, I am wondering if it could be turned into a DFA (with a sufficiently low number of states and sufficiently quickly)... Now I am also wondering if anyone could solve the problem within the given time constraints. I guess I could try to run the source code of the other contestants, but I do not have enough time now...
(17 May '13, 14:10)
2
@mugurelionut @anton_lunyov @vinayak.garg My solution takes about 2 minutes on this input! The NFA my code constructs has 43800 states for this input. During the contest, I was looking for ways to convert the NFA into a DFA. I had submitted a solution using only NFA to check some assertions when, to my surprise, it got accepted. I suspected that the test cases were weak, but I never expected that it would take 2 minutes.
(17 May '13, 14:45)

My solution in practice section. It has 3 parts
Note
EDIT  The test cases were weak for the given constraints on length of F and length of IDS. Ideally I should have lowered them further. I apologize if it prevented you from solving the problem. answered 16 May '13, 01:37

I forgot that after, I invented even tougher test case where NFA (quite compressed) has about 160000 states. Here the testgen:
The structure of the test is
but with randomness in choosing pairs of chars. Constraint 1000 allows to use up to 160 pairs of chars. So before applying @balajiganapath UPD answered 17 May '13, 14:54

My opinion is that is kind of akward that an editorial indicates that a problem should be solved by "GNU C library" (probably the submission should have been restricted to C/C++). I tried the same thing in Java but it's too slow. Related with NFA @mugurelionut: can you please explain the matching in O(N*L). I have tried a simple DF traversal on NFA with first matching char edges and only after empty edges. Thanks. answered 16 May '13, 11:57
1
Well.. we start at the initial state of the NFA and "expand" the set of reachable states along the empty edges. Then, for each character from the string, we first move along the char edge from each reachable state and then we "expand" the set of (new) reachable states along the empty edges. Since each state has O(1) empty edges in this NFA (unless I computed something wrong), each "expansion" along empty edges takes only O(N) time (e.g. use a BFS and initially add to the queue all the reachable states). Thus, you get O(N*L) overall.
(17 May '13, 13:56)
@mugurelionut  I have tried this exact algorithm in java but it seems that java is not good enough (TLE). Thak you very much for explanation. My solution here  http://www.codechef.com/viewsolution/2164945
(17 May '13, 14:33)
@inseder: I ran your solution and it takes 33 sec on first file (2.0 KB) and gives WA. One case it fails on 1 (((38)or(38)or(38))or(UM)) 2 38 UM It is quite strange that NFA approach is taking so much time. I unfortunately am not comfortable enough with java to find the reason.
(17 May '13, 18:18)

BTW. answered 17 May '13, 19:53

mugurelionut is right, if we use regex.h to do the job, this problem will become fairly easy. I think the toughest part of the problem was upto N digits/letters and there should have been at least some explanation of it answered 16 May '13, 00:59
Yes, @pulkit has done a similar thing. Nice solution though. http://www.codechef.com/viewsolution/2122210
(16 May '13, 01:31)
1
@bhambya I am sorry the solutions are still not up :( I have asked the codechef team to fix the permission issues, but the recent migrations of the site (for performance) has kept them quite busy. The upto N digits part is actually very simple using POSIX style regex(s) by using [09]{0,N} or [AZ]{0,N}. I left that part to be discovered from the Setter's Solution which is really very very clean.
(16 May '13, 01:39)

When will be setter's and tester's sol accessible
Setter's solution  http://www.codechef.com/viewsolution/2164314