×

HARD

# PREREQUISITES

String Parsing, Predictive Parsing of an LL(1) Grammar, Regular Expressions

# PROBLEM

You are given a string which defines what an ID should look like followed by a list of IDs. You have to report which one of them are valid.

# QUICK EXPLANATION

Refer to the problem statement to carefully understand the description of the grammar for parsing the expressions that describe the valid IDs for each test case. Following that description should lead you to the following context free grammar (from the Problem Setter's notes).

Tokens

or      : "or"
times   : "times"
optional: "optional"
letter  : "letter"
letters : "letters"
digit   : "digit"
digits  : "digits"
to      : "to"
exactly : "exactly"
upto    : "upto"
l       : [A-Z]
d       : [0-9]
num     : [0-9]{2,} //take care of leading zeroes separately
string  : [A-Z0-9]+

Grammar

S   : E S'
;

S'  : E S'
|
;

E   : T E'
;

E'  : or E
|
;

T   : F T'
;

T'  : num T''
| d T''
| optional T'
|
;

T'' : times T'
|
;

F   : d F1
| l F2
| upto num F3
| upto d F3
| exactly num F4
| exactly d F4
| digit
| letter
| string
| num
| ( S )
;

F1  : to d
|
;

F2  : to l
|
;

F3  : digits
| letters
;

F4  : digits
| letters
;


Note that "|" followed by nothing indicates the 'epsilon production'.

The grammar is indeed a LL(1) grammar. This means that you can always figure out which of the above rules to apply after parsing some part of the string, just by looking at the next token.

# EXPLANATION

After taking the input you may first want to clean the spaces in the expression

• Remove repeated spaces

This ensures that all tokens are now separated by single space. Now you can tokenize the string and mark each token to the type of token provided.

The next step is to parse the list of tokens according to the rules of the grammar provided. It is important to note that given a LL(1) grammar, you can parse the array of tokens in linear time. The tester's solution below uses the Recursive Descent Parsing with the simplification that because the grammar is LL(1), you never have to back track.

The outcome of parsing the expression is creating a meaningful regular expression. As well as, calculating the largest possible ID that can be generated so as to discard the case if it exceeds the bound on the length of the IDs.

The GNU C library has <regex.h> include that lets you use create a regex using the POSIX regular expressions syntax.

Once you have the regular expression compiled, you can execute it on all the IDs to check whether they match or they don't.

Refer to the Setter's Solution for a clean implementation of the Predictive parsing and learn how the various rules map to ideas in POSIX regular expressions. Each rule's corresponding method are named accordingly and the approach is easy to follow.

# SETTER'S SOLUTION

Can be found here. (some permission issue on uploaded solution, fixing ASAP)

# TESTER'S SOLUTION

Can be found here.

This question is marked "community wiki".

2.4k128183169
accept rate: 14%

3.7k113249

When will be setter's and tester's sol accessible

(16 May '13, 00:56)
2
(16 May '13, 01:57)

 10 I cannot access the setter's and tester's solutions at the moment, so I am not entirely sure, but, from reading the editorial, it seems that the algorithmically interesting part (deciding if a string matches a regular expression) was "delegated" to the regex.h header/library. What is the time complexity of: 1) "compiling" a regular expression ? (I am using the term from the editorial) 2) deciding if a given string matches the regular expression ? For me, efficiently deciding if a given string matches the regular expression was the most interesting part of the problem. I first built an evaluation tree for the expression (although not in linear time because of parantheses) and then I built a non-deterministic finite automaton (NFA) out of the evaluation tree (the NFA can be built in linear time in the size of the tree). Then, using the NFA, we can decide if a string of length L matches the initial expression or not in O(L * N), where N = the number of states of the NFA (although this is usually much faster, because the number of matched states at each step is usually much smaller than N). The alternative would have been to turn the NFA into a deterministic finite automaton (DFA) in which case deciding if a string matches the regular expression takes only O(L) time. The drawback is that the DFA may have an exponential number of states relative to the number of states of the NFA (and even if this were not the case in this problem, building the DFA would still be algorithmically challenging). Personally I would have expected some discussion about automata in the editorial (because that's what regex.h probably uses). answered 15 May '13, 23:40 10.0k●26●69●90 accept rate: 18% i totally agree with you. i didn't succeed in solving this problem because of the second part (meaning matching the string ids to their regexp format). the re module in Python is waaayyyy too slow to handle the number of strings provided by the input test case, and i was convinced that this part could have been implemented with a finite state automaton (but unfortunately, i've been really busy at work at the end of the contest, and i've not been able to find time to implement it properly). http://www.codechef.com/viewsolution/2134402 (16 May '13, 00:52) 2 From the discussions on this problem and going through setter's and tester's solutions I assumed that this was the approach I could most easily explain. I never bothered to read and understand how a regex works, but do have some idea about FSM. I did some googling and found this link very helpful in understanding just what, implementing regex is about. The implementation varies from language to language. glibc implements NFA Algorithm with caching to DFA. Python uses backtracking. (16 May '13, 01:31) 1 oh ! thank you for the link, it's exactly what i was looking for ! (16 May '13, 03:44)
 6 Actually, it's impossible to solve the problem using regex.h for every possible input under the given constraints. After several annoying hours of thinking how to solve the problem and reading several advanced papers where the best known solutions to regexp matching problem were described, I've found test case where NFA has a lot of states and a lot of epsilon-transitions (about 30000 states), while it accepts only strings of length upto 2000. I've decided to mail the testgen to feedback@codechef.com and what was the surprise when author responded that his solution is too slow on this test and official test data are "easy solvable" using some "practical approach". Here is the testgen of this testcase: #include #include #include using namespace std; int main() { freopen("XXX.in","wb",stdout); string base = "(upto 12 digits 12 times ab)12 times"; string col[100]; int k=0; for(int a=0;a<10;a++) for(int b=0;b<10;b++) { base[25] = a+'0'; base[26] = b+'0'; col[k++] = base; } int T = 100; printf("%d\n", T); for(int t=0;t
 3 My solution in practice section. It has 3 parts- DFA to do lexical analysis. Which breaks Format string into tokens and lexemes. Predictive parser for LL(1) grammar. The grammar of Format string is represented as- S -> E S | E E -> T or E | T T -> F num times T | F optional T | F F -> d to d | l to l | digit | letter | upto num digits | exactly num digits | upto num letters | exactly num letters | string | ( S ) But this grammar needs to be left-factored and the left-recursion needs to be removed before building predictive parser. The editorial shows Left factored and left recursion eliminated version of this grammar. In predictive parsing, we essentially build a parsing table which has some production to be applied for each input symbol. This in turn requires building First and Follow Set from Left factored and left recursion eliminated grammar. I am skipping the details as the theory is best read from book. Essential fact is that this parsing is linear time. And apart from parsing we perform some actions for each production (or rule in grammar) applied. I performed three actions while parsing- 1. Error checking - If there is no matching entry in parsing table for current symbol, the format is erroneous and rejected. 2. Construct regex pattern. More below. 3. Check if format can match ID string which exceed constraint on ID string. It was quite easy to see that the format string are simply describing regular expressions. Like A to D can be converted to [A-D]. Each keyword (optional, to, or, times etc.) could be converted to regex pattern. So while parsing we construct this pattern string. And now we "compile" it using regex.h regcomp(). Once compiled we simply match each ID string using regexec() in linear time. Note- A more elegant solution is "Recursive descent parser", but since it uses backtracking, it is too slow. No one used predictive parsing in solution. It is too much of theory and not much used in practice. Also it is a top-down parser, while there are bottom-up parser also which are much more powerful. Only few used regex library in the solution. Most language (C, python, Java) have it. STL regex could not be used unfortunately. :( As it requires C++11. And regex.h is not available on MSVC. EDIT - The test cases were weak for the given constraints on length of F and length of IDS. Ideally I should have lowered them further. I apologize if it prevented you from solving the problem. answered 16 May '13, 01:37 3.7k●11●32●49 accept rate: 18%
 3 I forgot that after, I invented even tougher test case where NFA (quite compressed) has about 160000 states. Here the testgen: #include #include #include using namespace std; int main() { freopen("XXXX.in","wb",stdout); string suffix = ")optional 10 times 9 times A)11 times"; int T = 100; printf("%d\n", T); for(int t=0;t 1000) exit(1); puts(pat.c_str()); int Q = 1000; printf("%d\n",Q); for(int q=0;q
 1 My opinion is that is kind of akward that an editorial indicates that a problem should be solved by "GNU C library" (probably the submission should have been restricted to C/C++). I tried the same thing in Java but it's too slow. Related with NFA @mugurelionut: can you please explain the matching in O(N*L). I have tried a simple DF traversal on NFA with first matching char edges and only after empty edges. Thanks. answered 16 May '13, 11:57 6★inseder 350●4●8●19 accept rate: 0% 1 Well.. we start at the initial state of the NFA and "expand" the set of reachable states along the empty edges. Then, for each character from the string, we first move along the char edge from each reachable state and then we "expand" the set of (new) reachable states along the empty edges. Since each state has O(1) empty edges in this NFA (unless I computed something wrong), each "expansion" along empty edges takes only O(N) time (e.g. use a BFS and initially add to the queue all the reachable states). Thus, you get O(N*L) overall. (17 May '13, 13:56) @mugurelionut - I have tried this exact algorithm in java but it seems that java is not good enough (TLE). Thak you very much for explanation. My solution here - http://www.codechef.com/viewsolution/2164945 (17 May '13, 14:33) inseder6★ @inseder: I ran your solution and it takes 33 sec on first file (2.0 KB) and gives WA. One case it fails on 1 (((38)or(38)or(38))or(UM)) 2 38 UM It is quite strange that NFA approach is taking so much time. I unfortunately am not comfortable enough with java to find the reason. (17 May '13, 18:18)
 1 BTW. Here is probably the most recent article about regexp matching problem that improves all algorithms known before. I learned it more or less fully but algorithm described there is quite cumbersome to code and improvement of asymptotic is rather small comparing with classical simple Thompson NFA approach, so possibly it will not give any time improvement. answered 17 May '13, 19:53 6.7k●62●119●138 accept rate: 12%
 0 mugurelionut is right, if we use regex.h to do the job, this problem will become fairly easy. I think the toughest part of the problem was upto N digits/letters and there should have been at least some explanation of it answered 16 May '13, 00:59 2★bhambya 310●2●5 accept rate: 16% Yes, @pulkit has done a similar thing. Nice solution though. http://www.codechef.com/viewsolution/2122210 (16 May '13, 01:31) 1 @bhambya I am sorry the solutions are still not up :( I have asked the codechef team to fix the permission issues, but the recent migrations of the site (for performance) has kept them quite busy. The upto N digits part is actually very simple using POSIX style regex(s) by using [0-9]{0,N} or [A-Z]{0,N}. I left that part to be discovered from the Setter's Solution which is really very very clean. (16 May '13, 01:39)
 toggle preview community wiki:
Preview

By Email:

Markdown Basics

• *italic* or _italic_
• **bold** or __bold__
• image?![alt text](/path/img.jpg "title")
• numbered list: 1. Foo 2. Bar
• to add a line break simply add two spaces to where you would like the new line to be.
• basic HTML tags are also supported
• mathemetical formulas in Latex between \$ symbol

Question tags:

×15,683
×1,343
×18
×12
×8
×1

question asked: 15 May '13, 21:41

question was seen: 3,775 times

last updated: 17 May '13, 19:53