1. Pattern matching: Find one of a specified set of strings in text.
2. Regular expression : a notation to specify a set of strings.
-- basic operations
-- additional operations: can be expressed by basic operations. eg. [A-E]+ is shorthand for (A|B|C|D|E)(A|B|C|D|E)*
3. Duality between REs and DFAs
-- RE: Concise way to describe a set of strings.
-- DFA: Machine to recognize whether a given string is in a given set.
-- Kleene's theorem.
- For any DFA, there exists a RE that describes the same set of strings.
- For any RE, there exists a DFA that recognizes the same set of strings.
-- Bad news : DFA may have exponential # of states.
4. Regular-expression-matching NFA.
-- RE enclosed in parentheses.
-- One state per RE character (start = 0, accept = M).
-- ε-transition: change state, but don't scan text.
-- match transition: change state and scan to next text char.
-- Accept if any sequence of transitions ends in accept state.(after scanning all text characters)
5. How to determine whether a string is matched by an automaton?
-- DFA: Deterministic ⇒ easy because exactly one applicable transition.
-- NFA: Nondeterministic ⇒ can be several applicable transitions; need to select the right one!
6. NFA representation
-- State names: Integers from 0 to M(number of symbols in RE).
-- Match-transitions: Keep regular expression in array re[]. Use (re[position] == input character) to verify the match transition.
-- ε-transitions. Store in a digraph G.
7. NFA simulation
-- Maintain set of all possible states that NFA could be in after reading in the first i text characters.
-- perform reachability in ε-transitions diagraph : Run DFS from each source (possible states after match transition), without unmarking vertices. (Runs in time proportional to E + V)
-- Implementation
public boolean recognizes(String txt) { Bag<Integer> ps = new Bag<Integer>(); // possible states after ε-transitions DirectedDFS dfs = new DirectedDFS(G, 0); // G is ε-transitions diagraph for (int v = 0; v < G.V(); v++) //states reachable from start by ε-transitions if (dfs.marked(v)) pc.add(v); for (int i = 0; i < txt.length(); i++) { Bag<Integer> match = new Bag<Integer>(); //possible states after match transition for (int v : pc) { if (v == M) continue; if ((re[v] == txt.charAt(i)) || re[v] == '.') // match transition match.add(v+1); } dfs = new DirectedDFS(G, match); // DFS search from set of vertices ps = new Bag<Integer>(); for (int v = 0; v < G.V(); v++) //follow ε-transitions if (dfs.marked(v)) pc.add(v); } for (int v : pc) if (v == M) return true; return false; }
-- Determining whether an N-character text is recognized by the NFA corresponding to an M-character pattern takes time proportional to MN in the worst case.
Pf. For each of the N text characters, we iterate through a set of states of size no more than M and run DFS on the graph of ε-transitions. [The NFA construction we will consider ensures the number of edges ≤ 3M.]
8. Building an NFA corresponding to an RE
-- States: Include a state for each symbol in the RE, plus an accept state.
-- Concatenation: Add match-transition edge from state corresponding to characters in the alphabet to next state.
-- Parentheses: Add ε-transition edge from parentheses to next state.
-- Closure: Add three ε-transition edges for each * operator.
-- Or: Add two ε-transition edges for each | operator.
-- Implementation
private Digraph buildEpsilonTransitionDigraph() { Digraph G = new Digraph(M+1); Stack<Integer> ops = new Stack<Integer>(); for (int i = 0; i < M; i++) { int lp = i; if (re[i] == '(' || re[i] == '|') ops.push(i); //left parentheses and |,push onto stack else if (re[i] == ')') { int or = ops.pop(); if (re[or] == '|') { lp = ops.pop(); G.addEdge(lp, or+1); G.addEdge(or, i); } else lp = or; } if (i < M-1 && re[i+1] == '*') { //closure,needs 1-character lookahead G.addEdge(lp, i+1); G.addEdge(i+1, lp); } if (re[i] == '(' || re[i] == '*' || re[i] == ')') G.addEdge(i, i+1); } return G; }
-- Building the NFA corresponding to an M-character RE takes time and space proportional to M.
Pf. For each of the M characters in the RE, we add at most three ε-transitions and execute at most two stack operations.
9. Generalized regular expression print
-- Take a RE as a command-line argument and print the lines from standard input having some substring that is matched by the RE.
-- Implementation:
public class GREP { public static void main(String[] args) { String re = "(.*" + args[0] + ".*)"; NFA nfa = new NFA(re); while (StdIn.hasNextLine()) { String line = StdIn.readLine(); if (nfa.recognizes(line)) StdOut.println(line); } } }
-- Worst-case for grep (proportional to M N ) is the same as for brute-force substring search.
10. Regular expressions in Java
-- String.matches(String regex)
//basic RE matching public class Validate { public static void main(String[] args) { String regexp = args[0]; String input = args[1]; StdOut.println(input.matches(re)); } }
-- java.util.regexp.Pattern and java.util.regexp.Matcher classes:
//Print all substrings of input that match a RE. public class Harvester { public static void main(String[] args) { String regexp = args[0]; In in = new In(args[1]); String input = in.readAll(); Pattern pattern = Pattern.compile(regexp);//compile() creates a Pattern (NFA) from RE Matcher matcher = pattern.matcher(input);//matcher() creates a Matcher (NFA simulator) from NFA and text while (matcher.find()) // find() looks for the next match { //group() returns the substring most recently found by find() StdOut.println(matcher.group()); } } }
相关推荐
Mastering Regular Expressions(3rd) 英文无水印pdf 第3版 pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,...
正则表达式(Regular Expressions)是一种强有力的文本匹配工具,用于在字符串中执行模式匹配和提取信息。从给出的文件内容来看,我们正在讨论一本关于正则表达式的电子书——《Introducing Regular Expressions》,...
《Mastering Regular Expressions》(第三版)是正则表达式领域的权威著作,由拥有近30年开发经验的专家Jeffrey E.F. Friedl撰写。这本书深入浅出地介绍了正则表达式的概念、语法以及实际应用,是编程者提升正则...
而Delphi自XE2版本起,内置了RegularExpressions组件,它基于.NET Framework的System.Text.RegularExpressions类库,提供了一套原生的正则表达式支持。虽然它可能没有PerlRegEx那么灵活,但对于大部分日常的正则...
《Wrox - Beginning Regular Expressions》是一本专为初学者设计的正则表达式入门教程。这本书深入浅出地介绍了正则表达式的基本概念、语法和应用,旨在帮助读者掌握这一强大的文本处理工具。 正则表达式(Regular ...
#### 标题:Mastering Regular Expressions - **主要内容**:本书深入探讨了正则表达式的高级用法和技术细节,旨在帮助读者掌握正则表达式的各个方面。 #### 描述:Mastering Regular Expressions.pdf - **内容...
**"Regular Expressions Cookbook.pdf"** 这个标题明确指出本书的主题是正则表达式(Regular Expressions,简称 Regex)。正则表达式是一种强大的文本处理工具,被广泛应用于搜索、替换以及解析文本等任务中。...
PCRE(Perl Compatible Regular Expressions)是一个Perl库,包括 perl 兼容的正规表达式库.这些在执行正规表达式模式匹配时用与Perl 5同样的语法和语义是很有用的。Boost太庞大了,使用boost regex后,程序的编译速度...
Mastering Python Regular Expressions 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请...
正则表达式(Regular Expressions)是一种强大的文本处理工具,用于在字符串中执行搜索、替换、提取等操作,它是一种在计算机科学和编程领域广泛使用的工具。正则表达式被设计为一种模式,能够匹配一系列符合特定...
To introduce readers to regular expressions in several technologies. While the material is primarily for people who have little or no experience with regular expressions, there is also some content ...
PCRE(Perl Compatible Regular Expressions)是一个Perl库,包括 perl 兼容的正则表达式库。这些在执行正规表达式模式匹配时用与Perl 5同样的语法和语义是很有用的。
书名:Mastering Regular Expressions, 3rd Edition 格式:CHM 语言:English 简介: Regular expressions are an extremely powerful tool for manipulating text and data. They are now standard ...
本部分内容主要介绍了正则表达式的相关知识,包括锚点、字符集、特殊字符、字符类、量词、模式修饰符、逃脱字符、正则表达式元字符、前后匹配、位置匹配等。 1. 锚点:锚点是正则表达式中的特殊字符,用于指定匹配...