`
hax
  • 浏览: 962468 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

写对正则:一行代码,速度差50倍

    博客分类:
  • JS
阅读更多

2009-05-11

A lesson of RegExp: 50x faster with just one line patch

While I'm developing WebSHi (which is the fastest syntax highlighter written by JavaScript), I also write many performance testings for other rivals. One of them is SyCODE Syntax Highlighter , which is written by silverdrag (水月). It derives from the famous SyntaxHighlighter 1.5.x (dp.SH for short) and as silverdrag's words, it should be 5x to 10x faster than original dp.SH .

But unfortunately, my testings can't prove it. Though it won't trigger the "script slowly" dialog like dp.SH when highlighting large file, in most cases, it only shows 2x faster than dp.SH on IE6. On the other side, when I tested it on FF, I was so surprised that SyCODE is extremely slow, it will cost 5s+ for processing a 700 lines JavaScript file while the original dp.SH only half second.

It's very strange, so I digged into SyCODE. I found that SyCODE highlights more words for JavaScript language (currently all my testcases are to highlight some JavaScript source code files). The original dp.SH (and most other rivals) only highlights the keywords of JavaScript language. SyCODE also highlights global names like Array , Boolean , String , etc., and properties and methods like alert , charAt , onclick etc.. That means SyCODE need to do more text searching and replacement. I disabled such features and tested again, this time SyCODE is 2x faster than dp.SH.

So you will think the problem is just the extra words replacement. And what interesting is it just affect FF a lot, even SyCODE do more text processing, it's still faster than (or at least as fast as) dp.SH on other browsers (Safari, Chrome, Opera and IE).

I'm curious about the root cause of the problem. After some researching, I located it. Just one simple function:


GetKeywords: function(str) {
 return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';
},

The function GetKeywords is used to generate a regexp for keywords search and replacement. For example, GetKeywords("abstract break byte case catch") will return a regexp /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ .

The code is straightforward, but it's bad and generate a very inefficient regexp.

The keypoint is \b , \b is a word boundary assertion. To test whether a position is a word boundary, the regexp engine need to consider both the left character of the position and the right character of the position. If one is a word character (aka a-z, A-Z, 0-9 and the underscore "_") and the other is not, then it's a word boundary. You see it need both look forward one char and look backward one char. Though \b assertion is not very expensive, each failed match of /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ will do such look forward/backward 10 times, and JavaScript language has 50+ keywords means each failed match will do 100 times, and SyCODE add 400+ properties/methods words means extra 800+ times!

Of coz, \b assertion can be easily optimized, but our test result shows that FF's regexp engine doesn't do a good optimization at all.

Anyway, there is a very cheap way to solve the problem. Most of those \b assertions are unnecessary . /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ can be rewrite as /\b(?:abstract|break|byte|case|catch)\b/ , those two regexp are equal, the only difference is the latter only need two \b assertion. Yes, we just need two , even SyCODE add 400+ words, we still just need two .

It's trivial to fix GetKeywords :


GetKeywords: function(str) {
 return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';

 return '\\b(' + str.replace(/\s+/g, '|') + ')\\b';

},

Let's see the result of applying this one line patch:

Test results of FF3 code lines original patched
700 6.7s 0.1s
1600 15.5s 0.3s
4300 41.5s 0.7s

Oops, one line code cause 50x difference.

Besides FF, the patch also help other browsers a lot.

Test results of 4300 lines of code browser original patched
IE6 6.3s 2.8s
Safari3 2.6s 0.8s
Opera9 8.2s 1.7s
Chrome1 2.3s 0.5s

As we can see, even Chrome, which introduce a very optimized regexp engine, also shows at least 4x difference.

SyCODE derives from dp.SH, the GetKeywords function is also the legacy from dp.SH, and even the new SyntaxHighlighter 2 still use the similar code. Because dp.SH only highlight about 50 keywords for JavaScript langauge, you will not see performance issue like SyCODE, but applying this one line patch still introduce 20% faster on most browsers.

But this patch is not the end. In next article, I will discuss a complex technique to get another 10% to 40% faster for keywords search/replacement.

7
5
分享到:
评论
5 楼 sohighthesky 2010-07-25  
     学习
4 楼 terryang 2009-05-14  
yidao620c 写道



3 楼 yidao620c 2009-05-14  
2 楼 i_love_sc 2009-05-13  
帖子是英文的,没什么。反正能看懂。还用英文回帖就太……
1 楼 boin 2009-05-12  
aweson post!
greatly inspires me on digging more deeper into regexp.
can't wait to see the following posts.

相关推荐

    正则表达式学习资料以及练习项目代码很多

    正则表达式(Regular Expression,简称regex)是用于匹配字符串的一种模式,广泛应用于文本处理、数据验证、搜索和替换等场景。在Python编程语言中,正则表达式提供了强大的文本处理能力,使得开发者能够高效地处理...

    PB实现的正则表达式

    2. 编译模式:将正则表达式编译成一个可以执行的对象,以提高后续匹配的速度。 3. 执行匹配:使用编译后的对象在目标字符串上执行匹配操作。 4. 处理结果:检查匹配结果,获取匹配的子串,或进行替换等操作。 需要...

    测试正则表达式软件

    正则表达式(Regular Expression,简称regex)是用于匹配字符串的一种模式,广泛应用于文本处理、数据验证、搜索和替换等场景。在Java编程语言中,正则表达式是一个强大的工具,能够帮助开发者高效地处理字符串。...

    正则表达式:文本处理的强大工具.pdf

    - **分组和捕获**:通过圆括号(())可以对正则表达式的一部分进行分组,并可以通过编号访问这些分组的内容。这对于复杂的模式匹配非常重要。 #### 应用案例 - **文本搜索**:在大量文本中查找符合特定模式的字符...

    正则表达式:深入理解与应用.zip

    正则表达式(Regular Expression,简称regex)是用于匹配字符串的一种模式,它是计算机科学中的一个概念,被广泛应用于文本处理、数据验证、搜索和替换等多个领域。深入理解和应用正则表达式,对于任何IT专业人士来...

    .net正则表达式测试工具

    1. **单选模式**:在单行模式下,正则表达式会将整个输入字符串视为一行,使"^"和"$"匹配字符串的开始和结束,而非每一行的开始和结束。这对于处理连续的文本块非常有用。 2. **多行模式**:多行模式下,"^"和"$"将...

    正则库表达式IOS

    在iOS开发中,正则表达式(Regular Expression)是一种强大的文本处理工具,它能用于模式匹配、字符串查找、替换和分割等操作。本类库专为iOS环境设计,旨在简化和增强应用程序对正则表达式的处理能力。下面将详细...

    第11.25节 Python正则表达式编译re.compile及正则对象使用.rar

    在Python编程语言中,正则表达式是一种强大的文本处理工具,用于匹配、查找、替换等操作。本节将深入探讨`re.compile()`函数及其在创建正则表达式对象中的应用。`re.compile()`是Python标准库`re`模块中的一员,它...

    VB6 正则表达式类

    设为True时,"^"和"$"分别匹配每一行的开始和结束。 8. **替换功能**:`Replace`方法可以用来替换匹配的子串,例如: ```vb Dim newString As String newString = regex.Replace("your_string", "replacement_...

    Python正则表达式基础

    1. `compile()`:编译正则表达式模式,生成一个正则表达式对象,提高匹配速度。 2. `findall()`:找到所有非重叠匹配的子串,并以列表形式返回。 3. `finditer()`:与`findall()`类似,但返回的是一个迭代器,每个...

    精通正则表达式~~~

    精通正则表达式第三版 搜集于网络 前言..........I 第1章:正则表达式入门.... 1 解决实际问题... 2 作为编程语言的正则表达式... 4 以文件名做类比... 4 以语言做类比... 5 正则表达式的知识框架... 6 对于...

    正则表达式

    这行代码创建一个新的RegExp对象,并将它赋给变量parttern.这个特殊的RegExp对象和所有以字母"s"结尾的字符串都匹配.用RegExp()也可以定义 一个等价的正则表达式,代码如下: var pattern = new RegExp("s$"); ...

    C#表单正则表达式验证手册

    1. **正则表达式基础**:了解基本的元字符,如`.`(任意字符)、`^`(行首)、`$`(行尾)、`\d`(数字)、`\w`(字母数字字符)、`\s`(空白字符)等,以及量词如`*`(零次或多次)、`+`(一次或多次)、`?...

    常用正则工具类

    在Java编程语言中,正则表达式是一种强大的文本处理工具,用于模式匹配、字符串查找、替换等操作。本文将深入探讨“常用正则工具类”所涉及的关键知识点,并提供一些实用的示例来帮助理解。 1. **正则表达式基础** ...

    正则表达式 正则表达式

    正则表达式是一种强大的文本处理工具,用于匹配、查找、替换和提取字符串模式。在编程领域,正则表达式(Regular Expression,简称regex)被广泛应用于数据验证、文本搜索和处理等方面。它由一系列特殊字符和操作符...

    正则表达式生成和检验工具

    正则表达式是一种强大的文本处理工具,用于匹配、查找、替换和分析字符串模式。在编程领域,正则表达式被广泛应用于数据验证、文本搜索和提取等任务。本篇文章将详细探讨与“正则表达式生成和检验工具”相关的知识点...

    mingw pcre8.12正则表达式库

    在实际编程中,你可能需要了解PCRE的一些特殊标志和选项,如`PCRE_CASELESS`(不区分大小写)、`PCRE_DOTALL`(使`.`匹配任何字符,包括换行符)和`PCRE_MULTILINE`(使`^`和`$`能分别匹配每一行的开始和结束)。...

    java文件读写和正则表达式检索字符次位.pdf

    【字符位置检索】代码中,通过`split()`方法将每行内容按正则表达式分割后,计算每个匹配项的起始位置(包括之前所有字符的长度加一),然后使用`substring()`方法获取匹配的单个字符。`substring(j-1, j)`表示从...

    python正则表达式_深入浅出

    - **`re.compile()`**:编译正则表达式为`Pattern`对象,提高后续匹配速度。 - **`re.Pattern`对象的方法**:如`match()`、`search()`、`fullmatch()`、`findall()`、`finditer()`、`split()`、`sub()`和`subn()`...

Global site tag (gtag.js) - Google Analytics