写对正则：一行代码，速度差50倍

hax

浏览: 968777 次
性别:
来自: 上海

最近访客更多访客>>

luxurioust

yuan

唔睬你啦

jklwan

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

JavaScript Chrome IE performance Safari

2009-05-11

A lesson of RegExp: 50x faster with just one line patch

While I'm developing WebSHi (which is the fastest syntax highlighter written by JavaScript), I also write many performance testings for other rivals. One of them is SyCODE Syntax Highlighter , which is written by silverdrag (水月). It derives from the famous SyntaxHighlighter 1.5.x (dp.SH for short) and as silverdrag's words, it should be 5x to 10x faster than original dp.SH .

But unfortunately, my testings can't prove it. Though it won't trigger the "script slowly" dialog like dp.SH when highlighting large file, in most cases, it only shows 2x faster than dp.SH on IE6. On the other side, when I tested it on FF, I was so surprised that SyCODE is extremely slow, it will cost 5s+ for processing a 700 lines JavaScript file while the original dp.SH only half second.

It's very strange, so I digged into SyCODE. I found that SyCODE highlights more words for JavaScript language (currently all my testcases are to highlight some JavaScript source code files). The original dp.SH (and most other rivals) only highlights the keywords of JavaScript language. SyCODE also highlights global names like Array , Boolean , String , etc., and properties and methods like alert , charAt , onclick etc.. That means SyCODE need to do more text searching and replacement. I disabled such features and tested again, this time SyCODE is 2x faster than dp.SH.

So you will think the problem is just the extra words replacement. And what interesting is it just affect FF a lot, even SyCODE do more text processing, it's still faster than (or at least as fast as) dp.SH on other browsers (Safari, Chrome, Opera and IE).

I'm curious about the root cause of the problem. After some researching, I located it. Just one simple function:


GetKeywords: function(str) {
 return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';
},

The function GetKeywords is used to generate a regexp for keywords search and replacement. For example, GetKeywords("abstract break byte case catch") will return a regexp /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ .

The code is straightforward, but it's bad and generate a very inefficient regexp.

The keypoint is \b , \b is a word boundary assertion. To test whether a position is a word boundary, the regexp engine need to consider both the left character of the position and the right character of the position. If one is a word character (aka a-z, A-Z, 0-9 and the underscore "_") and the other is not, then it's a word boundary. You see it need both look forward one char and look backward one char. Though \b assertion is not very expensive, each failed match of /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ will do such look forward/backward 10 times, and JavaScript language has 50+ keywords means each failed match will do 100 times, and SyCODE add 400+ properties/methods words means extra 800+ times!

Of coz, \b assertion can be easily optimized, but our test result shows that FF's regexp engine doesn't do a good optimization at all.

Anyway, there is a very cheap way to solve the problem. Most of those \b assertions are unnecessary . /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ can be rewrite as /\b(?:abstract|break|byte|case|catch)\b/ , those two regexp are equal, the only difference is the latter only need two \b assertion. Yes, we just need two , even SyCODE add 400+ words, we still just need two .

It's trivial to fix GetKeywords :


GetKeywords: function(str) {
 return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';

 return '\\b(' + str.replace(/\s+/g, '|') + ')\\b';

},

Let's see the result of applying this one line patch:

Test results of FF3 code lines original patched

700	6.7s	0.1s
1600	15.5s	0.3s
4300	41.5s	0.7s

Oops, one line code cause 50x difference.

Besides FF, the patch also help other browsers a lot.

Test results of 4300 lines of code browser original patched

IE6	6.3s	2.8s
Safari3	2.6s	0.8s
Opera9	8.2s	1.7s
Chrome1	2.3s	0.5s

As we can see, even Chrome, which introduce a very optimized regexp engine, also shows at least 4x difference.

SyCODE derives from dp.SH, the GetKeywords function is also the legacy from dp.SH, and even the new SyntaxHighlighter 2 still use the similar code. Because dp.SH only highlight about 50 keywords for JavaScript langauge, you will not see performance issue like SyCODE, but applying this one line patch still introduce 20% faster on most browsers.

But this patch is not the end. In next article, I will discuss a complex technique to get another 10% to 40% faster for keywords search/replacement.

7
顶

5
踩

分享到：

我的键盘坏了 | JavaScript的EOS（分号）问题

2009-05-12 03:43
浏览 6076
评论(5)
分类:非技术
查看更多

5 楼 sohighthesky 2010-07-25

学习

4 楼 terryang 2009-05-14

yidao620c 写道

挫

3 楼 yidao620c 2009-05-14

挫

2 楼 i_love_sc 2009-05-13

帖子是英文的，没什么。反正能看懂。还用英文回帖就太……

1 楼 boin 2009-05-12

aweson post!
greatly inspires me on digging more deeper into regexp.
can't wait to see the following posts.

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论