写对正则:一行代码,速度差50倍
http://hax.iteye.com/blog/383961
2009-05-11
A lesson of RegExp: 50x faster with just one line patch
While I'm developing WebSHi (which is the fastest syntax highlighter written by JavaScript), I also write many performance testings for other rivals. One of them is SyCODE Syntax Highlighter , which is written by silverdrag (水月). It derives from the famous SyntaxHighlighter 1.5.x (dp.SH for short) and as silverdrag's words, it should be 5x to 10x faster than original dp.SH .
But unfortunately, my testings can't prove it. Though it won't trigger the "script slowly" dialog like dp.SH when highlighting large file, in most cases, it only shows 2x faster than dp.SH on IE6. On the other side, when I tested it on FF, I was so surprised that SyCODE is extremely slow, it will cost 5s+ for processing a 700 lines JavaScript file while the original dp.SH only half second.
It's very strange, so I digged into SyCODE. I found that SyCODE highlights more words for JavaScript language (currently all my testcases are to highlight some JavaScript source code files). The original dp.SH (and most other rivals) only highlights the keywords of JavaScript language. SyCODE also highlights global names likeArray , Boolean , String , etc., and properties and methods like alert , charAt , onclick etc.. That means SyCODE need to do more text searching and replacement. I disabled such features and tested again, this time SyCODE is 2x faster than dp.SH.
So you will think the problem is just the extra words replacement. And what interesting is it just affect FF a lot, even SyCODE do more text processing, it's still faster than (or at least as fast as) dp.SH on other browsers (Safari, Chrome, Opera and IE).
I'm curious about the root cause of the problem. After some researching, I located it. Just one simple function:
GetKeywords: function(str) {
return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';
},
The function GetKeywords is used to generate a regexp for keywords search and replacement. For example,GetKeywords("abstract break byte case catch")
will return a regexp/\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/
.
The code is straightforward, but it's bad and generate a very inefficient regexp.
The keypoint is \b
, \b
is a word boundary assertion. To test whether a position is a word boundary, the regexp engine need to consider both the left character of the position and the right character of the position. If one is a word character (aka a-z, A-Z, 0-9 and the underscore "_") and the other is not, then it's a word boundary. You see it need both look forward one char and look backward one char. Though \b assertion is not very expensive, each failed match of /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/
will do such look forward/backward 10 times, and JavaScript language has 50+ keywords means each failed match will do 100 times, and SyCODE add 400+ properties/methods words means extra 800+ times!
Of coz, \b assertion can be easily optimized, but our test result shows that FF's regexp engine doesn't do a good optimization at all.
Anyway, there is a very cheap way to solve the problem. Most of those \b assertions are unnecessary ./\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/
can be rewrite as/\b(?:abstract|break|byte|case|catch)\b/
, those two regexp are equal, the only difference is the latter only need two \b assertion. Yes, we just need two , even SyCODE add 400+ words, we still just need two .
It's trivial to fix GetKeywords :
GetKeywords: function(str) {
return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';
return '\\b(' + str.replace(/\s+/g, '|') + ')\\b';
},
Let's see the result of applying this one line patch:
Test results of FF3 code lines original patched
700 | 6.7s | 0.1s |
1600 | 15.5s | 0.3s |
4300 | 41.5s | 0.7s |
Oops, one line code cause 50x difference.
Besides FF, the patch also help other browsers a lot.
Test results of 4300 lines of code browser original patched
IE6 | 6.3s | 2.8s |
Safari3 | 2.6s | 0.8s |
Opera9 | 8.2s | 1.7s |
Chrome1 | 2.3s | 0.5s |
As we can see, even Chrome, which introduce a very optimized regexp engine, also shows at least 4x difference.
SyCODE derives from dp.SH, the GetKeywords function is also the legacy from dp.SH, and even the newSyntaxHighlighter 2 still use the similar code. Because dp.SH only highlight about 50 keywords for JavaScript langauge, you will not see performance issue like SyCODE, but applying this one line patch still introduce 20% faster on most browsers.
But this patch is not the end. In next article, I will discuss a complex technique to get another 10% to 40% faster for keywords search/replacement.
相关推荐
Escape RegExp特殊字符 安装 $ npm install escape-string-regexp 用法 import escapeStringRegexp from 'escape-string-regexp' ; const escapedString = escapeStringRegexp ( 'How much $ for a :unicorn:?' ) ; ...
官方离线安装包,亲测可用
mysql-udf-regexp 该程序包将正则表达式函数用作MySQL用户定义函数(UDF)。 该软件包实现的功能是: REGEXP_LIKE(text, pattern [, mode]) REGEXP_SUBSTR(text, pattern [,position [,occurence [,mode]]]) ...
npm install path-to-regexp --save 用法 const { pathToRegexp , match , parse , compile } = require ( "path-to-regexp" ) ; // pathToRegexp(path, keys?, options?) // match(path) // parse(path) // compile...
Regexp::Lexer - 用于 perl 正则表达式的词法分析器 概要 use Regexp::Lexer qw(tokenize); my $tokens = tokenize(qr{\Ahello\s+world\z}i); 描述 Regexp::Lexer 是 perl 正则表达式的词法分析器。 该模块将正则...
path_to_regexp 将/user/:id类的路径转换为正则表达式。 匹配 pathToRegExp()将路径规范转换为与符合条件的路径匹配的正则表达式。 final regExp = pathToRegExp ( '/user/:id' ); regExp. hasMatch ( '/user/12'...
是用于查找RegExp错误和RegExp样式指南违规的ESLint插件。 :name_badge:特征 此ESLint插件提供了与更好的方法相关的整理规则,可以帮助您避免在使用RegExp时出现问题。 查找正则表达式及其提示的错误用法。 强制...
$ npm install clone-regexp 用法 import cloneRegexp from 'clone-regexp' ; const regex = / [ a-z ] / gi ; cloneRegexp ( regex ) ; //=> /[a-z]/gi cloneRegexp ( regex ) === regex ; //=> false cloneRegexp ...
在本课程"ajs_lesson_9_regexp:ПродвинутыйJS。 Лекция9.Задача2"中,我们深入探讨了JavaScript中的正则表达式(Regular Expressions),这是一个强大的文本处理工具,尤其在数据验证、搜索...
- `new RegExp()`:构造函数创建正则表达式对象。 - `/pattern/flags`:字面量语法创建正则表达式,如`/abc/gi`匹配所有"abc",`g`全局搜索,`i`忽略大小写。 - `test()`:测试字符串是否匹配正则表达式。 - `...
concat-regexp 是一个函数,它接受一系列正则表达式(作为 String 和/或 RegExp 对象)并以连接形式返回它们。 安装 $ npm install concat-regexp 例子 var concat = require ( 'concat-regexp' ) var username = '...
is.regexp 检查给定的值是一个RegExp 对象 应用程序接口 var isRegExp = require('is.regexp') isRegExp(正则表达式对象) 用法 var isRegExp = require ( 'is.regexp' ) ; var obj1 = / 2 [ 0-4 ] [ 0-9 ] / var ...
* [a-zA-Z] a through z or A through Z, inclusive (range) [a-zA-Z] 从a 到 z 或 从A 到 Z(包括a,z,A,Z)(范围) * [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) [a-d[m-p]] 从a 到...
用法 var isRegexp = require ( 'validate.io-regexp' ) ;isRegexp( 值 ) 验证value是否为正则表达式。 var value = / \. + / ;var bool = isRegexp ( value ) ;// returns true例子 console . log ( isRegexp ( / \...
实现基本的正则表达式匹配: 字符 含义 c 匹配任意的字母c . 匹配任意的单个字符 ^ 匹配输入字符串的开头 $ 匹配输入字符串的结尾 * 匹配前一个字符的零个或者多个出现 ...匹配所包含的任意一个字符
import isRegexp from 'is-regexp' ; isRegexp ( 'unicorn' ) ; //=> false isRegexp ( / unicorn / ) ; //=> true isRegexp ( new RegExp ( 'unicorn' ) ) ; //=> true 有关的 -输入检查值 Tidelift帮助维护人员...
import com.google.code.regexp.Pattern ; import com.google.code.regexp.Matcher ; public class NamedRegexpTest { public static void main ( String [] args ) { // pattern contains capture group, named ...
Eclipse Regexp是一款针对Eclipse集成开发环境的插件,专为Java开发者设计,用于测试和验证正则表达式。正则表达式是编程中的一种强大工具,它用于模式匹配和字符串处理,尤其是在数据提取、文本搜索和替换等方面。...
静态静态正则表达式 静态文件服务中间件从koa-static克隆添加option.regexp以支持通过regexp进行过滤安装$ npm install koa-static-regexp原料药var koa = require ( 'koa' ) ;var app = koa ( ) ;app . use ( ...
在"RegExp-Patch8"这样的文件名中,我们可以推测这是一个针对正则表达式学习应用的更新补丁,可能包含了一些新的功能或者修复了之前的问题。在实际的软件开发中,补丁文件通常用来修正程序中的错误,增强性能,或者...