- 浏览: 962468 次
- 性别:
- 来自: 上海
文章分类
最新评论
-
sscsacdsadcsd:
mike8625 写道react还要自己的一些标签 还得编译 ...
对于React体系的一点想法 -
mike8625:
说的都是给大公司听的,国内很多还是小公司,做个小项目, 说实话 ...
关于国内前端和JS技术发展的乱想 -
mike8625:
react还要自己的一些标签 还得编译 编译吧浏览器端说还慢 ...
对于React体系的一点想法 -
u012814086:
下意识想到了Golang
JavaScript语句后应该加分号么? -
xueduanyang:
我是水羊,年轻的时候觉得只要有好斧子就能做成好产品,各种产品都 ...
关于国内前端和JS技术发展的乱想
2009-05-11
A lesson of RegExp: 50x faster with just one line patch
While I'm developing WebSHi (which is the fastest syntax highlighter written by JavaScript), I also write many performance testings for other rivals. One of them is SyCODE Syntax Highlighter , which is written by silverdrag (水月). It derives from the famous SyntaxHighlighter 1.5.x (dp.SH for short) and as silverdrag's words, it should be 5x to 10x faster than original dp.SH .
But unfortunately, my testings can't prove it. Though it won't trigger the "script slowly" dialog like dp.SH when highlighting large file, in most cases, it only shows 2x faster than dp.SH on IE6. On the other side, when I tested it on FF, I was so surprised that SyCODE is extremely slow, it will cost 5s+ for processing a 700 lines JavaScript file while the original dp.SH only half second.
It's very strange, so I digged into SyCODE. I found that SyCODE highlights more words for JavaScript language (currently all my testcases are to highlight some JavaScript source code files). The original dp.SH (and most other rivals) only highlights the keywords of JavaScript language. SyCODE also highlights global names like Array , Boolean , String , etc., and properties and methods like alert , charAt , onclick etc.. That means SyCODE need to do more text searching and replacement. I disabled such features and tested again, this time SyCODE is 2x faster than dp.SH.
So you will think the problem is just the extra words replacement. And what interesting is it just affect FF a lot, even SyCODE do more text processing, it's still faster than (or at least as fast as) dp.SH on other browsers (Safari, Chrome, Opera and IE).
I'm curious about the root cause of the problem. After some researching, I located it. Just one simple function:
GetKeywords: function(str) {
return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';
},
The function GetKeywords
is used to generate a regexp
for keywords search and replacement. For example, GetKeywords("abstract break byte case catch")
will return a regexp /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/
.
The code is straightforward, but it's bad and generate a very inefficient regexp.
The keypoint is \b
, \b
is a word boundary assertion. To test whether a position is a word
boundary, the regexp engine need to consider both the left character of
the position and the right character of the position. If one is a word
character (aka a-z, A-Z, 0-9 and the underscore "_") and the other is
not, then it's a word boundary. You see it need both look forward one
char and look backward one char. Though \b assertion is not very
expensive, each failed match of /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/
will do such look forward/backward 10 times, and JavaScript language
has 50+ keywords means each failed match will do 100 times, and SyCODE
add 400+ properties/methods words means extra 800+ times!
Of coz, \b assertion can be easily optimized, but our test result shows that FF's regexp engine doesn't do a good optimization at all.
Anyway, there is a very cheap way to solve the problem. Most of those \b assertions are unnecessary
. /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/
can be rewrite as /\b(?:abstract|break|byte|case|catch)\b/
, those two regexp are equal, the only difference is the latter only need two \b assertion. Yes, we just need two
, even SyCODE add 400+ words, we still just need two
.
It's trivial to fix GetKeywords :
GetKeywords: function(str) {
return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';
return '\\b(' + str.replace(/\s+/g, '|') + ')\\b';
},
Let's see the result of applying this one line patch:
700 | 6.7s | 0.1s |
1600 | 15.5s | 0.3s |
4300 | 41.5s | 0.7s |
Oops, one line code cause 50x difference.
Besides FF, the patch also help other browsers a lot.
IE6 | 6.3s | 2.8s |
Safari3 | 2.6s | 0.8s |
Opera9 | 8.2s | 1.7s |
Chrome1 | 2.3s | 0.5s |
As we can see, even Chrome, which introduce a very optimized regexp engine, also shows at least 4x difference.
SyCODE derives from dp.SH, the GetKeywords function is also the legacy from dp.SH, and even the new SyntaxHighlighter 2 still use the similar code. Because dp.SH only highlight about 50 keywords for JavaScript langauge, you will not see performance issue like SyCODE, but applying this one line patch still introduce 20% faster on most browsers.
But this patch is not the end. In next article, I will discuss a complex technique to get another 10% to 40% faster for keywords search/replacement.
评论
挫
greatly inspires me on digging more deeper into regexp.
can't wait to see the following posts.
发表评论
-
论ES6模块系统的静态解析
2013-03-14 04:56 15890本文是Dave Herman的《Stati ... -
如何创建一个JavaScript裸对象
2012-08-27 02:11 8064所谓裸对象,即 naked object ,是指没有原型(sp ... -
JavaScript语句后应该加分号么?
2012-06-19 03:10 14396这是一个老生常谈的问 ... -
shim是应该抛异常还是应该fail silently?
2011-08-11 17:26 5600玉伯发布了es5-safe模块 ... -
7月30日的广州演讲视频和Slides
2011-08-01 23:38 31797月30日在W3CTech广州站活动上的演讲,题目是:ECMA ... -
如何判断一个函数的意图是被用作构造器,也就是可视为“类”
2011-07-21 13:55 2952前提是不要求做什么特殊标记。只是最大可能的猜测函数的作用大概是 ... -
关于国内前端和JS技术发展的乱想
2011-07-19 18:53 33286玉伯在我的一条微博后面写了一些(和主题不是很相关但)非常值得思 ... -
Module与Trait的比较
2011-08-12 12:50 4014最近我多次提及module和trait。 粗看,我们可以发现 ... -
如何将let结构(block scope)转换到当前的JavaScript代码
2011-07-12 17:24 3018本文是对如何将let结构转换到ES3代码的补充。 首先,原文 ... -
JavaScript的未来方向之观察
2011-07-12 02:53 8378最近每次去杭州,都有 ... -
我为什么力挺NodeJS
2011-07-04 00:27 0之前在参加CNodeJS社区在 ... -
JS之父再谈JS历史(续完)
2010-12-31 04:20 3448又到年底,我觉得是时候还债了。自开blog来,我出了不少“太监 ... -
我为什么是DC黑续,兼答Tin
2010-04-27 14:29 0我同意安全是一个重要问题。我不同意的是把所谓安全放到凌驾于其他 ... -
我为什么是DC黑─Why I disagree with Douglas Crockford
2010-04-26 17:51 11019参加完了QCon北京大会, ... -
JavaScript的EOS(分号)问题
2009-05-08 16:24 5787在http://bbs.51js.com/viewthre ... -
JavaScript五大关键字
2009-05-06 17:53 4773近期做语法高亮项目的副产品,是统计了一下几个主流JS工具包中各 ... -
curry和partial的差别
2009-03-28 00:15 371751js上asfman翻译了http://ejohn.org/ ... -
Eval is Evil , With evil too
2009-03-27 18:39 0with的问题: 2009-3-27 17:12:40 ... -
IE全局变量的Dissociative Identity Disorder(人格分裂症)
2009-03-16 02:47 14742最近,小麦提出了一个疑惑: 小麦 写道最后介绍一个我也搞不明白 ... -
JScript下Array对象的性能问题
2009-02-14 02:51 4005今天看了微软JScript官方blog上去年的两篇文章: ht ...
相关推荐
正则表达式(Regular Expression,简称regex)是用于匹配字符串的一种模式,广泛应用于文本处理、数据验证、搜索和替换等场景。在Python编程语言中,正则表达式提供了强大的文本处理能力,使得开发者能够高效地处理...
2. 编译模式:将正则表达式编译成一个可以执行的对象,以提高后续匹配的速度。 3. 执行匹配:使用编译后的对象在目标字符串上执行匹配操作。 4. 处理结果:检查匹配结果,获取匹配的子串,或进行替换等操作。 需要...
正则表达式(Regular Expression,简称regex)是用于匹配字符串的一种模式,广泛应用于文本处理、数据验证、搜索和替换等场景。在Java编程语言中,正则表达式是一个强大的工具,能够帮助开发者高效地处理字符串。...
- **分组和捕获**:通过圆括号(())可以对正则表达式的一部分进行分组,并可以通过编号访问这些分组的内容。这对于复杂的模式匹配非常重要。 #### 应用案例 - **文本搜索**:在大量文本中查找符合特定模式的字符...
正则表达式(Regular Expression,简称regex)是用于匹配字符串的一种模式,它是计算机科学中的一个概念,被广泛应用于文本处理、数据验证、搜索和替换等多个领域。深入理解和应用正则表达式,对于任何IT专业人士来...
1. **单选模式**:在单行模式下,正则表达式会将整个输入字符串视为一行,使"^"和"$"匹配字符串的开始和结束,而非每一行的开始和结束。这对于处理连续的文本块非常有用。 2. **多行模式**:多行模式下,"^"和"$"将...
在iOS开发中,正则表达式(Regular Expression)是一种强大的文本处理工具,它能用于模式匹配、字符串查找、替换和分割等操作。本类库专为iOS环境设计,旨在简化和增强应用程序对正则表达式的处理能力。下面将详细...
在Python编程语言中,正则表达式是一种强大的文本处理工具,用于匹配、查找、替换等操作。本节将深入探讨`re.compile()`函数及其在创建正则表达式对象中的应用。`re.compile()`是Python标准库`re`模块中的一员,它...
设为True时,"^"和"$"分别匹配每一行的开始和结束。 8. **替换功能**:`Replace`方法可以用来替换匹配的子串,例如: ```vb Dim newString As String newString = regex.Replace("your_string", "replacement_...
1. `compile()`:编译正则表达式模式,生成一个正则表达式对象,提高匹配速度。 2. `findall()`:找到所有非重叠匹配的子串,并以列表形式返回。 3. `finditer()`:与`findall()`类似,但返回的是一个迭代器,每个...
精通正则表达式第三版 搜集于网络 前言..........I 第1章:正则表达式入门.... 1 解决实际问题... 2 作为编程语言的正则表达式... 4 以文件名做类比... 4 以语言做类比... 5 正则表达式的知识框架... 6 对于...
这行代码创建一个新的RegExp对象,并将它赋给变量parttern.这个特殊的RegExp对象和所有以字母"s"结尾的字符串都匹配.用RegExp()也可以定义 一个等价的正则表达式,代码如下: var pattern = new RegExp("s$"); ...
1. **正则表达式基础**:了解基本的元字符,如`.`(任意字符)、`^`(行首)、`$`(行尾)、`\d`(数字)、`\w`(字母数字字符)、`\s`(空白字符)等,以及量词如`*`(零次或多次)、`+`(一次或多次)、`?...
在Java编程语言中,正则表达式是一种强大的文本处理工具,用于模式匹配、字符串查找、替换等操作。本文将深入探讨“常用正则工具类”所涉及的关键知识点,并提供一些实用的示例来帮助理解。 1. **正则表达式基础** ...
正则表达式是一种强大的文本处理工具,用于匹配、查找、替换和提取字符串模式。在编程领域,正则表达式(Regular Expression,简称regex)被广泛应用于数据验证、文本搜索和处理等方面。它由一系列特殊字符和操作符...
正则表达式是一种强大的文本处理工具,用于匹配、查找、替换和分析字符串模式。在编程领域,正则表达式被广泛应用于数据验证、文本搜索和提取等任务。本篇文章将详细探讨与“正则表达式生成和检验工具”相关的知识点...
在实际编程中,你可能需要了解PCRE的一些特殊标志和选项,如`PCRE_CASELESS`(不区分大小写)、`PCRE_DOTALL`(使`.`匹配任何字符,包括换行符)和`PCRE_MULTILINE`(使`^`和`$`能分别匹配每一行的开始和结束)。...
【字符位置检索】代码中,通过`split()`方法将每行内容按正则表达式分割后,计算每个匹配项的起始位置(包括之前所有字符的长度加一),然后使用`substring()`方法获取匹配的单个字符。`substring(j-1, j)`表示从...
- **`re.compile()`**:编译正则表达式为`Pattern`对象,提高后续匹配速度。 - **`re.Pattern`对象的方法**:如`match()`、`search()`、`fullmatch()`、`findall()`、`finditer()`、`split()`、`sub()`和`subn()`...