|
Cobra Parsing: Disable Persistent Connections and Set Socket Timeouts
|
I use the Cobra Toolkit to parse web pages for various projects. In a project that has eight concurrent parsers running, I found that some of the parsers would hang indefinitely in a socket read during JavaScript processing. I think, but have not confirmed, that most of the hung sockets are related to persistent connections to Google's safe browser/YouTube servers in the 1E100.net domain. So I wanted a method to disable persistent connections in Cobra. As it's also possible for a server not to respond, URLConnection read timeouts would be a useful option, too.
I created some simple code that causes URLConnection objects created by Cobra to be configured with a timeout and to optionally disable persistent connections.
The Cobra DOM parser requires an org.lobobrowser.html.UserAgentContext object. A sample UserAgentContext object is provided in org.lobobrowser.html.test.SimpleUserAgentContext. This class has a createHttpRequest() method that returns an org.lobobrowser.html.test.SimpleHttpRequest object whenever the parser needs to open a socket connection. In one of the several open() methods, SimpleHttpRequest creates a URLConnection object. The timeout and persistence settings need to be applied to the URLConnection.
I created two simple classes in the package com.benjysbrain.CobraExtension:
- CobraUserAgentContext extends SimpleUserAgentContext.
- CobraHttpRequest extends SimpleHttpRequest.
CobraUserAgentContext
The CobraUserAgentContext constructor takes two parameters:
- int timeout - the timeout in milliseconds used on the URLConnection setReadTimeout() and setConnectTimeout() methods
- boolean persistent - false to disable persistent connections
The createHttpRequest() method returns a CobraHttpRequest object that has been configured with the timeout and persistence values.
CobraHttpRequest
The new constructor for this class contains the UserAgentContext and proxy parameters as in the parent class but also adds timeout and persistent settings as in the CobraUserAgentContext class. The open() method with five parameters is overridden. When it is called, a URLConnection is created by the parent's five parameter open() method, and then the timeouts and persistence settings are applied to the URLConnection.
Using the code
Substitute CobraUserAgentContext for SimpleUserAgentContext in your programs. Use the constructor that allows you to set timeouts and persistence values, and pass the object to org.lobobrowser.html.parser.DocumentBuilderImpl. Whenever the parser needs a new URL connection, it will use the CobraHttpRequest object, which sets the timeouts and persistence settings.
To compile the code, listed below, you will need Java 1.5 or greater as the timeout methods of URLConnection are not in earlier versions. Create the com.benjysbrain.CobraExtension directory structure, put the source in the leaf directory, add the Cobra Toolkit jar to your classpass, and compile the source. I recommend setting a timeout value of about one minute, but you might want to increase this depending on the responsiveness of the servers from which pages are parsed.
package com.benjysbrain.CobraExtension ;
import org.lobobrowser.html.test.* ;
import org.lobobrowser.html.* ;
/** CobraUserAgentContext is a subclass of
org.lobobrowser.html.test.SimpleUserAgentContext that overrides the
createHttpRequest() method to provide an HttpRequest object with
a URLConnection object with timeouts and other properties. In addition
to the new createHttpRequest() method, a new constructor has been
added.
<p>
The Cobra Toolkit (http://lobobrowser.org/cobra.jsp) is part of
the Lobo Project.
<p>
Java 1.5 or later required.
<p>
Copyright 2010 by Ben E. Cline. This source code is provided
for educational purposes "as is" with no warranty. If you use
the code, please acknowledge the author.
<p>
http://www.benjysbrain.com
@author Benjy Cline
*/
public class CobraUserAgentContext extends SimpleUserAgentContext {
/** The read timeout and connection timeout in milliseconds. */
int timeout ;
/** If false, HttpRequest objects have the "Connection : close" property
set to discourage persistent connections. */
boolean persistent ;
/** Create a CobraUserAgentContext object where createHttpRequest()
returns an HttpRequest object with a URLConnection object with
the specified timeout and persistence setting. */
public CobraUserAgentContext(int timeout, boolean persistent) {
super() ;
this.timeout = timeout ;
this.persistent = persistent ;
}
/** Create an HttpRequest object, used to load images, scripts, etc.,
with timeout and persistence values. */
public HttpRequest createHttpRequest() {
return new CobraHttpRequest(this, this.getProxy(), timeout, persistent) ;
}
}
package com.benjysbrain.CobraExtension ;
import org.lobobrowser.html.test.* ;
import org.lobobrowser.html.* ;
import java.io.* ;
/**
CobraHttpRequest is a subclass of
org.lobobrowser.html.test.SimpleHttpRequest. It adds a constructor
and a modified version of the open() method. If the new constructor
is used, a timeout and persistent state are used during open() calls
to configure the URLConnection object. See
com.benjysbrain.CobraExtension.CobraUserAgentContext.
<p>
The Cobra Toolkit (http://lobobrowser.org/cobra.jsp) is part of
the Lobo Project.
<p>
Java 1.5 or later required.
<p>
Copyright 2010 by Ben E. Cline. This source code is provided
for educational purposes "as is" with no warranty. If you use
the code, please acknowledge the author.
<p>
http://www.benjysbrain.com
<p>
@author Benjy Cline
*/
public class CobraHttpRequest extends SimpleHttpRequest {
/** The read timeout and connection timeout in milliseconds. */
int timeout = 1000*60*30 ;
/** If false, HttpRequest objects have the "Connection : close" property
set to discourage persistent connections. */
boolean persistent = false ;
/** Create an HttpRequest object whose open() methods create
URLConnection objects with timeout and persistence values.
*/
public CobraHttpRequest(UserAgentContext context, java.net.Proxy proxy,
int timeout, boolean persistence) {
super(context, proxy) ;
this.timeout = timeout ;
this.persistent = persistent ;
}
/** Override the primary open() method so that the URLConnection object
can be configured. */
public void open(final String method, final java.net.URL url,
boolean asyncFlag, final String userName,
final String password) throws java.io.IOException {
super.open(method, url, asyncFlag, userName, password) ;
connection.setReadTimeout(timeout) ;
connection.setConnectTimeout(timeout) ;
if(!persistent)
connection.setRequestProperty("Connection", "close") ;
}
}
These classes are not particularly general, but they can serve as a model for more elegant code. If you have questions or comments or if you discover errors in this page or the code, please let me know at the e-mail address in the footer of this page.
This page © copyright 2010 by Ben E. Cline. E-Mail:
分享到:
相关推荐
Cobra 0.98.4 是一个针对Web开发者的强大工具,主要专注于HTML文档对象模型(DOM)的解析和渲染。这个版本是Cobra项目的其中一个重要里程碑,它提供了高效且稳定的性能,使得开发者能够更好地处理和操作HTML文档。...
1. **Java AWT和Swing**:Lobo浏览器的用户界面主要由Java的Abstract Window Toolkit (AWT) 和 Swing库构建。AWT提供了基本的窗口和控件,而Swing则提供了更丰富的组件和更好的外观。 2. **Java Network API**:...
lobo是一个开源的网页浏览器,完全用java写成。 浏览器的目标是支持HTML4,javascript,CSS2. 当然更主要的目标是,力图使lobo浏览网页速度更快,特点完整和稳定 最新的版本v0.97.5:...
**Java网页浏览器Lobo**是基于Java平台的开源网络浏览器,它为用户提供了一种在Java环境中浏览互联网的解决方案。Lobo项目始于2000年,旨在创建一个完全由Java编写、功能丰富的浏览器,以便开发者可以利用Java的跨...
lobo浏览器0.98.3版本..java浏览器
【标题】"基于Java的开发源码-网页浏览器Lobo.zip"揭示了这个压缩包包含的是一个使用Java语言编写的开源网页浏览器项目——Lobo。Lobo是一个古老的Java Web浏览器项目,它允许开发者通过Java技术来浏览和交互网页。...
lobo是一个开源的网页浏览器,完全用java写成。 浏览器的目标是支持HTML4,javascript,CSS2。安全,可扩展,容易集成其他语言,可移植。
Cobra.jar 文件包含了 Lobo 浏览器引擎的类库,这些类库负责解析 HTML、CSS、JavaScript,以及处理网络请求和渲染页面。 **Java Archive (JAR) 文件** JAR 文件是 Java 平台上的标准归档格式,用于集合多个 Java ...
,jar,src> Lobo is an extensible all-Java web browser and RIA platform. It supports HTML 4, Javascript (AJAX) and CSS 2 ... Cobra is the web browser's renderer API; also a Javascript-aware HTML parser.
基于java的网页浏览器 Lobo.zip
Lobo Evolution是Lobo浏览器的分支。 该项目继续了Lobo Browser(lobochief)的工作。 Lobo Evolution是一个可扩展的全Java Web浏览器和RIA平台。 它支持HTML 4,HTML5 Javascript(AJAX),CSS 3和Java(Swing / ...
Java网页浏览器Lobo是一款基于Java技术的开源网络浏览器,它提供了基本的网页浏览功能,并且允许用户通过Java平台进行扩展。Lobo项目的目标是创建一个完全由Java编写、跨平台的浏览器,使得开发者可以利用Java的强大...
【Java源码:Java网页浏览器Lobo】是一个经典的开源项目,它展示了如何使用Java语言来实现一个功能完备的Web浏览器。Lobo项目始于2001年,旨在为开发者提供一个可定制、轻量级且跨平台的浏览器解决方案。这个项目在...
**基于Java的网页浏览器Lobo**是一个开源项目,它展示了Java技术在开发网络浏览器方面的潜力。Lobo浏览器的出现,主要是为了提供一个可定制、跨平台的解决方案,使得用户可以在Java环境中浏览网页,同时利用Java的...
JAVA源码Java网页浏览器LoboJAVA源码Java网页浏览器Lobo
"El Rio Lobo" 是一款独特的字体,通常用于设计、排版、图形艺术等领域,以增添个性化的视觉效果。在IT行业中,字体扮演着至关重要的角色,它不仅可以传达信息,还可以影响用户对产品的感知和整体体验。"El Rio Lobo...
java资源Java网页浏览器 Lobojava资源Java网页浏览器 Lobo提取方式是百度网盘分享地址
给眼电图滤波,采用巴特沃斯高低通滤波器。
**Lobo Webframe 开源详解** Lobo Webframe 是一个基于 PHP 的开源网络框架,它为开发者提供了一套高效、灵活的工具,用于构建和管理动态网站和多媒体内容。作为一个框架,Lobo 致力于简化网站开发过程,提高开发...