- 浏览: 2552150 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
spider简单的爬虫程序
1、基础准备
htmlparser
首页:http://sourceforge.net/projects/htmlparser/
下载:http://sourceforge.net/project/showfiles.php?group_id=24399
文件:htmlparser1_6_20060610.zip
<dependency>
<groupId>org.htmlparser</groupId>
<artifactId>htmlparser</artifactId>
<version>1.6</version>
</dependency>
cpdetector
首页:http://cpdetector.sourceforge.net/
下载:http://sourceforge.net/project/showfiles.php?group_id=114421
文件:cpdetector_eclipse_project_1.0.7.zip
<dependency>
<groupId>cpdetector</groupId>
<artifactId>cpdetector</artifactId>
<version>1.0.5</version>
</dependency>
spindle
首页:http://www.bitmechanic.com/projects/spindle/ (但是已经无法访问)
2 修改spindle代码得到的spider
简单的将URL打印出来了,解析的内容等等都没有处理
解析HTML的基类HtmlParserUtil.java
package com.sillycat.api.commons.utils.html;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.SocketException;
import java.net.SocketTimeoutException;
import java.net.URL;
import java.net.UnknownHostException;
import java.nio.charset.Charset;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.visitors.HtmlPage;
import cpdetector.io.ASCIIDetector;
import cpdetector.io.CodepageDetectorProxy;
import cpdetector.io.JChardetFacade;
import cpdetector.io.ParsingDetector;
import cpdetector.io.UnicodeDetector;
public class HtmlParserUtil {
/* StringBuffer的缓冲区大小 */
public static int TRANSFER_SIZE = 4096;
/* 当前平台的行分隔符 */
public static String lineSep = System.getProperty("line.separator");
/* 自动探测页面编码,避免中文乱码的出现 */
public static String autoDetectCharset(URL url) {
CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
/**
* ParsingDetector可用于检查HTML、XML等文件或字符流的编码 构造方法中的参数用于指示是否显示探测过程的详细信息
* 为false则不显示
*/
detector.add(new ParsingDetector(false));
detector.add(JChardetFacade.getInstance());
detector.add(ASCIIDetector.getInstance());
detector.add(UnicodeDetector.getInstance());
Charset charset = null;
try {
charset = detector.detectCodepage(url);
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ie) {
ie.printStackTrace();
}
if (charset == null)
charset = Charset.defaultCharset();
return charset.name();
}
/* 按照指定编码解析标准的html页面,为建立索引做准备 */
public static String[] parseHtml(String url, String charset) {
String result[] = null;
String content = null;
try {
URL source = new URL(url);
InputStream in = source.openStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(
in, charset));
String line = new String();
StringBuffer temp = new StringBuffer(TRANSFER_SIZE);
while ((line = reader.readLine()) != null) {
temp.append(line);
temp.append(lineSep);
}
reader.close();
in.close();
content = temp.toString();
} catch (UnsupportedEncodingException uee) {
uee.printStackTrace();
} catch (MalformedURLException mue) {
System.err.println("Invalid URL : " + url);
} catch (UnknownHostException uhe) {
System.err.println("UnknowHost : " + url);
} catch (SocketException se) {
System.err.println("Socket Error : " + se.getMessage() + " " + url);
} catch (SocketTimeoutException ste) {
System.err.println("Socket Connection Time Out : " + url);
} catch (FileNotFoundException fnfe) {
System.err.println("broken link "
+ ((FileNotFoundException) fnfe.getCause()).getMessage()
+ " ignored");
} catch (IOException ie) {
ie.printStackTrace();
}
if (content != null) {
Parser myParser = Parser.createParser(content, charset);
HtmlPage visitor = new HtmlPage(myParser);
try {
myParser.visitAllNodesWith(visitor);
String body = null;
String title = "Untitled";
if (visitor.getBody() != null) {
NodeList nodelist = visitor.getBody();
body = nodelist.asString().trim();
}
if (visitor.getTitle() != null){
title = visitor.getTitle();
}
result = new String[] { body, title };
} catch (ParserException pe) {
pe.printStackTrace();
}
}
return result;
}
}
多线程爬虫类 HtmlCaptureRunner.java
package com.sillycat.api.thread.runner;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.SocketException;
import java.net.SocketTimeoutException;
import java.net.URL;
import java.net.UnknownHostException;
import java.util.ArrayList;
import java.util.HashSet;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.htmlparser.Parser;
import org.htmlparser.PrototypicalNodeFactory;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.tags.BaseHrefTag;
import org.htmlparser.tags.FrameTag;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.tags.MetaTag;
import org.htmlparser.util.EncodingChangeException;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import com.sillycat.api.commons.utils.StringUtil;
import com.sillycat.api.commons.utils.html.HtmlParserUtil;
public class HtmlCaptureRunner implements Runnable {
public Log logger = LogFactory.getLog(getClass());
/* 基准(初始)URL */
protected String baseURL = null;
private String contentPath = null;
/**
* 待解析的URL地址集合,所有新检测到的链接均存放于此; 解析时按照先入先出(First-In First-Out)法则线性取出
*/
protected ArrayList URLs = new ArrayList();
/* 已存储的URL地址集合,避免链接的重复抓取 */
protected HashSet indexedURLs = new HashSet();
protected Parser parser = new Parser();;
/* 程序运行线程数,默认2个线程 */
protected int threads = 2;
/* 解析页面时的字符编码 */
protected String charset;
/* 基准端口 */
protected int basePort;
/* 基准主机 */
protected String baseHost;
/* 是否存储,默认true */
protected boolean justDatabase = true;
/* 检测索引中是否存在当前URL信息,避免重复抓取 */
protected boolean isRepeatedCheck = false;
public HtmlCaptureRunner() {
PrototypicalNodeFactory factory = new PrototypicalNodeFactory();
factory.registerTag(new LocalLinkTag());
factory.registerTag(new LocalFrameTag());
factory.registerTag(new LocalBaseHrefTag());
parser.setNodeFactory(factory);
}
public void capture() {
URLs.clear();
URLs.add(getBaseURL());
int responseCode = 0;
String contentType = "";
try {
HttpURLConnection uc = (HttpURLConnection) new URL(baseURL)
.openConnection();
responseCode = uc.getResponseCode();
contentType = uc.getContentType();
} catch (MalformedURLException mue) {
logger.error("Invalid URL : " + getBaseURL());
} catch (UnknownHostException uhe) {
logger.error("UnknowHost : " + getBaseURL());
} catch (SocketException se) {
logger.error("Socket Error : " + se.getMessage() + " "
+ getBaseURL());
} catch (IOException ie) {
logger.error("IOException : " + ie);
}
if (responseCode == HttpURLConnection.HTTP_OK
&& contentType.startsWith("text/html")) {
try {
charset = HtmlParserUtil.autoDetectCharset(new URL(baseURL));
basePort = new URL(baseURL).getPort();
baseHost = new URL(baseURL).getHost();
if (charset.equals("windows-1252"))
charset = "GBK";
long start = System.currentTimeMillis();
ArrayList threadList = new ArrayList();
for (int i = 0; i < threads; i++) {
Thread t = new Thread(this, "Spider Thread #" + (i + 1));
t.start();
threadList.add(t);
}
while (threadList.size() > 0) {
Thread child = (Thread) threadList.remove(0);
try {
child.join();
} catch (InterruptedException ie) {
logger.error("InterruptedException : " + ie);
}
}
// for (int i = 0; i < threads; i++) {
// threadPool.getThreadPoolExcutor().execute(new
// Thread(this,"Spider Thread #" + (i + 1)));
// }
long elapsed = System.currentTimeMillis() - start;
logger.info("Finished in " + (elapsed / 1000) + " seconds");
logger.info("The Count of the Links Captured is "
+ indexedURLs.size());
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
}
public void run() {
String url;
while ((url = dequeueURL()) != null) {
if (justDatabase) {
process(url);
}
}
threads--;
}
/**
* 处理单独的URL地址,解析页面并加入到lucene索引中;通过自动探测页面编码保证抓取工作的顺利执行
*/
protected void process(String url) {
String result[];
String content = null;
String title = null;
result = HtmlParserUtil.parseHtml(url, charset);
content = result[0];
title = result[1];
if (content != null && content.trim().length() > 0) {
// content
System.out.println(url);
// title
// DateTools.timeToString(System.currentTimeMillis()
}
}
/* 从URL队列mPages里取出单个的URL */
public synchronized String dequeueURL() {
while (true)
if (URLs.size() > 0) {
String url = (String) URLs.remove(0);
indexedURLs.add(url);
if (isToBeCaptured(url)) {
NodeList list;
try {
int bookmark = URLs.size();
/* 获取页面所有节点 */
parser.setURL(url);
try {
list = new NodeList();
for (NodeIterator e = parser.elements(); e
.hasMoreNodes();)
list.add(e.nextNode());
} catch (EncodingChangeException ece) {
/* 解码出错的异常处理 */
parser.reset();
list = new NodeList();
for (NodeIterator e = parser.elements(); e
.hasMoreNodes();)
list.add(e.nextNode());
}
/**
* 依据 http://www.robotstxt.org/wc/meta-user.html 处理
* Robots <META> tag
*/
NodeList robots = list
.extractAllNodesThatMatch(
new AndFilter(new NodeClassFilter(
MetaTag.class),
new HasAttributeFilter("name",
"robots")), true);
if (0 != robots.size()) {
MetaTag robot = (MetaTag) robots.elementAt(0);
String content = robot.getAttribute("content")
.toLowerCase();
if ((-1 != content.indexOf("none"))
|| (-1 != content.indexOf("nofollow")))
for (int i = bookmark; i < URLs.size(); i++)
URLs.remove(i);
}
} catch (ParserException pe) {
logger.error("ParserException : " + pe);
}
return url;
}
} else {
threads--;
if (threads > 0) {
try {
wait();
threads++;
} catch (InterruptedException ie) {
logger.error("InterruptedException : " + ie);
}
} else {
notifyAll();
return null;
}
}
}
private boolean isHTML(String url) {
if (!url.endsWith(".html")) {
return false;
}
if (StringUtil.isNotBlank(contentPath)) {
if (!url.startsWith(baseURL + "/" + contentPath)) {
return false;
}
}
return true;
}
/**
* 判断提取到的链接是否符合解析条件;标准为Port及Host与基准URL相同且类型为text/html或text/plain
*/
public boolean isToBeCaptured(String url) {
boolean flag = false;
HttpURLConnection uc = null;
int responseCode = 0;
String contentType = "";
String host = "";
int port = 0;
try {
URL source = new URL(url);
String protocol = source.getProtocol();
if (protocol != null && protocol.equals("http")) {
host = source.getHost();
port = source.getPort();
uc = (HttpURLConnection) source.openConnection();
uc.setConnectTimeout(8000);
responseCode = uc.getResponseCode();
contentType = uc.getContentType();
}
} catch (MalformedURLException mue) {
logger.error("Invalid URL : " + url);
} catch (UnknownHostException uhe) {
logger.error("UnknowHost : " + url);
} catch (SocketException se) {
logger.error("Socket Error : " + se.getMessage() + " " + url);
} catch (SocketTimeoutException ste) {
logger.error("Socket Connection Time Out : " + url);
} catch (FileNotFoundException fnfe) {
logger.error("broken link " + url + " ignored");
} catch (IOException ie) {
logger.error("IOException : " + ie);
}
if (port == basePort
&& responseCode == HttpURLConnection.HTTP_OK
&& host.equals(baseHost)
&& (contentType.startsWith("text/html") || contentType
.startsWith("text/plain")))
flag = true;
return flag;
}
class LocalLinkTag extends LinkTag {
public void doSemanticAction() {
String link = getLink();
if (link.endsWith("/"))
link = link.substring(0, link.length() - 1);
int pos = link.indexOf("#");
if (pos != -1)
link = link.substring(0, pos);
/* 将链接加入到处理队列中 */
if (!(indexedURLs.contains(link) || URLs.contains(link))) {
if (isHTML(link)) {
URLs.add(link);
}
}
setLink(link);
}
}
/**
* Frame tag that rewrites the SRC URLs. The SRC URLs are mapped to local
* targets if they match the source.
*/
class LocalFrameTag extends FrameTag {
public void doSemanticAction() {
String link = getFrameLocation();
if (link.endsWith("/"))
link = link.substring(0, link.length() - 1);
int pos = link.indexOf("#");
if (pos != -1)
link = link.substring(0, pos);
/* 将链接加入到处理队列中 */
if (!(indexedURLs.contains(link) || URLs.contains(link))) {
if (isHTML(link)) {
URLs.add(link);
}
}
setFrameLocation(link);
}
}
/**
* Base tag that doesn't show. The toHtml() method is overridden to return
* an empty string, effectively shutting off the base reference.
*/
class LocalBaseHrefTag extends BaseHrefTag {
public String toHtml() {
return ("");
}
}
public String getBaseURL() {
return baseURL;
}
public void setBaseURL(String baseURL) {
this.baseURL = baseURL;
}
public int getThreads() {
return threads;
}
public void setThreads(int threads) {
this.threads = threads;
}
public String getCharset() {
return charset;
}
public void setCharset(String charset) {
this.charset = charset;
}
public int getBasePort() {
return basePort;
}
public void setBasePort(int basePort) {
this.basePort = basePort;
}
public String getBaseHost() {
return baseHost;
}
public void setBaseHost(String baseHost) {
this.baseHost = baseHost;
}
public boolean isJustDatabase() {
return justDatabase;
}
public void setJustDatabase(boolean justDatabase) {
this.justDatabase = justDatabase;
}
public String getContentPath() {
return contentPath;
}
public void setContentPath(String contentPath) {
this.contentPath = contentPath;
}
}
spring上的配置文件applicationContext-bean.xml:
<bean id="productCapture"
class="com.sillycat.api.thread.runner.HtmlCaptureRunner" >
<property name="contentPath" value="${product.contentPath}" />
<property name="basePort" value="${product.base.port}" />
<property name="baseURL" value="${product.base.url}" />
<property name="charset" value="${product.base.code}" />
<property name="threads" value="${product.base.threads}"/>
</bean>
<bean id="messageCapture"
class="com.sillycat.api.thread.runner.HtmlCaptureRunner" >
<property name="contentPath" value="${message.contentPath}" />
<property name="basePort" value="${message.base.port}" />
<property name="baseURL" value="${message.base.url}" />
<property name="charset" value="${message.base.code}" />
<property name="threads" value="${message.base.threads}"/>
</bean>
easySearch.properties配置文件:
#==========================================
# spider configration
#=========================================
product.contentPath=product
product.base.port=80
product.base.url=http://www.safedv.com
product.base.code=UTF-8
product.base.threads=3
message.contentPath=message
message.base.port=80
message.base.url=http://www.safedv.com
message.base.code=UTF-8
message.base.threads=3
单元测试类HtmlRunnerTest.java文件:
package com.sillycat.api.thread;
import com.sillycat.api.commons.base.BaseManagerTest;
import com.sillycat.api.thread.runner.HtmlCaptureRunner;
public class HtmlRunnerTest extends BaseManagerTest {
private HtmlCaptureRunner productCapture;
private HtmlCaptureRunner messageCapture;
protected void setUp() throws Exception {
super.setUp();
productCapture = (HtmlCaptureRunner) appContext.getBean("productCapture");
messageCapture = (HtmlCaptureRunner) appContext.getBean("messageCapture");
}
protected void tearDown() throws Exception {
super.tearDown();
}
public void testDumy() {
assertTrue(true);
}
public void ntestProductCapture() {
productCapture.capture();
}
public void testMessageCapture(){
messageCapture.capture();
}
}
1、基础准备
htmlparser
首页:http://sourceforge.net/projects/htmlparser/
下载:http://sourceforge.net/project/showfiles.php?group_id=24399
文件:htmlparser1_6_20060610.zip
<dependency>
<groupId>org.htmlparser</groupId>
<artifactId>htmlparser</artifactId>
<version>1.6</version>
</dependency>
cpdetector
首页:http://cpdetector.sourceforge.net/
下载:http://sourceforge.net/project/showfiles.php?group_id=114421
文件:cpdetector_eclipse_project_1.0.7.zip
<dependency>
<groupId>cpdetector</groupId>
<artifactId>cpdetector</artifactId>
<version>1.0.5</version>
</dependency>
spindle
首页:http://www.bitmechanic.com/projects/spindle/ (但是已经无法访问)
2 修改spindle代码得到的spider
简单的将URL打印出来了,解析的内容等等都没有处理
解析HTML的基类HtmlParserUtil.java
package com.sillycat.api.commons.utils.html;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.SocketException;
import java.net.SocketTimeoutException;
import java.net.URL;
import java.net.UnknownHostException;
import java.nio.charset.Charset;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.visitors.HtmlPage;
import cpdetector.io.ASCIIDetector;
import cpdetector.io.CodepageDetectorProxy;
import cpdetector.io.JChardetFacade;
import cpdetector.io.ParsingDetector;
import cpdetector.io.UnicodeDetector;
public class HtmlParserUtil {
/* StringBuffer的缓冲区大小 */
public static int TRANSFER_SIZE = 4096;
/* 当前平台的行分隔符 */
public static String lineSep = System.getProperty("line.separator");
/* 自动探测页面编码,避免中文乱码的出现 */
public static String autoDetectCharset(URL url) {
CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
/**
* ParsingDetector可用于检查HTML、XML等文件或字符流的编码 构造方法中的参数用于指示是否显示探测过程的详细信息
* 为false则不显示
*/
detector.add(new ParsingDetector(false));
detector.add(JChardetFacade.getInstance());
detector.add(ASCIIDetector.getInstance());
detector.add(UnicodeDetector.getInstance());
Charset charset = null;
try {
charset = detector.detectCodepage(url);
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ie) {
ie.printStackTrace();
}
if (charset == null)
charset = Charset.defaultCharset();
return charset.name();
}
/* 按照指定编码解析标准的html页面,为建立索引做准备 */
public static String[] parseHtml(String url, String charset) {
String result[] = null;
String content = null;
try {
URL source = new URL(url);
InputStream in = source.openStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(
in, charset));
String line = new String();
StringBuffer temp = new StringBuffer(TRANSFER_SIZE);
while ((line = reader.readLine()) != null) {
temp.append(line);
temp.append(lineSep);
}
reader.close();
in.close();
content = temp.toString();
} catch (UnsupportedEncodingException uee) {
uee.printStackTrace();
} catch (MalformedURLException mue) {
System.err.println("Invalid URL : " + url);
} catch (UnknownHostException uhe) {
System.err.println("UnknowHost : " + url);
} catch (SocketException se) {
System.err.println("Socket Error : " + se.getMessage() + " " + url);
} catch (SocketTimeoutException ste) {
System.err.println("Socket Connection Time Out : " + url);
} catch (FileNotFoundException fnfe) {
System.err.println("broken link "
+ ((FileNotFoundException) fnfe.getCause()).getMessage()
+ " ignored");
} catch (IOException ie) {
ie.printStackTrace();
}
if (content != null) {
Parser myParser = Parser.createParser(content, charset);
HtmlPage visitor = new HtmlPage(myParser);
try {
myParser.visitAllNodesWith(visitor);
String body = null;
String title = "Untitled";
if (visitor.getBody() != null) {
NodeList nodelist = visitor.getBody();
body = nodelist.asString().trim();
}
if (visitor.getTitle() != null){
title = visitor.getTitle();
}
result = new String[] { body, title };
} catch (ParserException pe) {
pe.printStackTrace();
}
}
return result;
}
}
多线程爬虫类 HtmlCaptureRunner.java
package com.sillycat.api.thread.runner;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.SocketException;
import java.net.SocketTimeoutException;
import java.net.URL;
import java.net.UnknownHostException;
import java.util.ArrayList;
import java.util.HashSet;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.htmlparser.Parser;
import org.htmlparser.PrototypicalNodeFactory;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.tags.BaseHrefTag;
import org.htmlparser.tags.FrameTag;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.tags.MetaTag;
import org.htmlparser.util.EncodingChangeException;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import com.sillycat.api.commons.utils.StringUtil;
import com.sillycat.api.commons.utils.html.HtmlParserUtil;
public class HtmlCaptureRunner implements Runnable {
public Log logger = LogFactory.getLog(getClass());
/* 基准(初始)URL */
protected String baseURL = null;
private String contentPath = null;
/**
* 待解析的URL地址集合,所有新检测到的链接均存放于此; 解析时按照先入先出(First-In First-Out)法则线性取出
*/
protected ArrayList URLs = new ArrayList();
/* 已存储的URL地址集合,避免链接的重复抓取 */
protected HashSet indexedURLs = new HashSet();
protected Parser parser = new Parser();;
/* 程序运行线程数,默认2个线程 */
protected int threads = 2;
/* 解析页面时的字符编码 */
protected String charset;
/* 基准端口 */
protected int basePort;
/* 基准主机 */
protected String baseHost;
/* 是否存储,默认true */
protected boolean justDatabase = true;
/* 检测索引中是否存在当前URL信息,避免重复抓取 */
protected boolean isRepeatedCheck = false;
public HtmlCaptureRunner() {
PrototypicalNodeFactory factory = new PrototypicalNodeFactory();
factory.registerTag(new LocalLinkTag());
factory.registerTag(new LocalFrameTag());
factory.registerTag(new LocalBaseHrefTag());
parser.setNodeFactory(factory);
}
public void capture() {
URLs.clear();
URLs.add(getBaseURL());
int responseCode = 0;
String contentType = "";
try {
HttpURLConnection uc = (HttpURLConnection) new URL(baseURL)
.openConnection();
responseCode = uc.getResponseCode();
contentType = uc.getContentType();
} catch (MalformedURLException mue) {
logger.error("Invalid URL : " + getBaseURL());
} catch (UnknownHostException uhe) {
logger.error("UnknowHost : " + getBaseURL());
} catch (SocketException se) {
logger.error("Socket Error : " + se.getMessage() + " "
+ getBaseURL());
} catch (IOException ie) {
logger.error("IOException : " + ie);
}
if (responseCode == HttpURLConnection.HTTP_OK
&& contentType.startsWith("text/html")) {
try {
charset = HtmlParserUtil.autoDetectCharset(new URL(baseURL));
basePort = new URL(baseURL).getPort();
baseHost = new URL(baseURL).getHost();
if (charset.equals("windows-1252"))
charset = "GBK";
long start = System.currentTimeMillis();
ArrayList threadList = new ArrayList();
for (int i = 0; i < threads; i++) {
Thread t = new Thread(this, "Spider Thread #" + (i + 1));
t.start();
threadList.add(t);
}
while (threadList.size() > 0) {
Thread child = (Thread) threadList.remove(0);
try {
child.join();
} catch (InterruptedException ie) {
logger.error("InterruptedException : " + ie);
}
}
// for (int i = 0; i < threads; i++) {
// threadPool.getThreadPoolExcutor().execute(new
// Thread(this,"Spider Thread #" + (i + 1)));
// }
long elapsed = System.currentTimeMillis() - start;
logger.info("Finished in " + (elapsed / 1000) + " seconds");
logger.info("The Count of the Links Captured is "
+ indexedURLs.size());
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
}
public void run() {
String url;
while ((url = dequeueURL()) != null) {
if (justDatabase) {
process(url);
}
}
threads--;
}
/**
* 处理单独的URL地址,解析页面并加入到lucene索引中;通过自动探测页面编码保证抓取工作的顺利执行
*/
protected void process(String url) {
String result[];
String content = null;
String title = null;
result = HtmlParserUtil.parseHtml(url, charset);
content = result[0];
title = result[1];
if (content != null && content.trim().length() > 0) {
// content
System.out.println(url);
// title
// DateTools.timeToString(System.currentTimeMillis()
}
}
/* 从URL队列mPages里取出单个的URL */
public synchronized String dequeueURL() {
while (true)
if (URLs.size() > 0) {
String url = (String) URLs.remove(0);
indexedURLs.add(url);
if (isToBeCaptured(url)) {
NodeList list;
try {
int bookmark = URLs.size();
/* 获取页面所有节点 */
parser.setURL(url);
try {
list = new NodeList();
for (NodeIterator e = parser.elements(); e
.hasMoreNodes();)
list.add(e.nextNode());
} catch (EncodingChangeException ece) {
/* 解码出错的异常处理 */
parser.reset();
list = new NodeList();
for (NodeIterator e = parser.elements(); e
.hasMoreNodes();)
list.add(e.nextNode());
}
/**
* 依据 http://www.robotstxt.org/wc/meta-user.html 处理
* Robots <META> tag
*/
NodeList robots = list
.extractAllNodesThatMatch(
new AndFilter(new NodeClassFilter(
MetaTag.class),
new HasAttributeFilter("name",
"robots")), true);
if (0 != robots.size()) {
MetaTag robot = (MetaTag) robots.elementAt(0);
String content = robot.getAttribute("content")
.toLowerCase();
if ((-1 != content.indexOf("none"))
|| (-1 != content.indexOf("nofollow")))
for (int i = bookmark; i < URLs.size(); i++)
URLs.remove(i);
}
} catch (ParserException pe) {
logger.error("ParserException : " + pe);
}
return url;
}
} else {
threads--;
if (threads > 0) {
try {
wait();
threads++;
} catch (InterruptedException ie) {
logger.error("InterruptedException : " + ie);
}
} else {
notifyAll();
return null;
}
}
}
private boolean isHTML(String url) {
if (!url.endsWith(".html")) {
return false;
}
if (StringUtil.isNotBlank(contentPath)) {
if (!url.startsWith(baseURL + "/" + contentPath)) {
return false;
}
}
return true;
}
/**
* 判断提取到的链接是否符合解析条件;标准为Port及Host与基准URL相同且类型为text/html或text/plain
*/
public boolean isToBeCaptured(String url) {
boolean flag = false;
HttpURLConnection uc = null;
int responseCode = 0;
String contentType = "";
String host = "";
int port = 0;
try {
URL source = new URL(url);
String protocol = source.getProtocol();
if (protocol != null && protocol.equals("http")) {
host = source.getHost();
port = source.getPort();
uc = (HttpURLConnection) source.openConnection();
uc.setConnectTimeout(8000);
responseCode = uc.getResponseCode();
contentType = uc.getContentType();
}
} catch (MalformedURLException mue) {
logger.error("Invalid URL : " + url);
} catch (UnknownHostException uhe) {
logger.error("UnknowHost : " + url);
} catch (SocketException se) {
logger.error("Socket Error : " + se.getMessage() + " " + url);
} catch (SocketTimeoutException ste) {
logger.error("Socket Connection Time Out : " + url);
} catch (FileNotFoundException fnfe) {
logger.error("broken link " + url + " ignored");
} catch (IOException ie) {
logger.error("IOException : " + ie);
}
if (port == basePort
&& responseCode == HttpURLConnection.HTTP_OK
&& host.equals(baseHost)
&& (contentType.startsWith("text/html") || contentType
.startsWith("text/plain")))
flag = true;
return flag;
}
class LocalLinkTag extends LinkTag {
public void doSemanticAction() {
String link = getLink();
if (link.endsWith("/"))
link = link.substring(0, link.length() - 1);
int pos = link.indexOf("#");
if (pos != -1)
link = link.substring(0, pos);
/* 将链接加入到处理队列中 */
if (!(indexedURLs.contains(link) || URLs.contains(link))) {
if (isHTML(link)) {
URLs.add(link);
}
}
setLink(link);
}
}
/**
* Frame tag that rewrites the SRC URLs. The SRC URLs are mapped to local
* targets if they match the source.
*/
class LocalFrameTag extends FrameTag {
public void doSemanticAction() {
String link = getFrameLocation();
if (link.endsWith("/"))
link = link.substring(0, link.length() - 1);
int pos = link.indexOf("#");
if (pos != -1)
link = link.substring(0, pos);
/* 将链接加入到处理队列中 */
if (!(indexedURLs.contains(link) || URLs.contains(link))) {
if (isHTML(link)) {
URLs.add(link);
}
}
setFrameLocation(link);
}
}
/**
* Base tag that doesn't show. The toHtml() method is overridden to return
* an empty string, effectively shutting off the base reference.
*/
class LocalBaseHrefTag extends BaseHrefTag {
public String toHtml() {
return ("");
}
}
public String getBaseURL() {
return baseURL;
}
public void setBaseURL(String baseURL) {
this.baseURL = baseURL;
}
public int getThreads() {
return threads;
}
public void setThreads(int threads) {
this.threads = threads;
}
public String getCharset() {
return charset;
}
public void setCharset(String charset) {
this.charset = charset;
}
public int getBasePort() {
return basePort;
}
public void setBasePort(int basePort) {
this.basePort = basePort;
}
public String getBaseHost() {
return baseHost;
}
public void setBaseHost(String baseHost) {
this.baseHost = baseHost;
}
public boolean isJustDatabase() {
return justDatabase;
}
public void setJustDatabase(boolean justDatabase) {
this.justDatabase = justDatabase;
}
public String getContentPath() {
return contentPath;
}
public void setContentPath(String contentPath) {
this.contentPath = contentPath;
}
}
spring上的配置文件applicationContext-bean.xml:
<bean id="productCapture"
class="com.sillycat.api.thread.runner.HtmlCaptureRunner" >
<property name="contentPath" value="${product.contentPath}" />
<property name="basePort" value="${product.base.port}" />
<property name="baseURL" value="${product.base.url}" />
<property name="charset" value="${product.base.code}" />
<property name="threads" value="${product.base.threads}"/>
</bean>
<bean id="messageCapture"
class="com.sillycat.api.thread.runner.HtmlCaptureRunner" >
<property name="contentPath" value="${message.contentPath}" />
<property name="basePort" value="${message.base.port}" />
<property name="baseURL" value="${message.base.url}" />
<property name="charset" value="${message.base.code}" />
<property name="threads" value="${message.base.threads}"/>
</bean>
easySearch.properties配置文件:
#==========================================
# spider configration
#=========================================
product.contentPath=product
product.base.port=80
product.base.url=http://www.safedv.com
product.base.code=UTF-8
product.base.threads=3
message.contentPath=message
message.base.port=80
message.base.url=http://www.safedv.com
message.base.code=UTF-8
message.base.threads=3
单元测试类HtmlRunnerTest.java文件:
package com.sillycat.api.thread;
import com.sillycat.api.commons.base.BaseManagerTest;
import com.sillycat.api.thread.runner.HtmlCaptureRunner;
public class HtmlRunnerTest extends BaseManagerTest {
private HtmlCaptureRunner productCapture;
private HtmlCaptureRunner messageCapture;
protected void setUp() throws Exception {
super.setUp();
productCapture = (HtmlCaptureRunner) appContext.getBean("productCapture");
messageCapture = (HtmlCaptureRunner) appContext.getBean("messageCapture");
}
protected void tearDown() throws Exception {
super.tearDown();
}
public void testDumy() {
assertTrue(true);
}
public void ntestProductCapture() {
productCapture.capture();
}
public void testMessageCapture(){
messageCapture.capture();
}
}
发表评论
-
SOAP AXIS2 with HTTPS
2011-11-24 15:28 4248SOAP AXIS2 with HTTPS 1. sampl ... -
xfire后续问题补充
2010-01-06 14:34 2631xfire后续问题补充 问题一:xfire的方法中,需要知道 ... -
AXIS实现WebService
2010-01-06 11:51 3959AXIS实现WebService webservice里面对 ... -
xfire的webservice安全机制之签名
2010-01-05 23:31 1381xfire的webservice安全机制之签名 服务端配置修 ... -
xfire的webservice安全机制之签名
2010-01-05 23:29 1614xfire的webservice安全机制之签名 服务端配置修 ... -
xfire的webservice安全机制之用户校验
2010-01-05 23:29 2111xfire的webservice安全机制之用户校验 xfir ... -
xfire的webservice安全机制之加密(三)
2010-01-05 23:29 1273如何用KEYTOOL工具生成私匙和公匙 1、通过别名和密码创 ... -
xfire的webservice安全机制之加密(二)
2010-01-05 23:28 2816xfire的webservice安全机制 下面是客户端调用的 ... -
xfire的webservice安全机制之加密(一)
2010-01-05 23:28 2893xfire的webservice安全机制 在原来使用xfir ... -
spring下的webservice之xfire
2010-01-05 23:25 2038http://xfire.codehaus.org/ xfi ... -
Xfire在Weblogic10.3上发布的问题
2010-01-05 10:37 6325Xfire在Weblogic10.3上发布的问题 最近项目的 ... -
mule2.2.x架构(八)部署到WEB项目
2010-01-05 10:36 1967mule2.2.x架构(八)部署到WEB项目 所有的示例文档 ... -
xfire的Client的WSDL调用
2010-01-05 10:36 3158xfire的Client的WSDL调用 也只是想测试一下,如 ... -
mule2.2.x架构(七)示例学习LoanBroker
2010-01-05 10:36 1710mule2.2.x架构(七)示例学习LoanBroker 所 ... -
xfire的client的JAVA调用方式
2010-01-05 10:35 3554xfire的client的JAVA调用方式 平时我们调用xf ... -
mule2.2.x架构(六)示例学习scripting
2010-01-05 10:35 1475mule2.2.x架构(六)示例学习scripting 所有 ... -
mule2.2.x架构(五)示例学习errorHandle
2010-01-05 10:34 1680mule2.2.x架构(五)示例学习errorHandle ... -
mule2.2.x架构(四)示例学习StockQuote
2010-01-05 10:34 1873mule2.2.x架构(四)示例学习StockQuote 所 ... -
mule2.2.x架构(三)示例学习hello
2010-01-05 10:34 1875mule2.2.x架构(三)示例学习hello 所有的示例文 ... -
mule2.2.x架构(二)示例学习echo
2010-01-05 10:33 1775mule2.2.x架构(二)示例学习echo 所有的示例文档 ...
相关推荐
【标题】: "关于spider网络爬虫的程序,用于搜索" 网络爬虫,或称为“蜘蛛”(Spider),是互联网上的一种自动化程序,它的主要任务是遍历Web页面,抓取并存储网页内容,以便进行后续的数据分析或构建搜索引擎。在...
Java编写Spider网络爬虫程序是IT领域中一种常见的技术实践,它主要用来自动化地抓取互联网上的信息。在这个源码中,我们可以学习到如何利用Java实现一个基础的网络爬虫,以便于下载指定域名范围内的网页内容,甚至...
C#之HTTP协议多线程下载实现spider网络爬虫程序编写[借鉴].pdf
网络爬虫,也被称为Web Spider或Web Crawler,是一种自动浏览互联网并收集信息的程序。在信息技术领域,网络爬虫是数据挖掘的重要工具,广泛应用于搜索引擎优化、市场分析、社交媒体监控、网站性能评估等多个场景。 ...
网络爬虫,又称为网页蜘蛛或Web机器人,是一种自动遍历互联网并抓取网页信息的程序。在C++中实现网络爬虫,需要掌握以下几个关键知识点: 1. **HTTP协议理解**:网络爬虫是基于HTTP/HTTPS协议与服务器交互的,因此...
【标签】"crawler"和"spider"进一步明确了主题,"crawler"是网络爬虫的英文术语,而"spider"通常指的是在网络中“爬行”的程序,二者都是指的同一种技术。这里,它们强调了项目的核心功能——自动抓取网页信息。 ...
【zhihu-spider-master爬虫程序】是一个用于抓取知乎网站数据的Python爬虫项目。这个项目的主要目的是从知乎网站上自动化地收集信息,包括问题、答案、评论以及用户资料等,为数据分析、研究或者个人兴趣提供便利。...
本篇将基于"spider.rar_FYH_spider_爬虫"这一资源,深入解析一个简单的爬虫结构,帮助你理解和构建自己的爬虫程序。 首先,我们要明确爬虫的基本概念。爬虫是一种自动化程序,它模拟人类在网页上的浏览行为,通过...
它的强大性能和灵活性使得它成为编写爬虫程序的理想选择,尤其是在处理大量数据和并发任务时。 **爬虫结构** 一个基本的网络爬虫通常由以下几个组件构成: 1. **URL管理器(URL Manager)**:负责存储待爬取和已...
通过学习和研究这些源代码,开发者可以了解爬虫的工作原理,掌握如何编写自己的爬虫程序。 【标签】进一步强调了以下几个关键概念: 1. **Spider(蜘蛛)**:这是对网络爬虫的比喻,因为它们像蜘蛛一样在网络中编织...
网络爬虫是自动化抓取互联网信息的一种程序,Python在爬虫领域非常流行,因为它拥有丰富的库支持,如BeautifulSoup、Scrapy和Requests等。在"weibo_spider.py"中,我们可以预期会涉及到以下知识点: 1. **...
在【标签】中,“java_spider”暗示了这个项目是用Java语言编写的爬虫程序。“网络_爬虫_程序”则强调了它的功能,即在网络中自动浏览并收集信息。网络爬虫是数据挖掘领域的重要工具,广泛应用于搜索引擎、市场分析...
python 图片爬虫程序
【标题】"JDspider_jdspider_python爬虫_京东_" 涉及的主要知识点是使用Python编程语言构建一个名为JDspider的爬虫程序,该程序专门针对京东(JD.com)的商品页面进行数据抓取,尤其是关注商品的优惠价格信息。...
在本项目中,“spider_爬虫_”特指针对研招网(中国研究生招生信息网)进行数据抓取的爬虫。研招网作为我国官方的研究生招生信息平台,包含丰富的学校、考试和导师信息,这些信息对于考研学生来说极其重要,通过爬虫...
在信息技术领域,"spider爬虫"是一种自动遍历网络并抓取网页内容的程序。它们通常被用于搜索引擎优化、数据分析、市场研究等多种用途。爬虫能够自动化地收集大量网络信息,使得用户无需手动浏览每一个网页就能获取所...
本项目“weibo_spider_爬虫python_关键词爬虫_python_python爬虫_spider”聚焦于利用Python实现一个分布式微博爬虫,它能有效地抓取微博平台上的大量数据,并且支持关键词搜索,帮助用户获取感兴趣的信息。...
1. 主爬虫脚本(如:`weibo_spider.py`):这是实际执行爬虫任务的主程序,可能包含了对微博API的调用、请求参数设置、数据解析和存储等功能。 2. 配置文件(如:`config.py`):可能包含了爬虫的配置信息,如API...
1. **Spider**:这是爬虫的英文术语,指代编写用于自动抓取互联网信息的程序。 2. **爬虫可视化**:这通常指的是使用图形化界面或图表来展示爬虫爬取过程和结果,帮助开发者监控和理解爬虫的工作状态,以及便于数据...
本示例"spider"可能是一个简单的Java爬虫项目,包含爬虫的主程序和相关配置,可能使用了Jsoup进行HTML解析。通过学习和运行这个示例,你可以了解一个基本的Java爬虫是如何工作的,并为自己的爬虫项目打下基础。 ...