Ubuntu 10.04 安装Twisted、Scrapy爬虫框架
Scrapy,Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结 Scrapy Python爬虫框架 logo[1]构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。Scrapy吸引人的地方在于它是一个框架,任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类,如BaseSpider、sitemap爬虫等,最新版本又提供了web2.0爬虫的支持。
准备工作
Requirements
Python 2.5, 2.6, 2.7 (3.x is not yet supported)
Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)
w3lib
lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)
simplejson (not required if using Python 2.6 or above)
pyopenssl (for HTTPS support. Optional, but highly recommended)
---------------------------------------------
Twisted安装过程
sudo apt-get install python-twisted python-libxml2 python-simplejson
安装完成后进入python,测试Twisted是否安装成功
pycrypto
wget http://pypi.python.org/packages/source/p/pycrypto/pycrypto-2.5.tar.gz#md5=783e45d4a1a309e03ab378b00f97b291
tar -zxvf pycrypto-2.5.tar.gz
cd pycrypto-2.5
sudo python setup.py install
/etc/host,/etc/hostname 要一致,否则报错
python版本:2.6.5 更新一下,否则报gcc返回状态不对
sudo apt-get install python-dev
pycrypto
wget http://pypi.python.org/packages/source/p/pycrypto/pycrypto-2.5.tar.gz#md5=783e45d4a1a309e03ab378b00f97b291
tar -zxvf pycrypto-2.5.tar.gz
cd pycrypto-2.5
sudo python setup.py install
当python2.6.5 时安装
pycrypto
warning: GMP or MPIR library not found; Not building Crypto.PublicKey._fastmath. ubuntu
python更新成 2.7版本后,警告消失
wget -c http://www.python.org/ftp/python/2.7/Python-2.7.tar.bz2
tar -xvjpf Python-2.7.tar.bz2
cd Python-2.7
./configure
make
sudo make altinstall
cd /usr/bin
mv python python.bak
mv python-config python-config.bak
mv python2 python2.bak
cd /usr/local/bin
ln -s python2.7 python
ln -s python2.7-config python-config
pyOpenSSL
wget http://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.13.tar.gz#md5=767bca18a71178ca353dff9e10941929
tar -zxvf pyOpenSSL-0.13.tar.gz
cd pyOpenSSL-0.13
sudo python setup.py install
测试是否安装成功
$python
>>> import Crypto
>>> import twisted.conch.ssh.transport
>>> print Crypto.PublicKey.RSA
<module 'Crypto.PublicKey.RSA' from '/usr/python/lib/python2.5/site-packages/Crypto/PublicKey/RSA.pyc'>
>>> import OpenSSL
>>> import twisted.internet.ssl
>>> twisted.internet.ssl
<module 'twisted.internet.ssl' from '/usr/python/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/ssl.pyc'>
如果出现类似提示,说明pyOpenSSL模块已经安装成功了,否则,请检查上面的安装过程(OpenSSL需要pycrypto)。
># python >Python 2.6.6 (r266:84292, Dec 7 2011, 20:38:36) >[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2 >Type "help", "copyright", "credits" or "license" for more information. >>>>import OpenSSL >Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "OpenSSL/__init__.py", line 40, in <module> > from OpenSSL import crypto >ImportError: cannot import name crypto Notice that the complaint is about "OpenSSL/__init__.py" instead of something more sensible like "/usr/lib/python2.6/site- packages/OpenSSL/__init__.py". You're probably testing this using a working directory inside the pyOpenSSL source directory, and thus getting the wrong version of the OpenSSL package (one that does not include the built extension modules). Try testing in a different directory - or build the extension modules in-place using the -i option to distutils' build_ext command.
cd pyOpenSSL-0.13
cd ..
从pyOpenSSL-0.13 目录出去就不报错了
If this doesn't solve the problem, consider asking in a forum dedicated to CentOS 6 or pyOpenSSL, since the issue isn't really based on any software or other materials from the Twisted project. Also, include more information when you do so, for example a full installation transcript and a manifest of installed files, otherwise it's not likely anyone will be able to provide a better answer.
安装:easy_install 工具
sudo apt-get install python-setuptools
w3lib
sudo easy_install -U w3lib
Scrapy
wget http://pypi.python.org/packages/source/S/Scrapy/Scrapy0.14.3.tar.gz#md5=59f1225f7692f28fa0f78db3d34b3850
tar -zxvf Scrapy-0.14.3.tar.gz
cd Scrapy-0.14.3
sudo python setup.py install
Scrapy安装验证
经过上面的安装和配置过程,已经完成了Scrapy的安装,我们可以通过如下命令行来验证一下:
$ scrapy
Scrapy 0.14.3 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
fetch Fetch a URL using the Scrapy downloader
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
上面提示信息,提供了一个fetch命令,这个命令抓取指定的网页,可以先看看fetch命令的帮助信息,如下所示:
$ scrapy fetch --help
Usage
=====
scrapy fetch [options] <url>
Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging
Options
=======
--help, -h show this help message and exit
--spider=SPIDER use this spider
--headers print response HTTP headers instead of body
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--lsprof=FILE write lsprof profiling stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
根据命令提示,指定一个URL,执行后抓取一个网页的数据,如下所示:
ubuntu[/home/ioslabs/scrapy]scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html > install.html
2012-07-19 11:11:34+0800 [scrapy] INFO: Scrapy 0.14.3 started (bot: scrapybot)
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled item pipelines:
2012-07-19 11:11:35+0800 [default] INFO: Spider opened
2012-07-19 11:11:35+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-19 11:11:35+0800 [default] DEBUG: Crawled (200) <GEThttp://doc.scrapy.org/en/latest/intro/install.html> (referer: None)
2012-07-19 11:11:35+0800 [default] INFO: Closing spider (finished)
2012-07-19 11:11:35+0800 [default] INFO: Dumping spider stats:
{'downloader/request_bytes': 227,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 21943,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 19, 3, 11, 35, 902943),
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 7, 19, 3, 11, 35, 559084)}
2012-07-19 11:11:35+0800 [default] INFO: Spider closed (finished)
2012-07-19 11:11:35+0800 [scrapy] INFO: Dumping global stats:
{'memusage/max': 23015424, 'memusage/startup': 23015424}
2012-07-19 11:11:34+0800 [scrapy] INFO: Scrapy 0.14.3 started (bot: scrapybot)
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled item pipelines:
2012-07-19 11:11:35+0800 [default] INFO: Spider opened
2012-07-19 11:11:35+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-19 11:11:35+0800 [default] DEBUG: Crawled (200) <GEThttp://doc.scrapy.org/en/latest/intro/install.html> (referer: None)
2012-07-19 11:11:35+0800 [default] INFO: Closing spider (finished)
2012-07-19 11:11:35+0800 [default] INFO: Dumping spider stats:
{'downloader/request_bytes': 227,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 21943,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 19, 3, 11, 35, 902943),
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 7, 19, 3, 11, 35, 559084)}
2012-07-19 11:11:35+0800 [default] INFO: Spider closed (finished)
2012-07-19 11:11:35+0800 [scrapy] INFO: Dumping global stats:
{'memusage/max': 23015424, 'memusage/startup': 23015424}
可见,我们已经成功抓取了一个网页。
根据scrapy官网的指南来进一步应用scrapy框架
Tutorial链接页面为 http://doc.scrapy.org/en/latest/intro/tutorial.html
相关推荐
Ubuntu 10.04 安装配置指南 本文档提供了一个详细的 Ubuntu 10.04 安装配置指南,涵盖了从准备安装到配置输入法的所有步骤。首先,用户需要下载 Ubuntu 光盘镜像文件,并校验其 MD5 值,然后备份数据,最后可以选择...
如果你下载的是"ubuntu10.04"这个压缩包,那么里面应该包含的是Ubuntu 10.04的安装镜像,通过这个ISO文件,你可以创建安装光盘或USB驱动器,以便在你的计算机上安装这个系统。无论是为了个人探索开源世界,还是作为...
在深入探讨如何安装Ubuntu 10.04的全过程之前,我们先来了解下Ubuntu 10.04以及为何选择在VMware Workstation虚拟机中进行安装。 ### Ubuntu 10.04简介 Ubuntu 10.04 LTS(长期支持版本),代号为“Lucid Lynx”,...
### Ubuntu 10.04 下安装 OpenCV 2.2.0 详细步骤 #### 知识点一:Ubuntu 10.04 环境介绍 - **Ubuntu 10.04 LTS**(代号 Lucid Lynx)是 Ubuntu 的一个长期支持版本,发布于2010年4月29日。它提供了大量的软件包,...
在安装Ubuntu 10.04的过程中,首先要了解如何启动试用环境。通过下载Ubuntu 10.04的ISO镜像文件,你可以选择硬盘安装或制作光盘启动。进入试用的Live CD桌面后,如果打算进行硬盘安装,可以在终端中使用`sudo umount...
ubuntu10.04界面汉化安装包,deb格式。由于包的相互依赖,需用如下命令安装: #dpkg -i language-pack-zh-hans_1%3a10.04+20100421_all.deb language-pack-zh-hans-base_1%3a10.04+20100421_all.deb language-pack-...
### Ubuntu 10.04 安装完全指南 #### 一、准备工作 在开始安装 Ubuntu 10.04 之前,确保你已经准备好了以下几项必需的工具和资源: 1. **11G 空间**:为了确保安装过程顺利进行以及系统后续的正常使用,建议为 ...
安装Ubuntu 10.04时,用户可以选择多种安装方式,包括标准的图形化安装、网络安装以及文本模式安装。其中,图形化安装是最常用的,通过简单的步骤引导用户完成分区、设置用户账户和时区等配置。 系统启动后,用户...
本文档记录了 Ubuntu 10.04 的安装配置过程,包括硬盘安装、Grub4Dos 安装、menu.lst 文件修改、ubuntu-10.04-alternate-i386.iso 文件复制、安装过程、系统通用配置等。 一、硬盘安装 Ubuntu 10.04 Alternate i386...
"Ubuntu 10.04 软件安装指南" 从标题和描述中,我们可以了解到这篇文章的主要内容是关于 Ubuntu 10.04 的软件安装和基本配置的指南。从标签中,我们可以看到这是一个文档类型的资源。 从部分内容中,我们可以看到...
标题“Ubuntu10.04”指的是Ubuntu操作系统的一个特定版本,即10.04 LTS(长期支持版),代号为“Lucid Lynx”。Ubuntu是基于Debian GNU/Linux的开源操作系统,以其用户友好的界面和广泛的应用软件库而闻名。LTS版本...
- 启动虚拟机,进入Ubuntu安装界面。选择语言,然后点击“Install Ubuntu”开始安装。 - 选择安装类型,如果你是新手,推荐选择“Use entire disk”自动分区。 - 设置用户信息,包括用户名、密码和时区。 - 等待...
cpp-2.95_2.95.4-24_i386.deb; gcc-2.95_2.95.4-24_i386.deb; g++-2.95_2.95.4-24_i386.deb; libstdc++2.10-glibc2.2_2.95.4-24_i386.deb ; libstdc++2.10-dev_2.95.4-24_i386.deb
总的来说,Ubuntu 10.04 LTS的安装教程涵盖了从准备工作到安装后的网络配置和软件源设定,为用户提供了一条清晰的路径来成功安装和使用这个操作系统。尽管这个版本已不再支持,了解其安装过程仍有助于理解Linux系统...
在Ubuntu 10.04系统中安装Fortran 90需要遵循一系列步骤,因为该版本的Ubuntu相对较老,所以可能需要处理一些依赖问题。以下是一个详细的安装过程: 首先,确保你的系统是Ubuntu 10.04。如果你使用的是其他版本的...
《Ubuntu10.04安装kscope:深入解析源码与工具应用》 kscope是一款强大的源代码浏览和分析工具,特别适用于C++项目。在Ubuntu10.04这个版本的操作系统上安装kscope,可以帮助开发者更好地理解和管理项目的源代码...
2. 安装Grub4Dos,这是一个引导加载器,用于从硬盘启动Ubuntu安装程序。 3. 修改Grub4Dos的menu.lst文件,添加Ubuntu安装的引导项。 4. 将ISO镜像文件、vmlinuz和initrd.gz复制到C盘。 5. 重启电脑,通过Grub选择...
### Ubuntu 10.04 LTS 下安装 JDK 1.6 的详细步骤及注意事项 #### 一、前言 在 Linux 系统中,特别是 Ubuntu 发行版中安装 Java 开发工具包 (JDK) 是一项常见的任务。本文将详细介绍如何在 Ubuntu 10.04 LTS 版本...
打开Grub4Dos的配置文件menu.lst,添加以下内容以引导Ubuntu安装: ``` title Install Ubuntu 10.04 root (hd0,0) kernel /vmlinuz boot=casper iso-scan/filename=/ubuntu-10.04-alternate-i386.iso ro quiet ...
Ubuntu 10.04完全版入门教程 包含Ubuntu安装,配置过程,以及shell编程等内容。 分两部分: Ubuntu 10.04完全版first part Ubuntu 10.04完全版second part