- 浏览: 2556951 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
2018 Scrapy Environment Enhance(3)Docker ENV
Set Up Scrapy Ubuntu DEV
>sudo apt-get install -qy python python-dev python-distribute python-pip ipython
>sudo apt-get install -qy firefox xvfb
>sudo apt-get install -qy libffi-dev libxml2-dev libxslt-dev lib32z1-dev libssl-dev
> sudo apt-get install python3-venv
> sudo apt-get install python3-dev
> sudo apt install unzip
> sudo apt-get install libxi6 libgconf-2-4
> sudo apt-get install libnss3 libgconf-2-4
> sudo apt-get install chromium-browser
If need, make it to remember the git username and password
> git config credential.helper 'cache --timeout=300000'
Create the virtual ENV and activate that
> python3 -m venv ./env
> source ./env/bin/activate
> pip install --upgrade pip
> pip install selenium pyvirtualdisplay
> pip install boto3
> pip install beautifulsoup4 requests
Install Twisted
> wget http://twistedmatrix.com/Releases/Twisted/17.9/Twisted-17.9.0.tar.bz2
> tar xjf Twisted-17.9.0.tar.bz2
> python setup.py install
> pip install lxml scrapy scrapyjs
Install Browser and Driver
> wget https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip
> unzip chromedriver_linux64.zip
> chmod a+x chromedriver
> sudo mv chromedriver /usr/local/bin/
> chromedriver --version
ChromeDriver 2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7)
> chromium-browser -version
Chromium 65.0.3325.181 Built on Ubuntu , running on Ubuntu 16.04
Setup Tor Network Proxy
> sudo apt-get install tor
> sudo apt-get install netcat
> sudo apt-get install curl
> sudo apt-get install privoxy
Check my Local IP
> curl http://icanhazip.com/
52.14.197.xxx
Set Up Tor
> tor --hash-password prxxxxxxxx
16:01D5D02xxxxxxxxxxxxxxxxxxxxxxxxxxx
> cat /etc/tor/torrc
ControlPort 9051
> cat /etc/tor/torrcpassword
HashedControlPassword 16:01D5D02EFA3D6A5xxxxxxxxxxxxxxxxxxx
Start Tor
> sudo service tor start
Verify it change my IP
> torify curl http://icanhazip.com/
192.36.27.4
Command does not work here
> echo -e 'AUTHENTICATE "pricemonitor1234"\r\nsignal NEWNYM\r\nQUIT' | nc 127.0.0.1 9051
Try to use Python to change the IP
> pip install stem
> python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from stem import Signal
>>> from stem.control import Controller
>>> with Controller.from_port(port=9051) as controller:
... controller.authenticate()
... controller.signal(Signal.NEWNYM)
...
That should work if the permission is right.
Config the Proxy
> cat /etc/privoxy/config
forward-socks5t / 127.0.0.1:9050 .
Start the Service
> sudo service privoxy start
Verify the IP
> curl -x 127.0.0.1:8118 http://icanhazip.com/
185.220.101.6
Verify with Request API
> python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
>>> import requests
>>> response = requests.get('http://icanhazip.com/', proxies={'http': '127.0.0.1:8118'})
>>> response.text.strip()
'185.220.101.6'
Think About Docker Application
Dockerfile
#Run a scrapy server side
#Prepare the OS
FROM ubuntu:16.04
MAINTAINER Carl Luo <luohuazju@gmail.com>
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update
RUN apt-get -qqy dist-upgrade
#Prepare the denpendencies
RUN apt-get install -qy python3 python3-dev python-distribute python3-pip ipython
RUN apt-get install -qy firefox xvfb
RUN pip3 install selenium pyvirtualdisplay
RUN pip3 install boto3 beautifulsoup4 requests
RUN apt-get install -qy libffi-dev libxml2-dev libxslt-dev lib32z1-dev libssl-dev
RUN pip3 install lxml scrapy scrapyjs
RUN pip3 install --upgrade pip
RUN apt-get install -qy python3-venv
RUN apt-get install -qy libxi6 libgconf-2-4 libnss3 libgconf-2-4
RUN apt-get install -qy chromium-browser
RUN apt-get install -qy wget unzip git
#add tool
ADD install/chromedriver /usr/local/bin/
RUN pip install scrapyd
#copy the config
RUN mkdir -p /tool/scrapyd/
ADD conf/scrapyd.conf /tool/scrapyd/
#set up the app
EXPOSE 6801
RUN mkdir -p /app/
ADD start.sh /app/
WORKDIR /app/
CMD [ "./start.sh" ]
Makefile
IMAGE=sillycat/public
TAG=ubuntu-scrapy-1.0
NAME=ubuntu-scrapy-1.0
docker-context:
build: docker-context
docker build -t $(IMAGE):$(TAG) .
run:
docker run -d -p 6801:6801 --name $(NAME) $(IMAGE):$(TAG)
debug:
docker run -p 6801:6801 --name $(NAME) -ti $(IMAGE):$(TAG) /bin/bash
clean:
docker stop ${NAME}
docker rm ${NAME}
logs:
docker logs ${NAME}
publish:
docker push ${IMAGE}
start.sh
#!/bin/sh -ex
#start the service
cd /tool/scrapyd/
scrapyd
Configuration in conf/scrapyd.conf
[scrapyd]
eggs_dir = eggs
logs_dir = logs
items_dir =
jobs_to_keep = 100
dbs_dir = dbs
max_proc = 0
max_proc_per_cpu = 20
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port = 6801
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
References:
http://sillycat.iteye.com/blog/2418353
http://sillycat.iteye.com/blog/2418229
Set Up Scrapy Ubuntu DEV
>sudo apt-get install -qy python python-dev python-distribute python-pip ipython
>sudo apt-get install -qy firefox xvfb
>sudo apt-get install -qy libffi-dev libxml2-dev libxslt-dev lib32z1-dev libssl-dev
> sudo apt-get install python3-venv
> sudo apt-get install python3-dev
> sudo apt install unzip
> sudo apt-get install libxi6 libgconf-2-4
> sudo apt-get install libnss3 libgconf-2-4
> sudo apt-get install chromium-browser
If need, make it to remember the git username and password
> git config credential.helper 'cache --timeout=300000'
Create the virtual ENV and activate that
> python3 -m venv ./env
> source ./env/bin/activate
> pip install --upgrade pip
> pip install selenium pyvirtualdisplay
> pip install boto3
> pip install beautifulsoup4 requests
Install Twisted
> wget http://twistedmatrix.com/Releases/Twisted/17.9/Twisted-17.9.0.tar.bz2
> tar xjf Twisted-17.9.0.tar.bz2
> python setup.py install
> pip install lxml scrapy scrapyjs
Install Browser and Driver
> wget https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip
> unzip chromedriver_linux64.zip
> chmod a+x chromedriver
> sudo mv chromedriver /usr/local/bin/
> chromedriver --version
ChromeDriver 2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7)
> chromium-browser -version
Chromium 65.0.3325.181 Built on Ubuntu , running on Ubuntu 16.04
Setup Tor Network Proxy
> sudo apt-get install tor
> sudo apt-get install netcat
> sudo apt-get install curl
> sudo apt-get install privoxy
Check my Local IP
> curl http://icanhazip.com/
52.14.197.xxx
Set Up Tor
> tor --hash-password prxxxxxxxx
16:01D5D02xxxxxxxxxxxxxxxxxxxxxxxxxxx
> cat /etc/tor/torrc
ControlPort 9051
> cat /etc/tor/torrcpassword
HashedControlPassword 16:01D5D02EFA3D6A5xxxxxxxxxxxxxxxxxxx
Start Tor
> sudo service tor start
Verify it change my IP
> torify curl http://icanhazip.com/
192.36.27.4
Command does not work here
> echo -e 'AUTHENTICATE "pricemonitor1234"\r\nsignal NEWNYM\r\nQUIT' | nc 127.0.0.1 9051
Try to use Python to change the IP
> pip install stem
> python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from stem import Signal
>>> from stem.control import Controller
>>> with Controller.from_port(port=9051) as controller:
... controller.authenticate()
... controller.signal(Signal.NEWNYM)
...
That should work if the permission is right.
Config the Proxy
> cat /etc/privoxy/config
forward-socks5t / 127.0.0.1:9050 .
Start the Service
> sudo service privoxy start
Verify the IP
> curl -x 127.0.0.1:8118 http://icanhazip.com/
185.220.101.6
Verify with Request API
> python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
>>> import requests
>>> response = requests.get('http://icanhazip.com/', proxies={'http': '127.0.0.1:8118'})
>>> response.text.strip()
'185.220.101.6'
Think About Docker Application
Dockerfile
#Run a scrapy server side
#Prepare the OS
FROM ubuntu:16.04
MAINTAINER Carl Luo <luohuazju@gmail.com>
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update
RUN apt-get -qqy dist-upgrade
#Prepare the denpendencies
RUN apt-get install -qy python3 python3-dev python-distribute python3-pip ipython
RUN apt-get install -qy firefox xvfb
RUN pip3 install selenium pyvirtualdisplay
RUN pip3 install boto3 beautifulsoup4 requests
RUN apt-get install -qy libffi-dev libxml2-dev libxslt-dev lib32z1-dev libssl-dev
RUN pip3 install lxml scrapy scrapyjs
RUN pip3 install --upgrade pip
RUN apt-get install -qy python3-venv
RUN apt-get install -qy libxi6 libgconf-2-4 libnss3 libgconf-2-4
RUN apt-get install -qy chromium-browser
RUN apt-get install -qy wget unzip git
#add tool
ADD install/chromedriver /usr/local/bin/
RUN pip install scrapyd
#copy the config
RUN mkdir -p /tool/scrapyd/
ADD conf/scrapyd.conf /tool/scrapyd/
#set up the app
EXPOSE 6801
RUN mkdir -p /app/
ADD start.sh /app/
WORKDIR /app/
CMD [ "./start.sh" ]
Makefile
IMAGE=sillycat/public
TAG=ubuntu-scrapy-1.0
NAME=ubuntu-scrapy-1.0
docker-context:
build: docker-context
docker build -t $(IMAGE):$(TAG) .
run:
docker run -d -p 6801:6801 --name $(NAME) $(IMAGE):$(TAG)
debug:
docker run -p 6801:6801 --name $(NAME) -ti $(IMAGE):$(TAG) /bin/bash
clean:
docker stop ${NAME}
docker rm ${NAME}
logs:
docker logs ${NAME}
publish:
docker push ${IMAGE}
start.sh
#!/bin/sh -ex
#start the service
cd /tool/scrapyd/
scrapyd
Configuration in conf/scrapyd.conf
[scrapyd]
eggs_dir = eggs
logs_dir = logs
items_dir =
jobs_to_keep = 100
dbs_dir = dbs
max_proc = 0
max_proc_per_cpu = 20
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port = 6801
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
References:
http://sillycat.iteye.com/blog/2418353
http://sillycat.iteye.com/blog/2418229
发表评论
-
Stop Update Here
2020-04-28 09:00 320I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 481NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 373Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 373Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 340Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 433Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 441Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 378Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 460VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 390Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 482NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 426Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 340Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 252GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 454GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 330GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 316Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 323Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 298Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 314Serverless with NodeJS and Tenc ...
相关推荐
Docker Scrapyd Scrapy Crawler - Mailan-Spider 应用程序 这个存储库是一个可以“Dockerized”的蜘蛛 Python 应用程序。 它附带了在 Mac OS X 中“Dockerizing”Python 应用程序的分步指南。您将了解 Scrapy、...
Scrapy DockerHub 部署、运行和监控您的 Scrapy 蜘蛛。 它利用 Fabric 命令行实用程序来管理运行 Scrapy 蜘蛛的远程 Docker 容器 安装 要在你的scrapy项目中使用它,你只需要在你的项目目录中创建fabfile.py ,...
基于scrapy1.5.1+es6+docker对知乎的搜索引擎
### Scrapy 0.22.3:一个强大的网络爬虫框架 #### 一、Scrapy简介 **Scrapy** 是一个用于爬取网站并提取结构化数据的应用框架,广泛应用于数据挖掘、信息处理或历史档案等领域。尽管最初设计是为了进行网页抓取...
基于Scrapy的Python3分布式淘宝爬虫源码.zip基于Scrapy的Python3分布式淘宝爬虫源码.zip基于Scrapy的Python3分布式淘宝爬虫源码.zip基于Scrapy的Python3分布式淘宝爬虫源码.zip基于Scrapy的Python3分布式淘宝爬虫...
在Windows系统上,特别是Python3环境下安装Scrapy,需要遵循一定的步骤,包括更新pip、安装wheel、Twisted、pywin32以及Scrapy本身。 首先,确保你的Python环境是3.6.5版本,并且已经安装了pip。pip是Python的包...
京东爬虫 -- docker mongodb redis scrapy等技术实现
主要介绍了Docker 部署Scrapy的详解,小编觉得挺不错的,现在分享给大家,也给大家做个参考。一起跟随小编过来看看吧
Python3 Scrapy 安装教程详解 Python 是一个功能强大的编程语言,Scrapy 是其下的一个功能强大的第三方模块,用于爬虫开发。安装 Scrapy 模块非常重要,但对于刚刚开始学习 Python 的朋友来说,安装 Scrapy 可能会...
Python3版本第三方库Scrapy的安装包,Scrapy-2.2.0-py3-none-any.whl下载请注意Python版本3
Scrapy是一个用于爬取网站数据和提取结构性数据的应用框架,它可以让程序员快速地抓取网站并提取所需的数据。在Scrapy中,使用和管理Cookies是一个非常常见的需求。Cookies是用来识别用户状态的一种机制,在网络请求...
3. **创建Scrapy爬虫项目**:在Django项目外部,创建一个独立的Scrapy项目,用于执行具体的爬虫任务。确保Scrapy项目中的`settings.py`配置文件允许其被其他程序调用。 4. **编写Django视图**:在Django应用中,...
3. **Twisted**: Twisted是一个异步网络编程框架,它允许Scrapy以非阻塞的方式处理网络请求,极大地提高了爬虫的并发能力。在Scrapy中,Twisted负责处理HTTP请求和响应,使得爬虫能够在等待网络响应的同时进行其他...
在学习Scrapy之前,确保你已经安装了Python3,并且遵循正确的安装步骤。以下是对Scrapy安装和使用的一些关键知识点的详细解释: ### Scrapy的安装 1. **标准Python3安装**: - 如果你的Python环境是标准的Python3...
在Python开发环境中,有时我们需要...3. 运行Pyinstaller命令进行打包。 4. 处理可能出现的依赖问题。 5. 分发打包后的exe文件。 通过这个过程,你可以轻松地将Scrapy项目转化为可在任何Windows环境下运行的独立程序。
3. 使用Scrapy提供的Request对象发起请求,配合回调函数解析响应内容。 4. 解析网页内容,提取词汇链接,并生成新的Request请求,循环爬取每个词汇的详细页。 5. 设计Item结构,定义计算机词汇相关的字段,如`name...
Scrapy是一个强大的Python爬虫框架,它为开发者提供了构建网络爬虫所需的各种工具和模块,使得数据抓取和处理变得更加高效。在处理大文件时,Scrapy提供了多种策略和技巧来确保过程的顺利进行。本篇文章将深入探讨...
Scrapy是Python编程语言中的一款强大且高效的网页抓取框架,专为数据抓取和爬虫项目设计。它提供了一整套工具集,使得开发者能够快速构建起复杂的网络爬虫,处理网页数据并进行分析。在本文中,我们将深入探讨Scrapy...
10. **six**:提供Python 2和Python 3之间的兼容性,确保Scrapy可以在不同版本的Python环境下运行。 11. **pywin32**(Windows)或**pexpect**(Unix-like系统):这些库用于进程控制和交互,例如在Windows上管理...
以下将详细介绍两种解决Python3环境下安装Scrapy报错的方法。 **方法一:手动下载与安装** 当使用`pip install scrapy`命令时,如果遇到依赖库不兼容或缺失的问题,可以采取手动下载和安装的方式。首先,你需要...