多线程下载cnblog新闻图片

mushme

浏览: 797162 次
性别:
来自: 西安

最近访客更多访客>>

mumume123

sker

odpsoft

西红柿炒笨蛋

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

python

主要处理的问题有：
1.如何防止重复下载
2.网络访问一般较慢，需要多线程协助提升下载速度
解决方案：
1.先遍历列表页，将图片地址保存到数据库中，保存时，判断是否有重复。
2.使用多线程，下载数据库中的图片
一.下载图片地址


# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup
import urllib.request
from urllib import request
# 导入SQLite驱动:
import sqlite3

DB_FILE_NAME="images.sqlite"

def saveLinks(link,title=None):
    if title is None:
	    title=""
    conn = sqlite3.connect(DB_FILE_NAME)
    cursor = conn.cursor()
    # 执行一条SQL语句，创建user表:
    cursor.execute('create table IF NOT EXISTS images (id INTEGER PRIMARY KEY, title varchar(100),link vachar(100),content text,status Integer default(0))')
    cursor.execute('select * from images where link=\''+link+'\'')
    values=cursor.fetchall()
    if len(values) > 0:#链接以前就存在
        print('链接已经存在:'+link)
    else:
        cursor.execute('insert into images (title, link,status) values (\''+title+'\', \''+link+'\',0)')
        print("save success."+link)    
# 关闭Cursor:
    cursor.close()
# 提交事务:
    conn.commit()
# 关闭Connection:
    conn.close()
    
def getListPage(id):	
    #1.获取页面内容html
    listlink='http://news.cnblogs.com/n/page/'+id+'/'
    print('-'*20+listlink)
    with request.urlopen(listlink) as f:
        html_doc=f.read()
    '''2.分析页面内容，获取标题内容和链接[格式如下]
    <div class="entry_summary" style="display: block;">
                            <a href="/n/topic_389.htm"><img src="http://images0.cnblogs.com/news_topic/阿里云.gif" class="topic_img" alt=""/></a>
    '''
    soup = BeautifulSoup(html_doc,"html.parser")
    news_array=soup.find_all('img', {'class': 'topic_img'})
    for news in news_array:
        saveLinks(news.get("src"))
        
############正式代码开始
startId=6
size=5        
for m in range(startId,startId+size):
        getListPage(str(m))

二、下载实际图片

# -*- coding:utf-8 -*-
import threading
import urllib.request
import sqlite3
from urllib.request import quote

SAVE_PATH="F:\\python\\download\\"
DB_FILE_NAME="images.sqlite"

hosturl='http://www.weibo.com/'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0',  
           'Referer' : hosturl} 
 

#根据url下载图片，如果没有设置图片地址，自动保存到D:\\download\\图片名称
def downImg(imgUrl,savePath=None):
 
    imgName=imgUrl.split('/')[-1]
    preUrl=imgUrl.replace(imgName,'')
    request = urllib.request.Request(preUrl+quote(imgName), None, headers)  
    response = urllib.request.urlopen(request)
	
    if savePath is None:
	    savePath=SAVE_PATH+imgName
    f = open(savePath,'wb')
    f.write(response.read())
    f.close()
    print('Saved:'+savePath) 

#保存文件时候注意类型要匹配，如要保存的图片为jpg，则打开的文件的名称必须是jpg格式，否则会产生无效图片

def saveImage(id):
    conn = sqlite3.connect(DB_FILE_NAME)
    cursor = conn.cursor()
    #print(id)
    cursor.execute('select * from images where status=0 and id=?',(id,))
    values = cursor.fetchall()
    
    for line in values:
        #id=line[0]
        link=line[2]
        try:
            downImg(link)
            cursor.execute('update images set status=1 where id=?',(id,))
        except Exception  as e:
            print('except:',e)
    cursor.close()
    conn.commit()
    conn.close()   

def saveImagess(startId,size):
    for m in range(startId,startId+size):
        saveImage(str(m))   
    
threads = []
startPos=1

#######十个线程，每个线程抓取500个数据
threadCount=4#开启的线程数
p_size=10

i=0
for x in range(1,threadCount+1):
    t1 = threading.Thread(target=saveImagess,args=((startPos+i*p_size),p_size))
    threads.append(t1)
    i=i+1

for t in threads:
    t.start()
t.join()

顺便贴点妹子图，请自行参照第一步处理

def getListPage(id):	
    #1.获取页面内容html
    listlink='http://jandan.net/ooxx'
    print('-'*20+listlink)
    with request.urlopen(listlink) as f:
        html_doc=f.read()
    soup = BeautifulSoup(html_doc,"html.parser")
    news_array=soup.find_all('a', {'class': 'view_img_link'})
    for news in news_array:
        print(news.get("href"))
        #saveLinks(news.get("href"))

图片下载的内容继续优化了下，见附件

jandan.zip (88.7 KB)
下载次数: 0

0
顶

4
踩

分享到：

运用百度语音识别来读文字 | 入门一门语言的顺序

2016-03-31 10:17
浏览 822
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

多线程下载cnblog新闻图片

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

多线程下载cnblog新闻图片

评论

发表评论

相关推荐

useragent

整理一个python工具类

scrapy抓取cnblog新闻

scrapy抓取dmoz内容

安装scrapy

获取天气预报的接口

选择一个好的驾校，用数据说话，我用python

python版wobot

运用百度语音识别来读文字

使用python备份搜狐博客

使用python从360doc上抓取内容

python连接telnet

获取可用的代理服务器

每日自动下载bing背景图片做桌面之python

python数据抓取

用python自动登录iteye

最近访客更多访客>>