图片抓取失败

san_yun

浏览: 2670041 次
来自: 杭州

最近访客更多访客>>

空城旧梦why

sd3870181

alexqdjay

hanmiao

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

问题

今天发现一个错误日志：

2013-06-06 12:25:13,332 [ERROR] upload.service.UploadFileService - image open error ,url = http://img.xitisi.com/Commodity/BOBOTou_2204/RiXiFaXingNvShengJiaFa_HuaBuWu2011XinKuan_QiLiuHaiBoboBoBoTouXiuLianDuanFaZongSe20120210034904.jpg ,cannot identify image fil

看了一下图片的头信息：

Accept-Ranges	`bytes`
Content-Encoding	`gzip`
Content-Length	`452449`
Content-Type	`image/jpeg`
Date	`Thu, 06 Jun 2013 05:03:08 GMT`
Etag	`"8041952b9a50cd1:1a9a"`
Last-Modified	`Fri, 22 Jun 2012 17:12:15 GMT`
Server	`Microsoft-IIS/6.0`
Vary	`Accept-Encoding`
X-Powered-By	`ASP.NET`

原来是通过gzip压缩过，所以Image无法识别，需要先处理一下。

解决办法：

1. 通过python的gzip反解

    def _read_content(self,response):
        content_type = response.headers.get('Content-Type')
        content_encoding = response.headers.get("Content-Encoding")
        if response.code == 200 and content_type and content_type.find('image')!=-1:
            data = StringIO(response.read())
            if content_encoding=="gzip":
                data = gzip.GzipFile(fileobj=data).read()
                data = StringIO((data))
            return data
        else:
            logger.error("can't open image ,content type=%s, url=%s"%(content_type,url))
            return None

2. 在请求头中指定不支持gzip

    self.headers = {}
            self.headers['User-Agent'] = """Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 GTB6"""
            self.headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
            self.headers['Accept-Encoding'] = 'identity'
            self.headers['Accept-Language'] = "zh,en-us;q=0.7,en;q=0.3"
            self.headers['Accept-Charset'] = "ISO-8859-1,utf-8;q=0.7,*;q=0.7"
            self.headers['Connection'] = "keep-alive"
            self.headers['Keep-Alive'] = "115"
            self.headers['Cache-Control'] = "no-cache"

    def open(self, url):
        try:
            response = self.opener.open(urllib2.Request(url, headers=self.headers),timeout=self.timeout)
            data =  self._read_content(response)
            return data
        except Exception,e:
            logger.error(url)
            logger.exception(e)
            return None

分享到：

规则引擎开发 | cmemcached和python memcache的兼容性问题

2013-06-06 13:07
浏览 1107
评论(0)
分类:非技术
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

图片抓取失败

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

图片抓取失败

评论

发表评论

相关推荐

spring-cloud问题排查

log4j 和slf4j的 类冲突解决

HikariPool-1 - Connection is not available, request timed out after

java.io.StreamCorruptedException: invalid stream header: EFBFBDEF

log4j问题总结-加载配置文件

一个诡异的类冲突错误排查记录

tomcat7.0.26的连接数控制bug的问题排查

tomcat,jboss,jetty访问出现404错误问题记录

记录Hadoop native libraries无法load的问题

Exception性能问题

spring加载xml去远程获取dtd验证xml的问题

hbase查询超时导致的错误

hbase无法启动问题

web.xml配置注意点

Linux服务器Cache占用过多内存导致系统内存不足问题的排查解决

linger close用法

log4j-over-slf4的log4j Loger加载问题

log4j,slf4j,logback问题总结

maven 小结（打包和单元测试）

solr load比较高

最近访客更多访客>>

log4j 和slf4j的类冲突解决