`
wangyanlong0107
  • 浏览: 502786 次
  • 性别: Icon_minigender_1
  • 来自: 沈阳
社区版块
存档分类
最新评论

http:the definitive guide学习记录

 
阅读更多

chapter1~4简单介绍了http协议的overview,主要介绍了与tcp建立连接的方式。

chapter5~10介绍了http协议中的web server,http proxy,web cache,http加密,web client,未来的http。

2.4节

1.为什么需要urlencoding?

    Applications need to walk a fine line.It is best for client applications to convert any unsafe or restricted characters before sending any URL to any other applications. Once all the unsafe characters have been encoded , the URL is in a canonical(正规,正则的) form that can be shared between applications;there is no need  to worry about the other application getting confused by any of the characters' special meanings. 

2.扩展的原则:

“Be conservative in what you send; be liberal in what you accept. –The Robustness principle”

“对于自己输出要严格; 对于他人的输入要灵活. –鲁棒性原则”

6.1.1proxies分类

 public proxies公共代理,private proxies个人代理(搜狗浏览器的代理)

 

6.1.2 proxies versus gateway

Strictly speaking ,proxies connect two or more applications that speak the same protocol,while gateways hook up two or more parties that speak different protocols.A gateway acts as a "protocol converter",allowing a client to complete  a transaction with a server ,even when the client and server speak different protocols.

In practice,the difference between proxies and gateway is blurry.Because browsers and servers implement different versions of Http,proxies often do some amount of protocol conversion .And commercial proxy servers implement gateway functionality to support SSL security protocols,SOCKS firewalls,FTP access,and web-based applications.

 

9.1.5

A good reference book for implementing huge data structures is Managing Gigabytes:Compressing and Indexing Documents and Images, by written ,et.al(Morgan Kaufmann).This book is full of tricks and techniques for managing large amounts of data.

 

9.4 Excluding Robots

The robot community understood the problems that robotic web site access could cause.In 1994 ,a simple, voluntary technique was purposed to keep robots out of where they don't belong and provide webmasters with a mechanism to better control their behavior. The standard  was named the "Robots Exclusion Standard" but is often just called robot.txt, after the file where the access-control information is stored.

The idea of robots.txt is simple.Any web server can provide an optional file named robots.txt in the document root of the server.This file contains information about what robots can access what parts of the server.If a robot follows this voluntary standard, it will request the robot.txt file from the web site before accessing any other resource from the site.

Before visiting any URLs on a web site,a robot must retrieve and process the robots.txt file on the web site ,if it is present.There is a single robots.txt resource for the entire web site defined by the hostname and port number.If the site is virtually  hosted , there can be a different robots.txt file for each virtual docroot, as with any other file .

 

9.4.2.1Fetching robots.txt

Robots fetch the robots.txt resource using the HTTP GET method, like any other file on the web server.The server returns the robots.txt file ,if present, in a text/plain body.If the server responds with a 404 NOT FOUND HTTP status code, the robot can assume that there are no robotic access restrictions and that it can request any file.

Robots should pass along identifying information in the From and User-Agent headers to help site administrators track robotic accesses  and to provide contact information in the event that the site administrator needs to inquire or complain about the robot.Here's an example HTTP crawler request from  a commercial web robot:

GET /robots.txt HTTP /1.0 CRLF

HOST:www.joes-hardware.com CRLF

USER-AGENT:Slurp/2.0 CRLF

DATE: Web Oct 3 20:22:48 EST 2001 CRLF

 

9.4.3 robots.txt File Format

The robots.txt file has a very simple ,line-oriented syntax . There are three types of lines in a robots.txt file:blank lines,comment lines,and rule lines.Rule lines look like HTTP headers(<Field>:<value>) and are used for pattern matching.For example:

User-Agent:slurp

User-Agent:webcrawler

Disallow:/private


User-Agent:*

Disallow:

The example shows a robots.txt file that allows the Slurp and Webcrawler robots to access and file except those files in the private subdirectory.The same file also prevents any other robots from accessing anything on the site.

 

9.4.3.1The User-Agent line

Each robots record starts with one or more User-Agent lines,of the form:

User-Agent:<robot-name>

or:

User-Agent:*

The robot name (chosen by the robot implementor) is sent in the User-Agent header of the robot's HTTP GET request.

When a robot processes a robots.txt file,it must obey the record with either:

    The first robot name that is a case-insensitive substring of the robot's name

    The first robot name that is "*"

If the robot can't find a User-Agent line that matches its name,and can't find a wildcarded "User-Agent:*"line ,no record matches, and access is unlimited.

Because the robot name matches case-insensitive substrings, be careful about false matches.For example,"User-Agent:bot" matches all the robots  named Bot,Robot,Bottom-Feeder,Spambot and Dont-Bother-Me.

 

11.6.4Different Cookies for Different Sites

In general, a browser sends to a server only those cookies that the server generated.Cookies generated by joes-hardware.com are sent to joes-hardware.com and not to bobs-books.com or marys-movies.com.

Many web sites contract with third-part vendors to manage advertisements.These advertisements are made to look like they are integral parts of the web site and do push persistent cookies.When the user goes to a different web site serviced by the same advertisement company, the persistent cookie set earlier is sent back again by the browser(because the domain match).A marketing company could use this technique,combined with the Referer header,to potentially build an exhaustive data set of user profiles and browsing habits.Modern browsers allow you to configure privacy settings to restrict third-part cookies. 

 

11.6.4.1 Cookie Domain attribute

A server generating a cookie can control which sites get to see that cookie by adding a Domain attribute to the Set-Cookie response header.For example,the following HTTP response header tells the browser to send the cookie user="mary17" to any site in the domain.airtravelbargains.com:

Set-cookie:user="mary17";domain="airtravelbargains.com"

If the user visits www.airtravelbargains.com,specials.airtravelbargains.com,or any site ending in .airtravelbargains.com,the following Cookie header will issued:

Cookie:user="mary17" 

 

11.6.5 Cookie Ingredients

 

11.6.10Cookies,Security,and Privacy

Still,it is good to be cautious when dealing with privacy and user tracking,because there is always potential for abuse.The biggest misuse comes from third-party web sites using persistent cookies to track users.This practice, combined with IP addresses and information from the Referer header,has enabled these marketing companies to build fairly accurate user profiles and browsing patterns.

 

chaptor 15.Entities and Encodings

 

chaptor 16.Internationalization --charset and character-encoding

 

16.2.1Charset is a Character-to-bits Encoding

16.2.5Content-type Charset Header and Meta Tags

Web servers send the client the MIME charset tag in the Content-type header ,using the charset parameter:  content-type:text/html;charset=iso-2022-jp

If no charset is explicitly listed,the receiver may try to infer the character set from the document contents.For HTML content ,character sets might be found in <META HTTP-EQUIV="Content-Type">Tags that describe the charset.

<HEAD>

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;

charset=iso-2022-jp">指定html文件的编码和entity body的bit流的编码

shows how HTML META tags set the charset to the Japanese encoding iso-2022-jp.If the document is not HTML,or there is no META Content-type tag,software may attempt to infer the charset encoding by scanning the actual text for common patterns indicative of languages and encodings.

If a client cannot infer a character encoding,it assumes iso-8859-1.

 

16.2.6The Accept -charset Header

There are thousands of defined character encoding and decoding methods,developed over the past several decades.Most clients do not support all the various character coding and mapping systems.HTTP clients can tell servers precisely which character systems they support,using the Accept-Charset request header.The Accept-Charset header value provides a list of character encoding schemes that the client supports.For example ,the following HTTP request header indicates that a client accepts the Western European iso-8859-1 character System  as well as the UTF-8 variable-length Unicode compatibility system.A server is free to return content in either of these character encoding schemes.

Accept-charset : iso-8859-1,utf-8

Note that there is no content-type response header to match the Accept-charset request header .The response character set is carried back from the server by the charset parameter of the Content-Type response header,to be compatible with MIME .It's too bad this isn't symmetric,but all the information still is there.

 

16.3.5.1 US-ASCII:The mother of all character sets

   HTTP messages (headers,URIs,etc) use US-ASCII.

16.3.5.2 iso-8859

   iso-8859-1, also known as Latin-1 , is the default character set for HTML.

16.5 Internationalized URIs

Today's URIs are comprised of a restricted subset of US-ASCII characters.(basic Latin alphabet letters ,digits,and a few special characters).其他格式必须经过urlencoding

16.5.2 URI Character Reportoire

    The subset of US-ASCII characters permitted in URLs can be divided into reserved, unreserved, and escape character classes.

    URI character syntax

    unreserved:[A-Za-z0-9]|"-"|"_"|"."|"!"|"~"|"*"|""|"("|")"|

    Reserved:";"|"/"|"?"|":"|"@"|"&"|"="|"+"|"$"|","

    Escape :"%"<HEX><HEX>

 

16.5.3 Escaping(在ascII范围内(0~127),不过不允许直接使用,需要escape %<hex><hex>) and Unescaping

    URI "escape" provide a way to safely insert reserved characters and other unsupported characters(such as spaces)inside URIs.An escape is a three- character sequence ,consisting of a percent character(%) followed by two hexadecimal digit characters.The two hex digits represent the code for a US-ASCII character.

    For example, to insert a space (ASCII 32) in a URL, you could use the escape "%20", because 20 is the hexadecimal representation of 32.Similarity, if you wanted to include a percent sign and have it not be treated as an escape,you could enter "%25",where 25 is the hexadecimal value of the ASCII code for percent.

    Internally,HTTP applications should transport and forward URIs with the escapes in place.HTTP applicaionts should unescape the URIs only when the data is needed.And,more importantly, the applications should ensure that no URI ever is unescaped twice,because percent signs that might have been encoded in an escape will themselves be unescaped,leading to loss of data.

    URLEncoding:URLs can only be sent over the  internet using the ASCII characters.Since URLs ofter contains characters outside the ASCII set,the URL has to be converted.URLencoding converts the URL into a valid ASCII format.URL encoding replaces unsafe ASCII characters with "%" followed by two hexadecimal digits corresponding to the character values in the ISO-8859-1 character-set.

    “中文”  ==encodeURI==>  ”%E4%B8%AD%E6%96%87″ (页面的编码) ==encodeURI(页面编码转为iso8859-1编码,http默认传输编码,%被编码)==>  ”%25E4%25B8%25AD%25E6%2596%2587″ (iso-8859-1)  ==Tomcat解码(ISO-8859-1)==>    ”%E4%B8%AD%E6%96%87″ ==Java decode(UTF-8)==>  ”中文”

16.5.4 Escaping International Characters(不在ASCII范围内,需要urlencoding,应使用文档的保存时的编码方式encoding)

    Note that escape values should be in the range of US-ASCII codes(0-127).Some applications attemp to use escape values to represent iso-8859-1 extended characters (128-255)--for example, web servers might erroneously use escapes to code filenames that contain international character.This is incorrent and may cause problems with some applications.

    For example ,the filename Sven 大家.html(containing an umlaut)might be encoded by a web server  as Seven%20%D6lssen.html.It's fine to encode the space with %20,but is technically illegal to encode the 

 

16.6.1Headers and Out-of-Spec Data

HTTP headers must consist of characters from the US-ASCII character set.

However ,not all clients and servers implement this correctly, so you may on occasion receive illegal characters with code values  larger than 127.

 

chapter 18~21 talks all about the technology for publishing and disseminating web content:

    chapter 18 discusses the ways people deploy servers in modern web hosting environments, HTTP support for virtual web hosting ,and how to replicating content across geographically distant servers.

    chapter 19 discusses the technology for creating web content and installing it onto web servers.

    chapter 20 surveys the tools and techniques for distributing incoming web traffic among a collection of servers.

    chapter 21 covers log formats and common questions.

 

分享到:
评论

相关推荐

    Spark The Definitive Guide

    1. RDD(Resilient Distributed Datasets)是Spark的基础数据结构,它代表了一个不可变、分区的记录集合。RDD具有容错性,可以在集群中并行计算。书中的章节会详细介绍如何创建、转换和操作RDD,以及如何利用Spark的...

    css权威指南第四版 CSS The Definitive Guide 英文版

    CSS权威指南第四版是一本关于CSS(层叠样式表)的权威性指南书籍。这本书由Eric A. Meyer和Estelle Weyl共同编写,旨在为网页设计师或...书籍献辞部分记录了作者对特定人士的感谢和致敬,表现出书籍出版背后的人情味。

    spark the definitive guide(epub)

    《Spark The Definitive Guide》是一本全面深入探讨Apache Spark技术的专业书籍,旨在为读者提供Spark的详尽指南。EPUB版本提供了与PDF不同的阅读体验,适合电子设备上的阅读。 Spark是一个快速、通用且可扩展的大...

    SSH, the Secure Shell The Definitive Guide

    而SSH的2.0版本是一个重大改进,它在2001年由Daniel J Barrett和Richard E Silverman撰写的《SSH, the Secure Shell: The Definitive Guide》一书中得到了详细解读,这本书被公认为是学习SSH协议的经典之作。...

    WP_EN_DG_Talend_DefinitiveGuide_DataGovernance.pdf

    据《哈佛商业评论》报道,平均而言,47%的数据记录因为有关键错误而影响工作。据Forrester研究,只有40%的CIO能够满足速度要求。 数字化转型的关键在于信任的数据。我们已经进入了信息经济时代,在这个时代,数据...

    Hadoop The Definitive Guide 2nd Edition.pdf

    《Hadoop:The Definitive Guide》第二版是一本深入介绍Hadoop技术体系的权威书籍,由Tom White编写,Doug Cutting作序。本书旨在为读者提供一个全面且深入的理解Hadoop的技术背景、核心组件以及实际应用场景等内容...

    The Definitive Guide to SQLite (Second Edition) 源码

    通过对《The Definitive Guide to SQLite (Second Edition)》源码的学习,开发者不仅可以深入了解SQLite的工作原理,还能掌握数据库系统设计的关键技术,这对于数据库开发和优化具有极大的价值。同时,这也为自定义...

    读书笔记:Netty权威指南 笔记Note of The Definitive Guide.zip

    读书笔记:Netty权威指南 笔记Note of The Definitive Guide

    JavaScript权威指南(第六版) 英文版(JavaScript.The.Definitive.Guide)

    书籍的印刷历史记录表明,本书从1996年的Beta版开始,经过多个版本的更新和改进,直至2011年的第六版。 在印刷历史部分,我们可以看到各个不同版本的发布日期和相应的修订内容。每一版的更新都紧密跟随JavaScript...

    The definitive Guide To Grails学习笔记

    《The definitive Guide To Grails学习笔记》是一份深入探讨Grails框架的重要资源,它源于经典书籍《The Definitive Guide to Grails》的精华总结。Grails是一种基于Groovy语言的开源Web应用框架,旨在提高开发效率...

    JavaScript - The Definitive Guide 第六版

    作者详细阐述了浏览器对象模型(BOM)的各个方面,包括窗口对象、历史记录和位置对象等。 作为权威指南,本书不仅仅是概念的介绍,还包含了大量实例代码,用以演示如何运用JavaScript解决实际问题。书中还包含了...

    Maven The Definitive Guide

    ### Maven The Definitive Guide 知识点解析 #### 标题:Maven The Definitive Guide **Maven The Definitive Guide** 是一本详细介绍 Apache Maven 的书籍,旨在为 Java 开发者提供一个全面、深入理解 Maven 的...

Global site tag (gtag.js) - Google Analytics