CPU学习

cooldatabase

浏览: 213431 次
性别:
来自: 杭州

最近访客更多访客>>

shouwang361

dongguangming88

qiuxia812913

liexusong001

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2010-04 ( 23)
2010-03 ( 94)
2009-12 ( 26)
更多存档...

博客分类：

linux

Cache SuSE GCC performance 数据结构

这是作者学习硬件基本知识过程中的笔记，由于以前很少接触这方面的知识，又缺乏系统
的学习，难免会出现错误，希望得到大家指正。
一、Intel CPU的主要部件：

1. CPU内核：

是真正意义上的处理器，用于执行指令和处理数据，其计算能力与CPU的速度密切相关。

2. L1 Cache

CPU内部集成的L1 Cache(一级高速缓存)，又被称为主缓存，用于暂存部分指令和数据。它和
CPU同频运行，是所有Cache中速度最快的。它一般由SRAM组成，造价昂贵而且结构复杂，由于
CPU体积有限，所以L1 Cache的容量一般不会太大。CPU的L1 Cache又分为D-Cache（Data Cache，
数据高速缓存）和ICache（Instruction Cache，指令高速缓存）。这种双路高速缓存架构减少
了争用高速缓存所造成的冲突，有效提高处理器性能。不过Pentium 4处理器放弃了ICache而采
用了更高效的T-Cache(Trace Cache）。

3. L2 Cache

即二级高速缓存。高速缓存器，提供CPU计算所需的指令和数据。通常由三个部件组成：L2 Cache
Controller、Cache SRAM和Cache tag RAM。分别用作控制器、存储器和缓存检索表。
由于L1 Cache 的成本昂贵，所以CPU内部集成了L2 Cache 以弥补L1Cache 较小的容量。L2 Cache
一般选用SDRAM。目前CPU内部集成的L2 Cache，英文叫On-die，一般都是L1 Cache的两倍或者是四
倍，甚至更多。在早期，二级缓存芯片多是放置在主板上,英文叫On-board，而不是集成在CPU内部的。

4. BSB （Backside Bus）：

通常称之为后端总线。互连CPU内核和二级缓存的总线。主要负责向CPU提供L2 Cache所存储的指令
和数据。BSB提供了66MHz、半速、全速三种速度。BSB速度决定了CPU访问Cache的速度。由于CPU所
需的指令和数据主要来自于L2 Cache，所以BSB速度对系统性能有非常重要的影响。

5. FSB （Frontside Bus）：
即通常所说的前端总线。互连CPU和主板芯片组的总线，一般用于互连CPU和内存控制器。
FSB的速度即是通常所说的外频。FSB速度的高低影响CPU对主内存的存取。

二、相关名词：

外频:

CPU访问内存的带宽。

Cache Line:

The smallest unit of memory than can be transferred between the main memory and the cache.
Rather than reading a single word or byte from main memory at a time, each cache entry is
usually holds a certain number of words, known as a "cache line" or "cache block" and a
whole line is read and cached at once. This takes advantage of the principle of locality
of reference: if one location is read then nearby locations (particularly following locations)
are likely to be read soon afterwards. It can also take advantage of page-mode DRAM which
allows faster access to consecutive locations.

主存和cache之间数据传输的最小单位。每次CPU访问内存时，以Cache Line为单位，请求一个或多
个Cache Line。Intel的P5和P6类CPU来说，一个Cache Line由32字节的数据或指令组成，也就是一
个Cache Line共256位，当CPU向L2 Cache请求1个Cache Line时，那么将从BSB上向CPU传输256位数
据或指令，如果BSB为64位宽，那么至少要分4次传输，如果每次传输能在1个Clock内完成，则传完
一个Cache Line至少需要4个Clock；若BSB数据宽度为256位，则只需在1个Clock内完成。

write through：

A cache architecture in which data is written to main memory at the same time as it is cached.

write back:

A cache architecture in which data is only written to main memory when it is forced out of the cache.

ATC:

Intel的一种BSB总线技术,称为Advanced Transfer Cache简称ATC。

MIPS:

MIPS(Million Instructions per Second)是处理器每秒中能执行几百万条指令的表示单位。这是
一种过时的而且不科学的衡量处理器速度与性能的度量单位。

SSE

SSE 指令集是Intel 为其Pentium Ⅲ系列处理器所开发指令集，它包括8条连续数据块传输内存优
化指令、12条MMX整数运算增强指令和50条SIMD浮点运算指令。这些指令能够强化系统对图形、视
频和音频的处理。SSE2 比起上一代增加了144条指令。

ALU

ALU(Arithmetic Logic Unit，算术逻辑单元)是CPU内部处理所有数据的部分，用来进行数学逻辑运算。

FPU

FPU(Floating-Point Unit，浮点运算单元)是目前专门进行浮点运算的单元。在Intel 80486 之前，
FPU作为一块特殊设计的独立芯片插装在主板上。它曾被称作数字协同处理器或浮点运算处理器。
在Intel 80486 之后，CPU一般都内置了FPU。

MMU

MMU（Memory Management Unit，存储器管理单元）是用来管理虚拟内存的系统组件。MMU 通常是CPU
的一部分，本身有少量的存储空间用来存放从虚拟地址到物理地址的匹配表TLB（Translation Look-aside
Buffer，或叫交叉转换表）。所有数据请求都送往MMU，由它来确定数据是在RAM内还是大容量存储设备内。

本文来自CSDN博客，转载请标明出处：http://blog.csdn.net/yayong/archive/2005/04/17/351514.aspx

一、Cache Coherence

在2004年写的一篇文章X86汇编语言学习手记(1)中，曾经涉及到gcc编译的代码默认16字节
栈对齐的问题。之所以这样做，主要是性能优化方面的考虑。

大多数现代CPU都One-die了L1和L2Cache。对于L1 Cache，大多是write though的；L2 Cache
则是write back的，不会立即写回memory，这就会导致Cache和Memory的内容的不一致；另外，
对于MP(Multi Processors)的环境，由于Cache是CPU私有的，不同CPU的Cache的内容也存在
不一致的问题，因此很多MP的的计算架构，不论是ccNUMA还是SMP都实现了Cache Coherence
的机制,即不同CPU的Cache一致性机制。

Cache Coherence的一种实现是通过Cache-snooping协议，每个CPU通过对Bus的Snoop实现对
其它CPU读写Cache的监控：

首先，Cache line是Cache和Memory之间数据传输的最小单元。

1. 当CPU1要写Cache时，其它CPU就会检查自己Cache中对应的Cache line,如果是dirty的，
就write back到Memory,并且会将CPU1的相关Cache line刷新；如果不是dirty的，就Invalidate
该Cache line.

2. 当CPU1要读Cache时，其它CPU就会将自己Cache中对应的Cache line中标记为dirty的部分
write back到Memory,并且会将CPU1的相关Cache line刷新。

所以，提高CPU的Cache hit rate,减少Cache和Memory之间的数据传输，将会提高系统的性能。

因此，在程序和二进制对象的内存分配中保持Cache line aligned就十分重要，如果不保证
Cache line对齐，出现多个CPU中并行运行的进程或者线程同时读写同一个Cache line的情况
的概率就会很大。这时CPU的Cache和Memory之间会反复出现Write back和Refresh情况，这种
情形就叫做Cache thrashing。

为了有效的避免Cache thrashing,通常有以下两种途径：

1. 对于Heap的分配，很多系统在malloc调用中实现了强制的alignment.
2. 对于Stack的分配，很多编译器提供了Stack aligned的选项。

当然，如果在编译器指定了Stack aligned,程序的尺寸将会变大，会占用更多的内存。因此，
这中间的取舍需要仔细考虑，下面是我在google上搜索到的一段讨论：

One of our customers complained about the additional code generated to
maintain the stack aligned to 16-byte boundaries, and suggested us to
default to the minimum alignment when optimizing for code size. This
has the caveat that, when you link code optimized for size with code
optimized for speed, if a function optimized for size calls a
performance-critical function with the stack misaligned, the
performance-critical function may perform poorly.

二、gcc的对齐参数

-mpreferred-stack-boundary在X86汇编语言学习手记(1)中已经提及，另外，在google上还搜
索到了一个关于栈对齐讨论的邮件，与大家分享：

----- Original Message -----
From: "Andreas Jaeger"
To: gcc@gcc.gnu.org
Cc: "Jens Wallner" wallner@ims.uni-hannover.de
Sent: Saturday, February 03, 2001 2:37 AM
Subject: Question about -mpreferred-stack-boundary

>
> We (glibc team) got a bug report that the stack is not aligned
> properly - and I'm a bit confused by the documentation of
> -mpreferred-stack-boundary which is:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @item -mpreferred-stack-boundary=@var{num}
> Attempt to keep the stack boundary aligned to a 2 raised to @var{num}
> byte boundary. If @samp{-mpreferred-stack-boundary} is not specified,
> the default is 4 (16 bytes or 128 bits).
>
> The stack is required to be aligned on a 4 byte boundary. On Pentium
> and PentiumPro, @code{double} and @code{long double} values should be
> aligned to an 8 byte boundary (see @samp{-malign-double}) or suffer
> significant run time performance penalties. On Pentium III, the
> Streaming SIMD Extension (SSE) data type @code{__m128} suffers similar
> penalties if it is not 16 byte aligned.
>
> To ensure proper alignment of this values on the stack, the stack boundary
> must be as aligned as that required by any value stored on the stack.
> Further, every function must be generated such that it keeps the stack
> aligned. Thus calling a function compiled with a higher preferred
> stack boundary from a function compiled with a lower preferred stack
> boundary will most likely misalign the stack. It is recommended that
> libraries that use callbacks always use the default setting.
>
> This extra alignment does consume extra stack space. Code that is sensitive
> to stack space usage, such as embedded systems and operating system kernels,
> may want to reduce the preferred alignment to
> @samp{-mpreferred-stack-boundary=2}.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Who has to align the stack for calls to a function - the caller or the
> callee? In other words: Does this mean that the stack has to be
> aligned before calling a function? Or does it have to be aligned when
> entering a function?
>
> Andreas
> --
> Andreas Jaeger
> SuSE Labs aj@suse.de
> private aj@arthur.inka.de
> http://www.suse.de/~aj
I believe the preferred alignment for long double is a 16 byte boundary, and
the stack (and instruction) alignments must be so set before entering a function.
Pentium 4 increases preferred data alignments to 32 bytes in some situations,
as well as increasing the number of situations (SSE2 instructions) where 16 byte
alignment is needed.

从这里可以看到，栈对齐是在调用函数之前就必须保证的：

本文来自CSDN博客，转载请标明出处：http://blog.csdn.net/yayong/archive/2005/04/17/351550.aspx

分享到：

修复 Java 内存模型，第 1 部分(什么是 J ... | fuse-2.7.3.tar.gz开源代码学习心得

2010-03-30 14:16
浏览 1498
评论(0)
分类:非技术
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

CPU学习

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

CPU学习

评论

发表评论

相关推荐

Linux内存：内存管理的实质

深入C++的new

Linux 的多线程编程

浅析Linux下core文件

Linux下的CPU利用率计算原理详解

linux cpu负载原理

从VFS inode到LFS inode的寻址过程

Cache Cohernce with Multi-Processor

Cache 的write back和write through

linux 2.6 Makefile详解

linux1

nfs

修改Linux内核增加系统调用

Linux内核裁剪的具体步骤

Linux内核修改实验

Linux内核剪裁实验

c语言深度解析

Linux线程实现机制分析

内存屏障原语

linux 经典进程切换实现代码

最近访客更多访客>>