`
duguyidao
  • 浏览: 138466 次
  • 性别: Icon_minigender_1
  • 来自: 苏州
文章分类
社区版块
存档分类
最新评论

out of order write

阅读更多
17. Out of order execution (PPro, PII and PIII)

--------------------------------------------------------------------------------
http://www.XiaoHui.com 日期: 2000-04-01 14:00
  您好, 来自 www.google.cn 的朋友! 您通过搜索 out-of-order+writes 来到本站。
  如果您是第一次来到本站, 欢迎点此将本站添加至您的收藏夹,或使用 RSS FEED 订阅本站更新 ,也可以订阅本站邮件列表,获取最新更新通知。

17. Out of order execution (PPro, PII and PIII)

The reorder buffer (ROB) can hold 40 uops. Each uop waits in the ROB until all its operands are ready and there is a vacant execution unit for it. This makes out-of-order execution possible. If one part of the code is delayed because of a cache miss then it won't delay later parts of the code if they are independent of the delayed operations.

Writes to memory cannot execute out of order relative to other writes. There are four write buffers, so if you expect many cache misses on writes or you are writing to uncached memory then it is recommended that you schedule four writes at at time and make sure the processor has something else to do before you give it the next four writes. Memory reads and other instructions can execute out of order, except IN, OUT and serializing instructions.

If your code writes to a memory address and soon after reads from the same address, then the read may by mistake be executed before the write because the ROB doesn't know the memory addresses at the time of reordering. This error is detected when the write address is calculated, and then the read operation (which was executed speculatively) has to be re-done. The penalty for this is approximately 3 clocks. The only way to avoid this penalty is to make sure the execution unit has other things to do between a write and a subsequent read from the same memory address.

There are several execution units clustered around five ports. Port 0 and 1 are for arithmetic operations etc. Simple move, arithmetic and logic operations can go to either port 0 or 1, whichever is vacant first. Port 0 also handles multiplication, division, integer shifts and rotates, and floating point operations. Port 1 also handles jumps and some MMX and XMM operations. Port 2 handles all reads from memory and a few string and XMM operations, port 3 calculates addresses for memory write, and port 4 executes all memory write operations. In chapter 29 you'll find a complete list of the uops generated by code instructions with an indication of which ports they go to. Note that all memory write operations require two uops, one for port 3 and one for port 4, while memory read operations use only one uop (port 2).

In most cases each port can receive one new uop per clock cycle. This means that you can execute up to 5 uops in the same clock cycle if they go to five different ports, but since there is a limit of 3 uops per clock earlier in the pipeline you will never execute more than 3 uops per clock on average.

You must make sure that no execution port receives more than one third of the uops if you want to maintain a throughput of 3 uops per clock. Use the table of uops in chapter 29 and count how many uops go to each port. If port 0 and 1 are saturated while port 2 is free then you can improve your code by replacing some MOV register,register or MOV register,immediate instructions with MOV register,memory in order to move some of the load from port 0 and 1 to port 2.

Most uops take only one clock cycle to execute, but multiplications, divisions, and many floating point operations take more:

Floating point addition and subtraction takes 3 clocks, but the execution unit is fully pipelined so that it can receive a new FADD or FSUB in every clock cycle before the preceding ones are finished (provided, of course, that they are independent).

Integer multiplication takes 4 clocks, floating point multiplication 5, and MMX multiplication 3 clocks. Integer and MMX multiplication is pipelined so that it can receive a new instruction every clock cycle. Floating point multiplication is partially pipelined: The execution unit can receive a new FMUL instruction two clocks after the preceding one, so that the maximum throughput is one FMUL per two clock cycles. The holes between the FMUL's cannot be filled by integer multiplications because they use the same circuitry. XMM additions and multiplications take 3 and 4 clocks respectively, and are fully pipelined. But since each logical XMM register is implemented as two physical 64-bit registers, you need two uops for a packed XMM operation, and the throughput will then be one arithmetic XMM instruction every two clock cycles. XMM add and multiply instructions can execute in parallel because they don't use the same execution port.

Integer and floating point division takes up to 39 clocks and is not pipelined. This means that the execution unit cannot begin a new division until the previous division is finished. The same applies to squareroot and transcendental functions.

Also jump instructions, calls, and returns are not fully pipelined. You cannot execute a new jump in the first clock cycle after a preceding jump. So the maximum throughput for jumps, calls, and returns is one for every two clocks.

You should, of course, avoid instructions that generate many uops. The LOOP XX instruction, for example, should be replaced by DEC ECX / JNZ XX.

If you have consecutive POP instructions then you may break them up to reduce the number of uops:

POP ECX / POP EBX / POP EAX ; can be changed to: MOV ECX,[ESP] / MOV EBX,[ESP+4] / MOV EAX,[ESP] / ADD ESP,12


The former code generates 6 uops, the latter generates only 4 and decodes faster. Doing the same with PUSH instructions is less advantageous because the split-up code is likely to generate register read stalls unless you have other instructions to put in between or the registers have been renamed recently. Doing it with CALL and RET instructions will interfere with prediction in the return stack buffer. Note also that the ADD ESP instruction can cause an AGI stall in earlier processors.
http://www.xiaohui.com/dev/mmx/mmx_p_17.htm
分享到:
评论

相关推荐

    AOS: Adaptive Out-of-order Scheduling for Write-caused Interference Reduction in Solid State Disks

    本文的主题是关于固态硬盘(SSD)的写入干扰问题及相应的解决策略,特别是在读写性能不对称的情况下如何通过一种名为AOS(Adaptive Out-of-order Scheduling)的自适应调度机制来减少写入干扰,同时不牺牲写入性能以...

    2009 达内Unix学习笔记

    其中,1是执行权(Execute),2是写权限(Write),4是读权限(Read), 具体权限相当于三种权限的数相加,如7=1+2+4,即拥有读写和执行权。 另外,临时文件/目录的权限为rwt,可写却不可删,关机后自动删除;建临时目录...

    The Majesty Of Vue.js

    You can solve these in order to gain a better understanding of Vue.js., By the end of this book, you will be able to create fast front-end applications and increase the performance of your existing ...

    西电软工oop上机题目2 7.10-7.rar

    Consider:Write a function for entering new words into a tree of Tnodes. Write a function to write out a tree of ... Write a function to write out a tree of Tnodes with the words in alphabetical order.

    matlabmatrix.rar_4 3 2 1_Fibonacci

    1) Write a function reverse(A) which takes a matrix A of arbitrary dimensions as input and returns a matrix B consisting of the columns of A in reverse order. Thus for example, if A = 1 2 3 then B = ...

    The Majesty Of Vue.js Paperback – October 28, 2016

    The best way to learn to code is to write it, so there's an exercise at the end of most of the chapters for you to solve and actually test yourself on what you have learned. You can solve these in ...

    Tomasulo 代码

    2. Out-of-Order Execution:Out-of-Order Execution 是一种 processor 架构技术,允许 processor 在执行指令时不按顺序执行,而是根据指令的 Ready 状态来执行。这可以提高 processor 的性能和效率。 3. Modular ...

    AMBA AXI protocal

    • support for out-of-order transaction completion • permits easy addition of register stages to provide timing closure. The AXI protocol includes the optional extensions that cover signaling for low...

    Java邮件开发Fundamentals of the JavaMail API

    take advantage of protocols for which Sun does not provide out-of-the-box support. You'll find support for NNTP (Network News Transport Protocol) [newsgroups], S/MIME (Secure Multipurpose Internet ...

    Sudoku Programming With C

    So, he had to develop his own applications in order to find out. And, from the very start, he decided that he would publish the code for anyone else to use and perhaps tinker with, but the author ...

    Universal-USB-Installer

    Source Code is made available at time of download, from the official UUI page: http://www.pendrivelinux.com/universal-usb-installer-easy-as-1-2-3/ IMPORTANT! No Warranty is being offered with this ...

    Test-.Driven.Python.Development.1783987928

    In order to get the best out of this book, you should have development experience with Python. In Detail This book starts with a look at the test-driven development process, and how it is different ...

    闭包搜索算法java编程

    find out how many pairs of events are concurrent. Input The input will include first an integer, nc, specifying the number of computations in the test case. For each of these nc computations there ...

    Lerner -- Python Workout. 50 Essential Exercises -- 2020.pdf

    - **Objective:** Write a function that takes a list of numbers and returns their sum. - **Key Concepts:** - Looping over elements in a list. - Using the `sum()` built-in function. - Handling ...

    Software Development, Design and Coding-2nd Edition-Apress(2017).pdf

    Although the chapter order generally follows the standard software development sequence, one can read the chapters independently and out of order. I’m assuming that you already know how to program ...

    Expert .NET Micro Framework

    This book is a must if you want to get as much as possible out of the .NET Micro Framework to write powerful embedded applications. Expert .NET Micro Framework also describes how to use resources, ...

    Elasticsearch.A.Complete.Guide.epub

    You'll be able to use Elasticsearch with other de facto components in order to get the most out of Elasticsearch. By the end of this course, you'll have developed a full-fledged data pipeline. This ...

    华南理工大学计算机全英班算法设计实验

    4)Write down the report in which there should be the execution results of the program. 5. Example code with C++ ………. void myquicksort(int* A, int l,int r) { if(l>=r) return ; int i=l,j=r; int...

    Dynamic Memory Allocation 动态内存分配

    Read a list of long integers from the console and store the numbers in a dynamically created array. The first number read ... Write it out to the console in the reverse order in which it was read.

Global site tag (gtag.js) - Google Analytics