- 浏览: 84672 次
文章分类
最新评论
-
bailangfei3344:
自我介绍 -
regionwar:
你好,转化为人为:1、不该加锁的不要加锁:局部变量,单线程占用 ...
关于java锁机制的优化 -
danni505:
希望能交流:
msn:danni-505#hotmail.co ...
关于java锁机制的优化 -
ouspec:
收藏的东西不错。
TOP500 -
willpower:
The idea behind is e-sync IO do ...
Rethink the sync
Lessons from the test lab: investigating a pleasan
This post describes our recent investigation into an interesting performance problem: benchmarks that we were surprised to find running significantly faster than we expected on new hardware. Along the way we discuss useful benchmarking tools, how to validate results, and why it pays to know exactly what hardware you're running on.
This all started in our performance test lab. During the development of Visual Studio, each new build undergoes a suite of automated performance tests, running in a lab full of identical machines. These performance tests allow us to track Visual Studio's performance over time, and detect performance regressions (when something gets unexpectedly worse). We recently added a batch of new machines in our lab, and that's when the fun started.
Pop Quiz: How Much Faster?
Old machine: dual-core Intel Pentium D 830 processor, running at 3 GHz, with 1 GB of RAM.
New machine: quad-core Intel Xeon 5355 processor, running at 2.66 GHz, with 4 GB of RAM.
Given the differences in the two hardware configurations above, how much faster would you expect the new machine to be when running a Visual Studio performance test? Lower than, same as, twice, three times or four times the performance of the older machine?
One line of reasoning might look at the relative clock frequencies of the processors on the two machines. This might lead you to expect the newer processor cores to perform slower than the older cores, since their clock frequency is 11% lower. By this reasoning you might conclude that single-threaded applications would perform poorly on the new machine.
Another line of reasoning would factor in the number of cores in the two systems. Since the new machine has twice the number of cores, you might expect it to have about twice the performance on multi-threaded applications. (If you also accounted for the lower clock frequency, you'd end up with a figure of 1.78 times the performance of the old machine.)
A third approach might estimate the impact of RAM size. We’ve quadrupled the amount of RAM, so maybe any benchmarks that used to page to disk can now execute entirely in memory and hence will be orders of magnitude faster. [We'll cheat here and tell you that our benchmarks are generally not memory constrained].
So far, all these options seem plausible. What's your guess?
What we naively expected to find lay somewhere between the first two lines of reasoning - that the new machines would be 1-2 times faster than the old machines, depending on the particular benchmark.
What we actually found is that many of our single-threaded CPU-bound benchmarks run about twice as fast on the new machine, while scalable multi-threaded benchmarks run up to four times as fast. This was a pleasant surprise, because it significantly reduces the overall time to run all the benchmarks. But it did leave us wondering why we were getting much greater speedups than our naive explanations would suggest. The rest of this post explores that question.
Using WinSAT and SPEC to Validate Benchmark Results
To make sure this wasn't a fluke result, we used the Windows System Assessment Tool (winsat.exe). This is a built-in tool that can give quickly give a representative view of a machine's performance. It is multi-threaded, taking full advantage of all the cores on a machine. Here are the WinSAT CPU results:
Benchmark Old Machine New Machine Speedup
CPU – Compression (MB/s) 70.5 262.0 3.7
CPU – Encryption (MB/s) 52.3 139.3 2.7
We also wanted to validate our results against other real-world benchmarks. For this we turned to the SPEC website. SPEC produces a series of benchmark suites, plus a very formal process that ensures results are reproducible and can fairly be applied across different manufacturers. More importantly for our purposes, SPEC posts all reported benchmark results on their web site. You won’t always be able to find your exact machine listed, but after using results from a tool like CPU-Z you can generally find results from a machine with the same CPU configuration and clock speed.
We used the "CINT2006" benchmarks – this is a widely-used benchmark suite concentrating on integer performance. We compared results for both CINT2006, which is a good test of single-threaded performance, and CINT2006 Rate, which tests the ability of a system to execute multiple copies of CINT2006, and is therefore a better test of multi-threaded performance. For two representative machines that are similar to our old and new hardware, here are the results:
Benchmark Old Machine New Machine Speedup
CINT2006 9.85 15.5 1.6
CINT2006 Rate 18.0 44.4 2.5
The WinSAT and SPEC results confirm that the new machines are much faster than our naive expectations, even for benchmarks such as CINT2006 that cannot take advantage of the extra cores. So what were we missing?
Using CPU-Z to Examining Machine Configurations
To answer this, we need a deeper understanding of the configurations of the two systems.
Unfortunately, finding detailed configuration information isn't always straightforward. For example, we know that level two (L2) cache size impacts performance, but Windows doesn't report it, and it's not easy to reboot into the BIOS to take a look at cache size when the machine is located in a remote test lab. This is where machine reporting tools like CPU-Z come in. You can run CPU-Z remotely on an unknown machine and get back a nicely formatted HTML report showing exactly what the hardware is. Here's a deeper look at our old and new systems:
Feature Old Machine New Machine
CPU name Pentium D 830
(“Smithfield”) Xeon X5355
(“Clovertown”)
CPU speed 3.00 GHz 2.66 GHz
Number of cores 2 4
L1 cache (per core) 16 KB 32 KB
L2 cache (total) 2 MB 8 MB
System RAM 1 GB DDR2 4 GB DDR2
Using BCDEdit to Disable Cores
Now we can try to tease out the relative impacts of the many changes from the old configurations the new configurations. The first and easiest step is to disable two out of four cores on a new machine, to enable a fairer "apples to apples" comparison of cores between old and new machines.
To do this we used the Windows BCDEdit tool, which replaces the old method of editing BOOT.INI by hand. Here we were particularly concerned with the order in which cores are disabled. This is important because the 8 MB of L2 cache in the Xeon “Clovertown” processors is divided: two of the four cores share 4 MB, and the other two cores share the other 4 MB. To keep our benchmark comparisons as fair as possible, we wanted to make sure that only one of the L2 caches was in use after disabling two cores. We used CPU-Z again after rebooting to confirm this.
Now we were in a position to do a fairer “cores to cores” comparison between the old and new machines. Here's a summary from WinSAT:
Benchmark Old Machine New (2 cores) Speedup
CPU – Compression (MB/s) 70.5 131.9 1.9
CPU – Encryption (MB/s) 52.3 69.7 1.3
Memory Bandwidth (MB/s) 4,041 3,360 0.8
Now we can really see the advantage of the latest processors – on a core-for-core basis, they are 1.3-1.9x faster on the CPU-intensive WinSAT benchmarks, despite having lower clock frequencies.
Good, now on to the next… wait a second. Look at that memory bandwidth result. Our new machines have less memory bandwidth than the old machines? That doesn't look right: although memory performance hasn't been keeping pace with CPU speeds, it has been improving over time. Compared to a three-year-old machine, we'd expect these new machines to have slightly better memory bandwidth, and definitely not worse. What gives?
Memory Channels
A primary limiting factor to memory bandwidth is the number of memory channels that are in use. And this turns out to be the problem here: although the new machines have four memory channels and eight memory slots, only two of those slots are filled, because the vendor supplied us with two 2 GB memory modules per machine. This maximizes future expansion potential – we can take the machine up to 16 GB without throwing away any of our initial investment in memory. But in the meantime using two memory slots limits us to two memory channels in use. If instead we had four 1 GB memory modules we'd have four memory channels in use, improving memory interleaving from 2:1 to 4:1 and increasing memory bandwidth. To confirm this, we populated four memory slots on one of the new machines (going from 4 GB to 8 GB) and reran WinSAT:
Benchmark 2 channels 4 channels Speedup
Memory Bandwidth (MB/s) 3,360 4,134 1.2
Conclusions
It's always possible to run more experiments to further isolate and explain benchmark results, but after a while you reach a point of diminishing returns. With the results we have so far, we can already draw some useful conclusions.
The first conclusion is that our naive explanations greatly underestimated just how much better the newer processors are at executing real benchmarks, despite their slower clock speeds. The results from WinSAT and SPEC clearly show this, with core-to-core performance that is 1.3-1.9x faster on the new machines, depending on the benchmark.
This is perhaps the most important lesson for developers to learn: clock speeds are no longer a good indicator of true performance. Although clock speeds have plateaued, processor designers continue to find ways to make each new generation significantly faster than the last. In our case, the old machines have Pentium D processors (“Smithfield”), while the new machines have Xeon 5-series processors (“Clovertown”). And while the newer processors have slightly slower clock speeds, their micro-architecture executes more instructions per clock cycle.
The second conclusion is that it's very hard to perform fair comparisons. The two machines have several configuration differences, including clock frequency, number of cores, core micro-architecture, cache sizes, bus speed, memory size and speed, and so on. We showed an example of isolating the effect of just one of these differences, the number of cores, using the BCDEdit tool. Isolating the effect of every single difference would require much more effort.
Indeed, some of these differences are interrelated, and it is hard to change one without affecting another. For example, CPU architects make their micro-architecture design decisions based on cache sizes. Now imagine a hypothetical experiment that tried to isolate the effect of L2 cache size by giving each core just 1 MB of cache. This would be especially hard on the newer processors, which have been designed on the assumption that they have 2 MB of L2 cache per core[1]. In trying to perform a fairer comparison, we would have actually handicapped one system!
Our final conclusion is that it truly pays to benchmark and compare systems. In our case, the simplest possible benchmark (WinSAT) showed an unexpected memory bandwidth loss, which we then traced back to a machine mis-configuration. So that was the final pleasant surprise: if we hadn't gotten curious about why the new machines were so much faster, we would never have found that they could be faster still!
David Berg
Sunny Egbo
Jonathan Hardwick
Peter Okonski
--------------------------------------------------------------------------------
[1] Because two cores share a single 4 MB L2 cache on the Clovertown processors, the exact size of the cache that is used by each core is not fixed at 2 MB per core; the use will vary during program execution. Cache hungry threads might get more of the cache, while less cache hungry threads get less. Even when two cache hungry threads run on the two cores, their memory hotspots are asynchronous; thus, the net effect is that each thread gets more of the cache when they need it and less when they don’t need it.
# MarkBFriedman said: Antonio:
It looks like we should have been a little clearer about what we meant when we used the word "thread." Sorry about that. (Reminds me of the famous words of a former US President and semiotician, "It depends on what the meaning of the word "is" is.")
From the standpoint of the OS, the thread is the dispatchable unit. From the standpoint of the CPU, a thread is any set of executing instructions that aren't executing an Idle loop. There are software and hardware guys collaborating on this post. (It may be unusual, but they do get along -- most days.) And while we knew what we meant, it seems we used "thread" without clearly distinguishing the two meanings and contexts.
Knowing the author and his tendencies, my guess is that the footnote about threads sharing the cache was written from the hardware perpective.
On the new lab machines, there is a dedicated L1 cache for each processor core, and a shared L2 cache that each processor core can access. The L2 cache is dynamically allocated. If CPU A is idle and CPU B is cranking, CPU B is capable of allocating the entire L2 cache. (If you don't expect the CPUs sharing the cache to all be cranking all of the time, this is probably a good approach.) I hope that clarifies the point.
Of course, I am a software guy, so, from the Windows point of view, let me also try to clarify your thread dispatching question:
It is true that "thread x is not guaranteed and is not going to run always on the first core of the processor." Having said that, however, the statement that follows isn't entirely true: "So if you have two concurrent threads, they are probably running both on both (or all four) cores."
Yes and no. On a symmetric multiprocessor (SMP), a thread by default tends to be a bit sticky to the processor it was last dispatched on. This is called "soft affinity" and is done to increase the probablity that a cache warm start will occur. This stickiness is especially noticeable when the processors are lightly loaded. The stickiness is also prominent in the WinSat benchmarks described here that run single threaded and were run in insolation.
But, in general, you are correct and you often observe threads switching back and forth between available processors. Because thread scheduling is priority-based with preemptive scheduling, and User mode threads are typically subject to dynamic adjustments, once the machine is loaded, threads will usually wander (somewhat randomly) from CPU to CPU.
The SMP soft affinity scheduling algorithm is roughly as follows: A waiting thread that transitions to the Ready state has an "ideal processor" where it will run if that processor is currently idle or running a lower priority thread. If the ideal (i.e., last) processor is busy or running a higher priority thread, but another processor is idle or running a lower priority thread, the ready thread will be scheduled there. This is the preemptive scheduling bit -- the highest priority Ready threads are always dispatched.
You will find more details in my Windows Server 2003 Resource Kit "Performance Guide" book: the priority scheme, hard processor affinity, etc. The ntttcp program discussed in my recent "Mainstream NUMA and the TCP/IP stack" post used hard processor affinity, for example, to ensure that all network processing was confined to a single CPU. Which was why I only showed what was happening on the one CPU. Hard affinity is the exception, though, not the rule.
Once you move to a NUMA architecture -- see my earlier blog posts on this subject, like it or not, NUMA is in your future if you are running server-class machines -- the thread scheduling scheme gets node-oriented. (Physical memory allocations are also node-oriented on NUMA machines.) Once scheduled on a node, a thread is likely to continue to be scheduled to run on one of that node's CPUs. (Subject to availability, similar to the SMP case.)
This NUMA node-oriented soft affinity scheme works pretty well when the L2 cache is a resource that is shared by all the processor cores on the socket. In today's multi-core machines, so long as the thread is re-dispatched to a CPU on the same socket (or node) where it last ran, the thread will likely benefit from an L2 cache warm start. But for an L1 cache warm start, the thread still has to be dispatched on its ideal (and still preferred) processor since that resource is dedicated.
This description of the the behavior of the Windows Scheduler is also worth my mentioning here because of its significance to my earlier "Mainstream NUMA and the TCP/IP stack" posting. In the next part of "Mainstream NUMA" post, which I hope to have ready in another week or so, I will try to make this connection explicit.
So, thanks for keeping us honest and stay tuned!
-- Mark
This all started in our performance test lab. During the development of Visual Studio, each new build undergoes a suite of automated performance tests, running in a lab full of identical machines. These performance tests allow us to track Visual Studio's performance over time, and detect performance regressions (when something gets unexpectedly worse). We recently added a batch of new machines in our lab, and that's when the fun started.
Pop Quiz: How Much Faster?
Old machine: dual-core Intel Pentium D 830 processor, running at 3 GHz, with 1 GB of RAM.
New machine: quad-core Intel Xeon 5355 processor, running at 2.66 GHz, with 4 GB of RAM.
Given the differences in the two hardware configurations above, how much faster would you expect the new machine to be when running a Visual Studio performance test? Lower than, same as, twice, three times or four times the performance of the older machine?
One line of reasoning might look at the relative clock frequencies of the processors on the two machines. This might lead you to expect the newer processor cores to perform slower than the older cores, since their clock frequency is 11% lower. By this reasoning you might conclude that single-threaded applications would perform poorly on the new machine.
Another line of reasoning would factor in the number of cores in the two systems. Since the new machine has twice the number of cores, you might expect it to have about twice the performance on multi-threaded applications. (If you also accounted for the lower clock frequency, you'd end up with a figure of 1.78 times the performance of the old machine.)
A third approach might estimate the impact of RAM size. We’ve quadrupled the amount of RAM, so maybe any benchmarks that used to page to disk can now execute entirely in memory and hence will be orders of magnitude faster. [We'll cheat here and tell you that our benchmarks are generally not memory constrained].
So far, all these options seem plausible. What's your guess?
What we naively expected to find lay somewhere between the first two lines of reasoning - that the new machines would be 1-2 times faster than the old machines, depending on the particular benchmark.
What we actually found is that many of our single-threaded CPU-bound benchmarks run about twice as fast on the new machine, while scalable multi-threaded benchmarks run up to four times as fast. This was a pleasant surprise, because it significantly reduces the overall time to run all the benchmarks. But it did leave us wondering why we were getting much greater speedups than our naive explanations would suggest. The rest of this post explores that question.
Using WinSAT and SPEC to Validate Benchmark Results
To make sure this wasn't a fluke result, we used the Windows System Assessment Tool (winsat.exe). This is a built-in tool that can give quickly give a representative view of a machine's performance. It is multi-threaded, taking full advantage of all the cores on a machine. Here are the WinSAT CPU results:
Benchmark Old Machine New Machine Speedup
CPU – Compression (MB/s) 70.5 262.0 3.7
CPU – Encryption (MB/s) 52.3 139.3 2.7
We also wanted to validate our results against other real-world benchmarks. For this we turned to the SPEC website. SPEC produces a series of benchmark suites, plus a very formal process that ensures results are reproducible and can fairly be applied across different manufacturers. More importantly for our purposes, SPEC posts all reported benchmark results on their web site. You won’t always be able to find your exact machine listed, but after using results from a tool like CPU-Z you can generally find results from a machine with the same CPU configuration and clock speed.
We used the "CINT2006" benchmarks – this is a widely-used benchmark suite concentrating on integer performance. We compared results for both CINT2006, which is a good test of single-threaded performance, and CINT2006 Rate, which tests the ability of a system to execute multiple copies of CINT2006, and is therefore a better test of multi-threaded performance. For two representative machines that are similar to our old and new hardware, here are the results:
Benchmark Old Machine New Machine Speedup
CINT2006 9.85 15.5 1.6
CINT2006 Rate 18.0 44.4 2.5
The WinSAT and SPEC results confirm that the new machines are much faster than our naive expectations, even for benchmarks such as CINT2006 that cannot take advantage of the extra cores. So what were we missing?
Using CPU-Z to Examining Machine Configurations
To answer this, we need a deeper understanding of the configurations of the two systems.
Unfortunately, finding detailed configuration information isn't always straightforward. For example, we know that level two (L2) cache size impacts performance, but Windows doesn't report it, and it's not easy to reboot into the BIOS to take a look at cache size when the machine is located in a remote test lab. This is where machine reporting tools like CPU-Z come in. You can run CPU-Z remotely on an unknown machine and get back a nicely formatted HTML report showing exactly what the hardware is. Here's a deeper look at our old and new systems:
Feature Old Machine New Machine
CPU name Pentium D 830
(“Smithfield”) Xeon X5355
(“Clovertown”)
CPU speed 3.00 GHz 2.66 GHz
Number of cores 2 4
L1 cache (per core) 16 KB 32 KB
L2 cache (total) 2 MB 8 MB
System RAM 1 GB DDR2 4 GB DDR2
Using BCDEdit to Disable Cores
Now we can try to tease out the relative impacts of the many changes from the old configurations the new configurations. The first and easiest step is to disable two out of four cores on a new machine, to enable a fairer "apples to apples" comparison of cores between old and new machines.
To do this we used the Windows BCDEdit tool, which replaces the old method of editing BOOT.INI by hand. Here we were particularly concerned with the order in which cores are disabled. This is important because the 8 MB of L2 cache in the Xeon “Clovertown” processors is divided: two of the four cores share 4 MB, and the other two cores share the other 4 MB. To keep our benchmark comparisons as fair as possible, we wanted to make sure that only one of the L2 caches was in use after disabling two cores. We used CPU-Z again after rebooting to confirm this.
Now we were in a position to do a fairer “cores to cores” comparison between the old and new machines. Here's a summary from WinSAT:
Benchmark Old Machine New (2 cores) Speedup
CPU – Compression (MB/s) 70.5 131.9 1.9
CPU – Encryption (MB/s) 52.3 69.7 1.3
Memory Bandwidth (MB/s) 4,041 3,360 0.8
Now we can really see the advantage of the latest processors – on a core-for-core basis, they are 1.3-1.9x faster on the CPU-intensive WinSAT benchmarks, despite having lower clock frequencies.
Good, now on to the next… wait a second. Look at that memory bandwidth result. Our new machines have less memory bandwidth than the old machines? That doesn't look right: although memory performance hasn't been keeping pace with CPU speeds, it has been improving over time. Compared to a three-year-old machine, we'd expect these new machines to have slightly better memory bandwidth, and definitely not worse. What gives?
Memory Channels
A primary limiting factor to memory bandwidth is the number of memory channels that are in use. And this turns out to be the problem here: although the new machines have four memory channels and eight memory slots, only two of those slots are filled, because the vendor supplied us with two 2 GB memory modules per machine. This maximizes future expansion potential – we can take the machine up to 16 GB without throwing away any of our initial investment in memory. But in the meantime using two memory slots limits us to two memory channels in use. If instead we had four 1 GB memory modules we'd have four memory channels in use, improving memory interleaving from 2:1 to 4:1 and increasing memory bandwidth. To confirm this, we populated four memory slots on one of the new machines (going from 4 GB to 8 GB) and reran WinSAT:
Benchmark 2 channels 4 channels Speedup
Memory Bandwidth (MB/s) 3,360 4,134 1.2
Conclusions
It's always possible to run more experiments to further isolate and explain benchmark results, but after a while you reach a point of diminishing returns. With the results we have so far, we can already draw some useful conclusions.
The first conclusion is that our naive explanations greatly underestimated just how much better the newer processors are at executing real benchmarks, despite their slower clock speeds. The results from WinSAT and SPEC clearly show this, with core-to-core performance that is 1.3-1.9x faster on the new machines, depending on the benchmark.
This is perhaps the most important lesson for developers to learn: clock speeds are no longer a good indicator of true performance. Although clock speeds have plateaued, processor designers continue to find ways to make each new generation significantly faster than the last. In our case, the old machines have Pentium D processors (“Smithfield”), while the new machines have Xeon 5-series processors (“Clovertown”). And while the newer processors have slightly slower clock speeds, their micro-architecture executes more instructions per clock cycle.
The second conclusion is that it's very hard to perform fair comparisons. The two machines have several configuration differences, including clock frequency, number of cores, core micro-architecture, cache sizes, bus speed, memory size and speed, and so on. We showed an example of isolating the effect of just one of these differences, the number of cores, using the BCDEdit tool. Isolating the effect of every single difference would require much more effort.
Indeed, some of these differences are interrelated, and it is hard to change one without affecting another. For example, CPU architects make their micro-architecture design decisions based on cache sizes. Now imagine a hypothetical experiment that tried to isolate the effect of L2 cache size by giving each core just 1 MB of cache. This would be especially hard on the newer processors, which have been designed on the assumption that they have 2 MB of L2 cache per core[1]. In trying to perform a fairer comparison, we would have actually handicapped one system!
Our final conclusion is that it truly pays to benchmark and compare systems. In our case, the simplest possible benchmark (WinSAT) showed an unexpected memory bandwidth loss, which we then traced back to a machine mis-configuration. So that was the final pleasant surprise: if we hadn't gotten curious about why the new machines were so much faster, we would never have found that they could be faster still!
David Berg
Sunny Egbo
Jonathan Hardwick
Peter Okonski
--------------------------------------------------------------------------------
[1] Because two cores share a single 4 MB L2 cache on the Clovertown processors, the exact size of the cache that is used by each core is not fixed at 2 MB per core; the use will vary during program execution. Cache hungry threads might get more of the cache, while less cache hungry threads get less. Even when two cache hungry threads run on the two cores, their memory hotspots are asynchronous; thus, the net effect is that each thread gets more of the cache when they need it and less when they don’t need it.
# MarkBFriedman said: Antonio:
It looks like we should have been a little clearer about what we meant when we used the word "thread." Sorry about that. (Reminds me of the famous words of a former US President and semiotician, "It depends on what the meaning of the word "is" is.")
From the standpoint of the OS, the thread is the dispatchable unit. From the standpoint of the CPU, a thread is any set of executing instructions that aren't executing an Idle loop. There are software and hardware guys collaborating on this post. (It may be unusual, but they do get along -- most days.) And while we knew what we meant, it seems we used "thread" without clearly distinguishing the two meanings and contexts.
Knowing the author and his tendencies, my guess is that the footnote about threads sharing the cache was written from the hardware perpective.
On the new lab machines, there is a dedicated L1 cache for each processor core, and a shared L2 cache that each processor core can access. The L2 cache is dynamically allocated. If CPU A is idle and CPU B is cranking, CPU B is capable of allocating the entire L2 cache. (If you don't expect the CPUs sharing the cache to all be cranking all of the time, this is probably a good approach.) I hope that clarifies the point.
Of course, I am a software guy, so, from the Windows point of view, let me also try to clarify your thread dispatching question:
It is true that "thread x is not guaranteed and is not going to run always on the first core of the processor." Having said that, however, the statement that follows isn't entirely true: "So if you have two concurrent threads, they are probably running both on both (or all four) cores."
Yes and no. On a symmetric multiprocessor (SMP), a thread by default tends to be a bit sticky to the processor it was last dispatched on. This is called "soft affinity" and is done to increase the probablity that a cache warm start will occur. This stickiness is especially noticeable when the processors are lightly loaded. The stickiness is also prominent in the WinSat benchmarks described here that run single threaded and were run in insolation.
But, in general, you are correct and you often observe threads switching back and forth between available processors. Because thread scheduling is priority-based with preemptive scheduling, and User mode threads are typically subject to dynamic adjustments, once the machine is loaded, threads will usually wander (somewhat randomly) from CPU to CPU.
The SMP soft affinity scheduling algorithm is roughly as follows: A waiting thread that transitions to the Ready state has an "ideal processor" where it will run if that processor is currently idle or running a lower priority thread. If the ideal (i.e., last) processor is busy or running a higher priority thread, but another processor is idle or running a lower priority thread, the ready thread will be scheduled there. This is the preemptive scheduling bit -- the highest priority Ready threads are always dispatched.
You will find more details in my Windows Server 2003 Resource Kit "Performance Guide" book: the priority scheme, hard processor affinity, etc. The ntttcp program discussed in my recent "Mainstream NUMA and the TCP/IP stack" post used hard processor affinity, for example, to ensure that all network processing was confined to a single CPU. Which was why I only showed what was happening on the one CPU. Hard affinity is the exception, though, not the rule.
Once you move to a NUMA architecture -- see my earlier blog posts on this subject, like it or not, NUMA is in your future if you are running server-class machines -- the thread scheduling scheme gets node-oriented. (Physical memory allocations are also node-oriented on NUMA machines.) Once scheduled on a node, a thread is likely to continue to be scheduled to run on one of that node's CPUs. (Subject to availability, similar to the SMP case.)
This NUMA node-oriented soft affinity scheme works pretty well when the L2 cache is a resource that is shared by all the processor cores on the socket. In today's multi-core machines, so long as the thread is re-dispatched to a CPU on the same socket (or node) where it last ran, the thread will likely benefit from an L2 cache warm start. But for an L1 cache warm start, the thread still has to be dispatched on its ideal (and still preferred) processor since that resource is dedicated.
This description of the the behavior of the Windows Scheduler is also worth my mentioning here because of its significance to my earlier "Mainstream NUMA and the TCP/IP stack" posting. In the next part of "Mainstream NUMA" post, which I hope to have ready in another week or so, I will try to make this connection explicit.
So, thanks for keeping us honest and stay tuned!
-- Mark
发表评论
-
CMG'08 Award Winning Papers: 提升java APP性能的10大方法
2009-07-07 15:47 873这10条里面最重要的一条: Test, Test, Test ... -
Performance tuning tips - 43 (have fun)
2008-11-28 14:59 662... -
Performance engineer MUST (4)
2008-11-14 14:41 855必看的一本书 Computer Systems: A Pro ... -
Performance engineer MUST (3)
2008-11-14 14:38 916TLB Translation lookaside buff ... -
Performance engineer MUST (2)
2008-11-14 11:50 879Spinlock 以下来自Wikipedia Spinloc ... -
Performance engineer MUST (1)
2008-11-07 15:10 775As a performance engineer, you ... -
性能优化前请你先做profile
2008-10-27 13:49 1149很多developer会忽视,甚至是无视profile的重要性 ... -
ReentrantLock and Synchronized (from peter)
2008-10-22 17:05 964import java.util.concurrent.loc ... -
关于java锁机制的优化
2008-10-21 17:14 3896JVM 级别的锁机制的优化 ... -
ReentrantLock VS Synchronized
2008-10-20 17:57 2339Here is a simple benchmark case ... -
Performance test, more than test 性能测试, 不只是测试 (1)
2008-10-16 15:25 982Everytime, when people mention ... -
关于性能测试中一些计算方法(throughput, active user...)
2008-10-08 14:33 3805我们知道最简单的情况下有如下公式: Throughput ... -
Something to say (for viewer)
2008-07-01 11:32 690All performance articles are on ...
相关推荐
《软件测试的经验教训:一种情境驱动的方法》是一本专注于软件测试领域的著作,旨在分享作者们在实际工作中积累的宝贵经验和教训。这本书以英文撰写,采用Word格式,易于阅读和引用,为读者提供了一种深入理解软件...
本书《Lessons Learned in Software Testing - A Context-Driven Approach》由Cem Kaner、James Bach和Bret Pettichord共同编写,由John Wiley & Sons出版于2002年,是一本共314页的软件测试专业指导手册。...
Course OutlineModule 1: Introduction to Business Intelligence and Data ModelingThis module introduces key BI concepts and the Microsoft BI product suite.Lessons Introduction to Business Intelligence...
Lessons from the Field》是关于在生产环境中扩展SparkR的实际经验分享。本文主要探讨了三个核心主题:数据科学、数据工程和管理,特别是如何利用SparkR来提升R语言在大数据处理中的性能和可扩展性。 1. 数据科学...
Lessons from the Field" 本文档将对 SparkR 在生产中的扩展性进行讨论,并分享实践经验。SparkR 是基于 Apache Spark 的 R 语言接口,旨在将 R 语言的数据科学能力扩展到大规模数据处理领域。 R 语言介绍 R ...
大学英语精读第二册 Unit lessons from JeffersonPPT学习教案.pptx 本资源是大学英语精读第二册的学习教案,主要讲解了美国总统的历史和生平,其中介绍了多位美国总统的生平和政绩,如乔治·华盛顿、托马斯·杰斐逊...
Lessons from Huawei v. ZTE1
【 Shackleton's Way - 领导力课程:来自伟大南极探险家的教诲】 《Shackleton's Way》一书讲述了20世纪初期伟大的南极探险家欧内斯特·沙克尔顿爵士(Sir Ernest Shackleton)的领导力故事。这本书由Margot ...
本教程“codersrank-html-lessons:用StackBlitz创建:high_voltage:”将带你深入学习如何利用HTML来创建交互式、动态的网页,并通过在线代码编辑器StackBlitz进行实践操作。 首先,让我们了解一下HTML的基本结构。一...
Gartner公司的内部资料《Real World Lessons from Big Data Deployments》为我们揭示了大数据分析及Apache Hadoop开源项目如何成为解决这些挑战的关键技术,并探讨了企业如何通过早期采用大数据分析获得竞争优势。...
《Practical Lessons from Predicting Clicks on Ads at Facebook》这篇文章详细介绍了Facebook如何构建一个高效、准确的广告点击预测模型。文章不仅涵盖了模型的设计与实现,还深入探讨了评估模型性能的各种指标...
This documentation includes lessons for teachers and API documentation for developers (check out the index on the left). We hope you enjoy developing for the BBC micro:bit using MicroPython. If you're...
IMPEDIMENTS: WHAT MAKES SHARING HARD AND HOW TO ...MAKING SHARING WORK IN PRACTICE: LESSONS LEARNED FROM PREVIOUS SHARING APPLYING THESE LESSONS IN THE REAL WORLD: CONCRETE STEPS TO IMPROVE SHARING
A Preview of Software Development and Maintenance in 2049 177 Chapter 4. How Software Personnel Learn New Skills 227 Chapter 5. Software Team Organization and Specialization 275 Chapter 6. Project ...
"lessons-ui-components"这个项目似乎是一个关于UI组件的学习资源,可能包含了JavaScript编程语言的实践应用,特别是与发布(publishing)相关的技术。下面将详细探讨这些知识点。 首先,UI组件(User Interface ...