![]() To profile for L2 misses and CPI in instrumentation mode, start with an instrumentation-based performance session. This means that when sorted by exclusive samples, the functions causing most L2 misses will be on top. The reported numbers will be based on profiling samples taken every 1M L2 miss events. (On some processors, L2 Misses will be marked as unsupported.) The labels in this case give away the punch line, but on the laptop we tested this on, the Faster version runs almost twice as fast as the Slower version, even though the only difference in code between the two is that the order of the for loops has been swapped.įigure 1 Memory Access Patterns Are Important //C#įor (int i = 0 i Memory Events category or L2 Lines In from the Platform Events > L2 Cache category. In the second code segment (labeled Slower), the application loops over each column, and within each column it loops over each row. In the first code segment (labeled Faster in the comments), the application loops over each row, and within each row it loops over each column. We've created a two-dimensional array to which the application then writes in two different ways. However, this does mean that a developer needs to be cognizant of how the application accesses memory in order to take the greatest advantage of the cache.Ĭonsider the C# program in Figure 1. Since applications frequently read bytes sequentially in memory (common when accessing arrays and the like), applications can avoid hitting main memory on every request by loading a series of data in a cache line, since it's likely that the data about to be read has already been loaded into the cache. On our laptop the cache line size for both L1 and L2 is 64 bytes. This loading of a whole cache line rather than individual bytes can dramatically improve application performance. When data is read from memory, the requested data as well as data around it (referred to as a cache line) is loaded from memory into the caches, then the program is served from the caches. (The laptop on which we're writing this column has 128KB of L1 cache.) L2 is a bit slower, but is less expensive, so machines will have more of it. ![]() L1 is the fastest, but it's also the most expensive, so machines will typically have a small amount of it. Caches come in multiple levels, with most consumer machines today having at least two levels, referred to as L1 and L2, and some having more than that. To account for this slow memory access, most processors today use memory caches to improve application performance. ![]() Memory access is slow-orders of magnitude slower than mathematical calculations, though orders of magnitude faster than accessing hard disks and network resources. When we talk about reading and writing from memory, we typically gloss over the fact that, in modern hardware, it's rare to read from and write to the machine's memory banks directly. To achieve optimal performance, developers need to have a good understanding of how things like memory accesses affect the performance of an application. The concept of knowing what's happening at the lower levels of an application is, of course, not new. At least in the near term, understanding how memory and caches work will be important in order to write efficient parallel programs. While the software industry is making strides in developing new programming models for concurrency, there does not seem to be a programming model on the horizon that would magically eliminate all concurrency-related issues. Unfortunately, while these constructs represent a monumental step forward in expressing parallelism, they don't obviate the need for developers to be aware of what their code is doing, how it's structured, and how hardware can have a significant impact on the performance of the application. These libraries all aim to decrease the amount of boilerplate code necessary to write efficient parallel applications by providing constructs such as Parallel.For and AsParallel. NET Framework, the Parallel Pattern Library (PPL), the Concurrency & Coordination Runtime (CCR), Intel's Threading Building Blocks (TBB), and others. To address this issue, a multitude of concurrency libraries and languages are beginning to emerge, including Parallel Extensions to the Microsoft. This shift demands that software developers start to write all of their applications with concurrency in mind in order to benefit from these significant increases in computing power. ![]() Unless you've been living under a rock, you've likely heard about the "manycore shift." Processor manufacturers such as Intel and AMD are increasing the power of their hardware by scaling out the number of cores on a processor rather than by attempting to continue to provide exponential increases in clock speed. In this article False Sharing Stephen Toub ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |