`
zhangxiong0301
  • 浏览: 361602 次
社区版块
存档分类
最新评论

Transparent huge pages in 2.6.38(转载)

阅读更多

The memory management unit in almost any contemporary processor can handle multiple page sizes, but the Linux kernel almost always restricts itself to just the smallest of those sizes - 4096 bytes on most architectures. Pages which are larger than that minimum - collectively called "huge pages" - can offer better performance for some workloads, but that performance benefit has gone mostly unexploited on Linux. That may change in 2.6.38, though, with the merging of the transparent huge page feature.

Huge pages can improve performance through reduced page faults (a single fault brings in a large chunk of memory at once) and by reducing the cost of virtual to physical address translation (fewer levels of page tables must be traversed to get to the physical address). But the real advantage comes from avoiding translations altogether. If the processor must translate a virtual address, it must go through as many as four levels of page tables, each of which has a good chance of being cache-cold, and, thus, slow. For this reason, processors maintain a "translation lookaside buffer" (TLB) to cache the results of translations. The TLB is often quite small; running cpuid on your editor's aging desktop machine yields:

 

   cache and TLB information (2):
      0xb1: instruction TLB: 2M/4M, 4-way, 4/8 entries
      0xb0: instruction TLB: 4K, 4-way, 128 entries
      0x05: data TLB: 4M pages, 4-way, 32 entries

So there is room for 128 instruction translations, and 32 data translations. Such a small cache is easily overrun, forcing the CPU to perform large numbers of address translations. A single 2MB huge page requires a single TLB entry; the same memory, in 4KB pages, would need 512 TLB entries. Given that, it's not surprising that the use of huge pages can make programs run faster.

The main kernel address space is mapped with huge pages, reducing TLB pressure from kernel code. The only way for user-space to take advantage of huge pages in current kernels, though, is through the hugetlbfs, which was extensively documented here in early 2010. Using hugetlbfs requires significant work from both application developers and system administrators; huge pages must be set aside at boot time, and applications must map them explicitly. The process is fiddly enough that use of hugetlbfs is restricted to those who really care and who have the time to mess with it. Hugetlbfs is often seen as a feature for large, proprietary database management systems and little else.

There would be real value in a mechanism which would make the use of huge pages easy, preferably requiring no development or administrative attention at all. That is the goal of the transparent huge pages (THP) patch, which was written by Andrea Arcangeli and merged for 2.6.38. In short, THP tries to make huge pages "just happen" in situations where they would be useful.

Current Linux kernels assume that all pages found within a given virtual memory area (VMA) will be the same size. To make THP work, Andrea had to start by getting rid of that assumption; thus, much of the initial part of the patch series is dedicated to enabling mixed page sizes within a VMA. Then the patch modifies the page fault handler in a simple way: when a fault happens, the kernel will attempt to allocate a huge page to satisfy it. Should the allocation succeed, the huge page will be filled, any existing small pages in the new page's address range will be released, and the huge page will be inserted into the VMA. If no huge pages are available, the kernel falls back to small pages and the application never knows the difference.

This scheme will increase the use of huge pages transparently, but it does not yet solve the whole problem. Huge pages must be swappable, lest the system run out of memory in a hurry. Rather than complicate the swapping code with an understanding of huge pages, Andrea simply splits a huge page back into its component small pages if that page needs to be reclaimed. Many other operations (mprotect()mlock(), ...) will also result in the splitting of a page.

The allocation of huge pages depends on the availability of large, physically-contiguous chunks of memory - something which Linux kernel programmers can never count on. It is to be expected that those pages will become available at inconvenient times - just after a process has faulted in a number of small pages, for example. The THP patch tries to improve this situation through the addition of a "khugepaged" kernel thread. That thread will occasionally attempt to allocate a huge page; if it succeeds, it will scan through memory looking for a place where that huge page can be substituted for a bunch of smaller pages. Thus, available huge pages should be quickly placed into service, maximizing the use of huge pages in the system as a whole.

The current patch only works with anonymous pages; the work to integrate huge pages with the page cache has not yet been done. It also only handles one huge page size (2MB). Even so, some useful performance improvements can be seen. Mel Gorman ran some benchmarks showing improvements of up to 10% or so in some situations. In general, the results were not as good as could be obtained with hugetlbfs, but THP is much more likely to actually be used.

No application changes need to be made to take advantage of THP, but interested application developers can try to optimize their use of it. A call tomadvise() with the MADV_HUGEPAGE flag will mark a memory range as being especially suited to huge pages, while MADV_NOHUGEPAGE will suggest that huge pages are better used elsewhere. For applications that want to use huge pages, use of posix_memalign() can help to ensure that large allocations are aligned to huge page (2MB) boundaries.

System administrators have a number of knobs that they can tweak, all found under /sys/kernel/mm/transparent_hugepage. The enabled value can be set to "always" (to always use THP), "madvise" (to use huge pages only in VMAs marked with MADV_HUGEPAGE), or "never" (to disable the feature). Another knob, defrag, takes the same values; it controls whether the kernel should make aggressive use of memory compaction to make more huge pages available. There's also a whole set of parameters controlling the operation of the khugepaged thread; see Documentation/vm/transhuge.txt for all the details.

The THP patch has had a bit of a rough ride since being merged into the mainline. This code never appeared in linux-next, so it surprised some architecture maintainers when it caused build failures in the mainline. Some bugs have also been found - unsurprising for a patch which is this large and which affects so much core code. Those problems are being ironed out, so, while 2.6.38-rc1 testers might want to be careful, THP should be in a usable state by the time the final 2.6.38 kernel is released.

分享到:
评论

相关推荐

    基于Tiny6410上移植_kernel-2.6.38

    在本文中,我们将深入探讨如何在Tiny6410开发板上移植Linux内核2.6.38。Tiny6410是一款基于ARM Cortex-A8处理器的嵌入式开发板,具有2GB NAND Flash存储和256MB RAM。移植内核到这样的硬件平台涉及多个步骤,包括...

    linux内核2.6.38 正品行货

    Linux内核2.6.38是Linux操作系统的核心部分,负责管理系统的硬件资源,调度进程,实现硬件抽象层,提供系统调用接口等关键功能。对于开发者而言,它是一个极其重要的基础,尤其对于那些需要定制化操作系统的专业人员...

    Mini6410 kernel-patch Linux-2.6.38

    Mini6410 Kernel-patch Linux-2.6.38

    openswan-2.6.38

    openswan-2.6.38源码,搬运自openswan.org openswan-2.6.38.tar.gz

    cs8900移植 kernel2.6.38

    ### CS8900网卡驱动在Linux Kernel 2.6.38中的移植与调试 #### 一、概述 本文旨在详细介绍如何将原本适用于Linux Kernel 2.6.27及以下版本的CS8900网卡驱动移植到更高版本的Linux Kernel 2.6.38上。由于内核版本的...

    ubuntu下编译linux kernel 2.6.38

    在Ubuntu 10.10环境下编译Linux内核2.6.38涉及一系列步骤,需要对操作系统、内核版本、编译工具和配置选项有深入理解。以下是详细的编译流程和注意事项: 首先,确保你的环境是Pentium 4架构的Ubuntu 10.10(内核...

    linux2.6.38wifi的初步分析

    对2.6.38内核中wifi驱动中出现的SD卡热插拔,wifi驱动进行了一个初步的分析.有什么不对的地方,还希望高手指点!

    mini6410_2.6.38内核_uart1_platform_device驱动

    原创的友善之臂的mini6410 linux-2.6.38内核的uart1 串口驱动,使用platform_device方式,压缩包里面有驱动源代码、编译好了的ko文件、使用说明文档、用户例程的源代码和可执行程序,但是并没有给出直接编译驱动和...

    ubuntu编译Tiny6410内核linux-2.6.38所需mkimage工具

    ubuntu编译ARM内核linux-2.6.38所需的工具

    linux2.6.38内核驱动常用函数.pdf

    Linux 2.6.38内核是Linux操作系统的一个版本,它包含了多种驱动开发所需的函数和API。在开发基于Linux内核的设备驱动程序时,熟练使用这些内核提供的函数是至关重要的。下面将详细介绍文档中提及的一些内核驱动常用...

    Linux简单的字符驱动例子,改自宋宝华的书,移植2.6.38成功

    Linux简单的字符驱动例子,改自宋宝华的书,移植2.6.38成功 使用make modules编译,生成的ko放到tiny6410开发板,可以insmod和rmmod 此例是移植到了2.6.38,书上原有的例子不适合2.6.36以上。因为ioctl被改名了

    基于 linux-2.6.38 内核的嵌入式驱动常用的函数调用

    ### 基于 Linux-2.6.38 内核的嵌入式驱动常用函数调用解析 #### 一、`copy_from_user()` **函数定义:** ```c static inline unsigned long copy_from_user(void __user *to, const void *from, unsigned long n)...

    可以引导linux2.6.38以上内核的uboot

    TQ2440的u_boot不能引导2.6.36以上的内核,用TQ2440的u_boot修改的,可以引导linux2.6.36以上的内核,同时兼容linux2.6.37以下老版本内核 用法: #make distclean #make Wamy2440_config #make ...

    CS8900A移植到linux-2.6.38和linux-2.6.35内核

    在本文中,我们将深入探讨如何将CS8900A网络控制器从Linux 2.6.35内核移植到2.6.38内核的过程。CS8900A是一款广泛使用的以太网控制器,尤其适用于嵌入式系统。在基于FS2410的开发板上进行移植工作,我们需要理解内核...

    Linux-2.6.38.4

    最新的Linux2.6操作系统,Linux-2.6.38.4,加入了新的功能,下载体验吧

    linux2.6.38.8 nand 与yaffs2 移植

    ### Linux 2.6.38.8 NAND 与 YAFFS2 移植详解 #### 前言 在嵌入式系统开发过程中,针对特定硬件平台进行内核及文件系统的移植是一项基本且重要的任务。本文将详细介绍如何在 Linux 2.6.38.8 内核上移植 NAND Flash...

    Linux-2.6.38驱动的几个结构体关系总结

    在Linux 2.6.38版本中,驱动程序设计涉及到多个关键的结构体,包括`struct file_operations`, `struct inode`, 和 `struct file`。下面是对这些结构体及其关系的详细解释。 1. **struct file_operations** 这个...

    Linux kernel-2.6.38.1-i686.cfg

    Linux kernel-2.6.38.1-i686.cfg

    linux-2.6.38-tiny6410:tiny6410开发板,Linux内核源码学习

    在"linux-2.6.38-tiny6410"这个项目中,我们将关注的是Linux内核的一个特定版本,针对tiny6410开发板进行了优化和适配。 tiny6410开发板是基于三星S5PV210处理器的嵌入式平台,常用于教学、实验和产品原型设计。S5...

    用于虚拟化的动态内存平衡

    Linux内核从2.6.38版本开始支持透明巨大页面(Transparent Hugepage),这是一种不需用户干预,就能在系统中自动使用的巨大页面。在论文中,作者分析了透明巨大页面对系统性能的影响,并指出其在效率上存在不足。...

Global site tag (gtag.js) - Google Analytics