PostgreSQL启动过程中的那些事七：初始化共享内存和信号五：shmem中初始化multixact -

BeiGang

浏览: 232097 次
性别:
来自: 北京

最近访客更多访客>>

雨花石-当

lhblion

lelelaogu

dennyooo

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

PostgreSQL启动过程中的那些事七：初始化共享内存和信号五：shmem中初始化multixact

博客分类：

PostgreSQL内核

PostgreSQL 事务共享内存哈希表索引组合事务

pg 初始化 shmem ，给其加上索引 "ShmemIndex" 后，接着就在 shmem 里初始化 xlog 。然后依次初始化 clog 、 subtrans 、 twophase 、 multixact 。安排按 clog 、 subtrans 、 multixact 、 twophase 的顺序写，把 twophase 放到 multixact 之后是因为前面三个用了相同的算法和数据结构，连起来写可以加深印象和归类记忆，本来想把初始化 clog 、 subtrans 、 multixact 放到一篇文章里写，因为篇幅太长还是分开了，看的时候这几篇文章可以结合起来看。

pg 多事务日志管理器是一个类 pg 提交事务管理器，为每一个 MultiXactId 存事务 ID 数组。它是共享行锁（ shared-row-lock ）实现的一个基础部分。一个被共享锁锁住的元组把 MultiXactId 存在自己的 Xmax 字段里，且一个事务需要等待元组被解锁后才能睡眠 / 再加锁于可能由多个事务 ID 组成的该 MultiXactId 之上。

Pg 使用两套 SLRU 相关结构，一套存放偏移量，这个偏移量是在另一套 SLRU 相关结构里每一个 MultiXact Id 数据的开始位置。这样的设计可以使我们保存变长事务 ID 数组。

和 XLOG 的关系：当一个新的偏移量或者成员页面被初始化为 0 时， MultiXact 模块产生一个 XLOG 记录，以及定义一个新的 MultiXactId 时，也会产生一个 XLOG 记录。这样使 pg 可以在重做事务日志（ XLOG replay ）时完整重建进入的数据。因为这一点， pg 不必遵循“在写数据前写 WAL 日志”的一般原则；只需要正确的保证在 checkpoint 完成之前我们把脏 OFFSET 和 MEMBER 页面（上面提到的两套 SLRU 相关结构的页面）刷出和同步到磁盘。在相应的 WAL 日志记录之前，如果一个页面做了，在使用该页面之前，这个页面肯定会被强制归 0 。因此， pg 不需要用 LSN 信息标记内存页面； pg 已经有了足够的同步。

像事务提交日志（ CLOG ）一样，但不像子事务（ subtrans ）， pg 必须保存跨越崩溃 / 崩溃恢复的状态且保证 MultiXactId 和偏移量数字在跨越破溃 / 破溃恢复时单调增长。 Pg 用和事务 ID 同样的方式保证这一点： WAL 日志记录保证包含每一个 MXID 的证据，我们不要担心这个，我们只需要确保在恢复时重放事务日志结束的时候，下一个 MXID 和下一个偏移量计数器至少是在重放日志中相应最大的就可以了。

上面概述了 MultiXact ，下来我们看方法调用流程

1 先上个图，看一下函数调用过程梗概，中间略过部分细节

初始化 MultiXact 方法调用流程图

2 初始化 xlog 相关结构

话说 main()->…->PostmasterMain()->…->reset_shared() -> CreateSharedMemoryAndSemaphores()->…-> MultiXactShmemInit() ，初始化 MultiXact 事务相关数据结构 MultiXactOffsetCtl 、MultiXactMemberCtl 、MultiXactState 等，用作内存里管理和缓存 MultiXact 事务日志文件（存放在 "data/pg_multixact/offsets" 和 "data/pg_multixact/members" 文件夹里的文件）。

MultiXactShmemInit ()->SimpleLruInit()->ShmemInitStruct() ，在其中调用 hash_search() 在哈希表索引 "ShmemIndex" 中查找 " MultiXactOffset Ctl " ，如果没有，就在 shmemIndex 中给 " MultiXactOffset Ctl " 分一个 HashElement 和 ShmemIndexEnt （ entry ），在其中的 Entry 中写上 " MultiXactOffset Ctl " 。返回 ShmemInitStruct() ，再调用 ShmemAlloc() 在共享内存上给 " MultiXactOffset Ctl " 相关结构（见下面“ MultiXact 相关结构图” ）分配空间，设置 entry （在这儿及ShmemIndexEnt 类型变量）的成员 location 指向该空间， size 成员记录该空间大小，最后返回 MultiXactShmemInit () ，让 SlruCtlData * 类型全局变量 MultiXactOffsetCtl 指向 SlruCtlData 类型静态全局变量 MultiXactOffsetCtlData ，MultiXactOffsetCtlData 的起始地址就是在shmem 里给 " MultiXactOffset Ctl" 相关结构分配的内存起始地址，设置其中SubTransCtlData 结构类型的成员值。

接着 MultiXactShmemInit ()->SimpleLruInit()->ShmemInitStruct() ，在其中调用 hash_search() 在哈希表索引 "ShmemIndex" 中查找 " MultiXactMember Ctl " ，如果没有，就在 shmemIndex 中给 " MultiXactMember Ctl " 分一个 HashElement 和 ShmemIndexEnt （ entry ），在其中的 Entry 中写上 " MultiXactMember Ctl " 。返回 ShmemInitStruct() ，再调用 ShmemAlloc() 在共享内存上给 " MultiXactMember Ctl " 相关结构（见下面“ MultiXact 相关结构图” ）分配空间，设置 entry （在这儿及ShmemIndexEnt 类型变量）的成员 location 指向该空间， size 成员记录该空间大小，最后返回 MultiXactShmemInit () ，让 SlruCtlData * 类型全局变量 MultiXactMemberCtl 指向 SlruCtlData 类型静态全局变量 MultiXactMemberCtlData ，MultiXactMemberCtlData 的起始地址就是在shmem 里给 " MultiXactMember Ctl" 相关结构分配的内存起始地址，设置其中SubTransCtlData 结构类型的成员值。

然后调用ShmemInitStruct() ，在其中调用 hash_search() 在哈希表索引 "ShmemIndex" 中查找 " Shared MultiXact State " ，如果没有，就在 shmemIndex 中给 " Shared MultiXact State " 分一个 HashElement 和 ShmemIndexEnt （ entry ），在其中的 Entry 中写上 " Shared MultiXact State " 。返回 ShmemInitStruct() ，再调用 ShmemAlloc() 在共享内存上给 " Shared MultiXact State " 相关结构（见下面“ MultiXact 相关结构图” ）分配空间，设置 entry （在这儿及ShmemIndexEnt 类型变量）的成员 location 指向该空间， size 成员记录该空间大小，最后返回 MultiXactShmemInit () ，让 MultiXactStateData * 类型全局静态变量 MultiXactState 指向 MultiXactStateData 结构实例， MultiXactStateData 的起始地址就是在shmem 里给 " Shared MultiXact State " 相关结构分配的内存起始地址，设置其中 MultiXactStateData 结构类型的成员值。

相关变量、结构定义和初始化完成后数据结构图在下面。

static MT_LOCAL SlruCtlData MultiXactOffsetCtlData;

static MT_LOCAL SlruCtlData MultiXactMemberCtlData;

#define MultiXactOffsetCtl (&MultiXactOffsetCtlData)

#define MultiXactMemberCtl (&MultiXactMemberCtlData)

typedef struct SlruCtlData

{

SlruShared shared;

* This flag tells whether to fsync writes (true for pg_clog, false for

* pg_subtrans).

bool do_fsync;

* Decide which of two page numbers is "older" for truncation purposes. We

* need to use comparison of TransactionIds here in order to do the right

* thing with wraparound XID arithmetic.

bool (*PagePrecedes) (int , int );

* Dir is set during SimpleLruInit and does not change thereafter. Since

* it's always the same, it doesn't need to be in shared memory.

char Dir[64];

} SlruCtlData;

typedef SlruCtlData *SlruCtl;

* Shared-memory state

typedef struct SlruSharedData

{

LWLockId ControlLock;

/* Number of buffers managed by this SLRU structure */

int num_slots;

* Arrays holding info for each buffer slot. Page number is undefined

* when status is EMPTY, as is page_lru_count.

char **page_buffer;

SlruPageStatus *page_status;

bool *page_dirty;

int *page_number;

int *page_lru_count;

LWLockId *buffer_locks;

/*----------

* We mark a page "most recently used" by setting

* page_lru_count[slotno] = ++cur_lru_count;

* The oldest page is therefore the one with the highest value of

* cur_lru_count - page_lru_count[slotno]

* The counts will eventually wrap around, but this calculation still

* works as long as no page's age exceeds INT_MAX counts.

*----------

int cur_lru_count;

* latest_page_number is the page number of the current end of the log;

* this is not critical data, since we use it only to avoid swapping out

* the latest page.

int latest_page_number;

} SlruSharedData;

typedef SlruSharedData *SlruShared;

static MultiXactStateData *MultiXactState;

typedef struct MultiXactStateData

{

/* next-to-be-assigned MultiXactId */

MultiXactId nextMXact;

/* next-to-be-assigned offset */

MultiXactOffset nextOffset;

/* the Offset SLRU area was last truncated at this MultiXactId */

MultiXactId lastTruncationPoint;

* Per-backend data starts here. We have two arrays stored in the area

* immediately following the MultiXactStateData struct. Each is indexed by

* BackendId. (Note: valid BackendIds run from 1 to MaxBackends; element

* zero of each array is never used.)

* OldestMemberMXactId[k] is the oldest MultiXactId each backend's current

* transaction(s) could possibly be a member of, or InvalidMultiXactId

* when the backend has no live transaction that could possibly be a

* member of a MultiXact. Each backend sets its entry to the current

* nextMXact counter just before first acquiring a shared lock in a given

* transaction, and clears it at transaction end. (This works because only

* during or after acquiring a shared lock could an XID possibly become a

* member of a MultiXact, and that MultiXact would have to be created

* during or after the lock acquisition.)

* OldestVisibleMXactId[k] is the oldest MultiXactId each backend's

* current transaction(s) think is potentially live, or InvalidMultiXactId

* when not in a transaction or not in a transaction that's paid any

* attention to MultiXacts yet. This is computed when first needed in a

* given transaction, and cleared at transaction end. We can compute it

* as the minimum of the valid OldestMemberMXactId[] entries at the time

* we compute it (using nextMXact if none are valid). Each backend is

* required not to attempt to access any SLRU data for MultiXactIds older

* than its own OldestVisibleMXactId[] setting; this is necessary because

* the checkpointer could truncate away such data at any instant.

* The checkpointer can compute the safe truncation point as the oldest

* valid value among all the OldestMemberMXactId[] and

* OldestVisibleMXactId[] entries, or nextMXact if none are valid.

* Clearly, it is not possible for any later-computed OldestVisibleMXactId

* value to be older than this, and so there is no risk of truncating data

* that is still needed.

MultiXactId perBackendXactIds[1]; /* VARIABLE LENGTH ARRAY */

} MultiXactStateData;

下面看看初始化完 " MultiXactOffset Ctl" 、 " MultiXactOffset Ctl" 及 " Shared MultiXact State " 相关结构后在内存中的结构图

初始化完 MultiXact 相关结构的内存结构图

为了精简上图，把创建 shmem 的哈希表索引 "ShmemIndex" 时创建的 HCTL 结构删掉了，这个结构的作用是记录创建可扩展哈希表的相关信息。增加了左边灰色底的部分，描述共享内存 /shmem 里各变量物理布局概览，由下往上，由低地址到高地址。其中的 " MultiXact Ctl" 相关机构即 MultiXactOffsetCtl 和MultiXactMemberCtl 的相关结构图下面分别给出，要不上面的图太大太复杂了。