postgres 源码解析46 可见性映射表VM

简介

Postgres 为实现多版本并发控制技术，当事务删除或者更新元组时，并非从物理上进行删除，而是将其进行逻辑删除[具体实现通过设置元组头信息xmax/infomask等标志位信息],随着业务的累增，表会越来越膨胀，对于执行计划的生成/最优路径的选择会产生干扰。为解决这一问题，可以通过调用VACUUM来清理这些无效元组。但是一个表可能有很多页组成，如何快速定位到含有无效元组的数据页在高并发场景显得尤为重要，幸运的是pg为表新增对应的附属文件—可见性映射表（VM）,来加速判断heap块是否存在无效元祖。

VM 文件结构

在这里插入图片描述

VM中为每个HEAP page设置两个比特位 (all-visible and all-frozen)，分别对应于该页是否存在无效元祖、该页元组是否全部冻结。
all-visible 比特位的设置表明页内所有元组对于后续所有的事务都是可见的，因此该页无需进行 vacuum操作；
all-frozen 比特位的设置表明页内所有的元组已被冻结，在进行全表扫描vacuum请求时也无需进行vacuum操作。
NOTES: all-frozen 比特位的设置必须建立在该页已设置过 all-visible比特位。

简单介绍下标识位的写/更新逻辑：

在这里插入图片描述
其中比特位的含义如下：
all-visible 比特位： 0 ==> 含有无效元祖 1 ==> 元组均可见，不含无效元祖
all-frozen 比特位： 0 ==> 含有非冻结元祖 1 ==> 元组均冻结可见
方便讲述，取自页内的第一个字节示例：
字节对应的二进制信息： 00 00 00 10
根据上述内容可知，heap表的第一页至第三页含有无效元祖，第四页没有无效元祖
场景：对heap表进行vacuum操作，块1无效元祖被清除，需要设置 all-visible比特位,而块4所有元组冻结
在这里插入图片描述

读取数据是以字节为单位，因此通过 char *map数组读取出页内容首地址，通过偏移量确定all-visible 与 all-frozen比特位
1 Block-1对应的比特位为 00，设置all-visible后更新为 10；
2 Block-4对应的比特位为 10，设置all-frozen后更新为 11；

宏定义与数据结构

/* Number of bits for one heap page */
#define BITS_PER_HEAPBLOCK 2             // 每个heap块对应 2bits/* Flags for bit map */
#define VISIBILITYMAP_ALL_VISIBLE	0x01	// all_visible
#define VISIBILITYMAP_ALL_FROZEN	0x02    // all_frozen 
#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap* flags bits */
** Size of the bitmap on each visibility map page, in bytes. There's no* extra headers, so the whole page minus the standard page header is* used for the bitmap.*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))    // map页大小/* Number of heap blocks we can represent in one byte */
#define HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK)  // 1 字节对应 4个heap块/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)  // 一个map 对应的heap块数量/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)/* Masks for counting subsets of bits in the visibility map. */
#define VISIBLE_MASK64	UINT64CONST(0x5555555555555555) /* The lower bit of each* bit pair */
#define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each* bit pair */
// 读取没有 line pointers文件页的访问方法，尤其适合于VM文件页
/*1. PageGetContents2. 	To be used in cases where the page does not contain line pointers.3.  4. Note: prior to 8.3 this was not guaranteed to yield a MAXALIGN'd result.5. Now it is.  Beware of old code that might think the offset to the contents6. is just SizeOfPageHeaderData rather than MAXALIGN(SizeOfPageHeaderData).*/
#define PageGetContents(page) \((char *) (page) + MAXALIGN(SizeOfPageHeaderData))

接口函数

1 visibilitymap_set
该函数的主要功能是设置可见性标识位，其执行流程如下：
1）首先进行安全性校验，判断传入的heap buf 和 vmbuf是否有效以及buf中缓存页是否一一对应；
2）获取VM页内容首地址（跳过PageHeaderData），获取vmbuf的 BUFFER_LOCK_EXCLUSIVE；
3）如果之前没有设置过相应的标识位，进行如下操作：
(1) 进入临界区，在指定bit位设置信息，将vmbuf标记为脏；
(2) 写WAL日志，如果开启wal_log_hints，需要将此日志号的LSN更新至heap 页后中；最后更新vmbuf缓存页的LSN，并退出临界。
4）释放vmbuf 持有的排他锁。

/**	visibilitymap_set - set bit(s) on a previously pinned page** recptr is the LSN of the XLOG record we're replaying, if we're in recovery,* or InvalidXLogRecPtr in normal running.  The page LSN is advanced to the* one provided; in normal running, we generate a new XLOG record and set the* page LSN to that value.  cutoff_xid is the largest xmin on the page being* marked all-visible; it is needed for Hot Standby, and can be* InvalidTransactionId if the page contains no tuples.  It can also be set* to InvalidTransactionId when a page that is already all-visible is being* marked all-frozen.** 在recovery时 recptr为XLOG 记录的LSN，正常运行时为 InvalidXLogRecPtr。* cutoff_xid为进行标记操作的最大事务号；在备机上如果页内没有元组则为 InvalidTransactionId* 在页标记为 all-frozen时其 cutoff_xid 为 InvalidTransactionId* * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling* this function. Except in recovery, caller should also pass the heap* buffer. When checksums are enabled and we're not in recovery, we must add* the heap buffer to the WAL chain to protect it from being torn.** You must pass a buffer containing the correct map page to this function.* Call visibilitymap_pin first to pin the right one. This function doesn't do* any I/O.*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,uint8 flags)
{BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);Page		page;uint8	   *map;#ifdef TRACE_VISIBILITYMAPelog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
#endifAssert(InRecovery || XLogRecPtrIsInvalid(recptr));Assert(InRecovery || BufferIsValid(heapBuf));Assert(flags & VISIBILITYMAP_VALID_BITS);/* Check that we have the right heap page pinned, if present */if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)elog(ERROR, "wrong heap buffer passed to visibilitymap_set");/* Check that we have the right VM page pinned */if (!BufferIsValid(vmBuf) || BufferGetBlockNumber(vmBuf) != mapBlock)elog(ERROR, "wrong VM buffer passed to visibilitymap_set");page = BufferGetPage(vmBuf);map = (uint8 *) PageGetContents(page);LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS)){START_CRIT_SECTION();map[mapByte] |= (flags << mapOffset);MarkBufferDirty(vmBuf);if (RelationNeedsWAL(rel)){if (XLogRecPtrIsInvalid(recptr)){Assert(!InRecovery);recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,cutoff_xid, flags);/** If data checksums are enabled (or wal_log_hints=on), we* need to protect the heap page from being torn.*/if (XLogHintBitIsNeeded()){Page		heapPage = BufferGetPage(heapBuf);/* caller is expected to set PD_ALL_VISIBLE first */Assert(PageIsAllVisible(heapPage));PageSetLSN(heapPage, recptr);}}PageSetLSN(page, recptr);}END_CRIT_SECTION();}LockBuffer(vmBuf, BUFFER_LOCK_UNLOCK);
}

2 visibilitymap_get_status

首先判断vmbuf是否有效，如果有效，则进一步其缓存的页是否为heap块对应页，若对应关系不匹配，则释放vmbuf pin;
若无效，则调用 vm_readbuf 将vm页加载至缓冲块中并返回vmbuf,若返回vmbuf无效，则返回false后退出；
3）紧接着读取vm页首地址，根据偏移量读取相应的标识位信息；
这里只需要pin 机制，无需加 BUFFER_LOCK_SHARE

/**	visibilitymap_get_status - get status of bits** Are all tuples on heapBlk visible to all or are marked frozen, according* to the visibility map?** On entry, *buf should be InvalidBuffer or a valid buffer returned by an* earlier call to visibilitymap_pin or visibilitymap_get_status on the same* relation. On return, *buf is a valid buffer with the map page containing* the bit for heapBlk, or InvalidBuffer. The caller is responsible for* releasing *buf after it's done testing and setting bits.** NOTE: This function is typically called without a lock on the heap page,* so somebody else could change the bit just after we look at it.  In fact,* since we don't lock the visibility map page either, it's even possible that* someone else could have changed the bit just before we look at it, but yet* we might see the old value.  It is the caller's responsibility to deal with* all concurrency issues!*/
uint8
visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);char	   *map;uint8		result;#ifdef TRACE_VISIBILITYMAPelog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
#endif/* Reuse the old pinned buffer if possible */if (BufferIsValid(*buf)){if (BufferGetBlockNumber(*buf) != mapBlock){ReleaseBuffer(*buf);*buf = InvalidBuffer;}}if (!BufferIsValid(*buf)){*buf = vm_readbuf(rel, mapBlock, false);if (!BufferIsValid(*buf))return false;}map = PageGetContents(BufferGetPage(*buf));/** A single byte read is atomic.  There could be memory-ordering effects* here, but for performance reasons we make it the caller's job to worry* about that.*///单一字节的读取是原子的 result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);return result;
}

3 vm_readbuf

vm_readbuf 函数的功能是负责将指定VM页加载至缓冲区中，若有需要会进行extend生成新页并进行初始化。其执行流程图如下：
在这里插入图片描述

/** Read a visibility map page.** If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is* true, the visibility map file is extended.*/
static Buffer
vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
{Buffer		buf;SMgrRelation reln;/** Caution: re-using this smgr pointer could fail if the relcache entry* gets closed.  It's safe as long as we only do smgr-level operations* between here and the last use of the pointer.*/reln = RelationGetSmgr(rel);/** If we haven't cached the size of the visibility map fork yet, check it* first.*/// 首先检查 是否cached 对应fork （vm）页if (reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber){if (smgrexists(reln, VISIBILITYMAP_FORKNUM))    // 判断是否存在，存在即cachedsmgrnblocks(reln, VISIBILITYMAP_FORKNUM);elsereln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;}/* Handle requests beyond EOF */// 申请的页号超出对应 fork现有最大页号，且指定扩展，则调用 vm_extend进行新建，反之返回InvalidBuffer if (blkno >= reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM]){if (extend)vm_extend(rel, blkno + 1);elsereturn InvalidBuffer;}/** Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's* always safe to clear bits, so it's better to clear corrupt pages than* error out.** The initialize-the-page part is trickier than it looks, because of the* possibility of multiple backends doing this concurrently, and our* desire to not uselessly take the buffer lock in the normal path where* the page is OK.  We must take the lock to initialize the page, so* recheck page newness after we have the lock, in case someone else* already did it.  Also, because we initially check PageIsNew with no* lock, it's possible to fall through and return the buffer while someone* else is still initializing the page (i.e., we might see pd_upper as set* but other page header fields are still zeroes).  This is harmless for* callers that will take a buffer lock themselves, but some callers* inspect the page without any lock at all.  The latter is OK only so* long as it doesn't depend on the page header having correct contents.* Current usage is safe because PageGetContents() does not require that.*/// 常规流程 ==》 从共享缓冲池选择一个缓冲块缓存指定的VM页面，如果是新NEW页，获取// BUFFER_LOCK_EXCLUSIVE，后再次检查页面是否为NEW[进行两次判断其是否为新页，// 是因为有其他进程在本进程申请锁时已经完成了初始化]buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,RBM_ZERO_ON_ERROR, NULL);if (PageIsNew(BufferGetPage(buf))){LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);if (PageIsNew(BufferGetPage(buf)))PageInit(BufferGetPage(buf), BLCKSZ, 0);LockBuffer(buf, BUFFER_LOCK_UNLOCK);}return buf;
}

4 vm_extend

当访问的vm页在文件中不存在时，此时需调用vm_extend函数扩展新页并完成相应的初始化工作，其执行流程图如下：
在这里插入图片描述

首先页面初始化，填充PageHeader结构体pd_lower、pd_upper/和flag初始信息；
2）获取relation的extension锁，防止其他进程进行同样的扩展工作；
3）如果文件不存在，则调用 smgrcreate进行创建，反之进入第4）步；
4）获取当前vm块号，如果当前块号小于指定快号，则需在此调用vm_extend进行扩展（递归调用）；
5）向其他进程发送无效消息强制其关闭对rel的引用，其目的是避免其他进程对此文件的create或者extension,因为这写操作容易发生。
6）最后释放锁资源；

/** Ensure that the visibility map fork is at least vm_nblocks long, extending* it if necessary with zeroed pages.*/
static void
vm_extend(Relation rel, BlockNumber vm_nblocks)
{BlockNumber vm_nblocks_now;PGAlignedBlock pg;SMgrRelation reln;PageInit((Page) pg.data, BLCKSZ, 0);/** We use the relation extension lock to lock out other backends trying to* extend the visibility map at the same time. It also locks out extension* of the main fork, unnecessarily, but extending the visibility map* happens seldom enough that it doesn't seem worthwhile to have a* separate lock tag type for it.** Note that another backend might have extended or created the relation* by the time we get the lock.*/LockRelationForExtension(rel, ExclusiveLock);/** Caution: re-using this smgr pointer could fail if the relcache entry* gets closed.  It's safe as long as we only do smgr-level operations* between here and the last use of the pointer.*/reln = RelationGetSmgr(rel);/** Create the file first if it doesn't exist.  If smgr_vm_nblocks is* positive then it must exist, no need for an smgrexists call.*/if ((reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&!smgrexists(reln, VISIBILITYMAP_FORKNUM))smgrcreate(reln, VISIBILITYMAP_FORKNUM, false);/* Invalidate cache so that smgrnblocks() asks the kernel. */reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;vm_nblocks_now = smgrnblocks(reln, VISIBILITYMAP_FORKNUM);/* Now extend the file */while (vm_nblocks_now < vm_nblocks){PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);vm_nblocks_now++;}/** Send a shared-inval message to force other backends to close any smgr* references they may have for this rel, which we are about to change.* This is a useful optimization because it means that backends don't have* to keep checking for creation or extension of the file, which happens* infrequently.*/CacheInvalidateSmgr(reln->smgr_rnode);UnlockRelationForExtension(rel, ExclusiveLock);
}