pg事务：快照

pg中的快照

快照（snapshot）是记录数据库当前瞬时状态的一个数据结构。pg数据库的快照保存当前所有活动事务的最小事务ID、最大事务ID、当前活跃事务列表、当前事务的command id等
快照数据保存在SnapshotData结构体类型中，源码src/include/utils/snapshot.h

typedef struct SnapshotData
{SnapshotType snapshot_type; /* 快照类型 */
TransactionId xmin;			/* 事务ID小于xmin，对于快照可见 */
TransactionId xmax;			/* 事务ID大于xmax，对于快照不可见 *//* 获取快照时活跃事务列表。该列表仅包括xmin与xmax之间的txid */
TransactionId *xip;
uint32		xcnt;			/* xip_list保存在xip[] *//* 获取快照时活跃子事务列表 */
TransactionId *subxip;
int32		subxcnt;		/* 子事务保存在subxip[] */
bool		suboverflowed;	/* 子事务是否溢出，子事务较多时会产生溢出 */bool		takenDuringRecovery;	/*  是否是恢复快照recovery-shaped snapshot? */
bool		copied;			/* 这里应该是快照是否是copy的（可重复读和串行化隔离级别，会copy快照）false if it's a static snapshot */CommandId	curcid;			/* 事务中的command id，CID< curcid的可见 */
...
TimestampTz whenTaken;		/* 生成快照的时间戳 */
XLogRecPtr	lsn;			/* 生成快照的LSN */
} SnapshotData;
typedef struct SnapshotData *Snapshot;

快照中最重要的信息是xmin、xmax、xip_list。通过pg_current_snapshot()（pg12及以前用 txid_current_snapshot () ）显示当前事务的快照。

注意区分快照xmin、xmax跟元组上的xmin、xmax，含义是不一样的。

lzldb=*# select pg_current_snapshot();pg_current_snapshot 
---------------------100:104:100,102

xmin	最早活跃的txid，所有比他更早的事务txid<xmin，要么提交和可见，要么回滚并成为死元组
xmax	第一个尚未分配的txid，xmax=latestCompletedXid+1，所有txid>=xmax的事务都未启动并对当前快照不可见
xip_list	xip_list存储在数组xip[]中。因为所有事务开始顺序性和完成顺序不一定是一致的，晚开始的事务可能早完成，所以只有xmin和xmax不能完全表达获取快照时的所有活动事务。xip_list保存获得快照时的活动事务

在这里插入图片描述

快照类型

除了mvcc快照以外，pg在src/include/utils/snapshot.h中还定义了一些其他的快照类型

typedef enum SnapshotType
{/* 当且仅当元组符合mvcc快照可见规则时，元组可见* 最重要的一种快照事务，是pg用来实现mvcc的快照类型* 元组可见性基于事务快照的xmin,xmax,xip_list,curcid等信息进行判断*  如果命令发生了数据变更，当前mvcc快照是看不到的，需要再生成mvcc快照*/
SNAPSHOT_MVCC = 0,
/* 元组上的事务已提交，则可见* 进行中的事务不可见* 命令发生了数据变更，当前self快照可以看见*/
SNAPSHOT_SELF,/** 任何元组都可见*/
SNAPSHOT_ANY,/** toast重要是有效的就可见。toast可见性依赖主表的元组可见性*/
SNAPSHOT_TOAST,/** 命令发生了数据变更，当前dirty快照可以看见* dirty快照会保存当前进行中元组的版本信息* 快照xmin会设置成其他进行中事务的元组xmin，xmax类似*/
SNAPSHOT_DIRTY,/* HISTORIC_MVCC快照规则与MVCC快照一致，用于逻辑解码*/
SNAPSHOT_HISTORIC_MVCC,/*判断死元组是否对一些事务可见*/
SNAPSHOT_NON_VACUUMABLE
} SnapshotType;

快照与隔离级别

不同的隔离级别，快照获取方式是不一样的

在这里插入图片描述

rc模式需要事务中的每个sql都获得快照，而rr模式在事务中只使用一个快照。获得快照的方法在GetTransactionSnapshot()函数中。

进程上的事务结构体

pg在获得快照数据的时候，需要检索所有backend进程的事务状态。

所以在理解获得快照数据函数GetSnapshotData()之前，需要先理解几个在关于backend process的结构体。这些结构体包括PGPROC、PGXACT、PROC_HDR(PROCGLOBAL)、ProcArray

这些process相关结构体包含一些进程、锁等信息，这里只研究process里事务相关的信息。源码以pg13源码为示例

PGPROC结构体

源码src/include/storage/proc.h

//每个backend进程在内存中都存储PGPROC结构体
//可以理解为backend进程的主结构体
struct PGPROC
{
...
LocalTransactionId lxid;	/* local id of top-level transaction currently* being executed by this proc, if running;* else InvalidLocalTransactionId */
...
struct XidCache subxids;	/* 缓存子事务XIDs */
...
/* clog组事务状态更新 */
bool		clogGroupMember;	/* 当前proc是否使用clog组提交 */
pg_atomic_uint32 clogGroupNext; /* 原子int，指向下一个组成员proc */
TransactionId clogGroupMemberXid;	/* 当前要提交的xid */
XidStatus	clogGroupMemberXidStatus;	/* 当前要提交xid的状态 */
int			clogGroupMemberPage;	/* 当前要提交xid属于哪个page*/XLogRecPtr	clogGroupMemberLsn; /* 当前要提交的xid的commit日志的lsn号 */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. 居然不跟结构体写在一起*/

PGXACT结构体

//在9.2以前，PGXACT的信息在PGPROC中，由于压测显示在多cpu系统中，因为减少了获取的缓存行数，把两者分开GetSnapshotData会更快，
typedef struct PGXACT
{TransactionId xid;			/* id of top-level transaction currently being* executed by this proc, if running and XID* is assigned; else InvalidTransactionId */// 看上是当前进程的xmaxTransactionId xmin;			/* 不包括lazy vaccum，事务开始时最小xid，vacuum无法删除xid >= xmin的元组*/uint8		vacuumFlags;	/* vacuum-related flags, see above */bool		overflowed;  //PGXACT是否溢出uint8		nxids;
} PGXACT;

能看出pgxact保存的信息比较简单，是backend的xmin、xmax等事务相关信息。而pgproc更倾向于保存backend的基本信息，pgproc中还是有一部分不太频繁调用的事务信息，不过最核心的进程事务信息在pgxact中

PROC_HDR(PROCGLOBAL)结构体

每个backend process都有proc结构体，很明显在高并发场景下扫描所有proc寻找事务信息比较耗时，这时需要一个实例级别的结构体存储所有proc信息，这个结构体就是PROCGLOBAL**。**

源码一般用结构体类型PROC_HDR定义结构体指针指向PROCGLOBAL。PROC_HDR存储的是全局的proc信息，所有proc数组列表、空闲proc等等

源码位置src/include/storage/proc.h

typedef struct PROC_HDR
{/* pgproc数组 (not including dummies for prepared txns) */PGPROC	   *allProcs;/* pgxact数组 (not including dummies for prepared txns) */PGXACT	   *allPgXact;.../* Current shared estimate of appropriate spins_per_delay value */int			spins_per_delay;/* The proc of the Startup process, since not in ProcArray */PGPROC	   *startupProc;int			startupProcPid;/* Buffer id of the buffer that Startup process waits for pin on, or -1 */int			startupBufferPinWaitBufId;
} PROC_HDR;

PROCARRAY结构体

procarray在procarray.c中，procarray.c是维护所有backend的PGPROC和PGXACT结构的。

源码位置src/backend/storage/ipc/procarray.c

typedef struct ProcArrayStruct
{int			numProcs;		/* proc的个数*/int			maxProcs;		/* proc array的大小 *///处理已分配的xidint			maxKnownAssignedXids;	/* allocated size of array */int			numKnownAssignedXids;	/* current # of valid entries */int			tailKnownAssignedXids;	/* index of oldest valid element */int			headKnownAssignedXids;	/* index of newest element, + 1 */slock_t		known_assigned_xids_lck;	/* protects head/tail pointers *//** Highest subxid that has been removed from KnownAssignedXids array to* prevent overflow; or InvalidTransactionId if none.  We track this for* similar reasons to tracking overflowing cached subxids in PGXACT* entries.  Must hold exclusive ProcArrayLock to change this, and shared* lock to read it.*/TransactionId lastOverflowedXid;/* oldest xmin of any replication slot */TransactionId replication_slot_xmin;/* oldest catalog xmin of any replication slot */TransactionId replication_slot_catalog_xmin;/* pgprocnos，相当于allPgXact[]数组下标，可用于检索allPgXact[]，该数组有PROCARRAY_MAXPROCS条目 */int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
} ProcArrayStruct;
static ProcArrayStruct *procArray;

获得快照

GetTransactionSnapshot()

通过函数GetTransactionSnapshot()获得快照

源码src/backend/utils/time/snapmgr.c

// GetTransactionSnapshot()为一个事务中的sql分配合适的快照
Snapshot
GetTransactionSnapshot(void)
{// 如果是逻辑解码，则获得historic类型快照Return historic snapshot if doing logical decoding. We'll never need a// 因为是逻辑解码事务，后续就不需要再call非historic类型快照了，直接returnif (HistoricSnapshotActive()){Assert(!FirstSnapshotSet);return HistoricSnapshot;}/* 如果不是事务的第一次调用，则进入if */if (!FirstSnapshotSet){/** 保证catalog快照是新的*/InvalidateCatalogSnapshot();Assert(pairingheap_is_empty(&RegisteredSnapshots));Assert(FirstXactSnapshot == NULL);//如果是并行模式下则返回报错if (IsInParallelMode())elog(ERROR,"cannot take query snapshot during a parallel operation");//如果是可重复读或串行化隔离级别，则在事务中都使用同一个快照，所以只copy一次//IsolationUsesXactSnapshot()标识隔离级别为可重复读或串行化，他们的在同事务中只使用一个快照if (IsolationUsesXactSnapshot()){//首先，在CurrentSnapshotData中创建快照 //如果是SI隔离级别，初始化SSI所需的数据结构if (IsolationIsSerializable())  CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);elseCurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);/* Make a saved copy *//* 可重复读或串行化隔离级别，这个快照会贯穿整个事务，所以只复制一次 */CurrentSnapshot = CopySnapshot(CurrentSnapshot);FirstXactSnapshot = CurrentSnapshot;/* Mark it as "registered" in FirstXactSnapshot */FirstXactSnapshot->regd_count++;pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);}else//如果是读已提交隔离级别，获得快照CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);// 修改标记，表示是第一次获得的快照，下次事务再调用该函数，就不会进到这层if了FirstSnapshotSet = true;return CurrentSnapshot;}//如果不是事务中第一次调用（已经有第一个快照了）
//可重复读或串行化隔离级别，返回第一个快照的复制品if (IsolationUsesXactSnapshot())return CurrentSnapshot;/* Don't allow catalog snapshot to be older than xact snapshot. */InvalidateCatalogSnapshot();//读已提交级别，重新获得快照CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);return CurrentSnapshot;
}

关于IsolationUsesXactSnapshot()和IsolationIsSerializable()

在src/include/access/xact.h宏定义

#define XACT_READ_UNCOMMITTED	0
#define XACT_READ_COMMITTED	1
#define XACT_REPEATABLE_READ	2
#define XACT_SERIALIZABLE	3
//内部只有3个隔离级别，就是1、2、3
//2个隔离级别在每个事务中用同一快照，其他隔离级别在每个sql语句用一个快照
#define IsolationUsesXactSnapshot() (XactIsoLevel >= XACT_REPEATABLE_READ)
#define IsolationIsSerializable() (XactIsoLevel == XACT_SERIALIZABLE)

IsolationUsesXactSnapshot()是可重复读或串行化隔离级别

IsolationIsSerializable()是串行化隔离级别。

GetTransactionSnapshot()函数流程图:

在这里插入图片描述
（图片来自csdn https://blog.csdn.net/Hehuyi_In）

GetTransactionSnapshot()主要的判断逻辑：

逻辑解码时的historic快照直接返回快照结果
在可重复读或串行化隔离级别，如果是第一次调用，返回快照并复制，以便下次（既非第一次）直接引用该快照
在读已提交隔离级别，每次调用都生成新快照
串行化隔离级别的第一次调用，额外获得SSI数据信息
GetTransactionSnapshot()获得快照，其获得快照数据调用的是GetSnapshotData()

GetSnapshotData()

源码src/backend/storage/ipc/procarray.c

Snapshot
GetSnapshotData(Snapshot snapshot)
{//先初始化一些变量，包括arrayP指针，procarray，xmin，xmax，复制槽事务id等等ProcArrayStruct *arrayP = procArray;TransactionId xmin;TransactionId xmax;TransactionId globalxmin;int			index;int			count = 0;int			subcount = 0;bool		suboverflowed = false;TransactionId replication_slot_xmin = InvalidTransactionId;TransactionId replication_slot_catalog_xmin = InvalidTransactionId;Assert(snapshot != NULL);if (snapshot->xip == NULL){/** First call for this snapshot. Snapshot is same size whether or not* we are in recovery, see later comments.*/snapshot->xip = (TransactionId *) //获得当前事务的xipmalloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));...Assert(snapshot->subxip == NULL);snapshot->subxip = (TransactionId *) //获得当前子事务的subxipmalloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));...}//获取procarray，需要共享lwlock锁LWLockAcquire(ProcArrayLock, LW_SHARED);/* xmax=最大完成xid+1 */xmax = ShmemVariableCache->latestCompletedXid;Assert(TransactionIdIsNormal(xmax));TransactionIdAdvance(xmax);  //xmax+1/* xmax的值已经取出，xmin需要检索pgproc、pgxact、procarray *//* 先把globalxmin、xmin赋值xmax，如果判断backend没有事务信息，就比较好办了 */globalxmin = xmin = xmax; //恢复快照单独处理snapshot->takenDuringRecovery = RecoveryInProgress();//非恢复快照需要到backend中获取事务信息if (!snapshot->takenDuringRecovery){int		   *pgprocnos = arrayP->pgprocnos;int			numProcs;/** Spin over procArray checking xid, xmin, and subxids.  The goal is* to gather all active xids, find the lowest xmin, and try to record* subxids.看上去在检索procarray的时候会spin，以收集所有活跃的xid，最小的xmin，子事务subxid*/numProcs = arrayP->numProcs;for (index = 0; index < numProcs; index++){int			pgprocno = pgprocnos[index]; //通过循环numProcs进程个数，取pgprocno全部下标PGXACT	   *pgxact = &allPgXact[pgprocno]; //通过pgprocno遍历所有pgxact结构体TransactionId xid;.../* Update globalxmin to be the smallest valid xmin */xid = UINT32_ACCESS_ONCE(pgxact->xmin);if (TransactionIdIsNormal(xid) &&NormalTransactionIdPrecedes(xid, globalxmin))globalxmin = xid;/* Fetch xid just once - see GetNewTransactionId */xid = UINT32_ACCESS_ONCE(pgxact->xid);.../* 把backend中的xmin保存到快照xip中 *//* 也就是说通过便利所有pgxact以找到所有活跃的xid */snapshot->xip[count++] = xid;.../* 子事务信息处理 */if (!suboverflowed) //如果子事务没有溢出{if (pgxact->overflowed)suboverflowed = true;  //如果事务溢出，将子事务也标记为溢出else{int			nxids = pgxact->nxids;if (nxids > 0){PGPROC	   *proc = &allProcs[pgprocno];pg_read_barrier();	/* pairs with GetNewTransactionId */memcpy(snapshot->subxip + subcount,(void *) proc->subxids.xids,nxids * sizeof(TransactionId));subcount += nxids;}}}}}else //这里的else对应if (!snapshot->takenDuringRecovery){// 这里的判断都是standby的，当实例是hot standby模式，从库中有查询事务时subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,xmax);if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))suboverflowed = true;}//事物槽的xmin和catalog全集群xmin，先保存到本地变量//事物槽xmin是为了防止元组被回收//注释中说明是为了不长时间持有ProcArrayLock，才保存到本地变量replication_slot_xmin = procArray->replication_slot_xmin;replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;//从backend中获取事务信息的工作已经完成，下面是一堆if判断，收尾工作并增加代码严谨性 if (!TransactionIdIsValid(MyPgXact->xmin))MyPgXact->xmin = TransactionXmin = xmin;LWLockRelease(ProcArrayLock); //释放ProcArrayLockif (TransactionIdPrecedes(xmin, globalxmin))globalxmin = xmin; //globalxmin和进程xmin，globalxmin赋值更小的那个RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;if (!TransactionIdIsNormal(RecentGlobalXmin))RecentGlobalXmin = FirstNormalTransactionId; //特殊情况下，如果RecentGlobalXmin<=2，赋值3/* Check whether there's a replication slot requiring an older xmin. */if (TransactionIdIsValid(replication_slot_xmin) &&NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))RecentGlobalXmin = replication_slot_xmin;/* Non-catalog tables can be vacuumed if older than this xid */RecentGlobalDataXmin = RecentGlobalXmin;//再次检查和对比catalog，globalxminnif (TransactionIdIsNormal(replication_slot_catalog_xmin) &&NormalTransactionIdPrecedes(replication_slot_catalog_xmin, RecentGlobalXmin))RecentGlobalXmin = replication_slot_catalog_xmin;RecentXmin = xmin;//开始给snapshot结构体赋值，返回快照数据snapshot->xmin = xmin;snapshot->xmax = xmax;snapshot->xcnt = count;snapshot->subxcnt = subcount;snapshot->suboverflowed = suboverflowed;snapshot->curcid = GetCurrentCommandId(false);//如果是一个新快照，初始化一些快照信息snapshot->active_count = 0;snapshot->regd_count = 0;snapshot->copied = false;//下面是快照过久时的判断，居然写在这if (old_snapshot_threshold < 0){/** If not using "snapshot too old" feature, fill related fields with* dummy values that don't require any locking.*///如果没有使用old_snapshot_threshold参数（参数<0，不会出现snapshot too old的问题）//赋一些简单的值，都是常量，不会产生任何锁snapshot->lsn = InvalidXLogRecPtr;snapshot->whenTaken = 0;}else{//当old_snapshot_threshold参数>=0时，需要完成old snapshot的逻辑snapshot->lsn = GetXLogInsertRecPtr();  //获得lsnsnapshot->whenTaken = GetSnapshotCurrentTimestamp(); //获得快照时间MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);  ////GetXLogInsertRecPtr()，GetSnapshotCurrentTimestamp() ,MaintainOldSnapshotTimeMapping()三个函数中有           //SpinLockAcquire和SpinLockRelease//MaintainOldSnapshotTimeMapping()函数还有LWLockAcquire和LWLockRelease //因为每次快照都要调用，获取快照数据函数应该是很频繁的//所以能看出来pg13源码中，如果将old_snapshot_threshold设置为负数，spinlock和lwlock会少很多}return snapshot;
}

pg14对事务的优化

pg14事务优化源码分析

pg13的源码能看出来GetSnapshotData()中写死了old_snapshot_threshold>=0时，每次获得快照数据都会产生较多的SpinLock和LWLock，而获得快照对于数据库来说是非常频繁的操作，这必定导致一些性能问题。所以pg14中直接把old_snapshot_threshold部分删除了···

除了删除GetSnapshotData()中的old_snapshot_threshold逻辑，还做了很多其他优化：

移除RecentGlobalXmin，RecentGlobalDataXmin，新增GlobalVisTest*系列函数

新增边界boundaries概念，有两个边界分别为definitely_needed，maybe_needed

struct GlobalVisState
{/* XIDs >= are considered running by some backend */// >=definitely_needed的行一定可见FullTransactionId definitely_needed;/* XIDs < are not considered to be running by any backend */// <maybe_needed的行一定可以清理FullTransactionId maybe_needed;
};

新增ComputeXidHorizons()用于进一步精准计算horizons（保存xmin和removable xid信息），该函数仍需要遍历PGPROC。计算的范围当然是在XID >= maybe_needed && XID < definitely_needed

新增GlobalVisTestShouldUpdate()用于判断是否需要再次计算边界

先了解一个变量ComputeXidHorizonsResultLastXmin

static TransactionId ComputeXidHorizonsResultLastXmin; //最后一次精准计算的xminGlobalVisTestShouldUpdate(GlobalVisState *state)
{//如果xmin=0，需要重新计算边界。相当于给初始化数据库产生的元组设置一个例外判断if (!TransactionIdIsValid(ComputeXidHorizonsResultLastXmin))return true;/** If the maybe_needed/definitely_needed boundaries are the same, it's* unlikely to be beneficial to refresh boundaries.*///maybe_needed等于definitely_needed不需要再计算了//不过不是用的等于，而是maybe_needed>=definitely_needed//“大于”的场景是没有行一定可见，“等于”的场景是只有一行一定可见if (FullTransactionIdFollowsOrEquals(state->maybe_needed,state->definitely_needed))return false;/* does the last snapshot built have a different xmin? *///当最后一次快照snapshot->xmin=最后一次精准计算的xmin时，不再重新计算边界return RecentXmin != ComputeXidHorizonsResultLastXmin;
}

可以看出maybe_needed和definitely_needed跟快照xmin、xmax是相似的，多嵌套了1层计算。先计算boundaries，再进一步精确计算horizons。GlobalVisTestShouldUpdate减少了计算boundaries的场景，而ComputeXidHorizons()精准计算也更高效。

优化结果

推荐一篇pg快照优化的文章：

https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462

对比优化前后的效果相当明显[外链图片转存失败,源站可能有防盗链机制,
在这里插入图片描述

其实在pg13的生产上也能看到GetSnapshotData的性能消耗总是很高。不过没截图，再借用下大佬的图

在这里插入图片描述

reference

books：
《postgresql指南内幕探索》
《postgresql实战》
《postgresql技术内幕事务处理深度探索》
《postgresql数据库内核分析》
https://edu.postgrespro.com/postgresql_internals-14_parts1-2_en.pdf
官方资料：
https://en.wikipedia.org/wiki/Concurrency_control
https://wiki.postgresql.org/wiki/Hint_Bits
https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND
https://www.postgresql.org/docs/10/storage-page-layout.html
https://www.postgresql.org/docs/13/pageinspect.html3
pg事务必读文章 interdb
https://www.interdb.jp/pg/pgsql05.html
https://www.interdb.jp/pg/pgsql06.html
源码大佬
https://blog.csdn.net/Hehuyi_In/article/details/102920988
https://blog.csdn.net/Hehuyi_In/article/details/127955762
https://blog.csdn.net/Hehuyi_In/article/details/125023923
pg的快照优化性能对比
https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462
其他资料
https://brandur.org/postgres-atomicity
https://mp.weixin.qq.com/s/j-8uRuZDRf4mHIQR_ZKIEg