GlusterFS 深度洞察:从架构原理到案例实践的全面解读(下)

  • 七.Gluster基本命令
  • 八. 客户端挂载访问
  • 九.日常巡检
  • 十.深度优化
  • 十一.常见故障与排查
  • 十二.GlusterFS经典案例
  • 十三.GlusterFS存储卷容灾能力对比图


  1. 管理glusterd服务
# systemctl start glusterd 
# systemctl stop glusterd
# systemctl enable —now glusterd
# systemctl disable —now glusterd
  1. 管理TSP(Trusted Storage Pool)
添加服务器要将服务器添加到TSP,请从池中已存在的服务器中对它进行添加#gluster peer probe <server>列出服务器# gluster pool list查看对等状态# gluster peer status删除服务器# gluster peer detach <server>
  1. 管理卷




  1)卸载所有客户端上的卷:#umount mount-point2) 使用以下命令停止卷#gluster volume stop <VOLNAME>3)更改传输类型。例如,要同时启用tcp和rdma,请执行following命令:# gluster volume set test-volume config, transport tcp,rdma OR tcp OR rdma4).在所有客户端上安装卷。例如,要使用rdma传输挂载,请使用以下命令:# mount -t glusterfs -o transport=rdma server1:/test-vlume /mnt/glusterfs

您可以在集群在线且可用时根据需要扩展卷。例如,您可能希望向分布式卷添加砖块,从而增加分布并增加 GlusterFS 卷的容量。同样,您可能希望将一组砖块添加到分布式复制卷中,从而增加 GlusterFS 卷的容量。

 注意:在扩展分布式复制卷和分布式分散卷时,您需要添加多个砖块,该砖块数量是副本或分散数量的倍数。例如,要扩展副本数为 2 的分布式复制卷,则需要以 2 的倍数(例如 4、6、8 等)添加砖块。1) 如果服务器还不是TSP的一部分,请使用以下命令探测包含要添加到卷的服务器:#glusterfs peer probe server4
2) 使用以下命令添加brick#gluster volume add-brick test-volume server4:/exp4
3)使用以下命令检查卷信息:#gluster volume info test-volume


1) 使用以下命令移除brick#gluster volume remove-brick test-volume server2:/exp2 startvolume remove-brick start: success2) 使用以下命令查看删除brick操作的状态#gluster volume remove-brick test-volume status# gluster volume remove-brick test-volume server2:/exp2 statusNode  Rebalanced-files  size  scanned       status---------  ----------------  ----  -------  -----------617c923e-6450-4065-8e33-865e28d9428f               34   340      162   in progress3). 一旦状态显示“完成”,提交remove-brick操作# gluster volume remove-brick test-volume server2:/exp2 commitRemoving brick(s) can result in data loss. Do you want to Continue? (y/n) yvolume remove-brick commit: successCheck the removed bricks to ensure all files are migrated.If files with data are found on the brick path, copy them via a gluster mount point before          re-purposing the removed brick.4).使用以下命令检查卷信息:#gluster volume infoVolume Name: test-volumeType: DistributeStatus: StartedNumber of Bricks: 3Bricks:Brick1: server1:/exp1Brick3: server3:/exp3Brick4: server4:/exp4


   使用以下命令停止卷:# gluster volume stop <VOLNAME>例如,要停止测试卷:# gluster volume stop test-volumestopping volume will make its data inaccessible. Do you want to continue? (y/n)输入y以确认操作。该命令的输出显示以下内容:stopping volume test-volume has been successful


使用以下命令删除卷:# gluster volume delete <VOLNAME>例如,要删除测试卷:# gluster volume delete test-volumeDeleting volume will erase all information about the volume. Do you want to continue? (y/n)输入y以确认操作。该命令显示以下内容:Deleting volume test-volume has been successful

八. 客户端挂载访问

您可以通过多种方式访问 gluster 卷。您可以使用 Gluster Native Client 方法在 GNU/Linux 客户端中实现高并发、高性能和透明故障转移。您还可以使用 NFS v3 访问 gluster 卷。

  1. Gluster原生客户端
    Gluster Native Client 是在用户空间中运行的基于 FUSE 的客户端。当需要高并发和高写入性能时,推荐使用 Gluster Native Client 访问卷的方法。
1).安装Gluster原生客户端在开始安装 Gluster Native Client 之前,您需要验证 FUSE 模块是否已加载到客户端并可以访问所需的模块,如下所示:将 FUSE 可加载内核模块 (LKM) 添加到 Linux 内核:# modprobe fuse验证 FUSE 模块是否已加载:# dmesg | grep -i fuse fuse init (API version 7.13)在客户端安装所需依赖包 yum -y install openssh-server wget fuse fuse-libs  libibverbs关闭防火墙systemctl stop firewalld在centos7发行版安装客户端yum install centos-release-gluster  && yum install glusterfs glusterfs-cli glusterfs-lib glusterfs-fuse -y2).手动挂载卷要挂载卷,执行以下命令# mount -t glusterfs HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR例如:# mount -t glusterfs server1:/test-volume /mnt/glusterfs安装选项mount -t glusterfs使用该命令时可以指定以下选项 。请注意,您需要用逗号分隔所有选项。backupvolfile-server=server-namevolfile-max-fetch-attempts=number of attemptslog-level=loglevellog-file=logfiletransport=transport-typedirect-io-mode=[enable|disable]use-readdirp=[yes|no]例如:# mount -t glusterfs -o backupvolfile-server=server2,server3,server4,use-readdirp=no,volfile-max-fetch-attempts=2,log-level=WARNING,log-file=/var/log/gluster.log server1:/test-volume /mnt/glusterfs如果backupvolfile-server挂载fuse客户端的时候加上option,当server1出现故障时, backupvolfile-serveroption中指定的server可以切换挂载客户端。在volfile-max-fetch-attempts=X选项中,指定在挂载卷时尝试获取卷文件的次数。当您挂载具有多个 IP 地址的服务器或为服务器名称配置循环 DNS 时,此选项很有用。如果use-readdirp设置为 ON,则强制在 fuse 内核模块中使用 readdirp 模式3).自动挂载卷您可以将系统配置为在每次系统启动时自动挂载gluster卷要挂载卷,请编辑 /etc/fstab 文件并添加以下行:HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR glusterfs defaults,_netdev 0 0例如:server1:/test-volume /mnt/glusterfs glusterfs defaults,_netdev 0 0
  1. 使用NFS挂载卷
    先决条件:在服务器和客户端上安装 nfs-common 软件包(仅适用于基于 Centos/redhat 的发行版),使用以下命令:
    $ sudo yum install nfs-common -y
  1).使用 NFS 手动挂载卷要挂载卷,请使用以下命令:# mount -t nfs -o vers=3 HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR例如:# mount -t nfs -o vers=3 server1:/test-volume /mnt/glusterfs笔记Gluster NFS 服务器不支持 UDP。如果您使用的 NFS 客户端默认使用 UDP 连接,则会出现以下消息:requested NFS version or transport protocol is not supported.2).使用 TCP 连接将以下选项添加到 mount 命令:-o mountproto=tcp例如:# mount -o mountproto=tcp -t nfs server1:/test-volume /mnt/glusterfs3).使用 NFS 自动挂载卷您可以将系统配置为在每次系统启动时使用 NFS 自动挂载 Gluster 卷。使用 NFS 自动挂载 Gluster 卷要挂载卷,请编辑 /etc/fstab 文件并添加以下行:HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR nfs defaults,_netdev,vers=3 0 0例如,server1:/test-volume /mnt/glusterfs nfs defaults,_netdev,vers=3 0 0注意:Gluster NFS 服务器不支持 UDP。如果您使用的 NFS 客户端默认使用 UDP 连接,则会出现以下消息:requested NFS version or transport protocol is not supported.4).使用 TCP 连接在 /etc/fstab 文件中添加以下条目:HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR nfs defaults,_netdev,mountproto=tcp 0 0例如:server1:/test-volume /mnt/glusterfs nfs defaults,_netdev,mountproto=tcp 0 0



您可以使用volume top 和profile命令查看性能并识别卷的每个brick的瓶颈。这能获取到系统的重要性能信息。

  1. 启用profile
 #gluster volume profile test-volume start
  1. 显示I/O信息
 #gluster volume profile test-volume infoBrick: Test:/export/2Cumulative Stats:Block                     1b+           32b+           64b+Size:Read:                0              0              0Write:             908             28              8Block                   128b+           256b+         512b+Size:Read:                0               6             4Write:               5              23            16Block                  1024b+          2048b+        4096b+Size:Read:                 0              52           17Write:               15             120          846Block                   8192b+         16384b+      32768b+Size:Read:                52               8           34Write:              234             134          286Block                                  65536b+     131072b+Size:Read:                               118          622Write:                             1341          594%-latency  Avg-      Min-       Max-       calls     Foplatency   Latency    Latency___________________________________________________________4.82      1132.28   21.00      800970.00   4575    WRITE5.70       156.47    9.00      665085.00   39163   READDIRP11.35      315.02    9.00     1433947.00   38698   LOOKUP11.88     1729.34   21.00     2569638.00    7382   FXATTROP47.35   104235.02 2485.00     7789367.00     488   FSYNC------------------------------------Duration     : 335BytesRead    : 94505058BytesWritten : 195571980停止分析#gluster volume profile stop例如要在测试卷上停止分析#gluster volume profile test-volume stop
  1. 查看打开fd计数和最大fd计数
 #gluster volume top VOLUME-NAME open [brick BRICK] [list-cnt {0..100}]

例如,要查看test-volume卷 brick: server1/export/dir1 的最大计数和打开计数,列出前10个打开调用

#gluster volume top test-volume open brick server1:/export/dir1 list-cnt 10 Brick: server:/export/dir1Current open fd's: 34 Max open fd's: 209==========Open file stats========open            file namecall count2               /clients/client0/~dmtmp/PARADOX/COURSES.DB11              /clients/client0/~dmtmp/PARADOX/ENROLL.DB11              /clients/client0/~dmtmp/PARADOX/STUDENTS.DB10              /clients/client0/~dmtmp/PWRPNT/TIPS.PPT10              /clients/client0/~dmtmp/PWRPNT/PCBENCHM.PPT9               /clients/client7/~dmtmp/PARADOX/STUDENTS.DB9               /clients/client1/~dmtmp/PARADOX/STUDENTS.DB9               /clients/client2/~dmtmp/PARADOX/STUDENTS.DB9               /clients/client0/~dmtmp/PARADOX/STUDENTS.DB9               /clients/client8/~dmtmp/PARADOX/STUDENTS.DB
  1. 查看每个brick的读取性能列表


# volume top <VOLNAME> {open|read|write|opendir|readdir|clear} [nfs|brick <brick>] [list-cnt <value>] | {read-perf|write-perf} [bs <size> count <count>] [brick <brick>] [list-cnt <value>]例如,要查看测试卷的brick服务器:/export/ 上的读取性能,计数 1 的 256 块大小和列表计数 10:
# gluster volume top test-volume read-perf bs 256 count 1 brick master:/bricks/brick1/gv0  list-cnt 10
Brick: master:/bricks/brick1/gv0
Throughput 64.00 MBps time 0.0000 secs
MBps Filename                                        Time
==== ========                                        ====0 /hello.8669                                     2022-06-01 02:43:24 +0000.15930 /hello.5004                                     2022-06-01 02:43:15 +0000.14790 /hello.1326                                     2022-06-01 02:43:06 +0000.44250 /hello.1325                                     2022-06-01 02:43:06 +0000.31470 /hello.1324                                     2022-06-01 02:43:06 +0000.17370 /hello.1323                                     2022-06-01 02:43:06 +0000.980 /hello.1322                                     2022-06-01 02:43:05 +0000.9988530 /hello.1321                                     2022-06-01 02:43:05 +0000.9975310 /hello.1320                                     2022-06-01 02:43:05 +0000.9961950 /hello.132                                      2022-06-01 02:43:05 +0000.994810
  1. 查看每个 Brick 上的写入性能列表

    此命令将为指定的计数和块大小启动 dd 并测量相应的吞吐量。查看每块砖的写入性能列表:

#gluster volume top <VOLNAME> {open|read|write|opendir|readdir|clear} [nfs|brick <brick>] [list-cnt <value>] | {read-perf|write-perf} [bs <size> count <count>] [brick <brick>] [list-cnt <value>]例如,要查看测试卷的砖服务器:/export/ 上的写入性能,计数 1 的 256 块大小和列表计数 10:# gluster volume top test-volume write-perf bs 256 count 1 brick list-cntBrick: server:/export/dir1256 bytes (256 B) copied, Throughput: 2.8 MB/s
  1. 显示卷信息


#gluster volume info test-volume


 #gluster volume info all
  1. 显示卷状态


# gluster volume status [all| []] [detail|clients|mem|inode|fd|callpool]例如,要显示有关 test-volume 的信息:# gluster volume status test-volumeSTATUS OF VOLUME: test-volumeBRICK                           PORT   ONLINE   PID————————————————————————————arch:/export/1                  24009   Y       22445————————————————————————————arch:/export/2                  24010   Y       22450


      # gluster volume status allSTATUS OF VOLUME: volume-testBRICK                           PORT   ONLINE   PID--------------------------------------------------------arch:/export/4                  24010   Y       22455STATUS OF VOLUME: test-volumeBRICK                           PORT   ONLINE   PID--------------------------------------------------------arch:/export/1                  24009   Y       22445--------------------------------------------------------arch:/export/2                  24010   Y       22450


     # gluster volume status test-volume detailsSTATUS OF VOLUME: test-volume-------------------------------------------Brick                : arch:/export/1Port                 : 24009Online               : YPid                  : 16977File System          : rootfsDevice               : rootfsMount Options        : rwDisk Space Free      : 13.8GBTotal Disk Space     : 46.5GBInode Size           : N/AInode Count          : N/AFree Inodes          : N/ANumber of Bricks: 1Bricks:Brick: server:/brick6


   # gluster volume status test-volume clients例如,要显示连接到 test-volume 的客户端列表:# gluster volume status test-volume clientsBrick : arch:/export/1Clients connected : 2Hostname          Bytes Read   BytesWritten--------          ---------    ------------    776          676127.0.0.1:1012    50440        51200


   # gluster volume status test-volume mem例如,要显示 test-volume 块的内存使用情况和内存池详细信息:Memory status for volume : test-volume----------------------------------------------Brick : arch:/export/1Mallinfo--------Arena    : 434176Ordblks  : 2Smblks   : 0Hblks    : 12Hblkhd   : 40861696Usmblks  : 0Fsmblks  : 0Uordblks : 332416Fordblks : 101760Keepcost : 100400Mempool Stats-------------Name                               HotCount ColdCount PaddedSizeof AllocCount MaxAlloc----                               -------- --------- ------------ ---------- -----test-volume-server:fd_t                0     16384           92         57        5test-volume-server:dentry_t           59       965           84         59       59test-volume-server:inode_t            60       964          148         60       60test-volume-server:rpcsvc_request_t    0       525         6372        351        2glusterfs:struct saved_frame           0      4096          124          2        2glusterfs:struct rpc_req               0      4096         2236          2        2glusterfs:rpcsvc_request_t             1       524         6372          2        1glusterfs:call_stub_t                  0      1024         1220        288        1glusterfs:call_stack_t                 0      8192         2084        290        2glusterfs:call_frame_t                 0     16384          172       1728     

6).使用以下命令显示卷的 inode 表:

  # gluster volume status inode例如,要显示测试卷的 inode 表# gluster volume status test-volume inodeinode tables for volume test-volume----------------------------------------------Brick : arch:/export/1Active inodes:GFID                                            Lookups            Ref   IA type----                                            -------            ---   -------6f3fe173-e07a-4209-abb6-484091d75499                  1              9         2370d35d7-657e-44dc-bac4-d6dd800ec3d3                  1              1         2LRU inodes:GFID                                            Lookups            Ref   IA type----                                            -------            ---   -------80f98abe-cdcf-4c1d-b917-ae564cf55763                  1              0         13a58973d-d549-4ea6-9977-9aa218f233de                  1              0         12ce0197d-87a9-451b-9094-9baa38121155                  1              0         2

7).使用以下命令显示卷的打开 fd 表:

     # gluster volume status fd例如,要显示测试卷的打开 fd 表:# gluster volume status test-volume fdFD tables for volume test-volume——————————————————————— Brick : arch:/export/1Connection 1:RefCount = 0  MaxFDs = 128  FirstFree = 4FD Entry            PID                 RefCount            Flags--------            ---                 --------            -----0                   26311               1                   21                   26310               3                   22                   26310               1                   23                   26311               3                   2Connection 2:RefCount = 0  MaxFDs = 128  FirstFree = 0No open fdsConnection 3:RefCount = 0  MaxFDs = 128  FirstFree = 0No open fds8).使用以下命令显示卷的挂起调用:# gluster volume status callpool每个调用都有一个包含调用帧的调用堆栈。例如,要显示 test-volume 的挂起调用:
# gluster volume status test-volume callpool
Pending calls for volume test-volume
Brick : arch:/export/1
Pending calls: 2
Call Stack1UID    : 0GID    : 0PID    : 26338Unique : 192138Frames : 7Frame 1Ref Count   = 1Translator  = test-volume-serverCompleted   = NoFrame 2Ref Count   = 0Translator  = test-volume-posixCompleted   = NoParent      = test-volume-access-controlWind From   = default_fsyncWind To     = FIRST_CHILD(this)->fops->fsyncFrame 3Ref Count   = 1Translator  = test-volume-access-controlCompleted   = NoParent      = repl-locksWind From   = default_fsyncWind To     = FIRST_CHILD(this)->fops->fsyncFrame 4Ref Count   = 1Translator  = test-volume-locksCompleted   = NoParent      = test-volume-io-threadsWind From   = iot_fsync_wrapperWind To     = FIRST_CHILD (this)->fops->fsyncFrame 5Ref Count   = 1Translator  = test-volume-io-threadsCompleted   = NoParent      = test-volume-markerWind From   = default_fsyncWind To     = FIRST_CHILD(this)->fops->fsyncFrame 6Ref Count   = 1Translator  = test-volume-markerCompleted   = NoParent      = /export/1Wind From   = io_stats_fsyncWind To     = FIRST_CHILD(this)->fops->fsyncFrame 7Ref Count   = 1Translator  = /export/1Completed   = NoParent      = test-volume-serverWind From   = server_fsync_resumeWind To     = bound_xl->fops->fsync


  # gluster volume set group metadata-cache 该组命令启用文件或目录的stat和xattr信息的缓存。缓存每 10 分钟刷新一次,并启用缓存失效以确保缓存一致性。
A. 要增加可以缓存的文件数,请执行以下命令: # gluster volume set network.inode-lru-limit n,设置为 50000。如果卷中的活动文件数非常多,可以增加它。增加这个数字会增加砖进程的内存占用。
B. 执行以下命令以启用 samba 特定元数据缓存: # gluster volume set cache-samba-metadata on
C. 默认情况下,某些 xattrs 由 gluster 缓存,例如:capability xattrs、ima xattrs ACL 等。如果应用程序使用 Gluster 存储使用任何其他 xattrs,请执行以下命令将这些 xattrs 添加到元数据缓存列表中:

# gluster volume set <volname> xattr-cache-list "comma separated xattr list" 例如:# gluster volume set <volname> xattr-cache-list “*,user.swift.metadata"



### 目录列表性能:使能够parallel-readdir # gluster volume set <VOLNAME> performance.readdir-ahead on # gluster volume set   <VOLNAME> performance.parallel-readdir on
### 文件/目录创建性能使能够nl-cache # gluster volume set <volname> group nl-cache # gluster volume set <volname> nl-cache-positive-entry on上述命令还启用了缓存失效并将超时时间增加到 10 分钟


对于主要读取小文件的用例,启用以下选项# gluster volume set <volname> performance.cache-invalidation on# gluster volume set <volname> features.cache-invalidation on# gluster volume set <volname> performance.qr-cache-timeout 600 --> 10 min recommended     setting# gluster volume set <volname> cache-invalidation-timeout 600 --> 10 min recommended setting此命令可以在客户端缓存中缓存小文件的内容。启用缓存失效可确保缓存一致性。可以使用设置总缓存大小# gluster volume set <volname> cache-size <size>默认情况下,<=64KB缓存具有大小的文件。要更改此值:# gluster volume set <volname> performance.cache-max-file-size <size>请注意,size参数使用 SI 单位后缀,例如64KBor 2MB。

Write Behind Translator (后写)
gluster volume set tank write-behind on
  通常情况下,写操作会比读要慢。通过使用"aggregated background write"技术,write-behind translator 相当显著地改善了写的性能。更确切地说,大量小的写操作被集中起来,形成少量的、大一些的写操作,并且进行后台写处理(non-blocking)。后写方式在client端上聚合了写操作,减小了必须传递的网络包数量。在server端,它帮助服务器优化写的磁盘寻道时间。


Read Ahead Translator (预读)

volume set tank read-ahead on

基于预设值,read-ahead会顺序地预取一些块。当你的应用忙于处理一些数据的时候,GlusterFS能够预读下一批等待处理的数据。这样能够使的读取操作更加流畅和迅速。而且,工作起来像一个读的集合器一样(read-aggregator),也就是说,将大量的、零散的读取操作集合成少量的、大一些的读操作,这样,减小了网络和磁盘的负载。page-size 描述了块的大小。page-count 描述了预读块的总数量。

gluster volume set tank io-cache on

  gluster volume set tank quick-read on


通过网络对文件系统进行操作开销很大,因此,quick-read使用glusterfs内部get接口来一次执行多个posix系统调用open/read/ close,一次get调用包含:一个open调用 + 多个read调用 + 一个close调用。
gluster volume set tank open-behind on
Perform open in the backend only when a necessary FOP arrives (e.g writev on the FD, unlink of the file). When option is disabled, perform backend open right after unwinding open().

gluster volume set tank io-thread-count 16


1. 报错:“Another transaction is in progress for volname” or “Locking failed on”

1) 多个事物争用同一个锁
2) 其中一个节点上存在过时的锁
a.检查glusterd.log文件以找出哪个节点持有过期的锁。查找消息:lock being held by
b.运行gluster peer status 以在日志消息中识别具有uuid的节点

2. 报错:“Transport endpoint is not connected” errors but all bricks are up

3. 报错:"Peer Rejected”
执行gluster peer status 命令返回“Peer Rejected”
这表明节点上的卷配置与可信任存储池的其余部分不同步。您应该在运行peer status命令的节点的glusterd日志中看到以下消息:
Version of Cksums differ. local cksum = xxxxxx, remote cksum = xxxxyx on peer


 运行gluster volume get all cluster.max-op-version以获取最新支持的操作版本通过执行cluster.op-version更新为最新支持的op-version #gluster volume set all cluster.op-version <op-version>


4. 报错:RPC Error: Program not registered”

  # /etc/init.d/portmap start  或# /etc/init.d/rpcbind start
启动 portmap 或 rpcbind 后,需要重新启动 gluster NFS 服务器。

5. 报错:执行mount灵命报““rpc.statd”相关报错
mount.nfs: rpc.statd is not running but is required for remote locking.
mount.nfs: Either use ‘-o nolock’ to keep locks local, or start statd.


(1) 扩缩容实战

  1. 使用场景

  2. 扩容
    扩容或缩容按照节点或者子卷为单位,这会使得DHT子卷的数量发生变化,从而导致每个子卷的目录哈希范围改变,进行重新计算和分配,而有些文件的哈希值落到了其他子卷,那么这些文件应该被迁移至正确的子卷。需要手动执行gluster rebalance命令来触发数据均衡功能



  1. 缩容


  1. 重命名




  1. 数据均衡处理流程



  1. 均衡建议
    (10)如果执行数据迁移对应用程序影响较大,可以只执行fix layout,这样可以只修复目录的哈希分布,并不会实际迁移文件,此时新文件可以存储到新增节点(或brick)上,之后再找适当时机(系统比较空闲的时候)执行数据迁移操作。

  2. 扩缩容实战


  1. 创建一个分布式卷并开启卷
  [root@master ~]# gluster volume create test1  master:/bricks/brick1/test1 node01:/bricks/brick1/test1volume create: test1: success: please start the volume to access data[root@master ~]# gluster volume  start test1volume start: test1: success
  1. 查看卷信息
 [root@master ~]# gluster volume info test1Volume Name: test1
Type: Distribute
Volume ID: 498151f3-5b8c-4f51-bc9f-104ffa60a0ed
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Brick1: master:/bricks/brick1/test1
Brick2: node01:/bricks/brick1/test1
Options Reconfigured:
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on


[root@master ~]# mount -t glusterfs node01:/test1 /mnt/test
[root@master ~]# for i in `seq -w 1 100`; do cp -rp /var/log/messages /mnt/test/copy-test-$i; donemaster节点
[root@master ~]# ls /bricks/brick1/test1/
copy-test-001  copy-test-019  copy-test-032  copy-test-052  copy-test-079  copy-test-094
copy-test-004  copy-test-021  copy-test-033  copy-test-054  copy-test-081  copy-test-095
copy-test-006  copy-test-022  copy-test-034  copy-test-057  copy-test-082  copy-test-098
copy-test-008  copy-test-023  copy-test-038  copy-test-060  copy-test-083  copy-test-099
copy-test-011  copy-test-024  copy-test-039  copy-test-063  copy-test-086  copy-test-100
copy-test-012  copy-test-028  copy-test-041  copy-test-065  copy-test-087
copy-test-015  copy-test-029  copy-test-046  copy-test-073  copy-test-088
copy-test-016  copy-test-030  copy-test-048  copy-test-077  copy-test-090
copy-test-017  copy-test-031  copy-test-051  copy-test-078  copy-test-093node01节点
[root@node01 ~]# ls /bricks/brick1/test1/
copy-test-002  copy-test-020  copy-test-043  copy-test-058  copy-test-070  copy-test-089
copy-test-003  copy-test-025  copy-test-044  copy-test-059  copy-test-071  copy-test-091
copy-test-005  copy-test-026  copy-test-045  copy-test-061  copy-test-072  copy-test-092
copy-test-007  copy-test-027  copy-test-047  copy-test-062  copy-test-074  copy-test-096
copy-test-009  copy-test-035  copy-test-049  copy-test-064  copy-test-075  copy-test-097
copy-test-010  copy-test-036  copy-test-050  copy-test-066  copy-test-076
copy-test-013  copy-test-037  copy-test-053  copy-test-067  copy-test-080
copy-test-014  copy-test-040  copy-test-055  copy-test-068  copy-test-084
copy-test-018  copy-test-042  copy-test-056  copy-test-069  copy-test-085可以看到,数据分布在master和node01节点


[root@master ~]# gluster volume add-brick test1 node02:/bricks/brick1/test1
volume add-brick: success
[root@master ~]# gluster volume info test1Volume Name: test1
Type: Distribute
Volume ID: 498151f3-5b8c-4f51-bc9f-104ffa60a0ed
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: master:/bricks/brick1/test1
Brick2: node01:/bricks/brick1/test1
Brick3: node02:/bricks/brick1/test1
Options Reconfigured:
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on


[root@master ~]# for i in `seq -w 101 150`; do cp -rp /var/log/messages /mnt/test/copy-test-$i; donemaster节点
[root@master ~]# ls /bricks/brick1/test1/
copy-test-001  copy-test-028  copy-test-054  copy-test-088  copy-test-108  copy-test-135
copy-test-004  copy-test-029  copy-test-057  copy-test-090  copy-test-110  copy-test-136
copy-test-006  copy-test-030  copy-test-060  copy-test-093  copy-test-111  copy-test-138
copy-test-008  copy-test-031  copy-test-063  copy-test-094  copy-test-115  copy-test-142
copy-test-011  copy-test-032  copy-test-065  copy-test-095  copy-test-121  copy-test-143
copy-test-012  copy-test-033  copy-test-073  copy-test-098  copy-test-123  copy-test-145
copy-test-015  copy-test-034  copy-test-077  copy-test-099  copy-test-124  copy-test-147
copy-test-016  copy-test-038  copy-test-078  copy-test-100  copy-test-125  copy-test-148
copy-test-017  copy-test-039  copy-test-079  copy-test-101  copy-test-128  copy-test-150
copy-test-019  copy-test-041  copy-test-081  copy-test-103  copy-test-129
copy-test-021  copy-test-046  copy-test-082  copy-test-104  copy-test-131
copy-test-022  copy-test-048  copy-test-083  copy-test-105  copy-test-132
copy-test-023  copy-test-051  copy-test-086  copy-test-106  copy-test-133
copy-test-024  copy-test-052  copy-test-087  copy-test-107  copy-test-134node01节点
[root@node01 ~]# ls /bricks/brick1/test1/
copy-test-002  copy-test-027  copy-test-053  copy-test-070  copy-test-096  copy-test-122
copy-test-003  copy-test-035  copy-test-055  copy-test-071  copy-test-097  copy-test-126
copy-test-005  copy-test-036  copy-test-056  copy-test-072  copy-test-102  copy-test-127
copy-test-007  copy-test-037  copy-test-058  copy-test-074  copy-test-109  copy-test-130
copy-test-009  copy-test-040  copy-test-059  copy-test-075  copy-test-112  copy-test-137
copy-test-010  copy-test-042  copy-test-061  copy-test-076  copy-test-113  copy-test-139
copy-test-013  copy-test-043  copy-test-062  copy-test-080  copy-test-114  copy-test-140
copy-test-014  copy-test-044  copy-test-064  copy-test-084  copy-test-116  copy-test-141
copy-test-018  copy-test-045  copy-test-066  copy-test-085  copy-test-117  copy-test-144
copy-test-020  copy-test-047  copy-test-067  copy-test-089  copy-test-118  copy-test-146
copy-test-025  copy-test-049  copy-test-068  copy-test-091  copy-test-119  copy-test-149
copy-test-026  copy-test-050  copy-test-069  copy-test-092  copy-test-120node02节点
[root@node02 ~]# ls /bricks/brick1/test1/
您在 /var/spool/mail/root 中有新邮件数据还是落盘到master和node01节点


[root@node02 ~]# gluster volume rebalance test1 fix-layout start
volume rebalance: test1: success: Rebalance on test1 has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: c77d8f6c-d932-4da2-be38-441fdaeecb11
[root@node02 ~]# gluster volume rebalance test1 statusNode                                    status           run time in h:m:s---------                               -----------                ------------master                               fix-layout completed        0:0:0node01                               fix-layout completed        0:0:0localhost                               fix-layout completed        0:0:0
volume rebalance: test1: success再写入一些文件
[root@master ~]# ls /bricks/brick1/test1/
copy-test-001  copy-test-031  copy-test-077  copy-test-103  copy-test-133  copy-test-169
copy-test-004  copy-test-032  copy-test-078  copy-test-104  copy-test-134  copy-test-171
copy-test-006  copy-test-033  copy-test-079  copy-test-105  copy-test-135  copy-test-172
copy-test-008  copy-test-034  copy-test-081  copy-test-106  copy-test-136  copy-test-179
copy-test-011  copy-test-038  copy-test-082  copy-test-107  copy-test-138  copy-test-180
copy-test-012  copy-test-039  copy-test-083  copy-test-108  copy-test-142  copy-test-190
copy-test-015  copy-test-041  copy-test-086  copy-test-110  copy-test-143  copy-test-191
copy-test-016  copy-test-046  copy-test-087  copy-test-111  copy-test-145  copy-test-192
copy-test-017  copy-test-048  copy-test-088  copy-test-115  copy-test-147  copy-test-196
copy-test-019  copy-test-051  copy-test-090  copy-test-121  copy-test-148  copy-test-197
copy-test-021  copy-test-052  copy-test-093  copy-test-123  copy-test-150  copy-test-198
copy-test-022  copy-test-054  copy-test-094  copy-test-124  copy-test-151  copy-test-199
copy-test-023  copy-test-057  copy-test-095  copy-test-125  copy-test-155  copy-test-200
copy-test-024  copy-test-060  copy-test-098  copy-test-128  copy-test-160
copy-test-028  copy-test-063  copy-test-099  copy-test-129  copy-test-161
copy-test-029  copy-test-065  copy-test-100  copy-test-131  copy-test-166
copy-test-030  copy-test-073  copy-test-101  copy-test-132  copy-test-167node01节点
[root@node01 ~]# ls /bricks/brick1/test1/
copy-test-002  copy-test-036  copy-test-059  copy-test-080  copy-test-117  copy-test-149
copy-test-003  copy-test-037  copy-test-061  copy-test-084  copy-test-118  copy-test-152
copy-test-005  copy-test-040  copy-test-062  copy-test-085  copy-test-119  copy-test-157
copy-test-007  copy-test-042  copy-test-064  copy-test-089  copy-test-120  copy-test-158
copy-test-009  copy-test-043  copy-test-066  copy-test-091  copy-test-122  copy-test-159
copy-test-010  copy-test-044  copy-test-067  copy-test-092  copy-test-126  copy-test-163
copy-test-013  copy-test-045  copy-test-068  copy-test-096  copy-test-127  copy-test-168
copy-test-014  copy-test-047  copy-test-069  copy-test-097  copy-test-130  copy-test-170
copy-test-018  copy-test-049  copy-test-070  copy-test-102  copy-test-137  copy-test-176
copy-test-020  copy-test-050  copy-test-071  copy-test-109  copy-test-139  copy-test-182
copy-test-025  copy-test-053  copy-test-072  copy-test-112  copy-test-140  copy-test-183
copy-test-026  copy-test-055  copy-test-074  copy-test-113  copy-test-141  copy-test-187
copy-test-027  copy-test-056  copy-test-075  copy-test-114  copy-test-144  copy-test-193
copy-test-035  copy-test-058  copy-test-076  copy-test-116  copy-test-146  copy-test-194node02节点
[root@node02 ~]# ls /bricks/brick1/test1/
copy-test-153  copy-test-162  copy-test-173  copy-test-177  copy-test-184  copy-test-188
copy-test-154  copy-test-164  copy-test-174  copy-test-178  copy-test-185  copy-test-189
copy-test-156  copy-test-165  copy-test-175  copy-test-181  copy-test-186  copy-test-195可以看到,哈希分布重新调整后,新节点可以存储文件了,但原有文件还是在老节点上,这样会增加老节点负载


[root@node02 ~]# gluster volume rebalance test1  start
volume rebalance: test1: success: Rebalance on test1 has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 4c712f20-3e1b-4d84-9912-a766a00a9bb0
[root@node02 ~]# gluster volume rebalance test1  statusNode Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------master                0        0Bytes            98             0            21            completed        0:00:00node01               18         2.6MB            84             0             0            completed        0:00:00localhost                0        0Bytes            18             0             0            completed        0:00:00
volume rebalance: test1: success[root@master ~]# ls /bricks/brick1/test1/|wc -l
98[root@node01 ~]# ls /bricks/brick1/test1/|wc -l
66[root@node02 ~]# ls /bricks/brick1/test1/|wc -l



[root@master ~]# gluster volume remove-brick test1 node02:/bricks/brick1/test1 start
It is recommended that remove-brick be run with cluster.force-migration option disabled to prevent possible data corruption. Doing so will ensure that files that receive writes during migration will not be migrated and will need to be manually copied after the remove-brick commit operation. Please check the value of the option and update accordingly.
Do you want to continue with your current cluster.force-migration settings? (y/n) y
volume remove-brick start: success
ID: a2b3c2e8-1bd6-4c23-8c0b-701f05107624
[root@master ~]# gluster volume remove-brick test1 node02:/bricks/brick1/test1 statusNode Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------node02               36         5.1MB            57             0             0            completed        0:00:01[root@master ~]# ls /bricks/brick1/test1/|wc -l
126[root@node01 ~]# ls /bricks/brick1/test1/|wc -l
[root@master ~]# for i in `seq -w 201 250`; do cp -rp /var/log/messages /mnt/test/copy-test-$i; done[root@master ~]# ls /bricks/brick1/test1/|wc -l
147[root@node01 ~]# ls /bricks/brick1/test1/|wc -l


  1. 脑裂介绍

  2. 脑裂产生原因

  3. 脑裂修复方法介绍
    #gluster volume heal info

#glusterfs volume heal info split-brain

使用gluster cli解决数据/元数据脑裂

gluster volume heal <VOLNAME> split-brain bigger-file <FILE>
在这里,<FILE>可以是从卷的根目录看到的完整文件名或文件的 GFID 字符串表示,有时会显示在修复信息命令的输出中。执行此命令后,将<FILE>找到包含更大尺寸的副本,并以该brick为源完成修复。b.选择 mtime 最新的文件作为源# gluster volume heal <VOLNAME> split-brain latest-mtime <FILE>c.选择副本中的一块砖作为特定文件的源
gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME> <FILE>d.选择副本的一块砖作为所有文件的源
gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME>
  1. 实际修复案例

现有一个test Volume卷,包含两个brick,b1和b2,自愈守护进程关闭

# gluster volume heal test info split-brain# gluster volume heal test info split-brainBrick <hostname:brickpath-b1><gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2><gfid:39f301ae-4038-48c2-a889-7dac143e82dd><gfid:c3c94de2-232d-4083-b534-5da17fc476ac>Number of entries in split-brain: 3Brick <hostname:brickpath-b2>/dir/file1/dir/file4Number of entries in split-brain: 3



在b1上[brick1]# stat b1/dir/file1File: ‘b1/dir/file1’Size: 17              Blocks: 16         IO Block: 4096   regular fileDevice: fd03h/64771d    Inode: 919362      Links: 2Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)Access: 2015-03-06 13:55:40.149897333 +0530Modify: 2015-03-06 13:55:37.206880347 +0530Change: 2015-03-06 13:55:37.206880347 +0530Birth: -[brick1]#[brick1]# md5sum b1/dir/file1040751929ceabf77c3c0b3b662f341a8  b1/dir/file1
     在b2上:[brick2]# stat b2/dir/file1File: ‘b2/dir/file1’Size: 13              Blocks: 16         IO Block: 4096   regular fileDevice: fd03h/64771d    Inode: 919365      Links: 2Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)Access: 2015-03-06 13:54:22.974451898 +0530Modify: 2015-03-06 13:52:22.910758923 +0530Change: 2015-03-06 13:52:22.910758923 +0530Birth: -[brick2]#[brick2]# md5sum b2/dir/file1cb11635a45d45668a403145059c2a0d5  b2/dir/file1使用以下命令进行修复#gluster volume heal test split-brain bigger-file /dir/file1修复完成后,两块砖上的 md5sum 和文件大小应该相同。
在 b1 上:[brick1]# stat b1/dir/file1File: ‘b1/dir/file1’Size: 17              Blocks: 16         IO Block: 4096   regular fileDevice: fd03h/64771d    Inode: 919362      Links: 2Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)Access: 2015-03-06 14:17:27.752429505 +0530Modify: 2015-03-06 13:55:37.206880347 +0530Change: 2015-03-06 14:17:12.880343950 +0530Birth: -[brick1]#[brick1]# md5sum b1/dir/file1040751929ceabf77c3c0b3b662f341a8  b1/dir/file1
在砖 b2 上:[brick2]# stat b2/dir/file1File: ‘b2/dir/file1’Size: 17              Blocks: 16         IO Block: 4096   regular fileDevice: fd03h/64771d    Inode: 919365      Links: 2Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)Access: 2015-03-06 14:17:23.249403600 +0530Modify: 2015-03-06 13:55:37.206880000 +0530Change: 2015-03-06 14:17:12.881343955 +0530Birth: -[brick2]#[brick2]# md5sum b2/dir/file1040751929ceabf77c3c0b3b662f341a8  b2/dir/file1




