文章目录
- 基本慨念
- 源码编译以及下载
- 运行hbck2工具
- hbck2参数
- 注意
- Hbck2简介
- 寻找问题
- 诊断工具
基本慨念
HBCK2 每次运行时都会执行一个独立的任务。 它并不是一个可以分析所有关于正在运行的集群,然后修复发现的“所有问题”,如 hbck1 使用的建议的工具。
虽然 hbck1 仍然捆绑在 hbase-2.x 中——为了尽量减少意外——但它已被弃用,将在 hbase-3.x 中删除
HBCK2 用于修复。 对于正在运行的集群中的不一致或阻塞的列表,您可以转到其他地方,查看正在运行的集群 Master 的日志和 UI。 一旦发现问题,您就可以使用 HBCK2 工具要求 Master 进行修复或跳过不良状态。 HBCK2 和 hbck1 之间的另一个重要区别是要求 Master 进行修复,而不是尝试在修复工具的上下文中进行本地修复。 有关此交互式修复过程如何工作以及 HBCK2 工作原理的更多信息,请参见以下部分。
源码编译以及下载
- 下载链接
https://hbase.apache.org/downloads.html
- 2下载:
HBase Operator Tools
src包
- 编译准备
下载maven
https://dlcdn.apache.org/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
- 解压
- cd hbase-operator-tools-1.2.0
- 开始编译
#编译命令
apache-maven-3.6.3/bin/mvn clean install -DskipTests
运行hbck2工具
HBCK2 jar 不包含依赖项; 它不是作为fat jar。 必须提供依赖项。 构建,调整顶级 pom 中的目标 hbase 版本以匹配您的部署将在针对您的部署运行时实现最流畅的操作(请参阅父 pom.xml hbase-operator-tools 以设置 hbase.version)。
HBCK2 和运行集群之间的运行时交互会变得有趣的地方是当 HBCK2 提前于你的 hbase 部署时,你的 hbase 不支持当前 HBCK2 中的所有 API。 如果 HBCK2 不需要服务器端支持,它应该会优雅地失败。 如果遇到该情况使用旧版本HBCK2或升级您的集群(如果可以)。
“提供” HBCK2 其依赖项的最简单方法是通过 $HBASE_HOME/bin/hbase 脚本启动 HBCK2。 bin/hbase 脚本本身就提到了 hbck——在帮助输出中列出了一个 hbck 选项。 默认情况下,运行 bin/hbase hbck,将运行内置的 hbck1 工具。 要运行 HBCK2,您需要使用 -j 选项指向已构建的 HBCK2 jar,如下所示
$ ${HBASE_HOME}/bin/hbase --config /etc/hbase-conf hbck -j ~/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-xxx.jar
在上面提到的地方, /etc/hbase-conf 是部署的配置所在的位置(随意指定一个空文件夹即可)。
HBCK2 jar 位于 ~/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-xxx.jar。
hbck2参数
usage: HBCK2 [OPTIONS] COMMAND <ARGS>
Options:-d,--debug run with debug output-h,--help output this help message-p,--hbase.zookeeper.property.clientPort <arg> port of hbase ensemble-q,--hbase.zookeeper.quorum <arg> hbase ensemble-s,--skip skip hbase version check(PleaseHoldException)-v,--version this hbck2 version-z,--zookeeper.znode.parent <arg> parent znode of hbaseensemble
Command:addFsRegionsMissingInMeta <NAMESPACE|NAMESPACE:TABLENAME>...Options:-d,--force_disable aborts fix for table if disable fails.To be used when regions missing from hbase:meta but directoriesare present still in HDFS. Can happen if user has run _hbck1_'OfflineMetaRepair' against an hbase-2.x cluster. Needs hbase:metato be online. For each table name passed as parameter, performs diffbetween regions available in hbase:meta and region dirs on HDFS.Then for dirs with no hbase:meta matches, it reads the 'regioninfo'metadata file and re-creates given region in hbase:meta. Regions arere-created in 'CLOSED' state in the hbase:meta table, but not in theMasters' cache, and they are not assigned either. To get theseregions online, run the HBCK2 'assigns'command printed when thiscommand-run completes.NOTE: If using hbase releases older than 2.3.0, a rolling restart ofHMasters is needed prior to executing the set of 'assigns' output.An example adding missing regions for tables 'tbl_1' in the defaultnamespace, 'tbl_2' in namespace 'n1' and for all tables fromnamespace 'n2':$ HBCK2 addFsRegionsMissingInMeta default:tbl_1 n1:tbl_2 n2Returns HBCK2 an 'assigns' command with all re-inserted regions.SEE ALSO: reportMissingRegionsInMetaSEE ALSO: fixMetaassigns [OPTIONS] <ENCODED_REGIONNAME/INPUTFILES_FOR_REGIONNAMES>...Options:-o,--override override ownership by another procedure-i,--inputFiles take one or more encoded region namesA 'raw' assign that can be used even during Master initialization (ifthe -skip flag is specified). Skirts Coprocessors. Pass one or moreencoded region names. 1588230740 is the hard-coded name for thehbase:meta region and de00010733901a05f5a2a3a382e27dd4 is an example ofwhat a user-space encoded region name looks like. For example:$ HBCK2 assigns 1588230740 de00010733901a05f5a2a3a382e27dd4Returns the pid(s) of the created AssignProcedure(s) or -1 if none.If -i or --inputFiles is specified, pass one or more input file names.Each file contains encoded region names, one per line. For example:$ HBCK2 assigns -i fileName1 fileName2bypass [OPTIONS] <PID>...Options:-o,--override override if procedure is running/stuck-r,--recursive bypass parent and its children. SLOW! EXPENSIVE!-w,--lockWait milliseconds to wait before giving up; default=1Pass one (or more) procedure 'pid's to skip to procedure finish. Parentof bypassed procedure will also be skipped to the finish. Entities willbe left in an inconsistent state and will require manual fixup. Mayneed Master restart to clear locks still held. Bypass fails ifprocedure has children. Add 'recursive' if all you have is a parent pidto finish parent and children. This is SLOW, and dangerous so useselectively. Does not always work.extraRegionsInMeta <NAMESPACE|NAMESPACE:TABLENAME>...Options:-f, --fix fix meta by removing all extra regions found.Reports regions present on hbase:meta, but with no relateddirectories on the file system. Needs hbase:meta to be online.For each table name passed as parameter, performs diffbetween regions available in hbase:meta and region dirs on the givenfile system. Extra regions would get deleted from Metaif passed the --fix option.NOTE: Before deciding on use the "--fix" option, it's worth check ifreported extra regions are overlapping with existing valid regions.If so, then "extraRegionsInMeta --fix" is indeed the optimal solution.Otherwise, "assigns" command is the simpler solution, as it recreatesregions dirs in the filesystem, if not existing.An example triggering extra regions report for tables 'table_1'and 'table_2', under default namespace:$ HBCK2 extraRegionsInMeta default:table_1 default:table_2An example triggering extra regions report for table 'table_1'under default namespace, and for all tables from namespace 'ns1':$ HBCK2 extraRegionsInMeta default:table_1 ns1Returns list of extra regions for each table passed as parameter, orfor each table on namespaces specified as parameter.filesystem [OPTIONS] [<TABLENAME>...]Options:-f, --fix sideline corrupt hfiles, bad links, and references.Report on corrupt hfiles, references, broken links, and integrity.Pass '--fix' to sideline corrupt files and links. '--fix' does NOTfix integrity issues; i.e. 'holes' or 'orphan' regions. Pass one ormore tablenames to narrow checkup. Default checks all tables andrestores 'hbase.version' if missing. Interacts with the filesystemonly! Modified regions need to be reopened to pick-up changes.fixMetaDo a server-side fix of bad or inconsistent state in hbase:meta.Available in hbase 2.2.1/2.1.6 or newer versions. Master UI hasmatching, new 'HBCK Report' tab that dumps reports generated bymost recent run of _catalogjanitor_ and a new 'HBCK Chore'. Itis critical that hbase:meta first be made healthy before makingany other repairs. Fixes 'holes', 'overlaps', etc., creating(empty) region directories in HDFS to match regions added tohbase:meta. Command is NOT the same as the old _hbck1_ commandnamed similarily. Works against the reports generated by the lastcatalog_janitor and hbck chore runs. If nothing to fix, run is anoop. Otherwise, if 'HBCK Report' UI reports problems, a run offixMeta will clear up hbase:meta issues. See 'HBase HBCK' UIfor how to generate new report.SEE ALSO: reportMissingRegionsInMetagenerateMissingTableDescriptorFile <TABLENAME>Trying to fix an orphan table by generating a missing table descriptorfile. This command will have no effect if the table folder is missingor if the .tableinfo is present (we don't override existing tabledescriptors). This command will first check it the TableDescriptor iscached in HBase Master in which case it will recover the .tableinfoaccordingly. If TableDescriptor is not cached in master then it willcreate a default .tableinfo file with the following items:- the table name- the column family list determined based on the file system- the default properties for both TableDescriptor andColumnFamilyDescriptorsIf the .tableinfo file was generated using default parameters thenmake sure you check the table / column family properties later (andchange them if needed).This method does not change anything in HBase, only writes the new.tableinfo file to the file system. Orphan tables can cause e.g.ServerCrashProcedures to stuck, you might need to fix these stillafter you generated the missing table info files.replication [OPTIONS] [<TABLENAME>...]Options:-f, --fix fix any replication issues found.Looks for undeleted replication queues and deletes them if passed the'--fix' option. Pass a table name to check for replication barrier andpurge if '--fix'.reportMissingRegionsInMeta <NAMESPACE|NAMESPACE:TABLENAME>...To be used when regions missing from hbase:meta but directoriesare present still in HDFS. Can happen if user has run _hbck1_'OfflineMetaRepair' against an hbase-2.x cluster. This is a CHECK onlymethod, designed for reporting purposes and doesn't perform anyfixes, providing a view of which regions (if any) would get re-addedto hbase:meta, grouped by respective table/namespace. To effectivelyre-add regions in meta, run addFsRegionsMissingInMeta.This command needs hbase:meta to be online. For each namespace/tablepassed as parameter, it performs a diff between regions available inhbase:meta against existing regions dirs on HDFS. Region dirs with nomatches are printed grouped under its related table name. Tables withno missing regions will show a 'no missing regions' message. If nonamespace or table is specified, it will verify all existing regions.It accepts a combination of multiple namespace and tables. Table namesshould include the namespace portion, even for tables in the defaultnamespace, otherwise it will assume as a namespace value.An example triggering missing regions report for tables 'table_1'and 'table_2', under default namespace:$ HBCK2 reportMissingRegionsInMeta default:table_1 default:table_2An example triggering missing regions report for table 'table_1'under default namespace, and for all tables from namespace 'ns1':$ HBCK2 reportMissingRegionsInMeta default:table_1 ns1Returns list of missing regions for each table passed as parameter, orfor each table on namespaces specified as parameter.setRegionState <ENCODED_REGIONNAME> <STATE>Possible region states:OFFLINE, OPENING, OPEN, CLOSING, CLOSED, SPLITTING, SPLIT,FAILED_OPEN, FAILED_CLOSE, MERGING, MERGED, SPLITTING_NEW,MERGING_NEW, ABNORMALLY_CLOSEDWARNING: This is a very risky option intended for use as last resort.Example scenarios include unassigns/assigns that can't move forwardbecause region is in an inconsistent state in 'hbase:meta'. Forexample, the 'unassigns' command can only proceed if passed a regionin one of the following states: SPLITTING|SPLIT|MERGING|OPEN|CLOSINGBefore manually setting a region state with this command, pleasecertify that this region is not being handled by a running procedure,such as 'assign' or 'split'. You can get a view of running proceduresin the hbase shell using the 'list_procedures' command. An examplesetting region 'de00010733901a05f5a2a3a382e27dd4' to CLOSING:$ HBCK2 setRegionState de00010733901a05f5a2a3a382e27dd4 CLOSINGReturns "0" if region state changed and "1" otherwise.setTableState <TABLENAME> <STATE>Possible table states: ENABLED, DISABLED, DISABLING, ENABLINGTo read current table state, in the hbase shell run:hbase> get 'hbase:meta', '<TABLENAME>', 'table:state'A value of \x08\x00 == ENABLED, \x08\x01 == DISABLED, etc.Can also run a 'describe "<TABLENAME>"' at the shell prompt.An example making table name 'user' ENABLED:$ HBCK2 setTableState users ENABLEDReturns whatever the previous table state was.scheduleRecoveries <SERVERNAME>...Schedule ServerCrashProcedure(SCP) for list of RegionServers. Formatserver name as '<HOSTNAME>,<PORT>,<STARTCODE>' (See HBase UI/logs).Example using RegionServer 'a.example.org,29100,1540348649479':$ HBCK2 scheduleRecoveries a.example.org,29100,1540348649479Returns the pid(s) of the created ServerCrashProcedure(s) or -1 ifno procedure created (see master logs for why not).Command support added in hbase versions 2.0.3, 2.1.2, 2.2.0 or newer.unassigns <ENCODED_REGIONNAME>...Options:-o,--override override ownership by another procedureA 'raw' unassign that can be used even during Master initialization(if the -skip flag is specified). Skirts Coprocessors. Pass one ormore encoded region names. 1588230740 is the hard-coded name for thehbase:meta region and de00010733901a05f5a2a3a382e27dd4 is an exampleof what a userspace encoded region name looks like. For example:$ HBCK2 unassign 1588230740 de00010733901a05f5a2a3a382e27dd4Returns the pid(s) of the created UnassignProcedure(s) or -1 if none.SEE ALSO, org.apache.hbase.hbck1.OfflineMetaRepair, the offlinehbase:meta tool. See the HBCK2 README for how to use.
注意
请注意,当您向 bin/hbase 传递 hbck 参数时,默认情况下它将使用默认客户端访问目标 hbase 集群。 这对于大多数 HBCK2 使用来说已经足够了。 如果您遇到如下异常:
bin/hbase --config hbase-conf hbck
2019-08-30 05:04:54,467 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfsat org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)at org.apache.hadoop.hbase.util.CommonFSUtils.getRootDir(CommonFSUtils.java:361)at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3605)
这是因为 HDFS jar 不在 CLASSPATH 上。 默认情况下,当通过 bin/hbase 运行 hbck 时,不会在 CLASSPATH 上捆绑 HDFS jar。 在环境中定义 HADOOP_HOME 以便 bin/hbase 可以找到您本地的 hadoop 安装,然后它将加载其 HDFS jar。
Hbck2简介
HBCK2 目前是一个简单的工具,一次只做一件事。
在 hbase-2.x 中,Master 是所有状态的最终仲裁者,因此大多数 HBCK2 命令的一般原则是它要求 Master 进行所有修复。 这意味着在您可以运行 HBCK2 命令之前,必须先启动 Master。
HBCK2 实现方法是利用托管在 Master 上的 HbckService。 该服务发布了一些方法供 HBCK2 工具使用。 因此,对于依赖于 Master 的 HbckService 门面的 HBCK2 命令,HBCK2 做的第一件事就是对集群进行 poke 以确保服务可用。 如果远程服务器没有发布服务或者 HbckService 缺少请求的方法,这将失败。 对于后一种情况,如果可以,请更新您的集群以获得更多修复工具。
寻找问题
虽然 hbck1 执行分析报告您的集群 GOOD 或 BAD,但 HBCK2 不那么自以为是。 在 hbase-2.x 中,操作员确定需要修复的内容,然后使用包括 HBCK2 在内的工具进行修复。 操作员可能必须来回运行几轮 HBCK2,然后检查集群状态。
要解决集群问题,请使用以下实用程序和方法。
诊断工具
Master Logs
Master 运行所有分配、服务器崩溃处理、集群启动和停止等。在 hbase-2.x 中,Master 所做的一切都被转换为在状态机引擎上运行的程序。 有关此新基础架构如何工作的详细信息,请参阅过程框架和分配管理器。 每个过程都有一个唯一的过程 id,它的 pid,它在每个日志记录中列出。 在 pid 之后,您可以在主日志中跟踪过程的生命周期,作为过程从开始到过程的各个阶段到完成的转换。 一些程序会产生子程序,等待它们的子程序,然后自己完成。 每个子程序记录它的 pid 和它的 ppid; 它的父程序的pid。
一般来说,所有运行都没有问题,但如果出现一些不可预见的情况,分配框架可能会受到损坏,需要操作员干预。 下面我们将讨论一些这样的场景,但它们可以在主日志中表现为一个区域被 STUCK 或一个转换实体(区域或表)的过程可能被阻塞,因为另一个过程持有排他锁并且不放手 .
STUCK 程序如下所示:
2018-09-12 15:29:06,558 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager:
STUCK Region-In-Transition rit=OPENING, location=va1001.example.org,22101,1536173230599,
table=IntegrationTestBigLinkedList_20180626110336, region=dbdb56242f17610c46ea044f7a42895b
Master UI: /master-status#tables
这部分关于 Master UI 主页的中间部分显示了一个表列表,其中包含表是 ENABLED、ENABLING、DISABLING 还是 DISABLED 以及其他属性的列。 还列出了具有各种过渡状态的区域计数的列:打开、关闭等。阅读此表有助于确定此表的区域是否具有适当的配置。 例如,如果一个表是 ENABLED 并且有没有处于 OPEN 状态的区域并且主日志对任何正在进行的分配保持沉默,那么就有问题了。
Master UI: ‘Procedures & Locks’
此页面在页面标题中的 Procedures & Locks 菜单项下的 Master UI 主页上列出了所有正在进行的过程和锁以及当前的 Master Procedure WAL 集(在 MasterProcWALs 目录下名为 pv2-0000000000000000###.log 你的 hbase 安装)。 在启动时,在一个大型集群上,当激烈的分配正在进行时,这个页面充满了过程和锁的列表。 MasterProcWAL 的数量也会膨胀。 如果在集群稳定后,有一个卡住的锁或过程,或者 WAL 的计数没有下降而是只会增加,那么需要操作员干预来解除阻塞。
锁和过程的列表也可以通过 hbase shell 获得:
$ echo "list_locks"| hbase shell &> /tmp/locks.txt
$ echo "list_procedures"| hbase shell &> /tmp/procedures.txt
Master UI: The ‘HBCK Report’
在 hbase 2.3.0/2.1.6/2.2.1 的 /hbck.jsp 版本中,一个 HBCK 报告页面被添加到 Master 中,该页面显示了 master 每隔一段时间运行的两次检查的输出; 一个由 CatalogJanitor 运行时输出。 如果 hbase:meta 中有重叠或漏洞,CatalogJanitor 页面的一半将列出它找到的内容(否则它是安静的)。 添加了另一个后台“杂项”进程来比较 hbase:meta 和文件系统内容进行比较; 如果异常,它将在其 HBCK 报告部分中记录。
有关如何强制检查员运行的信息,请参阅“HBCK 报告”页面本身。
The HBase Canary Tool
Canary 工具对验证分配状态很有用。它可以以表为焦点或针对整个集群运行。
例如,要检查集群分配:
$ hbase canary -f false -t 6000000 &>/tmp/canary.log
-f false
告诉 Canary 继续执行失败的区域提取,而 -t 6000000 告诉 Canary 最多运行约两个小时。 完成后,查看 /tmp/canary.log。查看ERROR的行以查找有问题的区域分配。
您可以在 hbase shell 中进行类似 Canary 的探测。 例如,给定一个 Region 的起始行 d1dddd0c 属于表 testtable,请执行以下操作:
hbase> scan 'testtable', {STARTROW => 'd1dddd0c', LIMIT => 10}
其他工具
要计算 ENABLED 或 ENABLING 表上未打开的区域列表,请阅读 hbase:meta table info:state 列。 例如,要查找表 IntegrationTestBigLinkedList_20180626064758 中所有区域的状态,请执行以下操作:
$ echo " scan 'hbase:meta', {ROWPREFIXFILTER => 'IntegrationTestBigLinkedList_20180626064758,', COLUMN => 'info:state'}"| hbase shell > /tmp/t.txt
…然后 OPENING 或 CLOSING 区域执行grep。
要将 OPENING 问题移至 OPEN 以使其与表的 ENABLED 状态一致,请使用 hbase shell 中的 assign 命令对新的分配过程进行排队(查看主日志以查看分配运行)。 如果要分配多个区域,请使用 HBCK2 工具。 它可以进行批量分配。