parquet类型小文件合并

ops/2024/12/21 21:39:33/

parquet类型小文件合并：
./2024-7-26/0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq

hadoop jar ./parquet-tools-1.9.0.jar --help
WARNING: Use “yarn jar” to launch YARN applications.
usage: parquet-tools cat [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
-j,–json Show records in JSON format.
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools head [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
-n,–records The number of records to show (default: 5)
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools schema [option…]
where option is one of:
-d,–detailed Show detailed information about the schema.
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the parquet file containing the schema to show

usage: parquet-tools meta [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools dump [option…]
where option is one of:
-c,–column Dump only the given column, can be specified more than
once
-d,–disable-data Do not dump column data
–debug Enable debug output
-h,–help Show this help string
-m,–disable-meta Do not dump row group and page metadata
-n,–disable-crop Do not crop the output based on console width
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools merge [option…] [ …]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the source parquet files/directory to be merged
is the destination parquet file

查看结构：
hadoop jar ./parquet-tools-1.9.0.jar schema ./0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq
message schema {
optional binary id;
optional binary sn;
optional binary mes_sn;
optional binary line_code;
optional binary section_code;
optional binary station_code;
optional binary station_slot;
optional binary test_software_version;
optional binary test_time;
optional double elapsed_time;
optional binary test_result;
optional binary failitem;
optional binary failitems;
optional binary bg;
optional binary bu;
optional binary project_code;
optional binary project_name;
}

查看内容：
hadoop jar ./parquet-tools-1.9.0.jar head -n 10 ./0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq

合并parquet小文件：原文件不删除，产生新的合并文件
hadoop jar ./parquet-tools-1.9.0.jar merge ./2024-7-26/ /tmp/all.parquet
合并结果：
hdfs dfs -du -h /tmp/all.parquet
280.6 M 841.7 M /tmp/all.parquet

parquet类型小文件合并

相关文章

每天学习一个思维模型 - 损失规避

R语言处理XML文件

现代风格VUE3易支付用户控制中心

clickhouse-题库

使用Python开发高级游戏：创建一个3D射击游戏

halcon3d disparity_image_to_xyz非常重要的算子及使用条件

kafka常用命令

数据结构泛谈