地表最强系列之Sqoop安装以及使用

news/2024/11/24 6:55:22/

什么是Sqoop

Sqoop(发音:skup)是一款开源的工具,主要用于在Hadoop(Hive)与传统的数据库(mysql、postgresql…)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。

Sqoop的安装以及设置

  • Sqoop安装包 提取码:m18x
  • 安装步骤
  • 下载并解压
    1.上传安装包 sqoop-1.4.6-cdh5.14.2.tar.gz 到虚拟机中
    2.解压 sqoop 安装包到指定目录tar -zxf sqoop-1.4.6-cdh5.14.2.tar.gz -C /opt/install/
    3.修改文件名
    4.配置环境变量
    添加如下内容:
 exportSQOOP_HOME=/opt/install/sqoop exportPATH=$SQOOP_HOME/bin:$PATH

让配置文件生效。 source/etc/profile

  • 修改配置文件
    Sqoop 的配置文件与大多数大数据框架类似,在 sqoop 根目录下的 conf 目录
    中。
    1.重命名配置文件mvsqoop-env-template.shsqoop-env.sh
    2.修改配置文件
 sqoop-env.sh exportHADOOP_COMMON_HOME=/opt/install/hadoop exportHADOOP_MAPRED_HOME=/opt/install/hadoop exportHIVE_HOME=/opt/install/hive exportZOOKEEPER_HOME=/opt/install/zookeeper exportZOOCFGDIR=/opt/install/zookeeper exportHBASE_HOME=/opt/install/hbase
  • 拷贝 JDBC 驱动

mysql-connector-java-5.1.27-bin.jar
hive-exec-1.1.0-cdh5.14.2.jar
java-json.jar
hive-common-1.1.0-cdh5.14.2.jar
拷贝 jdbc 驱动到 sqoop 的 lib 目录下,驱动包见其他资料。jar包 提取码:loqp

cp mysql-connector-java-5.1.27-bin.jar /opt/install/sqoop/lib/
1.4 验证 Sqoop
我们可以通过某一个 command 来验证 sqoop 配置是否正确: sqoop help 出现一些 Warning 警告(警告信息已省略),并伴随着帮助命令的输出:

Availablecommands: codegen Generatecodetointeractwithdatabaserecords create-hive-table ImportatabledefinitionintoHive eval EvaluateaSQLstatementanddisplaytheresults export ExportanHDFSdirectorytoadatabasetable help Listavailablecommands import ImportatablefromadatabasetoHDFS import-all-tables ImporttablesfromadatabasetoHDFS import-mainframe ImportdatasetsfromamainframeservertoHDFS job Workwithsavedjobslist-databases Listavailabledatabasesonaserver list-tables Listavailabletablesinadatabase merge Mergeresultsofincrementalimports metastore RunastandaloneSqoopmetastore version Displayversioninformation
See'sqoophelpCOMMAND'forinformationonaspecificcommand.

Sqoop的使用

  1. 准备工作
    mysql中建库建表
    mysql> create database retail_db;

mysql> use retail_db;

mysql> source /root/data/sqoop/retail_db.sql

mysql> show tables;
±--------------------+
| Tables_in_retail_db |
±--------------------+
| categories |
| customers |
| departments |
| order_items |
| orders |
| products |
±--------------------+
6 rows in set (0.00 sec)
2. 使用sqoop将customers表导入到hdfs上sqoop import --connect jdbc:mysql://localhost:3306/retail_db --driver com.mysql.jdbc.Driver --table customers --username root --password root --target-dir /data/retail_db/customers --m 3

  1. 使用where过滤
sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--driver com.mysql.jdbc.Driver \
--table orders \
--where "order_id<100" \
--username root \
--password root \
--delete-target-dir \
--target-dir /data/retail_db/orders \
--m 3

4.使用columns过滤

sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--driver com.mysql.jdbc.Driver \
--table customers \
--columns "customer_id,customer_fname,customer_lname" \
--username root \
--password root \
--delete-target-dir \
--target-dir /data/retail_db/customers \
--m 3

5.使用查询语句进行过滤

sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--driver com.mysql.jdbc.Driver \
--query "select * from orders where order_status!='CLOSED' and \$CONDITIONS" \
--username root \
--password ok \
--split-by order_id \
--delete-target-dir \
--target-dir /data1/retail_db/orders \
--m 3

6.增量导入

sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--driver com.mysql.jdbc.Driver \
--table orders \
--username root \
--password ok \
--incremental append \
--check-column order_date \
--last_value '0' \
--target-dir /data1/sqoop1/orders \
--m 3SELECT MIN(order_id), MAX(order_id) FROM orders WHERE ( order_date > '2013-07-24 00:00:00' AND order_date <= '2014-07-24 00:00:00.0' )
insert into  orders values(99999,'2015-05-30 00:00:00',11599,'CLOSED');sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--driver com.mysql.jdbc.Driver \
--table orders \
--username root \
--password root \
--incremental append \
--check-column order_date \
--last-value '0'
--target-dir /data/retail_db/orders \
--m 3SELECT MIN(order_id), MAX(order_id) FROM orders WHERE ( order_date > '2014-07-24 00:00:00.0' AND order_date <= '2015-05-30 00:00:00.0' )
  1. 创建job
sqoop job --create mysql2hdfs \
-- import \
--connect jdbc:mysql://localhost:3306/retail_db \
--driver com.mysql.jdbc.Driver \
--table orders \
--username root \
--password root \
--incremental append \
--check-column order_date \
--last-value '0' \
--target-dir /data/retail_db/orders \
--m 3
  1. 查看job
sqoop job --list
  1. 执行job
sqoop job --exec mysql2hdfsinsert into  orders values(999999,'2016-05-30 00:00:00',11599,'CLOSED');SELECT MIN(order_id), MAX(order_id) FROM orders WHERE ( order_date > '2015-05-30 00:00:00.0' AND order_date <= '2016-05-30 00:00:00.0' )每次job执行成功之后都会修改 --last-value 值 将最后一次的最大值填充进去
这里的 '0' 没有实际含义,只是为了保证第一次数据导入时值最小每天可以定时执行
crontab -e 
* 2 */1 * * sqoop job --exec mysql2hdfs
  1. 导入数据到Hive中
先在Hive中创建表
hive -e "create database if not exists retail_db;"sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--driver com.mysql.jdbc.Driver \
--table orders \
--username root \
--password root \
--hive-import \
--create-hive-table \
--hive-database retail_db \
--hive-table orders \
--m 3报错,输出路径已存在
20/07/15 22:29:04 ERROR tool.ImportTool: Import failed: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://hadoop1:9000/user/root/orders already existsat org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:143)at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1304)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)at org.apache.hadoop.mapreduce.Job.submit(Job.java:1304)at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1325)at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:203)at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:176)at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:273)at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:513)at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:621)at org.apache.sqoop.Sqoop.run(Sqoop.java:147)at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
删除已存在的目录
[root@hadoop1 lib]# hdfs dfs -rmr hdfs://hadoop1:9000/user/root/orders又报错
20/07/15 22:26:56 ERROR tool.ImportTool: Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConfat org.apache.sqoop.hive.HiveConfig.getHiveConf(HiveConfig.java:50)at org.apache.sqoop.hive.HiveImport.getHiveArgs(HiveImport.java:392)at org.apache.sqoop.hive.HiveImport.executeExternalHiveScript(HiveImport.java:379)at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:337)at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:241)at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:530)at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:621)at org.apache.sqoop.Sqoop.run(Sqoop.java:147)at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf缺少hive-common-1.1.0-cdh5.14.2.jar,所以从Hive的lib中进行拷贝
cp /opt/install/hive-1.1.0-cdh5.14.2/lib/hive-common-1.1.0-cdh5.14.2.jar /opt/install/sqoop-1.4.6-cdh5.14.2/lib/再执行又报错
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/shims/ShimLoaderat org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:370)at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:108)at java.lang.Class.forName0(Native Method)at java.lang.Class.forName(Class.java:264)at org.apache.sqoop.hive.HiveConfig.getHiveConf(HiveConfig.java:44)at org.apache.sqoop.hive.HiveImport.getHiveArgs(HiveImport.java:392)at org.apache.sqoop.hive.HiveImport.executeExternalHiveScript(HiveImport.java:379)at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:337)at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:241)at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:530)at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:621)at org.apache.sqoop.Sqoop.run(Sqoop.java:147)at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.shims.ShimLoader缺少jar包,需要从Hive中进行拷贝
cp /opt/install/hive-1.1.0-cdh5.14.2/lib/hive-shims* /opt/install/sqoop-1.4.6-cdh5.14.2/lib/
  1. 导入数据到Hive分区中
删除Hive表
drop table if exists orders;
导入
sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--driver com.mysql.jdbc.Driver \
--query "select order_id,order_status from orders where order_date>='2013-11-03' and order_date <'2013-11-04' and \$CONDITIONS" \
--username root \
--password root \
--delete-target-dir \
--target-dir /data/retail_db/orders \
--split-by order_id \
--hive-import \
--hive-database retail_db \
--hive-table orders \
--hive-partition-key "order_date" \
--hive-partition-value "2013-11-03" \
--m 3注意:分区字段不能当成普通字段导入表中
  1. 导入数据到HBase中
1.在HBase中建表
create 'products','data','category'
2.sqoop导入
sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--driver com.mysql.jdbc.Driver \
--username root \
--password root \
--table products \
--hbase-table products \
--column-family data \
--m 3
  1. HDFS 向MySQL中导出数据\
1.MySQL中建表
create table customers_demo as select * from customers where 1=2;2.上传数据
hdfs dfs -mkdir /customerinput
hdfs dfs -put customers.csv /customerinput3.导出数据
sqoop export \
--connect jdbc:mysql://localhost:3306/retail_db \
--driver com.mysql.jdbc.Driver \
--username root \
--password root \
--table customers_demo \
--export-dir /customerinput \
--m 1
  1. sqoop 脚本
1.编写脚本 job_RDBMS2HDFS.opt
--------------------------------
import
--connect
jdbc:mysql://localhost:3306/retail_db
--driver
com.mysql.jdbc.Driver
--table
customers
--username
root
--password
root
--target-dir
/data/retail_db/customers
--delete-target-dir
--m
3
--------------------------------
2.执行脚本
sqoop --options-file job_RDBMS2HDFS.opt

http://www.ppmy.cn/news/172344.html

相关文章

Sklearn:天池新人实战赛o2o优惠券使用预测 part2

日萌社 人工智能AI&#xff1a;Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战&#xff08;不定时更新&#xff09; 阿里云官网&#xff1a;天池新人实战赛o2o优惠券使用预测 数据集下载链接&#xff1a;https://pan.baidu.com/s/13OtaUv6j4x8dD7cgD4sL5g 提取码…

倍加福光电传感器OBE10M-18GM60-SE5-V1

倍加福光电传感器OBE10M-18GM60-SE5-V1 双色指示 LED 亮通/暗通&#xff0c;可编程 灵敏度调节器&#xff0c;确保最佳地满足应用需要 防护等级 IP67 坚固的圆柱形金属外壳 M18 x 1 产品阐述 对射型光电传感器&#xff0c;M18 螺纹外壳设计&#xff0c;镀镍黄铜外壳&#xff0…

外螺纹对照表_螺纹螺距对照表

紧固件螺纹直径与螺距对照表 螺纹规 M3 M4 M5 M6 M8 M10 M12 M14 M16 M18 M20 M22 M24 M27 M30 M33 M36 M39 M42 M45 M48 M52 M56 M60 M64 粗牙螺距 0.5 0.7 0.8 1.0 1.25 1.5 1.75 2.0 2.0 2.5 2.5 2.5 3.0 3.0 3.5 3.5 4.0 4.0 4.5 4.5 5.0 5.0 5.5 5.5 6.0 细牙螺距…

ML之R:通过数据预处理利用LiR/XGBoost等(特征重要性/交叉训练曲线可视化/线性和非线性算法对比/三种模型调参/三种模型融合)实现二手汽车产品交易价格回归预测之详细攻略

ML之R:通过数据预处理利用LiR/XGBoost等(特征重要性/交叉训练曲线可视化/线性和非线性算法对比/三种模型调参/三种模型融合)实现二手汽车产品交易价格回归预测之详细攻略 目录 三、模型训练 3.0、数据降内存 3.1、划分训练集、测试集 3.2、模型训练与验证

灵动岛前端Ui

一、前言 灵动岛&#xff08;Dynamic Island &#xff09;是什么&#xff1f; 灵动岛&#xff0c;是苹果公司iPhone 14 Pro系列 [2] 交互UI&#xff0c;让虚拟软件和硬件的交互变得更为流畅。当有来电、短信等通知时&#xff0c;灵动岛会变化它的形态&#xff0c;以便让用户能…

CSU_WF-中南大学网络服务队2020招新培训-硬件知识

WF-网络服务队2020招新培训-硬件知识 首先CPU1型号2架构3制程/工艺4核心core(C)/线程Thread(T)5主频、超频6缓存(一级缓存/二级缓存/三级缓存)7TDP功耗8核显9cpu封装 主板1版型2芯片组规格3CPU插槽&#xff08;cpu socket&#xff09;4其他的一些插槽5接口6BIOS&#xff08;Bas…

Spring4.x源码解析:JDK动态代理成生成代理对象源码

Component("aopTestBean") class AopTestBean implements AopTestBeanInterface{public void aop1(){System.out.println("aop1");}public void aop2(){System.out.println("aop2");} }代理对象源码&#xff1a; package com.sun.proxy;import …

OpenCV 4.x API 详解与C++实例-Mat数据类型详解

第二节 Mat数据类型详解 1、Mat数据类型描述 我们有多种从现实世界中获取数字图像的方法&#xff1a;数码相机&#xff0c;扫描仪&#xff0c;计算机断层扫描和磁共振成像等等。 在每种情况下&#xff0c;我们&#xff08;人类&#xff09;看到的都是图像。 但是&#xff0c;…