Datax在win10中的安装_windows安装datax_JMzz的博客-CSDN博客
DataX/userGuid.md at master · alibaba/DataX · GitHub
环境准备:
1.JDK(1.8以上,推荐1.8)
2.①Python(推荐Python2.7.X)
②Python(Python3.X.X的可以下载下面的安装包替换)
python3.0需要替换安装目录bin下的3个文件
替换文件下载:链接: 百度网盘 请输入提取码 提取码: re42
3.Apache Maven 3.x (Compile DataX) 非编译安装不需要
Python环境安装这里就不作说明,请自行下载安装。
1、下载解压
备注:我用的Python3.0.X 没有替换python相关文件
安装目录
E:\DATAX\datax
所有脚本请到 E:\DATAX\datax\bin 下执行
cmd
e:
cd E:\DATAX\datax\bin
2、自检脚本
python datax.py ../job/job.json
3、练手配置示例:从stream读取数据并打印到控制台
第一步、创建作业的配置文件(json格式)
可以通过命令查看配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}
例如:
python datax.py -r streamreader -w streamwriter
返回如下
DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.Please refer to the streamreader document:https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.mdPlease refer to the streamwriter document:https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.mdPlease save the following configuration as a json file and usepython {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job.{"job": {"content": [{"reader": {"name": "streamreader","parameter": {"column": [],"sliceRecordCount": ""}},"writer": {"name": "streamwriter","parameter": {"encoding": "","print": true}}}],"setting": {"speed": {"channel": ""}}} }
根据模板配置json如下:
stream2stream.json
{"job": {"content": [{"reader": {"name": "streamreader","parameter": {"sliceRecordCount": 10,"column": [{"type": "long","value": "10"},{"type": "string","value": "hello,你好,世界-DataX"}]}},"writer": {"name": "streamwriter","parameter": {"encoding": "UTF-8","print": true}}}],"setting": {"speed": {"channel": 5}}} }
第二步:启动DataX
$ cd {YOUR_DATAX_DIR_BIN}
$ python datax.py ./stream2stream.json
python datax.py ../job/stream2stream.json
同步结束,显示日志如下:
DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.2023-03-16 13:52:50.773 [main] INFO MessageSource - JVM TimeZone: GMT+08:00, Locale: zh_CN 2023-03-16 13:52:50.776 [main] INFO MessageSource - use Locale: zh_CN timeZone: sun.util.calendar.ZoneInfo[id="GMT+08:00",offset=28800000,dstSavings=0,useDaylight=false,transitions=0,lastRule=null] 2023-03-16 13:52:50.786 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl 2023-03-16 13:52:50.791 [main] INFO Engine - the machine info =>osInfo: Oracle Corporation 1.8 25.172-b11jvmInfo: Windows 10 amd64 10.0cpu num: 8totalPhysicalMemory: -0.00GfreePhysicalMemory: -0.00GmaxFileDescriptorCount: -1currentOpenFileDescriptorCount: -1GC Names [PS MarkSweep, PS Scavenge]MEMORY_NAME | allocation_size | init_sizePS Eden Space | 256.00MB | 256.00MBCode Cache | 240.00MB | 2.44MBCompressed Class Space | 1,024.00MB | 0.00MBPS Survivor Space | 42.50MB | 42.50MBPS Old Gen | 683.00MB | 683.00MBMetaspace | -0.00MB | 0.00MB2023-03-16 13:52:50.815 [main] INFO Engine - {"content":[{"reader":{"name":"streamreader","parameter":{"column":[{"type":"long","value":"10"},{"type":"string","value":"hello,你好,世界-DataX"}],"sliceRecordCount":10}},"writer":{"name":"streamwriter","parameter":{"encoding":"UTF-8","print":true}}}],"setting":{"speed":{"channel":5}} }2023-03-16 13:52:50.833 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null 2023-03-16 13:52:50.835 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0 2023-03-16 13:52:50.835 [main] INFO JobContainer - DataX jobContainer starts job. 2023-03-16 13:52:50.837 [main] INFO JobContainer - Set jobId = 0 2023-03-16 13:52:50.855 [job-0] INFO JobContainer - jobContainer starts to do prepare ... 2023-03-16 13:52:50.856 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work . 2023-03-16 13:52:50.857 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work . 2023-03-16 13:52:50.857 [job-0] INFO JobContainer - jobContainer starts to do split ... 2023-03-16 13:52:50.858 [job-0] INFO JobContainer - Job set Channel-Number to 5 channels. 2023-03-16 13:52:50.859 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks. 2023-03-16 13:52:50.859 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks. 2023-03-16 13:52:50.880 [job-0] INFO JobContainer - jobContainer starts to do schedule ... 2023-03-16 13:52:50.889 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups. 2023-03-16 13:52:50.892 [job-0] INFO JobContainer - Running by standalone Mode. 2023-03-16 13:52:50.900 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks. 2023-03-16 13:52:50.905 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated. 2023-03-16 13:52:50.906 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated. 2023-03-16 13:52:50.916 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started 2023-03-16 13:52:50.919 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started 2023-03-16 13:52:50.923 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 2023-03-16 13:52:50.929 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 2023-03-16 13:52:50.933 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 2023-03-16 13:52:51.049 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[130]ms 2023-03-16 13:52:51.049 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[120]ms 2023-03-16 13:52:51.050 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[135]ms 2023-03-16 13:52:51.052 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[129]ms 2023-03-16 13:52:51.052 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[119]ms 2023-03-16 13:52:51.053 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks. 2023-03-16 13:53:00.918 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.008s | Percentage 100.00% 2023-03-16 13:53:00.919 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks. 2023-03-16 13:53:00.923 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work. 2023-03-16 13:53:00.923 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work. 2023-03-16 13:53:00.923 [job-0] INFO JobContainer - DataX jobId [0] completed successfully. 2023-03-16 13:53:00.924 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: E:\DATAX\datax\hook 2023-03-16 13:53:00.925 [job-0] INFO JobContainer -[total cpu info] =>averageCpu | maxDeltaCpu | minDeltaCpu-1.00% | -1.00% | -1.00%[total gc info] =>NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTimePS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000sPS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s2023-03-16 13:53:00.925 [job-0] INFO JobContainer - PerfTrace not enable! 2023-03-16 13:53:00.926 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.008s | Percentage 100.00% 2023-03-16 13:53:00.927 [job-0] INFO JobContainer - 任务启动时刻 : 2023-03-16 13:52:50 任务结束时刻 : 2023-03-16 13:53:00 任务总计耗时 : 10s 任务平均流量 : 95B/s 记录写入速度 : 5rec/s 读出记录总数 : 50 读写失败总数 : 0
4、实际配置
这里只测试了mysql的相关配置,其他的需要继续研究
脚本格式信息可以去https://github.com/alibaba/DataX/查看
因为和clickhouse一起学习,这里可以看到支持对 CLickHouse的写
1、mysqlreader
DataX/mysqlreader/doc/mysqlreader.md at master · alibaba/DataX · GitHub
table column方式和querysql方式是冲突的。只能用一种
干货:
jdbcUrl 可以配置多个,依次检查合法性
table 可以配置多个,需保证多张表是同一schema结构?,table必须包含在connection配置单元中
1、配置一个从Mysql数据库同步抽取数据到本地的作业:
通过table column方式
mysql2stream1.json
{"job": {"setting": {"speed": {"channel": 3},"errorLimit": {"record": 0,"percentage": 0.02}},"content": [{"reader": {"name": "mysqlreader","parameter": {"username": "root","password": "sa","column": ["ryxm","rysfz"],"splitPk": "id","connection": [{"table": ["sys_czry"],"jdbcUrl": ["jdbc:mysql://172.16.0.101:3306/qyjx_v3.1?useUnicode=true&characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&useSSL=false&serverTimezone=GMT%2B8&&nullCatalogMeansCurrent=true&allowMultiQueries=true&rewriteBatchedStatements=true"]}]}},"writer": {"name": "streamwriter","parameter": {"print":true}}}]} }
python datax.py ../job/mysql2stream1.json
2、配置一个自定义SQL的数据库同步任务到本地内容的作业:
通过querysql方式
mysql2stream2.json
{"job": {"setting": {"speed": {"channel": 1}},"content": [{"reader": {"name": "mysqlreader","parameter": {"username": "root","password": "sa","connection": [{"querySql": ["SELECT ryxm,rysfz,rygh from sys_czry;"],"jdbcUrl": ["jdbc:mysql://172.16.0.101:3306/qyjx_v3.1?useUnicode=true&characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&useSSL=false&serverTimezone=GMT%2B8&&nullCatalogMeansCurrent=true&allowMultiQueries=true&rewriteBatchedStatements=true"]}]}},"writer": {"name": "streamwriter","parameter": {"print": true,"encoding": "UTF-8"}}}]} }
python datax.py ../job/mysql2stream2.json
2、mysqlwrier
DataX/mysqlwriter/doc/mysqlwriter.md at master · alibaba/DataX · GitHub
1、这里使用一份从内存产生到 Mysql 导入的数据
{"job": {"setting": {"speed": {"channel": 1}},"content": [{"reader": {"name": "streamreader","parameter": {"column" : [{"value": "DataX","type": "string"}],"sliceRecordCount": 1000}},"writer": {"name": "mysqlwriter","parameter": {"writeMode": "insert","username": "root","password": "root","column": ["name"],"session": ["set session sql_mode='ANSI'"],"preSql": ["delete from test"],"connection": [{"jdbcUrl": "jdbc:mysql://127.0.0.1:3306/datax?useUnicode=true&characterEncoding=gbk","table": ["test"]}]}}}]} }
python datax.py ../job/stream2mysql1.json
结果:
2、这里使用一份从mysql(服务器1)产生到 Mysql(本地) 导入的数据
模拟跨服务器、数据库环境
{"job": {"setting": {"speed": {"channel": 1}},"content": [{"reader": {"name": "mysqlreader","parameter": {"username": "root","password": "sa","connection": [{"querySql": ["SELECT ryxm,rysfz from sys_czry;"],"jdbcUrl": ["jdbc:mysql://172.16.0.101:3306/qyjx_v3.1?useUnicode=true&characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&useSSL=false&serverTimezone=GMT%2B8&&nullCatalogMeansCurrent=true&allowMultiQueries=true&rewriteBatchedStatements=true"]}]}},"writer": {"name": "mysqlwriter","parameter": {"writeMode": "insert","username": "root","password": "root","column": ["name","rysfz"],"session": ["set session sql_mode='ANSI'"],"preSql": ["delete from test"],"connection": [{"jdbcUrl": "jdbc:mysql://127.0.0.1:3306/datax?useUnicode=true&characterEncoding=gbk","table": ["test"]}]}}}]} }
python datax.py ../job/mysql2mysql.json
结果:
异常:
1、如果出现乱码的话
先输入CHCP 65001