本文是《docker下,极速搭建spark集群(含hdfs集群)》的续篇,前文将spark集群搭建成功并进行了简单的验证,但是存在以下几个小问题:
- spark只有一个work节点,只适合处理小数据量的任务,遇到大量数据的任务要消耗更多时间;
- hdfs的文件目录和docker安装目录在一起,如果要保存大量文件,很可能由于磁盘空间不足导致上传失败;
- master的4040和work的8080端口都没有开放,看不到job、stage、executor的运行情况;
今天就来调整系统参数,解决上述问题;
最初的docker-compose.yml内容
优化前的docker-compose.yml内容如下所示:
version: "2.2"
services:namenode:image: bde2020/hadoop-namenode:1.1.0-hadoop2.7.1-java8container_name: namenodevolumes:- hadoop_namenode:/hadoop/dfs/name- ./input_files:/input_filesenvironment:- CLUSTER_NAME=testenv_file:- ./hadoop.envports:- 50070:50070resourcemanager:image: bde2020/hadoop-resourcemanager:1.1.0-hadoop2.7.1-java8container_name: resourcemanagerdepends_on:- namenode- datanode1- datanode2env_file:- ./hadoop.envhistoryserver:image: bde2020/hadoop-historyserver:1.1.0-hadoop2.7.1-java8container_name: historyserverdepends_on:- namenode- datanode1- datanode2volumes:- hadoop_historyserver:/hadoop/yarn/timelineenv_file:- ./hadoop.envnodemanager1:image: bde2020/hadoop-nodemanager:1.1.0-hadoop2.7.1-java8container_name: nodemanager1depends_on:- namenode- datanode1- datanode2env_file:- ./hadoop.envdatanode1:image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8container_name: datanode1depends_on:- namenodevolumes:- hadoop_datanode1:/hadoop/dfs/dataenv_file:- ./hadoop.envdatanode2:image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8container_name: datanode2depends_on:- namenodevolumes:- hadoop_datanode2:/hadoop/dfs/dataenv_file:- ./hadoop.envdatanode3:image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8container_name: datanode3depends_on:- namenodevolumes:- hadoop_datanode3:/hadoop/dfs/dataenv_file:- ./hadoop.envmaster:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: mastercommand: bin/spark-class org.apache.spark.deploy.master.Master -h masterhostname: masterenvironment:MASTER: spark://master:7077SPARK_CONF_DIR: /confSPARK_PUBLIC_DNS: localhostlinks:- namenodeexpose:- 7001- 7002- 7003- 7004- 7005- 7077- 6066ports:- 6066:6066- 7077:7077- 8080:8080volumes:- ./conf/master:/conf- ./data:/tmp/data- ./jars:/root/jarsworker:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: workercommand: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: workerenvironment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 1gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8081SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881ports:- 8081:8081volumes:- ./conf/worker:/conf- ./data:/tmp/datavolumes:hadoop_namenode:hadoop_datanode1:hadoop_datanode2:hadoop_datanode3:hadoop_historyserver:
接下来开始优化;
实战环境信息
本次实战所用的电脑是联想笔记本:
- CPU:i5-6300HQ(四核四线程)
- 内存:16G
- 硬盘:256G的NVMe再加500G机械硬盘
- 系统:Deepin15
- docker:18.09.1
- docker-compose:1.17.1
- spark:2.3.0
- hdfs:2.7.1
调整work节点数量
由于内存有16G,于是打算将work节点数从1个调整到6个,调整后work容器的配置如下:
worker1:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: worker1command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: worker1environment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 2gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8081SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881volumes:- ./conf/worker1:/conf- ./data/worker1:/tmp/data
worker2:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: worker2command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: worker2environment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 2gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8082SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881volumes:- ./conf/worker2:/conf- ./data/worker2:/tmp/data
如上所示,注意volumes参数,都映射在了docker-compose.yml同一层级的conf和data两个目录下,这里只贴出了worker1和worker2的内容,worker3-worker6的内容都是类似的;
hdfs的文件目录导致的磁盘空间不足问题
- 先来看下hdfs的文件目录配置:
volumes:- hadoop_datanode1:/hadoop/dfs/data
- 上面的hadoop_datanode1数据卷的配置在docker-compose.yml的最底部,是默认声明,如下:
volumes:hadoop_namenode:hadoop_datanode1:hadoop_datanode2:hadoop_datanode3:hadoop_historyserver:
- 在容器运行状态,执行命令docker inspect datanode1查看容器信息,和数据卷相关的信息如下所示:
"Mounts": [{"Type": "volume","Name": "temp_hadoop_datanode1","Source": "/var/lib/docker/volumes/temp_hadoop_datanode1/_data","Destination": "/hadoop/dfs/data","Driver": "local","Mode": "rw","RW": true,"Propagation": ""}]
可见hdfs容器的文件目录对应的是宿主机的/var/lib/docker/volumes;
4. 用df -m看看磁盘空间情况,如下所示,"/var/lib/docker/volumes"所在的"/dev/nvme0n1p3"设备可用空间只有20多G(29561),显然在保存大量文件时这个空间是不够的,而且hdfs的默认副本数为3:
root@willzhao-deepin:/data/work/spark/temp# df -m
文件系统 1M-块 已用 可用 已用% 挂载点
udev 7893 0 7893 0% /dev
tmpfs 1584 4 1581 1% /run
/dev/nvme0n1p3 43927 12107 29561 30% /
tmpfs 7918 0 7918 0% /dev/shm
tmpfs 5 1 5 1% /run/lock
tmpfs 7918 0 7918 0% /sys/fs/cgroup
/dev/nvme0n1p4 87854 181 83169 1% /home
/dev/nvme0n1p1 300 7 293 3% /boot/efi
/dev/sda1 468428 109152 335430 25% /data
tmpfs 1584 1 1584 1% /run/user/108
tmpfs 1584 0 1584 0% /run/user/0
- 上面的磁盘信息显示设备/dev/sda1还有300G,所以hdfs的文件目录映射到/dev/sda1就能缓解磁盘空间问题了,于是修改docker-compose.yml文件中hdfs的三个数据节点的配置,修改后如下:
datanode1:image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8container_name: datanode1depends_on:- namenodevolumes:- ./hadoop/datanode1:/hadoop/dfs/dataenv_file:- ./hadoop.envdatanode2:image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8container_name: datanode2depends_on:- namenodevolumes:- ./hadoop/datanode2:/hadoop/dfs/dataenv_file:- ./hadoop.envdatanode3:image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8container_name: datanode3depends_on:- namenodevolumes:- ./hadoop/datanode3:/hadoop/dfs/dataenv_file:- ./hadoop.env
再将下面这段配置删除:
volumes:hadoop_namenode:hadoop_datanode1:hadoop_datanode2:hadoop_datanode3:hadoop_historyserver:
开发master的4040和work的8080端口
- 任务运行过程中,如果有UI页面来观察详情,可以帮助我们更全面直观的了解运行情况,所以需要修改配置开放端口;
- 如下所示,expose参数增加4040,表示对外暴露4040端口,ports参数增加4040:4040,表示容器的4040映射到宿主机的4040端口:
master:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: mastercommand: bin/spark-class org.apache.spark.deploy.master.Master -h masterhostname: masterenvironment:MASTER: spark://master:7077SPARK_CONF_DIR: /confSPARK_PUBLIC_DNS: localhostlinks:- namenodeexpose:- 4040- 7001- 7002- 7003- 7004- 7005- 7077- 6066ports:- 4040:4040- 6066:6066- 7077:7077- 8080:8080volumes:- ./conf/master:/conf- ./data:/tmp/data- ./jars:/root/jars
- worker的web端口同样需要打开,访问worker的web页面可以观察worker的状态,并且查看任务日志(这个很重要),这里要注意的是由于有多个worker,所以要映射到宿主机的多个端口,如下配置,workder1的environment.SPARK_WORKER_WEBUI_PORT配置为8081,并且暴露8081,再将容器的8081映射到宿主机的8081,workder2的environment.SPARK_WORKER_WEBUI_PORT配置为8082,并且暴露8082,再将容器的8082映射到宿主机的8082:
worker1:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: worker1command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: worker1environment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 2gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8081SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881- 8081ports:- 8081:8081volumes:- ./conf/worker1:/conf- ./data/worker1:/tmp/dataworker2:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: worker2command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: worker2environment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 2gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8082SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881- 8082ports:- 8082:8082volumes:- ./conf/worker2:/conf- ./data/worker2:/tmp/data
worker3-worker6的配置与上面类似,注意用不同的端口号;
至此,修改已经完成,最终版的docker-compose.yml内容如下:
version: "2.2"
services:namenode:image: bde2020/hadoop-namenode:1.1.0-hadoop2.7.1-java8container_name: namenodevolumes:- ./hadoop/namenode:/hadoop/dfs/name- ./input_files:/input_filesenvironment:- CLUSTER_NAME=testenv_file:- ./hadoop.envports:- 50070:50070resourcemanager:image: bde2020/hadoop-resourcemanager:1.1.0-hadoop2.7.1-java8container_name: resourcemanagerdepends_on:- namenode- datanode1- datanode2env_file:- ./hadoop.envhistoryserver:image: bde2020/hadoop-historyserver:1.1.0-hadoop2.7.1-java8container_name: historyserverdepends_on:- namenode- datanode1- datanode2volumes:- ./hadoop/historyserver:/hadoop/yarn/timelineenv_file:- ./hadoop.envnodemanager1:image: bde2020/hadoop-nodemanager:1.1.0-hadoop2.7.1-java8container_name: nodemanager1depends_on:- namenode- datanode1- datanode2env_file:- ./hadoop.envdatanode1:image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8container_name: datanode1depends_on:- namenodevolumes:- ./hadoop/datanode1:/hadoop/dfs/dataenv_file:- ./hadoop.envdatanode2:image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8container_name: datanode2depends_on:- namenodevolumes:- ./hadoop/datanode2:/hadoop/dfs/dataenv_file:- ./hadoop.envdatanode3:image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8container_name: datanode3depends_on:- namenodevolumes:- ./hadoop/datanode3:/hadoop/dfs/dataenv_file:- ./hadoop.envmaster:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: mastercommand: bin/spark-class org.apache.spark.deploy.master.Master -h masterhostname: masterenvironment:MASTER: spark://master:7077SPARK_CONF_DIR: /confSPARK_PUBLIC_DNS: localhostlinks:- namenodeexpose:- 4040- 7001- 7002- 7003- 7004- 7005- 7077- 6066ports:- 4040:4040- 6066:6066- 7077:7077- 8080:8080volumes:- ./conf/master:/conf- ./data:/tmp/data- ./jars:/root/jarsworker1:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: worker1command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: worker1environment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 2gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8081SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881- 8081ports:- 8081:8081volumes:- ./conf/worker1:/conf- ./data/worker1:/tmp/dataworker2:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: worker2command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: worker2environment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 2gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8082SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881- 8082ports:- 8082:8082volumes:- ./conf/worker2:/conf- ./data/worker2:/tmp/data worker3:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: worker3command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: worker3environment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 2gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8083SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881- 8083ports:- 8083:8083volumes:- ./conf/worker3:/conf- ./data/worker3:/tmp/dataworker4:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: worker4command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: worker4environment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 2gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8084SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881- 8084ports:- 8084:8084volumes:- ./conf/worker4:/conf- ./data/worker4:/tmp/dataworker5:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: worker5command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: worker5environment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 2gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8085SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881- 8085ports:- 8085:8085volumes:- ./conf/worker5:/conf- ./data/worker5:/tmp/dataworker6:image: gettyimages/spark:2.3.0-hadoop-2.8container_name: worker6command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077hostname: worker6environment:SPARK_CONF_DIR: /confSPARK_WORKER_CORES: 2SPARK_WORKER_MEMORY: 2gSPARK_WORKER_PORT: 8881SPARK_WORKER_WEBUI_PORT: 8086SPARK_PUBLIC_DNS: localhostlinks:- masterexpose:- 7012- 7013- 7014- 7015- 8881- 8086ports:- 8086:8086volumes:- ./conf/worker6:/conf- ./data/worker6:/tmp/data
接下来我们运行一个实例来验证;
验证
- 在docker-compose.yml所在目录创建hadoop.env文件,内容如下:
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=falseYARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource___tracker_address=resourcemanager:8031
- 修改好docker-composes.yml后,执行以下命令启动容器:
docker-compose up -d
- 此次验证所用的spark应用的功能是分析维基百科的网站统计信息,找出访问量最大的网页,本次实战用现成的jar包,不涉及编码,该应用的源码和开发详情请参照《spark实战之:分析维基百科网站统计数据(java版)》;
- 从github下载已经构建好的spark应用jar文件:
wget https://raw.githubusercontent.com/zq2599/blog_demos/master/files/sparkdemo-1.0-SNAPSHOT.jar
- 从github下载维基百科的网站统计信息大数据集,这里只下载了一个文件,建议您参照《寻找海量数据集用于大数据开发实战(维基百科网站统计数据)》下载更多文件用来实战:
wget https://raw.githubusercontent.com/zq2599/blog_demos/master/files/pagecounts-20160801-000000
- 将下载的sparkdemo-1.0-SNAPSHOT.jar文件放在docker-compose.xml所在目录的jars目录下;
- 在docker-compose.xml所在目录的input_files目录内创建input目录,再将下载的pagecounts-20160801-000000文件放在这个input目录下;
- 执行以下命令,将整个input目录放入hdfs:
docker exec namenode hdfs dfs -put /input_files/input /
- 执行以下命令,提交一个任务,使用了12个executor,每个1G内存:
docker exec -it master spark-submit \
--class com.bolingcavalry.sparkdemo.app.WikiRank \
--executor-memory 1g \
--total-executor-cores 12 \
/root/jars/sparkdemo-1.0-SNAPSHOT.jar \
namenode \
8020
- 宿主机的状态如下所示,CPU和内存都被榨干:
- 宿主机的IP地址是192.168.1.102,以下是状态信息,地址:http://192.168.1.102:8080/
- 查看job的Stage情况,如下图,这些信息对学习和掌握spark至关重要,地址:http://192.168.1.102:4040
- 查看worker1的基本情况,如下图,地址是:http://192.168.1.102:8081
- 如果想查看worker1上的业务日志,请点击下图红框中的链接,但此时会提示页面访问失败,对应的url是"http://localhost:8081/logPage?appId=app-20190216081637-0002&executorId=5&logType=stdout",这个地址是页面生成的,我们只要把其中的"localhost"改成宿主机的IP地址就好了:
- 修改后的链接可以访问,看到的业务日志如下图,红框中就是业务代码输出的日志:
以上就是优化和验证的全部过程,您可以根据自己机器的实际情况来调整参数,将电脑的性能充分的利用起来;
后来我用24个300M的文件做数据集,大约1.5亿条记录,在上述硬件环境运行上述命令,最终耗时30分钟完成,如下图: