王烨鑫,湘潭seo公司,广东手机网站建设价格,顺企网吉安网站建设视频课程地址#xff1a;https://www.bilibili.com/video/BV1WY4y197g7 课程资料链接#xff1a;https://pan.baidu.com/s/15KpnWeKpvExpKmOC8xjmtQ?pwd5ay8
Hadoop入门学习笔记#xff08;汇总#xff09; 目录 四、MapReduce的框架配置和YARN的部署4.1. 配置MapReduce…视频课程地址https://www.bilibili.com/video/BV1WY4y197g7 课程资料链接https://pan.baidu.com/s/15KpnWeKpvExpKmOC8xjmtQ?pwd5ay8
Hadoop入门学习笔记汇总 目录 四、MapReduce的框架配置和YARN的部署4.1. 配置MapReduce和YARN4.2. YARN集群启停脚本4.2.1. 一键启停脚本4.2.2. 单独进程启停 4.3. 提交MapReduce示例程序到YARN运行4.3.1. 提交wordcount单词统计示例程序4.3.2. 提交根据Monte Carlo蒙特卡罗算法求圆周率的示例程序 四、MapReduce的框架配置和YARN的部署
本次YARN的部署结构如下图所示 当前共有三台服务器虚拟机构成集群集群规划如下所示
主机部署的服务node1ResourceManager、NodeManager、ProxyServer、JobHistoryServernode2NodeManagernode3NodeManager
MapReduce是运行在YARN上的所以MapReduce只需要配置YARN需要部署并启动。
4.1. 配置MapReduce和YARN
1、在node1节点修改mapred-env.sh文件
# 进入hadoop配置文件目录
cd /export/server/hadoop-3.3.4/etc/hadoop/
# 打开mapred-env.sh文件
vim mapred-env.sh打开后在文件中加入以下内容
# 设置JDK路径
export JAVA_HOME/export/server/jdk
# 设置JobHistoryServer进程的内存为1G
export HADOOP_JOB_HISTORYSERVER_HEAPSIZE1000
# 设置日志级别为INFO
export HADOOP_MAPRED_ROOT_LOGGERINFO,RFA2、再修改同目录下的mapred-site.xml配置文件在其configuration标签内增加以下内容 propertynamemapreduce.framework.name/namevalueyarn/valuedescription/description/propertypropertynamemapreduce.jobhistory.address/namevaluenode1:10020/valuedescription/description/propertypropertynamemapreduce.jobhistory.webapp.address/namevaluenode1:19888/valuedescription/description/propertypropertynamemapreduce.jobhistory.intermediate-done-dir/namevalue/data/mr-history/tmp/valuedescription/description/propertypropertynamemapreduce.jobhistory.done-dir/namevalue/data/mr-history/done/valuedescription/description/propertypropertynameyarn.app.mapreduce.am.env/namevalueHADOOP_MAPRED_HOME$HADOOP_HOME/value/propertypropertynamemapreduce.map.env/namevalueHADOOP_MAPRED_HOME$HADOOP_HOME/value/propertypropertynamemapreduce.reduce.env/namevalueHADOOP_MAPRED_HOME$HADOOP_HOME/value/property其中 mapreduce.framework.name 表示MapReduce的运行框架这里设置为Yarn mapreduce.jobhistory.address 表示历史服务器通讯地址和端口号这里为node1:10020 mapreduce.jobhistory.webapp.address 表示历史服务器Web端地址和端口号这里为node1:19888 mapreduce.jobhistory.intermediate-done-dir 表示历史信息在HDFS的记录临时路径这里是/data/mr-history/tmp mapreduce.jobhistory.done-dir 表示历史信息在HDFS的记录路径这里是/data/mr-history/done yarn.app.mapreduce.am.env 表示MapReduce HOME的路径这里设置为HADOOP_HOME相同路径 mapreduce.map.env 表示Map HOME的路径这里设置为HADOOP_HOME相同路径 mapreduce.reduce.env 表示Reduce HOME的路径这里设置为HADOOP_HOME相同路径
至此MapReduce的配置完成。
3、接下来配置YARN。在node1节点修改yarn-env.sh文件
# 进入hadoop配置文件目录
cd /export/server/hadoop-3.3.4/etc/hadoop/
# 打开yarn-env.sh文件
vim yarn-env.sh在文件中添加以下内容
# 设置JDK路径的环境变量
export JAVA_HOME/export/server/jdk
# 设置HADOOP_HOME的环境变量
export HADOOP_HOME/export/server/hadoop
# 设置配置文件路径的环境变量
export HADOOP_CONF_DIR$HADOOP_HOME/etc/hadoop
# 设置日志文件路径的环境变量
export HADOOP_LOG_DIR$HADOOP_HOME/logs4、修改同目录下的yarn-site.xml配置文件在其configuration节点中添加以下内容 !-- Site specific YARN configuration properties --propertynameyarn.log.server.url/namevaluehttp://node1:19888/jobhistory/logs/valuedescription/description/propertypropertynameyarn.web-proxy.address/namevaluenode1:8089/valuedescriptionproxy server hostname and port/description/propertypropertynameyarn.log-aggregation-enable/namevaluetrue/valuedescriptionConfiguration to enable or disable log aggregation/description/propertypropertynameyarn.nodemanager.remote-app-log-dir/namevalue/tmp/logs/valuedescriptionConfiguration to enable or disable log aggregation/description/property!-- Site specific YARN configuration properties --propertynameyarn.resourcemanager.hostname/namevaluenode1/valuedescription/description/propertypropertynameyarn.resourcemanager.scheduler.class/namevalueorg.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler/valuedescription/description/propertypropertynameyarn.nodemanager.local-dirs/namevalue/data/nm-local/valuedescriptionComma-separated list of paths on the local filesystem where intermediate data is written./description/propertypropertynameyarn.nodemanager.log-dirs/namevalue/data/nm-log/valuedescriptionComma-separated list of paths on the local filesystem where logs are written./description/propertypropertynameyarn.nodemanager.log.retain-seconds/namevalue10800/valuedescriptionDefault time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled./description/propertypropertynameyarn.nodemanager.aux-services/namevaluemapreduce_shuffle/valuedescriptionShuffle service that needs to be set for Map Reduce applications./description/property其中核心配置如下 yarn.resourcemanager.hostname 表示ResourceManager设置在哪个节点这里是node1节点 yarn.nodemanager.local-dirs 表示NodeManager中间数据Linux系统本地存储的路径 yarn.nodemanager.log-dirs 表示NodeManager数据Linux系统日志本地存储的路径 yarn.nodemanager.aux-services 表示为MapReduce程序开启Shuffle服务 额外配置如下 yarn.log.server.url 表示历史服务器的URL yarn.web-proxy.address 表示代理服务器的主机和端口号 yarn.log-aggregation-enable 表示是否开启日志聚合 yarn.nodemanager.remote-app-log-dir 表示程序日志在HDFS中的存放路径 yarn.resourcemanager.scheduler.class 表示选择Yarn使用的调度器这里选的是公平调度器
5、完成上述配置后需要将MapReduce和YARN的配置文件分发到node2和node3服务器相同位置中使用hadoop用户身份执行以下命令
# 将mapred-env.sh、mapred-site.xml、yarn-env.sh、yarn-site.xml四个配置文件复制到node2的相同路径下
scp mapred-env.sh mapred-site.xml yarn-env.sh yarn-site.xml node2:pwd/
# 将mapred-env.sh、mapred-site.xml、yarn-env.sh、yarn-site.xml四个配置文件复制到node3的相同路径下
scp mapred-env.sh mapred-site.xml yarn-env.sh yarn-site.xml node2:pwd/4.2. YARN集群启停脚本
在启动YARN集群前需要确保HDFS集群已经启动。同样启停YARN集群也必须使用hadoop用户身份。
4.2.1. 一键启停脚本
$HADOOP_HOME/sbin/start-yarn.sh或start-yarn.sh 一键启动YARN集群
会基于yarn-site.xml中配置的yarn.resourcemanager.hostname来决定在哪台机器上启动resourcemanager会基于workers文件配置的主机启动NodeManager;在当前机器启动ProxyServer代理服务器。 命令执行效果如下图所示 此时通过jps命令查看进程可以看到如下效果 此时可以看到ResourceManager、NodeManager和WebAppProxyServer都已经启动还需要启动HistoryServer可以通过后续章节介绍的mapred --daemon start historyserver命令启动。 至此整个YARN集群启动完成。 此时可以通过访问http://node1:8088/ 即可看到YARN集群的监控页面即ResourceManager的WebUI
$HADOOP_HOME/sbin/stop-yarn.sh 或 stop-yarn.sh 一键关闭YARN集群。配置部署好YARN集群后可以关闭YARN集群、关闭JobHistoryServer、关闭HDFS集群、关闭虚拟机之后对虚拟机创建快照保存好当前环境。
4.2.2. 单独进程启停
在每一台机器单独启动或停止进程可以通过如下命令执行
$HADOOP_HOME/bin/yarn --daemon start|stop resourcemanager|nodemanager|proxyserverstart和stop决定启动和停止 可控制resourcemanager、nodemanager、webappproxyserver三种进程。 例如
# 在node1启动ResourceManager
yarn --daemon start resourcemanager
# 在node1、node2、node3分别启动NodeManager
yarn --daemon start nodemanager
# 在node1启动WebProxyServer
yarn --daemon start proxyserver历史服务器JobHistoryServer的启动和停止
$HADOOP_HOME/bin/mapred --daemon start|stop historyserver用法
# 启动JobHistoryServer
mapred --daemon start historyserver
# 停止JobHistoryServer
mapred --daemon stop historyserver4.3. 提交MapReduce示例程序到YARN运行
YARN作为资源调度管控框架其本身提供资供许多程序运行常见的有
MapReduce程序Spark程序Flink程序
Hadoop官方提供了一些预置的MapReduce程序代码存放于$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar文件内。 上述程序可使用hadoop har命令提交至YARN运行其命令语法为
hadoop jar 程序文件 java类名 [程序参数] ... [程序参数]4.3.1. 提交wordcount单词统计示例程序
1、程序内容
给定数据输入的路径HDFS、给定结果输出的路径HDFS将输入路径内的数据中的单词进行计数将结果写到输出路径
2、准备一份待统计的数据文件并上传至HDFS中 使用vim words.txt命令在Linux本地创建words.txt文件其内容如下 itheima itcast itheima itcast hadoop hdfs hadoop hdfs hadoop mapreduce hadoop yarn itheima hadoop itcast hadoop itheima itcast hadoop yarn mapreduce 使用命令hdfs dfs -mkdir -p /input在HDFS根目录创建input文件夹用于存储待统计的文件使用hdfs dfs -mkdir -p /output命令在HDFS根目录创建output文件夹用于存储统计结果使用hdfs dfs -put words.txt /input命令将本地的words.txt文件上传至HDFS系统中。
3、提交MapReduce程序 使用如下命令
hadoop jar /export/server/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar wordcount hdfs://node1:8020/input/ hdfs://8020/output/wc其中 hadoop jar 表示向YARN提交一个Java程序 /export/server/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar 表示所要提交的程序路径 wordcount 表示要运行的java类名 hdfs://node1:8020/input/ 表示参数1在本程序中是待统计的文件夹这里写了hdfs协议头指明了是HDFS文件系统的路径经测试不写也可以默认读取HDFS文件系统路径 hdfs://8020/output/wc 表示参数2在本程序中是统计结果输出的文件夹这里写明了hdfs协议头指明了是HDFS文件系统的路径经测试不写也可以默认读取HDFS文件系统路径这里需要确保该文件夹不存在否则会报错。 运行日志如下所示
[hadoopnode1 ~]$ hadoop jar /export/server/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar wordcount hdfs://node1:8020/input hdfs://node1:8020/output/wc
2023-12-14 15:31:53,988 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node1/192.168.88.101:8032
2023-12-14 15:31:55,818 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1702538855741_0001
2023-12-14 15:31:56,752 INFO input.FileInputFormat: Total input files to process : 1
2023-12-14 15:31:57,040 INFO mapreduce.JobSubmitter: number of splits:1
2023-12-14 15:31:57,607 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1702538855741_0001
2023-12-14 15:31:57,607 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-12-14 15:31:58,167 INFO conf.Configuration: resource-types.xml not found
2023-12-14 15:31:58,170 INFO resource.ResourceUtils: Unable to find resource-types.xml.
2023-12-14 15:31:59,119 INFO impl.YarnClientImpl: Submitted application application_1702538855741_0001
2023-12-14 15:31:59,406 INFO mapreduce.Job: The url to track the job: http://node1:8089/proxy/application_1702538855741_0001/
2023-12-14 15:31:59,407 INFO mapreduce.Job: Running job: job_1702538855741_0001
2023-12-14 15:32:23,043 INFO mapreduce.Job: Job job_1702538855741_0001 running in uber mode : false
2023-12-14 15:32:23,045 INFO mapreduce.Job: map 0% reduce 0%
2023-12-14 15:32:37,767 INFO mapreduce.Job: map 100% reduce 0%
2023-12-14 15:32:50,191 INFO mapreduce.Job: map 100% reduce 100%
2023-12-14 15:32:51,220 INFO mapreduce.Job: Job job_1702538855741_0001 completed successfully
2023-12-14 15:32:51,431 INFO mapreduce.Job: Counters: 54File System CountersFILE: Number of bytes read84FILE: Number of bytes written553527FILE: Number of read operations0FILE: Number of large read operations0FILE: Number of write operations0HDFS: Number of bytes read248HDFS: Number of bytes written54HDFS: Number of read operations8HDFS: Number of large read operations0HDFS: Number of write operations2HDFS: Number of bytes read erasure-coded0Job CountersLaunched map tasks1Launched reduce tasks1Data-local map tasks1Total time spent by all maps in occupied slots (ms)11593Total time spent by all reduces in occupied slots (ms)9650Total time spent by all map tasks (ms)11593Total time spent by all reduce tasks (ms)9650Total vcore-milliseconds taken by all map tasks11593Total vcore-milliseconds taken by all reduce tasks9650Total megabyte-milliseconds taken by all map tasks11871232Total megabyte-milliseconds taken by all reduce tasks9881600Map-Reduce FrameworkMap input records6Map output records21Map output bytes233Map output materialized bytes84Input split bytes98Combine input records21Combine output records6Reduce input groups6Reduce shuffle bytes84Reduce input records6Reduce output records6Spilled Records12Shuffled Maps 1Failed Shuffles0Merged Map outputs1GC time elapsed (ms)300CPU time spent (ms)2910Physical memory (bytes) snapshot353423360Virtual memory (bytes) snapshot5477199872Total committed heap usage (bytes)196218880Peak Map Physical memory (bytes)228843520Peak Map Virtual memory (bytes)2734153728Peak Reduce Physical memory (bytes)124579840Peak Reduce Virtual memory (bytes)2743046144Shuffle ErrorsBAD_ID0CONNECTION0IO_ERROR0WRONG_LENGTH0WRONG_MAP0WRONG_REDUCE0File Input Format CountersBytes Read150File Output Format CountersBytes Written54
4、查看运行结果
运行完毕后使用hadoop fs -ls /output/wc可以看到运行结果输出的文件 使用hadoop fs -cat /output/wc/part-r-00000命令可以看到程序运行的结果
除此之外在YARN集群的监控页面http://node1:8088/ 点击左侧的Applications菜单可以看到刚才运行过的任务 再点击任务的ID可以进入任务详情页面 再点击某一个阶段的Logs链接可以看到对应阶段的运行的客户端日志在配置yarn-site.xml文件时配置了开启日志聚合这个页面本质上是JobHistoryServer提供的页面19888端口 在任务详情页面点击History链接可以看到任务的历史运行状态在其中可以看到其Map任务和Reduce任务也可以继续点进Map和Reduce任务查看相关的日志等信息对于程序出错时的排查很有帮助。
4.3.2. 提交根据Monte Carlo蒙特卡罗算法求圆周率的示例程序
1、提交程序
hadoop jar /export/server/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar pi 3 1000hadoop jar 表示向YARN提交一个Java程序 /export/server/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar 表示所要提交的程序路径 pi 表示运行的Java类名 3 表示使用3个Map任务 1000 表示样本数为1000样本数越多求得的圆周率越准确但是程序运行时长越长。 运行日志如下所示
[hadoopnode1 ~]$ hadoop jar /export/server/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar pi 3 1000
Number of Maps 3
Samples per Map 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Starting Job
2023-12-14 16:06:12,042 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node1/192.168.88.101:8032
2023-12-14 16:06:13,550 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1702538855741_0002
2023-12-14 16:06:13,888 INFO input.FileInputFormat: Total input files to process : 3
2023-12-14 16:06:14,149 INFO mapreduce.JobSubmitter: number of splits:3
2023-12-14 16:06:14,658 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1702538855741_0002
2023-12-14 16:06:14,659 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-12-14 16:06:15,065 INFO conf.Configuration: resource-types.xml not found
2023-12-14 16:06:15,065 INFO resource.ResourceUtils: Unable to find resource-types.xml.
2023-12-14 16:06:15,256 INFO impl.YarnClientImpl: Submitted application application_1702538855741_0002
2023-12-14 16:06:15,403 INFO mapreduce.Job: The url to track the job: http://node1:8089/proxy/application_1702538855741_0002/
2023-12-14 16:06:15,404 INFO mapreduce.Job: Running job: job_1702538855741_0002
2023-12-14 16:06:32,155 INFO mapreduce.Job: Job job_1702538855741_0002 running in uber mode : false
2023-12-14 16:06:32,156 INFO mapreduce.Job: map 0% reduce 0%
2023-12-14 16:06:47,156 INFO mapreduce.Job: map 67% reduce 0%
2023-12-14 16:06:50,188 INFO mapreduce.Job: map 100% reduce 0%
2023-12-14 16:06:57,275 INFO mapreduce.Job: map 100% reduce 100%
2023-12-14 16:06:58,328 INFO mapreduce.Job: Job job_1702538855741_0002 completed successfully
2023-12-14 16:06:58,589 INFO mapreduce.Job: Counters: 54File System CountersFILE: Number of bytes read72FILE: Number of bytes written1108329FILE: Number of read operations0FILE: Number of large read operations0FILE: Number of write operations0HDFS: Number of bytes read786HDFS: Number of bytes written215HDFS: Number of read operations17HDFS: Number of large read operations0HDFS: Number of write operations3HDFS: Number of bytes read erasure-coded0Job CountersLaunched map tasks3Launched reduce tasks1Data-local map tasks3Total time spent by all maps in occupied slots (ms)39354Total time spent by all reduces in occupied slots (ms)7761Total time spent by all map tasks (ms)39354Total time spent by all reduce tasks (ms)7761Total vcore-milliseconds taken by all map tasks39354Total vcore-milliseconds taken by all reduce tasks7761Total megabyte-milliseconds taken by all map tasks40298496Total megabyte-milliseconds taken by all reduce tasks7947264Map-Reduce FrameworkMap input records3Map output records6Map output bytes54Map output materialized bytes84Input split bytes432Combine input records0Combine output records0Reduce input groups2Reduce shuffle bytes84Reduce input records6Reduce output records0Spilled Records12Shuffled Maps 3Failed Shuffles0Merged Map outputs3GC time elapsed (ms)699CPU time spent (ms)11980Physical memory (bytes) snapshot775233536Virtual memory (bytes) snapshot10945183744Total committed heap usage (bytes)466890752Peak Map Physical memory (bytes)227717120Peak Map Virtual memory (bytes)2734153728Peak Reduce Physical memory (bytes)113000448Peak Reduce Virtual memory (bytes)2742722560Shuffle ErrorsBAD_ID0CONNECTION0IO_ERROR0WRONG_LENGTH0WRONG_MAP0WRONG_REDUCE0File Input Format CountersBytes Read354File Output Format CountersBytes Written97
Job Finished in 46.895 seconds
Estimated value of Pi is 3.141333333333333333332、查看运行情况 在在YARN集群的监控页面可以查看对应任务的History信息可以看到当前任务使用了3个Map任务和1个Reduce任务同时也可以查看相应的运行日志信息。