Hadoop中distcp命令

最新推荐文章于 2025-07-03 09:32:05 发布

说文科技

最新推荐文章于 2025-07-03 09:32:05 发布

阅读量2.6k

点赞数

CC 4.0 BY-SA版权

分类专栏： # Hadoop

喜欢文章？请私信联系作者。

本文链接：https://blog.csdn.net/liu16659/article/details/86481832

Hadoop 专栏收录该内容

52 篇文章

订阅专栏

本文详细介绍了Hadoop中的distcp命令，包括其工作原理、使用方法及常见用例。distcp是一个并行复制工具，通过MapReduce作业在Hadoop文件系统间高效传输数据。文章还提供了具体的命令执行示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

`Hadoop`中`distcp`命令

1.什么是`distcp`命令？

Hadoop comes with a useful program called distcp for copying data to and from Hadoop filesystems in parallel.

2.`distcp` 是如何实现的？

distcp is implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster.There are no reducers.

Each file is copied by a single map, and distcp tries to give each map approximately the same amount of data by bucketing files into roughly equal allocations

By default, up to 20 maps are used, but this can be changed by specifying the -m argument to distcp.

3.如何使用`distcp`命令？

hadoop distcp dir1 dir2

If dir2 does not exist, it will be created, and the contents of the dir1 directory will be copied there.
If dir2 already exists, then dir1 will be copied under it, creating the directory structure dir2/dir1. If this isn’t what you want, you can supply the -overwrite option to keep the same directory structure and force files to be overwritten.

4.`distcip` 的用途

A very common use case for distcp is for transferring data between two HDFS clusters.

5.示例

执行命令前
查看HDFS 的存储内容
执行命令hadoop distcp /test /lawson
这个命令的意思是：将/test 目录下的内容 copy 一份放到 /lawson文件夹下。

[root@server4 hadoop]# hadoop distcp /test /lawson
19/01/14 18:08:17 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[/test], targetPath=/lawson, targetPathExists=false, preserveRawXattrs=false}
19/01/14 18:08:17 INFO client.RMProxy: Connecting to ResourceManager at server4/192.168.211.4:8032
19/01/14 18:08:17 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
19/01/14 18:08:17 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
19/01/14 18:08:17 INFO client.RMProxy: Connecting to ResourceManager at server4/192.168.211.4:8032
19/01/14 18:08:18 INFO mapreduce.JobSubmitter: number of splits:2
19/01/14 18:08:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547460247783_0001
19/01/14 18:08:19 INFO impl.YarnClientImpl: Submitted application application_1547460247783_0001
19/01/14 18:08:19 INFO mapreduce.Job: The url to track the job: http://server4:8088/proxy/application_1547460247783_0001/
19/01/14 18:08:19 INFO tools.DistCp: DistCp job-id: job_1547460247783_0001
19/01/14 18:08:19 INFO mapreduce.Job: Running job: job_1547460247783_0001
19/01/14 18:08:31 INFO mapreduce.Job: Job job_1547460247783_0001 running in uber mode : false
19/01/14 18:08:31 INFO mapreduce.Job:  map 0% reduce 0%
19/01/14 18:08:42 INFO mapreduce.Job:  map 50% reduce 0%
19/01/14 18:08:43 INFO mapreduce.Job:  map 100% reduce 0%
19/01/14 18:08:44 INFO mapreduce.Job: Job job_1547460247783_0001 completed successfully
19/01/14 18:08:44 INFO mapreduce.Job: Counters: 33
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=218172
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=889
		HDFS: Number of bytes written=30
		HDFS: Number of read operations=28
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=7
	Job Counters 
		Launched map tasks=2
		Other local map tasks=2
		Total time spent by all maps in occupied slots (ms)=15439
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=15439
		Total vcore-milliseconds taken by all map tasks=15439
		Total megabyte-milliseconds taken by all map tasks=15809536
	Map-Reduce Framework
		Map input records=2
		Map output records=0
		Input split bytes=268
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=293
		CPU time spent (ms)=740
		Physical memory (bytes) snapshot=201781248
		Virtual memory (bytes) snapshot=4150059008
		Total committed heap usage (bytes)=93454336
	File Input Format Counters 
		Bytes Read=591
	File Output Format Counters 
		Bytes Written=0
	org.apache.hadoop.tools.mapred.CopyMapper$Counter
		BYTESCOPIED=30
		BYTESEXPECTED=30
		COPY=2

执行命令后

可以看到已经生成了一个/lawson的文件，而且下面有一个users.txt文件
接着我们将/test文件夹下的文件重命名为distcp.txt，再次执行如下命令

[root@server4 hadoop]# hadoop distcp /test /lawson
19/01/14 18:14:17 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[/test], targetPath=/lawson, targetPathExists=true, preserveRawXattrs=false}
19/01/14 18:14:17 INFO client.RMProxy: Connecting to ResourceManager at server4/192.168.211.4:8032
19/01/14 18:14:18 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
19/01/14 18:14:18 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
19/01/14 18:14:18 INFO client.RMProxy: Connecting to ResourceManager at server4/192.168.211.4:8032
19/01/14 18:14:19 INFO mapreduce.JobSubmitter: number of splits:2
19/01/14 18:14:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547460247783_0002
19/01/14 18:14:19 INFO impl.YarnClientImpl: Submitted application application_1547460247783_0002
19/01/14 18:14:19 INFO mapreduce.Job: The url to track the job: http://server4:8088/proxy/application_1547460247783_0002/
19/01/14 18:14:19 INFO tools.DistCp: DistCp job-id: job_1547460247783_0002
19/01/14 18:14:19 INFO mapreduce.Job: Running job: job_1547460247783_0002
19/01/14 18:14:26 INFO mapreduce.Job: Job job_1547460247783_0002 running in uber mode : false
19/01/14 18:14:26 INFO mapreduce.Job:  map 0% reduce 0%
19/01/14 18:14:33 INFO mapreduce.Job:  map 50% reduce 0%
19/01/14 18:14:35 INFO mapreduce.Job:  map 100% reduce 0%
19/01/14 18:14:35 INFO mapreduce.Job: Job job_1547460247783_0002 completed successfully
19/01/14 18:14:35 INFO mapreduce.Job: Counters: 33
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=218164
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=906
		HDFS: Number of bytes written=30
		HDFS: Number of read operations=29
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=7
	Job Counters 
		Launched map tasks=2
		Other local map tasks=2
		Total time spent by all maps in occupied slots (ms)=11509
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=11509
		Total vcore-milliseconds taken by all map tasks=11509
		Total megabyte-milliseconds taken by all map tasks=11785216
	Map-Reduce Framework
		Map input records=2
		Map output records=0
		Input split bytes=266
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=100
		CPU time spent (ms)=520
		Physical memory (bytes) snapshot=203853824
		Virtual memory (bytes) snapshot=4150059008
		Total committed heap usage (bytes)=93454336
	File Input Format Counters 
		Bytes Read=610
	File Output Format Counters 
		Bytes Written=0
	org.apache.hadoop.tools.mapred.CopyMapper$Counter
		BYTESCOPIED=30
		BYTESEXPECTED=30
		COPY=2

查看/lawson文件夹下的内容：

[root@server4 hadoop]# hadoop fs -ls /lawson
Found 2 items
drwxr-xr-x   - root supergroup          0 2019-01-14 18:14 /lawson/test
-rw-r--r--   3 root supergroup         30 2019-01-14 18:08 /lawson/users.txt


[root@server4 hadoop]# hadoop fs -ls /lawson/test
Found 1 items
-rw-r--r--   3 root supergroup         30 2019-01-14 18:14 /lawson/test/distcp.txt

可以看到在/lawson文件夹下生成了一个/test文件夹，而该文件夹下有一个distcp.txt文件。