Page Menu
Home
c4science
Search
Configure Global Search
Log In
Files
F102917177
usage.html
No One
Temporary
Actions
Download File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Subscribers
None
File Metadata
Details
File Info
Storage
Attached
Created
Tue, Feb 25, 11:44
Size
19 KB
Mime Type
text/html
Expires
Thu, Feb 27, 11:44 (1 d, 23 h)
Engine
blob
Format
Raw Data
Handle
24361259
Attached To
R3704 elastic-yarn
usage.html
View Options
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- Generated by Apache Maven Doxia at 2014-02-11 -->
<html
xmlns=
"http://www.w3.org/1999/xhtml"
>
<head>
<title>
Apache Hadoop Distributed Copy -
Usage
</title>
<style
type=
"text/css"
media=
"all"
>
@import
url
(
"./css/maven-base.css"
)
;
@import
url
(
"./css/maven-theme.css"
)
;
@import
url
(
"./css/site.css"
)
;
</style>
<link
rel=
"stylesheet"
href=
"./css/print.css"
type=
"text/css"
media=
"print"
/>
<meta
name=
"Date-Revision-yyyymmdd"
content=
"20140211"
/>
<meta
http-equiv=
"Content-Type"
content=
"text/html; charset=UTF-8"
/>
</head>
<body
class=
"composite"
>
<div
id=
"banner"
>
<a
href=
"http://hadoop.apache.org/"
id=
"bannerLeft"
>
<img
src=
"http://hadoop.apache.org/images/hadoop-logo.jpg"
alt=
""
/>
</a>
<a
href=
"http://www.apache.org/"
id=
"bannerRight"
>
<img
src=
"http://www.apache.org/images/asf_logo_wide.png"
alt=
""
/>
</a>
<div
class=
"clear"
>
<hr/>
</div>
</div>
<div
id=
"breadcrumbs"
>
<div
class=
"xleft"
>
<a
href=
"http://www.apache.org/"
class=
"externalLink"
>
Apache
</a>
>
<a
href=
"http://hadoop.apache.org/"
class=
"externalLink"
>
Hadoop
</a>
>
Apache Hadoop Distributed Copy
</div>
<div
class=
"xright"
>
<a
href=
"http://wiki.apache.org/hadoop"
class=
"externalLink"
>
Wiki
</a>
|
<a
href=
"https://svn.apache.org/repos/asf/hadoop/"
class=
"externalLink"
>
SVN
</a>
| Last Published: 2014-02-11
| Version: 2.3.0
</div>
<div
class=
"clear"
>
<hr/>
</div>
</div>
<div
id=
"leftColumn"
>
<div
id=
"navcolumn"
>
<h5>
General
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../index.html"
>
Overview
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/SingleCluster.html"
>
Single Node Setup
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/ClusterSetup.html"
>
Cluster Setup
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/CommandsManual.html"
>
Hadoop Commands Reference
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/FileSystemShell.html"
>
File System Shell
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/Compatibility.html"
>
Hadoop Compatibility
</a>
</li>
</ul>
<h5>
Common
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/CLIMiniCluster.html"
>
CLI Mini Cluster
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/NativeLibraries.html"
>
Native Libraries
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/Superusers.html"
>
Superusers
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/SecureMode.html"
>
Secure Mode
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html"
>
Service Level Authorization
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/HttpAuthentication.html"
>
HTTP Authentication
</a>
</li>
</ul>
<h5>
HDFS
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html"
>
HDFS User Guide
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html"
>
High Availability With QJM
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html"
>
High Availability With NFS
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/Federation.html"
>
Federation
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html"
>
HDFS Snapshots
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html"
>
HDFS Architecture
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html"
>
Edits Viewer
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html"
>
Image Viewer
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html"
>
Permissions and HDFS
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html"
>
Quotas and HDFS
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/Hftp.html"
>
HFTP
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/LibHdfs.html"
>
C API libhdfs
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/WebHDFS.html"
>
WebHDFS REST API
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-hdfs-httpfs/index.html"
>
HttpFS Gateway
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html"
>
Short Circuit Local Reads
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html"
>
Centralized Cache Management
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html"
>
HDFS NFS Gateway
</a>
</li>
</ul>
<h5>
MapReduce
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html"
>
Compatibilty between Hadoop 1.x and Hadoop 2.x
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html"
>
Encrypted Shuffle
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html"
>
Pluggable Shuffle/Sort
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html"
>
Distributed Cache Deploy
</a>
</li>
</ul>
<h5>
YARN
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/YARN.html"
>
YARN Architecture
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html"
>
Writing YARN Applications
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html"
>
Capacity Scheduler
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/FairScheduler.html"
>
Fair Scheduler
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html"
>
Web Application Proxy
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/YarnCommands.html"
>
YARN Commands
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-sls/SchedulerLoadSimulator.html"
>
Scheduler Load Simulator
</a>
</li>
</ul>
<h5>
YARN REST APIs
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html"
>
Introduction
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html"
>
Resource Manager
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html"
>
Node Manager
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html"
>
MR Application Master
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html"
>
History Server
</a>
</li>
</ul>
<h5>
Auth
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-auth/index.html"
>
Overview
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-auth/Examples.html"
>
Examples
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-auth/Configuration.html"
>
Configuration
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-auth/BuildingIt.html"
>
Building
</a>
</li>
</ul>
<h5>
Reference
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/releasenotes.html"
>
Release Notes
</a>
</li>
<li
class=
"none"
>
<a
href=
"../api/index.html"
>
API docs
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/CHANGES.txt"
>
Common CHANGES.txt
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/CHANGES.txt"
>
HDFS CHANGES.txt
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-mapreduce/CHANGES.txt"
>
MapReduce CHANGES.txt
</a>
</li>
</ul>
<h5>
Configuration
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/core-default.xml"
>
core-default.xml
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml"
>
hdfs-default.xml
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml"
>
mapred-default.xml
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-common/yarn-default.xml"
>
yarn-default.xml
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/DeprecatedProperties.html"
>
Deprecated Properties
</a>
</li>
</ul>
<a
href=
"http://maven.apache.org/"
title=
"Built by Maven"
class=
"poweredBy"
>
<img
alt=
"Built by Maven"
src=
"./images/logos/maven-feather.png"
/>
</a>
</div>
</div>
<div
id=
"bodyColumn"
>
<div
id=
"contentBox"
>
<!-- Copyright 2002-2004 The Apache Software Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. -->
<div
class=
"section"
>
<h2>
Basic Usage
<a
name=
"Basic_Usage"
></a></h2>
<p>
The most common invocation of DistCp is an inter-cluster copy:
</p>
<p><tt>
bash
$
hadoop jar hadoop-distcp.jar hdfs://nn1:8020/foo/bar \
</tt><br
/>
<tt>
hdfs://nn2:8020/bar/foo
</tt></p>
<p>
This will expand the namespace under
<tt>
/foo/bar
</tt>
on nn1
into a temporary file, partition its contents among a set of map
tasks, and start a copy on each TaskTracker from nn1 to nn2.
</p>
<p>
One can also specify multiple source directories on the command
line:
</p>
<p><tt>
bash
$
hadoop jar hadoop-distcp.jar hdfs://nn1:8020/foo/a \
</tt><br
/>
<tt>
hdfs://nn1:8020/foo/b \
</tt><br
/>
<tt>
hdfs://nn2:8020/bar/foo
</tt></p>
<p>
Or, equivalently, from a file using the
<tt>
-f
</tt>
option:
<br
/>
<tt>
bash
$
hadoop jar hadoop-distcp.jar -f hdfs://nn1:8020/srclist \
</tt><br
/>
<tt>
hdfs://nn2:8020/bar/foo
</tt><br
/></p>
<p>
Where
<tt>
srclist
</tt>
contains
<br
/>
<tt>
hdfs://nn1:8020/foo/a
</tt><br
/>
<tt>
hdfs://nn1:8020/foo/b
</tt></p>
<p>
When copying from multiple sources, DistCp will abort the copy with
an error message if two sources collide, but collisions at the
destination are resolved per the
<a
href=
"#options"
>
options
</a>
specified. By default, files already existing at the destination are
skipped (i.e. not replaced by the source file). A count of skipped
files is reported at the end of each job, but it may be inaccurate if a
copier failed for some subset of its files, but succeeded on a later
attempt.
</p>
<p>
It is important that each TaskTracker can reach and communicate with
both the source and destination file systems. For HDFS, both the source
and destination must be running the same version of the protocol or use
a backwards-compatible protocol (see
<a
href=
"#cpver"
>
Copying Between
Versions
</a>
).
</p>
<p>
After a copy, it is recommended that one generates and cross-checks
a listing of the source and destination to verify that the copy was
truly successful. Since DistCp employs both Map/Reduce and the
FileSystem API, issues in or between any of the three could adversely
and silently affect the copy. Some have had success running with
<tt>
-update
</tt>
enabled to perform a second pass, but users should
be acquainted with its semantics before attempting this.
</p>
<p>
It's also worth noting that if another client is still writing to a
source file, the copy will likely fail. Attempting to overwrite a file
being written at the destination should also fail on HDFS. If a source
file is (re)moved before it is copied, the copy will fail with a
FileNotFoundException.
</p>
<p>
Please refer to the detailed Command Line Reference for information
on all the options available in DistCp.
</p>
</div>
<div
class=
"section"
>
<h2>
Update and Overwrite
<a
name=
"Update_and_Overwrite"
></a></h2>
<p><tt>
-update
</tt>
is used to copy files from source that don't
exist at the target, or have different contents.
<tt>
-overwrite
</tt>
overwrites target-files even if they exist at the source, or have the
same contents.
</p>
<p><br
/>
Update and Overwrite options warrant special attention, since their
handling of source-paths varies from the defaults in a very subtle manner.
Consider a copy from
<tt>
/source/first/
</tt>
and
<tt>
/source/second/
</tt>
to
<tt>
/target/
</tt>
, where the source
paths have the following contents:
</p>
<p><tt>
hdfs://nn1:8020/source/first/1
</tt><br
/>
<tt>
hdfs://nn1:8020/source/first/2
</tt><br
/>
<tt>
hdfs://nn1:8020/source/second/10
</tt><br
/>
<tt>
hdfs://nn1:8020/source/second/20
</tt><br
/></p>
<p><br
/>
When DistCp is invoked without
<tt>
-update
</tt>
or
<tt>
-overwrite
</tt>
, the DistCp defaults would create directories
<tt>
first/
</tt>
and
<tt>
second/
</tt>
, under
<tt>
/target
</tt>
.
Thus:
<br
/></p>
<p><tt>
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
</tt></p>
<p><br
/>
would yield the following contents in
<tt>
/target
</tt>
:
</p>
<p><tt>
hdfs://nn2:8020/target/first/1
</tt><br
/>
<tt>
hdfs://nn2:8020/target/first/2
</tt><br
/>
<tt>
hdfs://nn2:8020/target/second/10
</tt><br
/>
<tt>
hdfs://nn2:8020/target/second/20
</tt><br
/></p>
<p><br
/>
When either
<tt>
-update
</tt>
or
<tt>
-overwrite
</tt>
is
specified, the
<b>
contents
</b>
of the source-directories
are copied to target, and not the source directories themselves. Thus:
</p>
<p><tt>
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
</tt></p>
<p><br
/>
would yield the following contents in
<tt>
/target
</tt>
:
</p>
<p><tt>
hdfs://nn2:8020/target/1
</tt><br
/>
<tt>
hdfs://nn2:8020/target/2
</tt><br
/>
<tt>
hdfs://nn2:8020/target/10
</tt><br
/>
<tt>
hdfs://nn2:8020/target/20
</tt><br
/></p>
<p><br
/>
By extension, if both source folders contained a file with the same
name (say,
<tt>
0
</tt>
), then both sources would map an entry to
<tt>
/target/0
</tt>
at the destination. Rather than to permit this
conflict, DistCp will abort.
</p>
<p><br
/>
Now, consider the following copy operation:
</p>
<p><tt>
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
</tt></p>
<p><br
/>
With sources/sizes:
</p>
<p><tt>
hdfs://nn1:8020/source/first/1 32
</tt><br
/>
<tt>
hdfs://nn1:8020/source/first/2 32
</tt><br
/>
<tt>
hdfs://nn1:8020/source/second/10 64
</tt><br
/>
<tt>
hdfs://nn1:8020/source/second/20 32
</tt><br
/></p>
<p><br
/>
And destination/sizes:
</p>
<p><tt>
hdfs://nn2:8020/target/1 32
</tt><br
/>
<tt>
hdfs://nn2:8020/target/10 32
</tt><br
/>
<tt>
hdfs://nn2:8020/target/20 64
</tt><br
/></p>
<p><br
/>
Will effect:
</p>
<p><tt>
hdfs://nn2:8020/target/1 32
</tt><br
/>
<tt>
hdfs://nn2:8020/target/2 32
</tt><br
/>
<tt>
hdfs://nn2:8020/target/10 64
</tt><br
/>
<tt>
hdfs://nn2:8020/target/20 32
</tt><br
/></p>
<p><br
/><tt>
1
</tt>
is skipped because the file-length and contents match.
<tt>
2
</tt>
is copied because it doesn't exist at the target.
<tt>
10
</tt>
and
<tt>
20
</tt>
are overwritten since the contents
don't match the source.
</p>
<p>
If
<tt>
-update
</tt>
is used,
<tt>
1
</tt>
is overwritten as well.
</p>
</div>
</div>
</div>
<div
class=
"clear"
>
<hr/>
</div>
<div
id=
"footer"
>
<div
class=
"xright"
>
©
2014
Apache Software Foundation
-
<a
href=
"http://maven.apache.org/privacy-policy.html"
>
Privacy Policy
</a></div>
<div
class=
"clear"
>
<hr/>
</div>
</div>
</body>
</html>
Event Timeline
Log In to Comment