Page Menu
Home
c4science
Search
Configure Global Search
Log In
Files
F102910090
faq.html
No One
Temporary
Actions
Download File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Subscribers
None
File Metadata
Details
File Info
Storage
Attached
Created
Tue, Feb 25, 10:18
Size
17 KB
Mime Type
text/html
Expires
Thu, Feb 27, 10:18 (2 d)
Engine
blob
Format
Raw Data
Handle
24368618
Attached To
R3704 elastic-yarn
faq.html
View Options
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- Generated by Apache Maven Doxia at 2014-02-11 -->
<html
xmlns=
"http://www.w3.org/1999/xhtml"
>
<head>
<title>
Apache Hadoop Distributed Copy - Frequently Asked Questions
</title>
<style
type=
"text/css"
media=
"all"
>
@import
url
(
"./css/maven-base.css"
)
;
@import
url
(
"./css/maven-theme.css"
)
;
@import
url
(
"./css/site.css"
)
;
</style>
<link
rel=
"stylesheet"
href=
"./css/print.css"
type=
"text/css"
media=
"print"
/>
<meta
name=
"Date-Revision-yyyymmdd"
content=
"20140211"
/>
<meta
http-equiv=
"Content-Type"
content=
"text/html; charset=UTF-8"
/>
</head>
<body
class=
"composite"
>
<div
id=
"banner"
>
<a
href=
"http://hadoop.apache.org/"
id=
"bannerLeft"
>
<img
src=
"http://hadoop.apache.org/images/hadoop-logo.jpg"
alt=
""
/>
</a>
<a
href=
"http://www.apache.org/"
id=
"bannerRight"
>
<img
src=
"http://www.apache.org/images/asf_logo_wide.png"
alt=
""
/>
</a>
<div
class=
"clear"
>
<hr/>
</div>
</div>
<div
id=
"breadcrumbs"
>
<div
class=
"xleft"
>
<a
href=
"http://www.apache.org/"
class=
"externalLink"
>
Apache
</a>
>
<a
href=
"http://hadoop.apache.org/"
class=
"externalLink"
>
Hadoop
</a>
>
Apache Hadoop Distributed Copy
</div>
<div
class=
"xright"
>
<a
href=
"http://wiki.apache.org/hadoop"
class=
"externalLink"
>
Wiki
</a>
|
<a
href=
"https://svn.apache.org/repos/asf/hadoop/"
class=
"externalLink"
>
SVN
</a>
| Last Published: 2014-02-11
| Version: 2.3.0
</div>
<div
class=
"clear"
>
<hr/>
</div>
</div>
<div
id=
"leftColumn"
>
<div
id=
"navcolumn"
>
<h5>
General
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../index.html"
>
Overview
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/SingleCluster.html"
>
Single Node Setup
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/ClusterSetup.html"
>
Cluster Setup
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/CommandsManual.html"
>
Hadoop Commands Reference
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/FileSystemShell.html"
>
File System Shell
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/Compatibility.html"
>
Hadoop Compatibility
</a>
</li>
</ul>
<h5>
Common
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/CLIMiniCluster.html"
>
CLI Mini Cluster
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/NativeLibraries.html"
>
Native Libraries
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/Superusers.html"
>
Superusers
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/SecureMode.html"
>
Secure Mode
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html"
>
Service Level Authorization
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/HttpAuthentication.html"
>
HTTP Authentication
</a>
</li>
</ul>
<h5>
HDFS
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html"
>
HDFS User Guide
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html"
>
High Availability With QJM
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html"
>
High Availability With NFS
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/Federation.html"
>
Federation
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html"
>
HDFS Snapshots
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html"
>
HDFS Architecture
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html"
>
Edits Viewer
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html"
>
Image Viewer
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html"
>
Permissions and HDFS
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html"
>
Quotas and HDFS
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/Hftp.html"
>
HFTP
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/LibHdfs.html"
>
C API libhdfs
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/WebHDFS.html"
>
WebHDFS REST API
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-hdfs-httpfs/index.html"
>
HttpFS Gateway
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html"
>
Short Circuit Local Reads
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html"
>
Centralized Cache Management
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html"
>
HDFS NFS Gateway
</a>
</li>
</ul>
<h5>
MapReduce
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html"
>
Compatibilty between Hadoop 1.x and Hadoop 2.x
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html"
>
Encrypted Shuffle
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html"
>
Pluggable Shuffle/Sort
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html"
>
Distributed Cache Deploy
</a>
</li>
</ul>
<h5>
YARN
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/YARN.html"
>
YARN Architecture
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html"
>
Writing YARN Applications
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html"
>
Capacity Scheduler
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/FairScheduler.html"
>
Fair Scheduler
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html"
>
Web Application Proxy
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/YarnCommands.html"
>
YARN Commands
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-sls/SchedulerLoadSimulator.html"
>
Scheduler Load Simulator
</a>
</li>
</ul>
<h5>
YARN REST APIs
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html"
>
Introduction
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html"
>
Resource Manager
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html"
>
Node Manager
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html"
>
MR Application Master
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html"
>
History Server
</a>
</li>
</ul>
<h5>
Auth
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-auth/index.html"
>
Overview
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-auth/Examples.html"
>
Examples
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-auth/Configuration.html"
>
Configuration
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-auth/BuildingIt.html"
>
Building
</a>
</li>
</ul>
<h5>
Reference
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/releasenotes.html"
>
Release Notes
</a>
</li>
<li
class=
"none"
>
<a
href=
"../api/index.html"
>
API docs
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/CHANGES.txt"
>
Common CHANGES.txt
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/CHANGES.txt"
>
HDFS CHANGES.txt
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-mapreduce/CHANGES.txt"
>
MapReduce CHANGES.txt
</a>
</li>
</ul>
<h5>
Configuration
</h5>
<ul>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/core-default.xml"
>
core-default.xml
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml"
>
hdfs-default.xml
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml"
>
mapred-default.xml
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-yarn/hadoop-yarn-common/yarn-default.xml"
>
yarn-default.xml
</a>
</li>
<li
class=
"none"
>
<a
href=
"../hadoop-project-dist/hadoop-common/DeprecatedProperties.html"
>
Deprecated Properties
</a>
</li>
</ul>
<a
href=
"http://maven.apache.org/"
title=
"Built by Maven"
class=
"poweredBy"
>
<img
alt=
"Built by Maven"
src=
"./images/logos/maven-feather.png"
/>
</a>
</div>
</div>
<div
id=
"bodyColumn"
>
<div
id=
"contentBox"
>
<!-- Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License. -->
<div
class=
"section"
>
<h2><a
name=
"top"
>
Frequently Asked Questions
</a><a
name=
"Frequently_Asked_Questions"
></a></h2>
<p><b>
General
</b></p>
<ol
style=
"list-style-type: decimal"
>
<li><a
href=
"#Update"
>
Why does -update not create the parent source-directory under
a pre-existing target directory?
</a></li>
<li><a
href=
"#Deviation"
>
How does the new DistCp differ in semantics from the Legacy
DistCp?
</a></li>
<li><a
href=
"#nMaps"
>
Why does the new DistCp use more maps than legacy DistCp?
</a></li>
<li><a
href=
"#more_maps"
>
Why does DistCp not run faster when more maps are specified?
</a></li>
<li><a
href=
"#client_mem"
>
Why does DistCp run out of memory?
</a></li></ol></div>
<div
class=
"section"
>
<h2>
General
<a
name=
"General"
></a></h2>
<dl>
<dt><a
name=
"Update"
>
Why does -update not create the parent source-directory under
a pre-existing target directory?
</a></dt>
<dd>
The behaviour of
<tt>
-update
</tt>
and
<tt>
-overwrite
</tt>
is described in detail in the Usage section of this document. In short,
if either option is used with a pre-existing destination directory, the
<b>
contents
</b>
of each source directory is copied over, rather
than the source-directory itself.
This behaviour is consistent with the legacy DistCp implementation as well.
<p
align=
"right"
><a
href=
"#top"
>
[top]
</a></p><hr
/></dd>
<dt><a
name=
"Deviation"
>
How does the new DistCp differ in semantics from the Legacy
DistCp?
</a></dt>
<dd>
<ul>
<li>
Files that are skipped during copy used to also have their
file-attributes (permissions, owner/group info, etc.) unchanged,
when copied with Legacy DistCp. These are now updated, even if
the file-copy is skipped.
</li>
<li>
Empty root directories among the source-path inputs were not
created at the target, in Legacy DistCp. These are now created.
</li>
</ul>
<p
align=
"right"
><a
href=
"#top"
>
[top]
</a></p><hr
/></dd>
<dt><a
name=
"nMaps"
>
Why does the new DistCp use more maps than legacy DistCp?
</a></dt>
<dd>
<p>
Legacy DistCp works by figuring out what files need to be actually
copied to target
<b>
before
</b>
the copy-job is launched, and then
launching as many maps as required for copy. So if a majority of the files
need to be skipped (because they already exist, for example), fewer maps
will be needed. As a consequence, the time spent in setup (i.e. before the
M/R job) is higher.
</p>
<p>
The new DistCp calculates only the contents of the source-paths. It
doesn't try to filter out what files can be skipped. That decision is put-
off till the M/R job runs. This is much faster (vis-a-vis execution-time),
but the number of maps launched will be as specified in the
<tt>
-m
</tt>
option, or 20 (default) if unspecified.
</p>
<p
align=
"right"
><a
href=
"#top"
>
[top]
</a></p><hr
/></dd>
<dt><a
name=
"more_maps"
>
Why does DistCp not run faster when more maps are specified?
</a></dt>
<dd>
<p>
At present, the smallest unit of work for DistCp is a file. i.e.,
a file is processed by only one map. Increasing the number of maps to
a value exceeding the number of files would yield no performance
benefit. The number of maps lauched would equal the number of files.
</p>
<p
align=
"right"
><a
href=
"#top"
>
[top]
</a></p><hr
/></dd>
<dt><a
name=
"client_mem"
>
Why does DistCp run out of memory?
</a></dt>
<dd>
<p>
If the number of individual files/directories being copied from
the source path(s) is extremely large (e.g. 1,000,000 paths), DistCp might
run out of memory while determining the list of paths for copy. This is
not unique to the new DistCp implementation.
</p>
<p>
To get around this, consider changing the
<tt>
-Xmx
</tt>
JVM
heap-size parameters, as follows:
</p>
<p><tt>
bash
$
export HADOOP_CLIENT_OPTS=
"
-Xms64m -Xmx1024m
"
</tt></p>
<p><tt>
bash
$
hadoop distcp /source /target
</tt></p>
<p
align=
"right"
><a
href=
"#top"
>
[top]
</a></p></dd></dl></div>
</div>
</div>
<div
class=
"clear"
>
<hr/>
</div>
<div
id=
"footer"
>
<div
class=
"xright"
>
©
2014
Apache Software Foundation
-
<a
href=
"http://maven.apache.org/privacy-policy.html"
>
Privacy Policy
</a></div>
<div
class=
"clear"
>
<hr/>
</div>
</div>
</body>
</html>
Event Timeline
Log In to Comment