Page MenuHomec4science

appendix.html
No OneTemporary

File Metadata

Created
Tue, Feb 25, 04:02

appendix.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- Generated by Apache Maven Doxia at 2014-02-11 -->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop Distributed Copy -
Appendix</title>
<style type="text/css" media="all">
@import url("./css/maven-base.css");
@import url("./css/maven-theme.css");
@import url("./css/site.css");
</style>
<link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20140211" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xleft">
<a href="http://www.apache.org/" class="externalLink">Apache</a>
&gt;
<a href="http://hadoop.apache.org/" class="externalLink">Hadoop</a>
&gt;
Apache Hadoop Distributed Copy
</div>
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://svn.apache.org/repos/asf/hadoop/" class="externalLink">SVN</a>
&nbsp;| Last Published: 2014-02-11
&nbsp;| Version: 2.3.0
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CommandsManual.html">Hadoop Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FileSystemShell.html">File System Shell</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Compatibility.html">Hadoop Compatibility</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Superusers.html">Superusers</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">HDFS User Guide</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html">High Availability With QJM</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html">High Availability With NFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">HDFS Snapshots</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">HDFS Architecture</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/Hftp.html">HFTP</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">C API libhdfs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS REST API</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/index.html">HttpFS Gateway</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">HDFS NFS Gateway</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibilty between Hadoop 1.x and Hadoop 2.x</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YARN.html">YARN Architecture</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">YARN Commands</a>
</li>
<li class="none">
<a href="../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html">History Server</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/releasenotes.html">Release Notes</a>
</li>
<li class="none">
<a href="../api/index.html">API docs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CHANGES.txt">Common CHANGES.txt</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/CHANGES.txt">HDFS CHANGES.txt</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-mapreduce/CHANGES.txt">MapReduce CHANGES.txt</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!-- Copyright 2002-2004 The Apache Software Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. -->
<div class="section">
<h2>Map sizing<a name="Map_sizing"></a></h2>
<p> By default, DistCp makes an attempt to size each map comparably so
that each copies roughly the same number of bytes. Note that files are the
finest level of granularity, so increasing the number of simultaneous
copiers (i.e. maps) may not always increase the number of
simultaneous copies nor the overall throughput.</p>
<p> The new DistCp also provides a strategy to &quot;dynamically&quot; size maps,
allowing faster data-nodes to copy more bytes than slower nodes. Using
<tt>-strategy dynamic</tt> (explained in the Architecture), rather
than to assign a fixed set of source-files to each map-task, files are
instead split into several sets. The number of sets exceeds the number of
maps, usually by a factor of 2-3. Each map picks up and copies all files
listed in a chunk. When a chunk is exhausted, a new chunk is acquired and
processed, until no more chunks remain.</p>
<p> By not assigning a source-path to a fixed map, faster map-tasks (i.e.
data-nodes) are able to consume more chunks, and thus copy more data,
than slower nodes. While this distribution isn't uniform, it is
<b>fair</b> with regard to each mapper's capacity.</p>
<p>The dynamic-strategy is implemented by the DynamicInputFormat. It
provides superior performance under most conditions. </p>
<p>Tuning the number of maps to the size of the source and
destination clusters, the size of the copy, and the available
bandwidth is recommended for long-running and regularly run jobs.</p>
</div>
<div class="section">
<h2>Copying between versions of HDFS<a name="Copying_between_versions_of_HDFS"></a></h2>
<p>For copying between two different versions of Hadoop, one will
usually use HftpFileSystem. This is a read-only FileSystem, so DistCp
must be run on the destination cluster (more specifically, on
TaskTrackers that can write to the destination cluster). Each source is
specified as <tt>hftp://&lt;dfs.http.address&gt;/&lt;path&gt;</tt>
(the default <tt>dfs.http.address</tt> is
&lt;namenode&gt;:50070).</p>
</div>
<div class="section">
<h2>Map/Reduce and other side-effects<a name="MapReduce_and_other_side-effects"></a></h2>
<p>As has been mentioned in the preceding, should a map fail to copy
one of its inputs, there will be several side-effects.</p>
<ul>
<li>Unless <tt>-overwrite</tt> is specified, files successfully
copied by a previous map on a re-execution will be marked as
&quot;skipped&quot;.</li>
<li>If a map fails <tt>mapred.map.max.attempts</tt> times, the
remaining map tasks will be killed (unless <tt>-i</tt> is
set).</li>
<li>If <tt>mapred.speculative.execution</tt> is set set
<tt>final</tt> and <tt>true</tt>, the result of the copy is
undefined.</li>
</ul>
</div>
<div class="section">
<h2>SSL Configurations for HSFTP sources:<a name="SSL_Configurations_for_HSFTP_sources:"></a></h2>
<p>To use an HSFTP source (i.e. using the hsftp protocol), a Map-Red SSL
configuration file needs to be specified (via the <tt>-mapredSslConf</tt>
option). This must specify 3 parameters:</p>
<ul>
<li><tt>ssl.client.truststore.location</tt>: The local-filesystem
location of the trust-store file, containing the certificate for
the namenode.</li>
<li><tt>ssl.client.truststore.type</tt>: (Optional) The format of
the trust-store file.</li>
<li><tt>ssl.client.truststore.password</tt>: (Optional) Password
for the trust-store file.</li>
</ul>
<p>The following is an example of the contents of the contents of
a Map-Red SSL Configuration file:</p>
<p> <br /> <tt> &lt;configuration&gt; </tt> </p>
<p> <br /> <tt>&lt;property&gt; </tt> </p>
<p> <tt>&lt;name&gt;ssl.client.truststore.location&lt;/name&gt; </tt> </p>
<p> <tt>&lt;value&gt;/work/keystore.jks&lt;/value&gt; </tt> </p>
<p> <tt>&lt;description&gt;Truststore to be used by clients like distcp. Must be specified. &lt;/description&gt;</tt> </p>
<p> <br /> <tt>&lt;/property&gt; </tt> </p>
<p><tt> &lt;property&gt; </tt> </p>
<p> <tt>&lt;name&gt;ssl.client.truststore.password&lt;/name&gt; </tt> </p>
<p> <tt>&lt;value&gt;changeme&lt;/value&gt; </tt> </p>
<p> <tt>&lt;description&gt;Optional. Default value is &quot;&quot;. &lt;/description&gt; </tt> </p>
<p> <tt>&lt;/property&gt; </tt> </p>
<p> <br /> <tt> &lt;property&gt; </tt> </p>
<p> <tt> &lt;name&gt;ssl.client.truststore.type&lt;/name&gt;</tt> </p>
<p> <tt> &lt;value&gt;jks&lt;/value&gt;</tt> </p>
<p> <tt> &lt;description&gt;Optional. Default value is &quot;jks&quot;. &lt;/description&gt;</tt> </p>
<p> <tt> &lt;/property&gt; </tt> </p>
<p> <tt> <br /> &lt;/configuration&gt; </tt> </p>
<p><br />The SSL configuration file must be in the class-path of the
DistCp program.</p>
</div>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">&#169; 2014
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a></div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>

Event Timeline