FAQ: What is the "distributed cache" feature provided by Hadoop? and How can my application use it?
Answer:
Some jobs require each Map task to read in one or more large configuration or data files before processing any data. The Hadoop distributed cache is a mechanism for accessing these files from the processes comprising your distributed task. The information for using the distributed cache is scattered across different documents. In general, the documentation for newer version is more detailed with respect to this and other topics. Here are a few starting points:
You can use one of the following file types with the distributed cache (e.g., through the
-cacheArchive
command line argument):
- A ZIP file, (with .zip extension). These files are copied to the local data node and made locally available at the data node. The file contents are unpacked (i.e., decompressed and extracted).
- A JAR file (with .jar extension). These are handled in the same way as zip files. Their location may be added to the tasks' CLASSPATH variables.
- Other files (with extensions other than .zip or .jar) are copied to the local node and left as they are. Tasks can locally access these files.
You can specify multiple files to be cached for a job by using the
-cacheArchive
argument multiple times, once per file.
If you are not running a streaming job, you can still use the facility through the Java API as specified in the documentation for the
org.apache.hadoop.filecache.DistributedCache
class.
Alternatively, you can use the following
undocumented approach. You can set the respective job configuration parameters. The corresponding key names for the
DistributedCache.addCache{Archive|File}
methods are:
-
mapred.cache.archives
-
mapred.cache.files
These configuration parameters are lists of comma separated file URIs.
Back to: CloudClusterFAQ