FAQ: What is the "distributed cache" feature provided by Hadoop? and How can my application use it?

Answer:

Some jobs require each Map task to read in one or more large configuration or data files before processing any data. The Hadoop distributed cache is a mechanism for accessing these files from the processes comprising your distributed task. The information for using the distributed cache is scattered across different documents. In general, the documentation for newer version is more detailed with respect to this and other topics. Here are a few starting points:

You can use one of the following file types with the distributed cache (e.g., through the -cacheArchive command line argument):

  1. A ZIP file, (with .zip extension). These files are copied to the local data node and made locally available at the data node. The file contents are unpacked (i.e., decompressed and extracted).
  2. A JAR file (with .jar extension). These are handled in the same way as zip files. Their location may be added to the tasks' CLASSPATH variables.
  3. Other files (with extensions other than .zip or .jar) are copied to the local node and left as they are. Tasks can locally access these files.

You can specify multiple files to be cached for a job by using the -cacheArchive argument multiple times, once per file.

If you are not running a streaming job, you can still use the facility through the Java API as specified in the documentation for the org.apache.hadoop.filecache.DistributedCache class.

Alternatively, you can use the following undocumented approach. You can set the respective job configuration parameters. The corresponding key names for the DistributedCache.addCache{Archive|File} methods are:
  • mapred.cache.archives
  • mapred.cache.files
These configuration parameters are lists of comma separated file URIs.

Back to: CloudClusterFAQ
Topic revision: r2 - 10 Mar 2010, JulioLopez - This page was cached on 14 Nov 2024 - 01:37.

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding PDLWiki? Send feedback