I've written a MR algorithm on some data to create a data structure. After creation I need to answer some queries. To answer these queries faster I created a metadata (around several MBs) from the result.
Now my question is this:
Is it possible to create this metadata in the memory of Master Node to avoid file I/O as a result answer queries faster?
2 Answers
Answers 1
Assuming, based on the OP response to the other answer, the metadata will be required for another MR job. Using Distributed cache in this case is rather easy:
In the driver class:
public class DriverClass extends Configured{ public static void main(String[] args) throws Exception { /* ...some init code... */ /* * Instantiate a Job object for your job's configuration. */ Configuration job_conf = new Configuration(); DistributedCache.addCacheFile(new Path("path/to/your/data.txt").toUri(),job_conf); Job job = new Job(job_conf); /* ... configure and start the job... */ } }
In the mapper class you can read the data at the setup stage and make it available for the map class:
public class YourMapper extends Mapper<LongWritable, Text, Text, Text>{ private List<String> lines = new ArrayList<String>(); @Override protected void setup(Context context) throws IOException, InterruptedException { /* Get the cached archives/files */ Path[] cached_file = new Path[0]; try { cached_file = DistributedCache.getLocalCacheFiles(context.getConfiguration()); } catch (IOException e1) { // TODO add error code e1.printStackTrace(); } File f = new File (cached_file[0].toString()); try { /* Read the data some thing like: */ lines = Files.readLines(f,charset); } catch (IOException e) { e.printStackTrace(); } } @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { /* * In the mapper - use the data as needed */ } }
Note that Distributed Cache can hold more the plain text file. You can use archives (zip, tar..) and even a full java class (jar files).
Also note that in newer Hadoop implementations, the Distributed Cache API is found in the Job class itself. Refer to this API and this answer.
Answers 2
In hadoop, to avoid file I/O in MR jobs you can use distributed cache. https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html
Namenode(master node) maintains namespace for all directories and files in cluster(on disk) which itself fast enough to locate your data blocks(in-memory) associated with it. https://twiki.opensciencegrid.org/bin/view/Documentation/HadoopUnderstanding
0 comments:
Post a Comment