MapReduce programming
I 've done this code on java which collects various information about photos and extracts the results to a text file. I would like to convert this program to function with the MapReduce model. I am a newbie on MapReduce programming. Any help would be very appreciated!! Thank you
import java.io.*;
import java.util.*;
import java.net.*;
import javax.xml.parsers.ParserConfigurationException;
import org.xml.sax.SAXException;
import com.aetrion.fl开发者_StackOverflow社区ickr.people.User;
import com.aetrion.flickr.photos.Photo;
import com.aetrion.flickr.photos.PhotoList;
import com.aetrion.flickr.photos.PhotosInterface;
import com.aetrion.flickr.photos.SearchParameters;
import com.aetrion.flickr.photosets.PhotosetsInterface;
import com.aetrion.flickr.test.TestInterface;
import com.aetrion.flickr.people.PeopleInterface;
import com.aetrion.flickr.groups.*;
import com.aetrion.flickr.groups.pools.*;
import com.aetrion.flickr.*;
public class example2{
public example2() {
}
/**
* @param args
* @throws FlickrException
* @throws SAXException
* @throws IOException
* @throws ParserConfigurationException
*/
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException, SAXException, FlickrException, ParserConfigurationException { // TODO Auto-generated method stub
FileWriter out = new FileWriter("photos.txt");
//Set api key
String key="apikey";
String svr="www.flickr.com";
REST rest=new REST();
rest.setHost(svr);
//initialize Flickr object with key and rest
Flickr flickr=new Flickr(key,rest);
Flickr.debugStream=false;
//initialize SearchParameter object, this object stores the search keyword
SearchParameters searchParams=new SearchParameters();
searchParams.setSort(SearchParameters.INTERESTINGNESS_DESC);
searchParams.setGroupId("group_id");
//Initialize PhotosInterface object
PhotosInterface photosInterface=flickr.getPhotosInterface();
//Execute search with entered tags
PhotoList photoList=photosInterface.search(searchParams,500,1);
if(photoList!=null){
//Get search result and check the size of photo result
for(int i=0;i<photoList.size();i++){
//get photo object
Photo photo=(Photo)photoList.get(i);
System.out.print(photo.getId()+"\t");
out.write(photo.getId()+"\t");
System.out.print(photo.getOwner().getId()+"\t");
out.write(photo.getOwner().getId()+"\t");
Photo photo1=photosInterface.getPhoto(photo.getId());
if(photo1.getGeoData() != null ){
System.out.print("latitute="+photo1.getGeoData().getLatitude()+"\t");
out.write(photo1.getGeoData().getLatitude()+"\t");
System.out.print("longitude="+photo1.getGeoData().getLongitude()+"\t");
out.write(photo1.getGeoData().getLongitude()+"\t");
}
else {System.out.print(photo1.getGeoData()+"\t");
out.write(photo1.getGeoData()+"\t\t"+photo1.getGeoData());}
System.out.println("");
out.write("\n");
}
out.close();
}
}}
I'm not sure this is a good use case for Hadoop, unless you have tons of search results to process, and the processing accounts for a significant portion of the overall program. The search itself can't be performed in parallel: only the processing in your for loop.
If you want to process one search in parallel in Hadoop, you will first have to perform the search outside Hadoop** and output the results to a text file--a list of IDs, for instance. Then, write a mapper that takes an ID, fetches the photo, and does the processing you currently do in your for loop, emitting the string with your fetched attributes (which you are currently printing to System.out
). Hadoop will run this mapper individually for every ID in your list of results.
I don't imagine this is going to be worth it, unless there is some other processing you are planning on doing. Some alternatives to consider:
Use map-reduce to perform lots of different searches in parallel. Your program would be essentially unchanged--it would just run inside a map function instead of the main() loop. Or your search could happen in the mapper, emitting the results, and your processing could happen in the reducer. Your input would be a list of search terms.
Forget about map-reduce, and just run the processing in parallel using a thread pool. Check out the various
Executors
injava.util.concurrent
.
** An alternative, hackish way to make the whole thing run "inside" Hadoop would be to run the search inside a map function, emitting the results one by one. Use an input file that has one line of text--a dummy value--so your mapper just runs once. Then do the image fetching in a reducer instead of the mapper.
Update:
If you have a bunch of different Group IDs to search, then you can use the first "alternative" approach.
Like you suggested, use the Group IDs and API keys as input. Put one on each line, separated by a tab or something that you can easily parse. You will also have to split them up into different files if you want them to run in different mappers. If you only have as many Group IDs as nodes, you will probably just want to put one line in each file. Use TextInputFormat
for your Hadoop job. The line with the Group ID and API key will be the value--use value.toString().split("\t")
to separate it into the two parts.
Then, run your whole search inside the mapper. For each result, use context.write(key, value)
(or output.collect(key, value)
, depending on your version) to write a photo ID as the key and the string with your attributes as the value. Both of these will have to be converted into Hadoop's Text
objects.
I'm not going to give wholesale code for this--it should be very easy to just adapt the Hadoop MapReduce tutorial. The only real differences:
Use
job.setOutputValueClass(Text)
, and change where it saysIntWritable
in the mapper class signature:public static class Map extends Mapper<LongWritable, Text, Text, Text> {
Just disable the reducer. Take out the reducer class, and change this:
job.setMapperClass(Map.class); job.setCombinerClass(Reduce.class); job.setReducerClass(Reduce.class);
into this:
job.setMapperClass(Map.class); job.setNumReduceTasks(0);
If you have specific questions about getting this to work, feel free to ask. Do put some research effort into it first, though.
I don't agree with Tim Yates answer. Your search can be very well parallized. My approach would be the following:
- Implement a Mapper that will take a chunk of your search queries as input (you have to chunk them for yourself because these things aren't sequencefiles), then do the query stuff and write the result into the filesystem. Key would be the ID and Value would be your additional information.
- Implement a reduce that makes anything you want with the data (output of first step).
I've already did this with the YouTube API, so it works well parallized. BUT you have to watch out for quota limits. You can handle them with Thread.sleep(PERIOD_OF_TIME)
quite good.
This only applies if you have a bunch of search queries, if you have a user that will input a searchstring, MapReduce isn't the (optimal) way to go.
精彩评论