Download tarball from repository
I am currently working on a project for scraping source code from SourceForge. I would like to download the tarball from the code repository.
An example link is given below: http://wurfl.cvs.sourceforge.net/viewvc/wurfl/?view=tar
The problems I faced while downloading is that, I am unable to use conventional URLConnection, HttpClient, HtmlUnit, Jsoup, etc API's to download the file. The specified link does not contain any filename or extension, this makes the download process even more complicated.
Can you suggest a means by which given a set of tarball links as parameters, I should be able to download them to my disk? Also, I was able to download it using wget. Is there a way I can pr开发者_运维百科ogramatically do it in Java in Windows?
Before you go any further with your efforts, carefully read the Sourceforge Terms of Use page. If you don't understand the ToS, contact Sourceforge and ask them if you are allowed to do what you are proposing.
The problems i faced while downloading is that, I am unable to use conventional url, http, htmlunit, jsoup apis etc to download the file.
Your assumption is incorrect.
You CAN use APIs such as the standard HttpURLConnection
API or the Apache HttpClient
APIs to do this kind of thing. If it is not working, it is because
- you are doing something the wrong way (e.g. you haven't configured your Java app to use your local HTTP proxy), or
- Sourceforge are using some technical means to stop you doing this; see the ToS.
If you post some details on what is happening when you try these approaches, maybe we can help you.
(HtmlUnit and Jsoup are probably inappropriate because they target HTML content.)
The specified link does not contain any filename or extension, this makes the download process even more complicated.
You can get the source filename and / or content type from the response headers. Refer to the HTTP specifications for details.
In the case that you really DO want to perhaps violate SourceForges ToS, then this may help.
You need wget.exe, as you wanted.
ProcessBuilder pb = new ProcessBuilder("wget.exe","http://wurfl.cvs.sourceforge.net/viewvc/wurfl/?view=tar", "no-proxy");
Process p = pb.start();
This will work as long as you have wget.exe in the same directory as the class file.
You may also want to check if the file DOES exist, in which case you would do something among the lines of:
ProcessBuilder pb = new ProcessBuilder("wget.exe","http://wurfl.cvs.sourceforge.net/viewvc/wurfl/?view=tar", "no-proxy");
Process p = pb.start();
int exitValue = p.waitFor();
BufferedReader reader;
// System.out.println("Exit Value" + exitValue);
if (exitValue == 0) {
reader = new BufferedReader(new InputStreamReader(p
.getInputStream()));
} else {
reader = new BufferedReader(new InputStreamReader(p
.getErrorStream()));
}
StringBuffer sb = new StringBuffer();
String temp = reader.readLine();
while (temp != null) {
sb.append(temp);
temp = reader.readLine();
}
reader.close();
System.out.println(sb.toString());
if(sb.toString().indexOf("404") != -1) {
//means that the file does not exist
System.out.println("File does not exist, or access is denied");
} else {
if(sb.toString().indexOf("200") != -1) {
//file exists, download it
System.out.println("File exists, downloading...");
ProcessBuilder pb = new ProcessBuilder("wget.exe","http://wurfl.cvs.sourceforge.net/viewvc/wurfl/?view=tar", "no-proxy");
Process p = pb.start();
}
But I reccomend NOT scraping SourceForge, unless its your own code that you are scraping (I did that once as an updater program). If you do, and my example helps, please kindly don't mention me. =]
Hope I helped!
精彩评论