Performance of mathematical calculations on a text file Java
I'm taking in a text file with around 60,000 lines of point coordinates, (I expect to scale up soon) and performing Mahalanobis distance from each point to every other point, and outputting the result as a text file. That means my results will be nearly 3,600,000,000 lines long. My program creates around 60,000 lines every 1 or two seconds.
Am I correct in thinking my code is not able to be multithreaded? Is there a better way to code this algorithm? How do people deal with processes like these?
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
public class Coord {
public int a,b,c,d,e,f;
public static void main(String[] args) throws IOException {
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("/Users/evanlivingston/2a.txt", true)));
Scanner sc = new Scanner(new File("/Users/evanlivingston/1.txt"));
List<Coord> coords = new ArrayList<Coord>();{
// for each line in the file
while(sc.hasNextLine()) {
String[] numstrs = sc.nextLine().split("\\s+");
Coord c = new Coord();
c.a = Integer.parseInt(numstrs[1]);
c.b = Integer.parseInt(numstrs[2]);
c.c = Integer.parseInt(numstrs[3]);
c.d = Integer.parseInt(numstrs[4]);
c.e = Integer.parseInt(numstrs[5]);
c.f = Integer.parseInt(numstrs[6]);
coords.add(c);
}
// now you have all coords in memory
int counter = 0; {
for(int i=0; i<coords.size(); i++ )
for( int j=0; j<coords.size(); j++, counter++ )
{
Coord c1 = coords.get(i);
Coord c2 = coords.get(j);
double foo = ((c1.a - c2.a) * (c1.a - c2.a)) *1 ;
double goo = ((c1.b - c2.b) * (c1.b - c2.b)) *1 ;
double hoo = ((c1.c - c2.c) * (c1.c - c2.c)) *2 ;
double joo = ((c1.d - c2.d) * (c1.d - c2.d)) *2 ;
double koo = ((c1.e - c2.e) * (c1.e - c2.e)) *4 ;
double loo = ((c1.f - c2.f) * (c1.f - c2.f)) *4 ;
double zoo = Math.sqrt开发者_Python百科(foo + goo + hoo + joo + koo + loo);
out.println(counter + "; " + i + " " + j + " " + zoo);
System.out.println(counter + "; " + i + " " + j + " " + zoo);
}
out.flush();
out.close();
}
}
}
}
My input file looks like
0 0 0 0 0 0 0
1 0 0 0 0 0 1
....
59318 12 2 12 2 12 2
The first number is a place holder. It's a list of all combinations with replacement limited to the amounts you see in the last line.
It seems now as though the calculations will take about 16 hours, that still seems too long. Not to mention I estimate the final text output to be around 120 GBs.
Your code is very inefficient. You re-read the file second time on every line(!!!) in the file. Disk IO is very slow.
What you should do is to load the file into a parsed memory structure (an array of doubles), and then do a nested loop over it.
Am I correct in thinking my code is not able to be multithreaded?
You're incorrect. This task would benefit a lot from threading. But your first priority is to get rid of repetitive IO. I'd guess the performance would be good enough then.
UPDATE to UPDATE
Rewrote your class to multiple threads (4 by default). Downside: lines in the output file are not written in order, though by using unix sort utility you may sort it after the computation, if needed. Both A->B and B->A are still calculated as I couldn't come up with a simple way to store the result of A->B short of using Java 64bit and installing some 64G of RAM.
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
public class Coord {
public int a, b, c, d, e, f;
private static class CoordsThread extends Thread {
private int start;
private int end;
private List<Coord> coords;
private PrintWriter out;
public CoordsThread(int start, int end, List<Coord> list, PrintWriter out) {
this.start = start;
this.end = end;
this.coords = list;
this.out = out;
// last block can be shorter
if( this.end > this.coords.size() ) this.end = this.coords.size();
}
public void run() {
System.out.println("started thread "+getName()+" for ["+start+";"+end+")");
for (int i = start; i < end; i++) {
for (int j = 0; j < coords.size(); j++ ) {
Coord c1 = coords.get(i);
Coord c2 = coords.get(j);
double foo = ((c1.a - c2.a) * (c1.a - c2.a)) * 1;
double goo = ((c1.b - c2.b) * (c1.b - c2.b)) * 1;
double hoo = ((c1.c - c2.c) * (c1.c - c2.c)) * 2;
double joo = ((c1.d - c2.d) * (c1.d - c2.d)) * 2;
double koo = ((c1.e - c2.e) * (c1.e - c2.e)) * 4;
double loo = ((c1.f - c2.f) * (c1.f - c2.f)) * 4;
double zoo = Math.sqrt(foo + goo + hoo + joo + koo + loo);
synchronized (out) {
out.println(i*coords.size()+j + "; " + i + " " + j + " " + zoo);
}
}
}
System.out.println("completed thread "+getName());
}
}
public static void main(String[] args) throws Exception {
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("2.txt")));
Scanner sc = new Scanner(new File("1.txt"));
List<Coord> coords = new ArrayList<Coord>();
// for each line in the file
while (sc.hasNextLine()) {
String[] numstrs = sc.nextLine().split("\\s+");
Coord c = new Coord();
c.a = Integer.parseInt(numstrs[1]);
c.b = Integer.parseInt(numstrs[2]);
c.c = Integer.parseInt(numstrs[3]);
c.d = Integer.parseInt(numstrs[4]);
c.e = Integer.parseInt(numstrs[5]);
c.f = Integer.parseInt(numstrs[6]);
coords.add(c);
}
System.out.println("total lines read: "+coords.size());
int threadsCount = 4;
List<Thread> ths = new ArrayList<Thread>();
int blockSize = coords.size()/threadsCount+1;
for( int i=0; i<threadsCount; ++i ) {
CoordsThread ct = new CoordsThread(i*blockSize, (i+1)*blockSize, coords, out);
ct.setName("Block"+i);
ths.add(ct);
}
for (Thread th : ths) {
th.start();
}
for (Thread th : ths) {
th.join();
}
out.flush();
out.close();
}
}
You are doing lots of repetitive IO, very expensive, more expensive by orders of magnitude than any calculations you are doing.
Also your problem domain fits into a map / reduce scenario very nicely, which is not only easy to multi-thread but you should be able to distribute the calculations over multiple machines as well.
You are reading the file 1.txt
too many times. Read it once, store it in an array of type int[][]
.
Also, try to increase the size of the BufferedWriter
instance.
Also, let the Scanner
instance work on a BufferedInputstream
with a proper character set.
精彩评论