How To Mark a String in a File?
I have a text file. It is designed as following:
#1{1,12,345,867} #2{123, 3243534, 2132131231} #3{234, 35345} #4{}
... (at the end of an each entry stands "\开发者_Go百科n")
That is an example. In fact my strings #number{number,number,...,number} could be really long...
Here is a template of a constructor of a class which works with this file:
public Submatrix(String matrixFilePath, int startPos, int endPos) throws FileNotFoundException{
}
As you can see submatrix is determined by startPos and endPos numbers of strings of a matrix.
My question is : "How could I count strings to reach the right one?" My file can contain billions of strings. Should I use LineNumberReader->readLine() billions times?????
I would be tempted to read each line sequentially until I reached the desired line. However, since the lines are numbered in the file and delimited with newlines you can treat the file as random access and employ various strategies. For example, you get use a variant of binary search to quickly find the starting line. You can estimate the average line length from the first N lines and then try to make a more accurate guess as to the starting location, and so on.
I think the answer would be yes, you read billions of lines using readLine
, unless you think it's worth the trouble using either
- the strategy outlined by GregS, that is, estimating the line length and using that to start reading somewhere near the correct line, or
you use a seperate index, either at the start of the file or in a separate file which is very predictable and is something like
0000001 000000001024 0000002 000000001064 0000003 000000002010
That is, line number and starting position of that line in bytes in a strictly defined fashion which makes it possible to determine the position of the index by something like:
I want to read line 3, so I find the position of line 3 by going to position (3-1) * 20, and read
0000003 000000002010
, parse that and know that line 3 is at byte position 2010, seek that position and start reading.Calculating or maintaining the index might not be easy if it's in the main data file, as it would mean that you precalculate positions before you actually write the file. I think I would use a seperate index file and either calculate indices during writing, or have a seperate utility to create a index file given a data file.
EDIT Added example code to demonstrate my proposal
I have made a smallish Python script which reads a data file and creates an index file. The index file contains the position of a line in the data file and is designed to be easily searchable.
This example script has index formatting of 06d, which is good enough for 999.999 line data files, for you it might have to be adjusted (don't forget INDEX_LENGTH). It creates an index file, and uses that index file to read a given line out of the data file (for demonstration purposes; you would use java for that part:)
The script is called like:
python create_index.py data.txt data.idx 3
my example data file is:
#1{1,12,345,867}
#2{123, 3243534, 2132131231}
#3{234, 35345}
#4{}
and the script itself is:
import sys
# Usage: python this_script.py datafile indexfile lineno
# indexfile will be overwritten
# lineno is the data line which will be printed using the
# index file, as a demonstration
datafilename= sys.argv[1]
indexfilename = sys.argv[2]
lineno = int(sys.argv[3])
# max 999999 lines in this format
format = "%06d\n"
INDEX_LENGTH = 6+1 # +1 for newline
def create_indexfile():
indexfile = open(indexfilename, "wB")
# Print index of first line
indexfile.write(format % 0)
f = open(datafilename, "rB")
line = f.readline()
while len(line) > 0:
indexfile.write( format % f.tell() )
line = f.readline()
f.close()
indexfile.close()
# Retrieve the data of 1 line in the data file
# using the index file
def get_line():
linepos = INDEX_LENGTH * (lineno - 1)
indexfile = open(indexfilename, "rB")
indexfile.seek(linepos)
datapos = int(indexfile.readline())
indexfile.close()
datafile = open(datafilename, "rB")
datafile.seek(datapos)
print datafile.readline()
datafile.close()
if __name__ == '__main__':
create_indexfile()
get_line()
The index file needs to be rebuild after a change in the data file. You can verify if you read the right data by comparing your line number from the data read (#3{...}) with the input line number, so it's fairly safe.
Whether you choose to use it or not, I think the example is pretty clear and easy.
@extraneon
This is the class I want to use to represent a string #number{number, number,...}
package logic;
public class DenominatedBinaryRow{
private int sn;
private BinaryRow row;
public DenominatedBinaryRow(int sn, BinaryRow row){
this.sn = sn;
this.row = row;
}
public DenominatedBinaryRow plus(int sn, DenominatedBinaryRow addend){
return new DenominatedBinaryRow(sn, this.row.plus(addend.row));
}
public int getSn(){
return this.sn;
}
public BinaryRow getRow(){
return this.row;
}
public boolean equals(Object obj){
DenominatedBinaryRow res = (DenominatedBinaryRow) obj;
if (this.getSn() == res.getSn() && this.getRow().equals(res.getRow())){
return true;
}
return false;
}
}
May be it would be efficient to serialize it, instead of converting the BinaryRow (it's implementation goes below) to a string? If I serialize many instances of it to a file, how will I deserialize the necessary string (necessary instance) back? (Hope, I understood your question correctly)
package logic;
import java.util.*;
public class BinaryRow {
private List<Integer> row;
public BinaryRow(){
this.row = new ArrayList<Integer>();
}
public List<Integer> getRow(){
return this.row;
}
public void add(Integer arg){
this.getRow().add(arg);
}
public Integer get(int index){
return this.getRow().get(index);
}
public int size(){
return this.getRow().size();
}
public BinaryRow plus(BinaryRow addend){
BinaryRow result = new BinaryRow();
//suppose, rows are already sorted (ascending order)
int i = this.size();
int j = addend.size();
while (i > 0 && j > 0)
if (this.get(this.size() - i) < addend.get(addend.size() - j)){
result.add(this.get(this.size() - i));
i--;
}
else if (this.get(this.size() - i) > addend.get(addend.size() - j)){
result.add(addend.get(addend.size() - j));
j--;
}
else{
result.add(this.get(this.size() - i));
i--;
j--;
}
if (i > 0){
for (int k = this.size() - i; k < this.size(); k++)
result.add(this.get(k));
}
if (j > 0){
for (int k = addend.size() - j; k < addend.size(); k++)
result.add(addend.get(k));
}
return result;
}
public boolean equals(Object obj){
BinaryRow binRow = (BinaryRow) obj;
if (this.size() == binRow.size()){
for (int i = 0; i < this.size(); i++){
if (this.getRow().get(i) != binRow.getRow().get(i)) return false;
}
return true;
}
return false;
}
public long convertToDec(){
long result = 0;
for (Integer next : this.getRow()) {
result += Math.pow(2, next);
}
return result;
}
}
I am affraid you have to get to the x-th line, you will have to call readLine() x times. This means reading all the data until you reach this line. Every character could be a line end, so there is no way going to the x-th line without reading every character before this line.
精彩评论