Parse Text File Line by Line, Skipping Certain Lines
I have a file that looks like this (but is much bigger):
>some text
ABC
DEF
GHI
>some more text
JKL
MNO
PQR
I have been playing around with it in Java for some time and have been able to build arrays with the lines, etc. The lines with '>' are usually one line but sometimes could be 2, 3 or more lines. The lines tha开发者_高级运维t don't begin with '>' are the same length in characters but there may be 10, 20 or 30 or more of these lines. I am at the point now where I want to create an string array, where each string in the array contains a string of the lines that don't begin with '>' like so:
array element 1 = ABCDEFGHI
array element 2 = JKLMONPQR
I feel like I am close but need a small kick in the butt to get me going. I'm sure this is easy for a pro, but I am still new to Java.
Specific problem is related to other posts I made on this board. It's a FASTA file:
>3BHS_BOVIN (P14893) 3 beta-hydroxysteroid
AGWSCLVTGGGGFLGQRIICLLVEEKDLQEIRVLDKVFRPEVREEFSKLQSKIKLTLLEG
DILDEQCLKGACQGTSVVIHTASVIDVRNAVPRETIMNVNVKGTQLLLEACVQASVPVFI
>41_BOVIN (Q9N179) Protein 4.1
MHCKVSLLDDTVYECVVEKHAKGQDLLKRVCEHLNLLEEDYFGLAIWDNATSKTWLDSAK
EIKKQVRGVPWNFTFNVKFYPPDPAQLTEDITRYYLCLQLRQDIVSGRLPCSFATLALLG
SYTIQSELGDYDPELHGADYVSDFKLAPNQTKELEEKVMELHKSYRSMTPAQADLEFLEN
>5NTD_BOVIN (Q05927) 5'-nucleotidase
MNPGAARTPALRILPLGALLWPAARPWELTILHTNDVHSRLEQTSEDSSKCVNASRCVGG
VARLATKVHQIRRAEPHVLLLDAGDQYQGTIWFTVYKGTEVAHFMNALGYESMALGNHEF
DNGVEGLIDPLLKEVNFPILSANIKAKGPLASKISGLYSPYKILTVGDEVVGIVGYTSKE
TPFLSNPGTNLVFEDEITALQPEVDKLKTLNVNKIIALGHSGFEVDKLIAQKVKGVDVVV
I ultimately need the sequences in their own array element so that I can manipulate them later.
Assuming you can iterate over the lines:
List<String> array = new ArrayList<String>();
StringBuilder buf = new StringBuilder();
for (String line : lines) {
if (line.startsWith(">")) {
if (buf.length() > 0) {
array.add(buf.toString());
buf.setLength(0);
}
} else {
buf.append(line);
}
}
if (buf.length() > 0) { // Add the final text element(s).
array.add(buf.toString());
}
Try this. I didn't bother with proper variable names. Also it works assuming first line has a >. It's probably not optimised either but should give you an idea of how this is possible.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
public class Parse {
public static void main(String[] args) throws IOException {
String lala = ">some text\r\n" +
"ABC\r\n" +
"DEF\r\n" +
"GHI\r\n" +
">some more text\r\n" +
"JKL\r\n" +
"MNO\r\n" +
"PQR";
ArrayList<String> lines = new ArrayList<String>();
BufferedReader in = new BufferedReader( new StringReader( lala ) );
String line;
while( ( line = in.readLine() ) != null ) {
lines.add( line );
}
ArrayList<String> parsed = new ArrayList<String>();
for( String s : lines ) {
if( s.contains(">") ) {
parsed.add("");
} else {
String current = parsed.get( parsed.size() - 1 );
parsed.set( parsed.size() - 1, current + s );
}
}
for( String s : parsed ) {
System.out.println( s );
}
}
}
The above will output:
ABCDEFGHI
JKLMNOPQR
Another interesting way you could do it is at the 'in.readLine()' loop you can check for the > and if it exists add a < at the end of that string before pushing it onto 'lines'. You can then use a regex to grab the other lines back out later.
Something like this?
Array<String> lines
//Open the file for reading
try {
BufferedReader br = new BufferedReader(new FileReader(<FileNameGoesHere>));
while ((thisLine = br.readLine()) != null) { // while loop begins here
if(thisLine.charAt(0) != '>') {
lines.add(thisLine);
}
} // end while
} // end try
catch (IOException e) {
System.err.println("Error: " + e);
}
skipping the lines starting with >
is easy;
while((line=istream.readLine())!=null){
if(line.charAt(0)=='>')continue;
//do normal concat to buffers
}
if you want to go to the next buffer on lines starting with >
is a bit more involved
while((line=istream.readLine())!=null){
if(line.charAt(0)=='>'){
//create new buffer and append the current one to the list (check first if current one is not empty)
continue;
}
//do normal concat to buffer
}
精彩评论