Parse Text File Line by Line, Skipping Certain Lines

2023-03-12 19:49 问答作者：

I have a file that looks like this (but is much bigger):

>some text
ABC
DEF
GHI
>some more text
JKL
MNO
PQR

I have been playing around with it in Java for some time and have been able to build arrays with the lines, etc. The lines with '>' are usually one line but sometimes could be 2, 3 or more lines. The lines tha开发者_高级运维t don't begin with '>' are the same length in characters but there may be 10, 20 or 30 or more of these lines. I am at the point now where I want to create an string array, where each string in the array contains a string of the lines that don't begin with '>' like so:

array element 1 = ABCDEFGHI
array element 2 = JKLMONPQR

I feel like I am close but need a small kick in the butt to get me going. I'm sure this is easy for a pro, but I am still new to Java.

Specific problem is related to other posts I made on this board. It's a FASTA file:

>3BHS_BOVIN (P14893) 3 beta-hydroxysteroid
AGWSCLVTGGGGFLGQRIICLLVEEKDLQEIRVLDKVFRPEVREEFSKLQSKIKLTLLEG
DILDEQCLKGACQGTSVVIHTASVIDVRNAVPRETIMNVNVKGTQLLLEACVQASVPVFI
>41_BOVIN (Q9N179) Protein 4.1 
MHCKVSLLDDTVYECVVEKHAKGQDLLKRVCEHLNLLEEDYFGLAIWDNATSKTWLDSAK
EIKKQVRGVPWNFTFNVKFYPPDPAQLTEDITRYYLCLQLRQDIVSGRLPCSFATLALLG
SYTIQSELGDYDPELHGADYVSDFKLAPNQTKELEEKVMELHKSYRSMTPAQADLEFLEN
>5NTD_BOVIN (Q05927) 5'-nucleotidase 
MNPGAARTPALRILPLGALLWPAARPWELTILHTNDVHSRLEQTSEDSSKCVNASRCVGG
VARLATKVHQIRRAEPHVLLLDAGDQYQGTIWFTVYKGTEVAHFMNALGYESMALGNHEF
DNGVEGLIDPLLKEVNFPILSANIKAKGPLASKISGLYSPYKILTVGDEVVGIVGYTSKE
TPFLSNPGTNLVFEDEITALQPEVDKLKTLNVNKIIALGHSGFEVDKLIAQKVKGVDVVV

I ultimately need the sequences in their own array element so that I can manipulate them later.

Assuming you can iterate over the lines:

List<String> array = new ArrayList<String>();
StringBuilder buf = new StringBuilder();
for (String line : lines) {
  if (line.startsWith(">")) {
    if (buf.length() > 0) {
      array.add(buf.toString());
      buf.setLength(0);
    }
  } else {
    buf.append(line);
  }
}
if (buf.length() > 0) { // Add the final text element(s).
  array.add(buf.toString());
}

Try this. I didn't bother with proper variable names. Also it works assuming first line has a >. It's probably not optimised either but should give you an idea of how this is possible.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;


public class Parse {
    public static void main(String[] args) throws IOException {
        String lala = ">some text\r\n" + 
                "ABC\r\n" + 
                "DEF\r\n" + 
                "GHI\r\n" + 
                ">some more text\r\n" + 
                "JKL\r\n" + 
                "MNO\r\n" + 
                "PQR";

        ArrayList<String> lines = new ArrayList<String>();

        BufferedReader in = new BufferedReader( new StringReader( lala ) );

        String line;
        while( ( line = in.readLine() ) != null ) {
            lines.add( line );
        }

        ArrayList<String> parsed = new ArrayList<String>();

        for( String s : lines ) {
            if( s.contains(">") ) {
                parsed.add("");
            } else {
                String current = parsed.get( parsed.size() - 1 );
                parsed.set( parsed.size() - 1, current + s );
            }
        }

        for( String s : parsed ) {
            System.out.println( s );
        }
    }

}

The above will output:

ABCDEFGHI
JKLMNOPQR

Another interesting way you could do it is at the 'in.readLine()' loop you can check for the > and if it exists add a < at the end of that string before pushing it onto 'lines'. You can then use a regex to grab the other lines back out later.

Something like this?

Array<String> lines    
 //Open the file for reading
    try {    
       BufferedReader br = new BufferedReader(new FileReader(<FileNameGoesHere>));
       while ((thisLine = br.readLine()) != null) { // while loop begins here
         if(thisLine.charAt(0) != '>') {
           lines.add(thisLine);
         }
       } // end while 
     } // end try
     catch (IOException e) {
       System.err.println("Error: " + e);
     }

skipping the lines starting with > is easy;

while((line=istream.readLine())!=null){
    if(line.charAt(0)=='>')continue;

    //do normal concat to buffers
}

if you want to go to the next buffer on lines starting with > is a bit more involved

while((line=istream.readLine())!=null){
    if(line.charAt(0)=='>'){
         //create new buffer and append the current one to the list (check first if current one is not empty)
         continue;
    }

    //do normal concat to buffer
}

继续阅读：arrays list string

Parse Text File Line by Line, Skipping Certain Lines

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？