How to tokenize an input file in java
i'm doing tokenizing a text file in java. I want to read an input file, tokenize it and write a certain character that has been tokenized into an output file. This is what i've done so far:
package org.apache.lucene.analysis;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StreamTokenizer;
class StringProcessing {
// Create BufferedReader class instance
public static void main(String[] args) throws IOException {
InputStreamReader input = new InputStreamReader(System.in);
BufferedReader keyboardInput = new BufferedReader(input);
System.out.print("Please enter a java file name: ");
String filename = keyboardInput.readLine();
if (!filename.endsWith(".DAT")) {
System.out.println("This is not a DAT file.");
System.exit(0);
}
File File = new File(filename);
if (File.exists()) {
FileReader file = new FileReader(filename);
StreamTokenizer streamTokenizer = new StreamTokenizer(file);
int i = 0;
int numberOfTokensGenerated = 0;
while (i != StreamTokenizer.TT_EOF) {
i = streamTokenizer.nextToken();
numberOfTokensGenerated++;
}
// Output number of characters in the line
System.out.println("Number of tokens = " + numberOfTokensGenerated);
// Output tokens
for (int counter = 0; counter < numberOfTokensGenerated; counter++) {
char character = file.toString().charAt(counter);
if (character == ' ') { System.out.println(); } else { System.out.print(character); }
}
} else {
System.out.println("File does not exist!");
System.exit(0);
}
System.out.println("\n");
}//end main
}//end class
When i run this code, this is what i get:
Please enter a java file name: D://eclipse-java-helios-SR1-win32/LexractData.DAT
Number of tokens = 129
java.io.FileReader@19821fException in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 25
at java.lang.String.charAt(Unknown Source)
at org.apache.lucene.analysis.StringProcessing.main(StringProcessing.java:40)
The input file will look like this:开发者_运维问答
-K1 Account
--Op1 withdraw
---Param1 an
----Type Int
---Param2 amount
----Type Int
--Op2 deposit
---Param1 an
----Type Int
---Param2 Amount
----Type Int
--CA1 acNo
---Type Int
-K2 CheckAccount
--SC Account
--CA1 credit_limit
---Type Int
-K3 Customer
--CA1 name
---Type String
-K4 Transaction
--CA1 date
---Type Date
--CA2 time
---Type Time
-K5 CheckBook
-K6 Check
-K7 BalanceAccount
--SC Account
I just want to read the string which are starts with -K1
, -K2
, -K3
, and so on... can anyone help me?
The problem is with this line --
char character = file.toString().charAt(counter);
file
is a reference to a FileReader
that does not implement toString()
.. it calls Object.toString() which prints a reference around 25 characters long. Thats why your exception says OutofBoundsException at the 26th character.
To read the file correctly, you should wrap your filereader with a bufferedreader and then put each readline into a stringbuffer.
FileReader fr = new FileReader(filename);
BufferedReader br = new BufferedReader(fr);
StringBuilder sb = new StringBuilder();
String s;
while((s = br.readLine()) != null) {
sb.append(s);
}
// Now use sb.toString() instead of file.toString()
If you are wanting to tokenize the input file then the obvious choice is to use a Scanner. The Scanner class reads a given input stream, and can output either tokens or other scanned types (scanner.nextInt(), scanner.nextLine(), etc).
import java.util.Scanner;
import java.io.File;
import java.io.IOException;
public static void main(String[] args) throws IOException {
Scanner in = new Scanner(new File("filename.dat"));
while (in.hasNext) {
String s = in.next(); //get the next token in the file
// Now s contains a token from the file
}
}
Check out Oracle's documentation of the Scanner class for more info.
public class FileTokenize { public static void main(String[] args) throws IOException {
final var lines = Files.readAllLines(Path.of("myfile.txt"));
FileWriter writer = new FileWriter( "output.txt");
String data = " ";
for (int i = 0; i < lines.size(); i++) {
data = lines.get(i);
StringTokenizer token = new StringTokenizer(data);
while (token.hasMoreElements()) {
writer.write(token.nextToken() + "\n");
}
}
writer.close();
}
精彩评论