开发者

Separating ASCII text from binary content in a file

I have a file that has both ASCII text and binary content. I would开发者_高级运维 like to extract the text without having to parse the binary content as the binary content is 180MB. Can I simply extract the text for further manipulation ... what would be the best way of going about it.

The ASCII is at the very beginning of the file.


There are 4 libraries to read FITS files in Java here:

Java

nom.tam.fits classes

A Java FITS library has been developed which provides efficient -- at least for Java -- I/O for FITS images and binary tables. The Java libraries support all basic FITS formats and gzip compressed files. Support for access to data subsets is included and the HIERARCH convention may be used.

eap.fits

Includes an applet and application for viewing and editing FITS files. Also includes a general purpose package for reading and writing FITS data. It can read PGP encrypted files if the optional PGP jar file is available.

jfits

The jfits library supports FITS images and ASCII and binary tables. In-line modification of keywords and data is supported.

STIL

A pure java general purpose table I/O library which can read and write FITS binary tables amongst other table formats. It is efficient and can provide fast sequential or random read access to FITS tables much larger than physical memory. There is no support for FITS images.


I am not aware of any Java classes that will read the ASCII characters and ignore the rest, but the easiest thing I can come up with here is to use the strings utility (assuming you are on a Unix-based system).

SYNOPSIS strings [ - ] [ -a ] [ -o ] [ -t format ] [ -number ] [ -n number ] [--] [file ...]

DESCRIPTION Strings looks for ASCII strings in a binary file or standard input. Strings is useful for identifying random object files and many other things. A string is any sequence of 4 (the default) or more printing characters ending with a newline or a null. Unless the - flag is given, strings looks in all sections of the object files except the (__TEXT,__text) section. If no files are specified standard input is read.

You could then pipe the output to another file and do whatever you want with it.

Edit: with the additional information that all the ASCII comes at the beginning, it would be a little easier to extract the text programmatically; still, this is faster than writing code.


Assuming you can tell where the end of the ASCII content is, just read characters from the file until you find the end of it, and close the file.


Supposing that there is some token which divides the file into the binary and ASCII components (say, "#END#" on a line all by itself), you can do sometihng like the following:

import java.io.*;

// ...

public static void main(String args[]) {
  try {
    FileInputStream f = new FileInputStream("object.bin");
    DataInputStream d = new DataInputStream(f);
    BufferedReader b = new BufferedReader(new InputStreamReader(d));

    String s = "";
    while ((s = b.readLine()) != "#END#") {
      // ASCII contents parsed here.
      System.out.println(s);
    }

    d.close();
  } catch (Exception e) {
      System.err.println("kablammo! " + e.getMessage());
  }
}


Have a method that checks whether a particular character meets your criteria (here, I've covered characters that are found on the keyboard). Once you hit a character for which the method returns false, you know you've hit the binary. Note that valid ASCII characters may also form part of the binary so you may end up with a few extra characters at the end.

static boolean isAsciiCharacter(char c) {
    return (c >= ' ' && c <= '~') ||
            c == '\n' ||
            c == '\r';
}


The first 2880 bytes of a FITS file are ASCII header data, representing 36 80-column "card images". There are no line terminator characters, just a 36x80 ASCII array, padded out with blanks if necessary. There may be additional 2880-byte ASCII headers preceding the binary data; you'd have to parse the first set of headers to know how much ASCII to expect.

But I heartily endorse Oscar Reyes' advice to use an existing package to decode FITS files! Two of the packages he mentioned are hosted by NASA's Goddard Space Flight Center, who are also responsible for maintaining the FITS format. That's about as definitive a source as you can get.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜