Java regular expression on JPG carving
I'm having a few problems using regular expressions in Java. I'm attempting to search through an ISO file, and carve out any JPG images, if there are any in there.
At the moment, I'm having success with locating EXIF information within the JPG, using the following regular expression:
Pattern imageRegex = Pattern.compile("\\x45\\x78\\x69\\x66"); //Exif regex
This works fine and I can then file carve out the EXIF information.
However, if I use this regex:
Pattern imageRegex = Pattern.compile("\\xff\\xd8\\xff"); //JPG header regex
Java fails to find any matches. I can confirm that there are JPGs present within the ISO file.
I'm reading in 200 bytes of the file at a time into a byte array and then converting that to a string to be regex'd.
Can anyone advice why this is happening as it's rather confusing.
Or can anyone advise a better way of approaching the issue of file carving JPGs using regular expressions in Java?
Any advice would be gr开发者_开发知识库eatly appreciated.
I'm reading in 200 bytes of the file at a time into a byte array and then converting that to a string to be regex'd.
Maybe all the JPEG headers are split across the N*200 borders.
Anyway, this is a rather unconventional (and inefficient) way of searching binary data. Why don't you just go through the input stream until you find the header?
If you're reading in a byte array and converting it to a string, it's possible that string encoding issues are biting you in the rear. It so happens that the EXIF pattern you're looking for is all ASCII-compatible:
0x45 0x78 0x69 0x66
E x i f
but the JPEG header isn't:
0xff 0xd8 0xff
You'd do well to folow Jakub's advice and skip the regular expressions.
Using regex to match binary sequences is rarely appropiate; I wonder if you are well aware of the conceptual differences between binary data and strings in Java (as opposed to, say, C).
A JPEG file is binary data (a sequence of bytes), to use in a pattern regex you must have it in Java as a String (a sequence of characters), they are fundamentally different entities, and to convert from one to another some charset encoding must be specified. Further, when you specify the literal \x45
inside a pattern or as a literal string, you are not meaning (as you seem to believe) "the byte with binary value 0x45" (this would not make sense, because we are not dealing with bytes ) but, "the character point number 0x45
in Unicode".
It's true that in several usual charset encodings (in particular in UTF-8 and in ISO-8859-1 and its variants) a sequence of bytes in the "ascii range" (less than 127) will be converted to a codepoint with that byte value. But for other encodings (as UTF-16) or other values (in the 128-255 range) that's not necesarily true. In particular, it's not true for UTF-8 - it's true for ISO-8859-1, but you should not rely on this "coincidence" (if your you this is a coincidence).
In your scenario, I'd say that if you specify ISO-8859-1 encoding you will probably get what you expect. But it would still smell bad.
Exercise: try to predict/understand what this code prints:
public static void main(String[] args) throws Exception {
byte[] b = { 0x30, (byte) 0xb2 };
String x = new String(b, "ISO-8859-1");
System.out.println(x.matches(".*\\x30.*"));
System.out.println(x.matches(".*\\xb2.*"));
String x2 = new String(b, "UTF-8");
System.out.println(x2.matches(".*\\x30.*"));
System.out.println(x2.matches(".*\\xb2.*"));
}
Place the mouse over below to see the answer.
true true true false
精彩评论