How are hex sequence translated to assembly without ambiguity?
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
A senquence like above can be segmented in various ways,each segment can be translated to corresponding assembly instruction, but each binary executable has its only DEFINITE assembly 开发者_运维知识库,what's the mathematical principle that avoids ambiguity?
UPDATE
The answer with most votes doesn't actually answer my question at all.
Knowing your starting point.
In other words, given a specific starting byte of an instruction, it is unambiguous where the instruction ends, thus giving you the starting byte of the next instruction and allowing you to continue. Given an arbitrary block of memory it is impossible to break it up into individual instructions without knowing where the first instruction begins.
From a more mathematical perspective, there is no valid instruction whose bytes are a prefix of another valid instruction. So if ab
is valid, then you know that ab cd
cannot be valid so ab
must be one instruction and cd
is the start of the next instruction.
If I understand your question correctly, you're trying to understand why
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
Could be split e.g. as
8BEC 568BF4 68007040 00FF 15BC 8240
Rathern than say,
8B EC568B F4 68007040 00FF 15BC 8240
This is entirely specified by the ISA of your architecture. That document describes exactly how instructions are uniquely constructed from a series of bytes.
For the ISA to be well formed, a single series of bytes can correspond to at most a single series of decoded instructions (might be less, if there are invalid instructions).
To get a bit more concrete, lets take the x86 example: If you want to know what each byte corresponds to, have a look here.
You'll see that, e.g. an instruction starting with 00 is an add (additional parameters are in the next byte, with a specific encoding).
You'll also see that some values are actually prefixes that modify the following instruction (0F - prefix to extend the opcode space, 26, 2E, 36, 3E, 64, 65, 66, 67, F0, F2, F3), and that some of them take different meaning based on the exact following instruction. Those are not opcodes, but they can alter the encoding of the arguments of the opcode, or introduce a completely new opcode space (e.g. SSE uses 0F).
Overall, the x86 encoding is very complex, thanks for disassemblers.
First of all you have to distinguish between RISC and CISC architectures.
In a RISC architecture you usually have instructions of the same size, so ambiguity cannot be presented. Your CPU will fetch for example 4 bytes for every instruction, and since it will have to start from somewhere (your CPU doesn't have just a sequence like the one you presented, it will have a starting point for sure) once that it has the right alignment no problem can occur.
What happens with a CISC instruction set is essentially the same: starting from the entry point of the program it will fetch instructions accordingly to your opcodes. It doesn't need to know how to matematically distinguish ambiguities since it won't happen that it just doesn't know how long is the next instruction or where the last one finished.
So asking how to separate every instruction is like asking how to separate every word in
thepenisonthetable
There's not mathematical proof but you know which letters are correct together and which ones are not meaningful. The previous sentence contains "son" but you know that it is obtained from "is on". You wouldn't be able to say so without having a meaningful phrase, but your CPU only executes meaningful programs so what's the point?
So if the CPU could work on the previous sentence it will find the first senseful instruction "the", then "pen", "is", "on" and the "son" couldn't never be recognized anyway.
EDIT:
To be cleared, in CISC architectures, the only contraint you have to be sure not to have ambiguities is to avoid having an instruction that is a prefix of another. Let's assume a finite alphabet composed by letters a-z instead that hex numbers (just for practical purposes).
If the program counter points to
abbcbcaabdeffabd
you can have that abb
is a whole instruction. In that case ab
wouldn't be a valid instruction, otherwise the CPU couldn't know where to stop, at the same time abbc
can't be an instruction too or it may create problems. Keeping it on you can have for example that ca
is the next instruction, c
couldn't and cbc
neither.
You can extend this argumentation to the whole string. You will see that, if the CPU finds itself in a state in which the next byte of the binary points to the FIRST byte of an instruction, and there are no instruction that are prefixes of other instruction, then in the next state the program counter will point to the FIRST byte of the next, correct, instruction.
If you open a binary in a hex editor, copy a portion of data and paste in a disassembler, you will very probably not copy a complete instruction. Let's use your example.. in a Windows XP 32bits SP3 English, if I assembly 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
I'll get:
Hex dump Command
8BEC mov ebp,esp
56 push esi
8BF4 mov esi,esp
68 00704000 push 407000
FF15 BC824000 call dword ptr ds:[4082bc]
As you can see it assembled completely different then the answer of the other guys below...
Now let's pretend that instead of assembling 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
you added C0
opcode at the beginning C0 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
Now check below what a single byte did in our instructions:
Hex dump Command
C08B EC568BF4 68 ror byte ptr ds:[ebx+f48b56ec],68
0070 40 add byte ptr ds:[eax+40],dh
00FF add bh,bh
15 BC824000 adc eax,4082bc
As you can see it was completely modified!
The processor know where to start and what arguments to the instruction to take by the opcode instruction. In the first example our first opcode was 8B
so it knows that it can be followed by another byte. If this byte is EC
so the instruction ends here, and it means mov ebp, esp
.
In our second example it start with C0
and it can be followed by another byte, meaning another instruction. Then C08B
is the instruction, and EC568BF4 68
is the argument.
Imagine that inside the processor has a huge (but nano) circuit, that behave like a chain of ifs (conditions), that depending on the value of the hexcode (or opcode) it knows "where to go" or "how to behave".
The sequence you listed shows exactly 1 number. In binary, it's
100010111110110001010110100010111111010001101000000000000111000001000000000000001111111100010101101111001000001001000000
. In decimal, it's 726522768938664460674442126658667072
. These are all just different ways of writing exactly the same value. A particular architecture's ISA will divide the bits into fields and assign them meaning. Most processors have easy to get manuals that describe the meaning assigned to each of the bits in those fields.
Dont confuse linearly trying to disassemble with execution order of code. The binary is decoded in execution order, starting with a known location. Other than intentional hacks for various reasons there is no ambiguity.
Try writing a disassembler for a variable word length instruction set. At the end of the day it has to be done in execution order, and even there you can only disassemble some of the program as some branches can be based on addresses computed at run time. Modern compiler generated code is much better than older hand assembled code. In an old standup arcade game for example there are conditional branches preceeded by an instruction that guarantees that only one of the conditions is met (why was that in there? we will never know) and the data that follows the conditional branch resembles an opcode in such a way that you run into a collision with other opcodes.
Old dos programs trying to defeat disassemblers would have self modifying code, one instruction would modify another instruction one or two instructions ahead, if single stepped that modification would happen, but if run at full speed the instruction was already fetched in the pipeline, and the modified/broken one in memory was not used. pretty neat trick.
Anyway, the answer to your question is do not look at the bytes in linear order, look at them in execution order starting at the addresses defined by the reset and other vectors in the vector table.
It sounds like the answer to your question is the somewhat flippant "Know your starting point", but maybe you want something a little more verbose. Given your string:
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
AND a starting point (Let's say the 8B is your starting point) there is only one possible interpretation of the bytes.
So let's say one operation is 8B EC 56 8B (Depending on your operation length, etc), then the NEXT operation is F4 68... In this case, it's impossible for the machine to try to interpret an operation 56 8B F4 68 because no operation ended at just that point.
Now, if your start point was the 56, then you can get that group but cannot get either of the ones mentioned previously.
The layout of your memory is very specific and start points/jump points are exact and unforgiving--they are required as surely as the code itself.
There might also be some clues elsewhere about what is a valid starting address. There is always a reset vector address, and there are usually interrupt vector addresses, which all must be valid start points for blocks of code. More usefully, if you come across a jump or call instruction elsewhere which references an address in the block you are trying to decode, then that gives you another start address.
I think I see your worry, and as far as I know its correct - if the program counter gets upset by one and that causes invalid instructions or unintended instructions to be executed, the program probably crashes. True, and also if you run into a data block and try to execute that. At least the latter can be avoided by using a Harvard architecture, where code and data are in seperate memory spaces and may be different bit widths.
Maybe you find it interesting to think about the other direction: How would you have to design your code to be easy to segment for others? You could require the most significant bit of the byte starting a sequence to be zero, and those in the middle of a sequence to be one, like UTF-8 does it. Then if you start from a random position – assuming you know where the bytes are – it is easy to find the next sequence. Going one step further, how would you code a pure bit stream such that the start of a sequence is easy to find. How many bits were wasted by such a coding?
Since you asked about the maths, I think the relevant topics are “Coding Theory”, “Variable-length codes” or “Prefix codes”.
How do you find a gene in a sequence of base pairs?
精彩评论