is there any faster way to parse than by walk each byte?
is there any faster way to parse a text than by walk each byte of the 开发者_如何学Gotext?
I wonder if there is any special CPU (x86/x64) instruction for string operation that is used by string library, that somehow used to optimize the parsing routine.
for example instruction like finding a token in a string that could be run by hardware instead of looping each byte until a token is found.
*edited->note: I'am asking more to algorithm instead of CPU architecture, so my really question is, is there any special algorithm or specific technique that could optimize the string manipulation routine given the current cpu architecture.
The x86 had a few string instructions, but they fell out of favor on modern processors, because they became slower than more primitive instructions which do the same thing.
The processor world is moving more and more towards RISC, ie, simplistic instruction sets.
Quote from Wikipedia (emphasis mine):
The first highly (or tightly) pipelined x86 implementations, the 486 designs from Intel, AMD, Cyrix, and IBM, supported every instruction that their predecessors did, but achieved maximum efficiency only on a fairly simple x86 subset that resembled only a little more than a typical RISC instruction set (i.e. without typical RISC load-store limitations).
This is still true on today's x86 processors.
You could get marginally better performance processing four bytes at a time, assuming each "token" in the text was four-byte-aligned. Obviously this isn't true for most text... so better to stick with byte-by-byte scanning.
Yes there are special CPU instructions; and the run-time library, which implements functions like strchr
, might be written in assembly.
One technique that can be faster than walking bytes is to walk double-words, i.e. to process data 32 bits at a time.
The problem with walking bigger-than-the-smallest-addressable-memory-unit chunks in the context of strings is one of alignment
You add code at the begining and end of your function (before and after your loop), to handle the uneven/unaligned byte[s]. (shrug) It makes your code faster: not simpler. The following for example is some source code which claims to be an improved version of strchr. It is using special CPU instructions, but it's not simple (and has extra code for the unaligned bytes):
PATCH: Optimize strchr/strrchr with SSE4.2 -- "This patch adds SSE4.2 optimized strchr/strrchr. It can speed up strchr/strrchr by up to 2X on Intel Core i7"
While (some) processors do have string instructions, they're of little use in producing faster code. First of all, as @zildjohn01 noted, they're often slower than other instructions with current processors. More importantly, it rarely makes much difference anyway -- if you're scanning much text at all, the bottleneck will usually be the bandwidth from the memory to the CPU, so essentially nothing you do with changing instructions is likely to make a significant difference in any case.
That said, especially if the token you're looking for is long, a better algorithm may be useful. A Boyer-Moore search (or variant) can avoid looking at some of the text, which can give a substantial improvement.
Well, you have to look at everything to know everything about the text, at some level. Arguably, you could have some sort of structured text, which gives you further information about where to walk at each point in a sort of n-ary space partition. But then, the "parsing" has partially been done back when the file was created. Starting from 0 information, you will need to touch every byte to know everything.
Yes. As your edit specifically asks for algorithms, I'd like to add an example fo that.
You're going to know the language that you are parsing, and you can use that knowledge when building a parser. Take for instance a language in which every token must be at least two characters, but any length of whitespace can occur between tokens. Now when you're scanning through whitespace for the next token, you can skip every second character. Only when you hit the first non-whitespace do you need to back up one character.
Example:
01234567890123456789
FOO BAR
When scanning for the next token, you'd probe 4,6,8,10 and 12. Only when you see the A do you look back to 11 to find B. You never looked at 3,5,7 and 9.
This is a special case of the Boyer–Moore string search algorithm
Even though this question is long gone, I have another answer for you.
The fastest way to parse is to not parse at all. Huh?!
By that I mean, statistically, most source code text, and most code and data generated from that source code, do not change from compile to compile.
Therefore you can use a technique called incremental compilation. The first time you compile source, you divide it into coarse grained chunks, for example, at the boundaries of global declarations and function bodies. You have to persist the chunks' source, or its signatures or checksums, plus the information about the boundaries, what they compiled to, etc.
The next time you have to recompile the same source, given the same environment, you can gallop over the source code looking for changes. One easy way is to compare (longword at a time) the current code to the persisted snapshot of the code you saved from last time. As long as the longwords match, you skip through the source; instead of parsing and compiling, you reuse the snapshotted compilation results for that section from last time.
As seen in "C#'88", QuickC 2.0, and VC++ 4.0 Incremental Recompilation.
Happy hacking!
is there any faster way to parse a text than by walk each byte of the text?
Yes, sometimes you can skip over data. The Boyer-Moore string search algorithm is based on this idea.
Of course, if the result of the parse operation somehow needs to contain all the information in the text, then there is no way around the fact that you have to read everything. If you want to avoid CPU load, you could build hardware which processes data with direct memory access I guess.
精彩评论