How are text editors generally implemented?
This question is probably going to make me sound pretty clueless. That's because I am.
I'm just thinking, if I were hypothetically interested in designing my own text editor GUI control, widget, or whatever you want to call it (which I'm not), how would I even do it?
The temptation to a novice such as myself would be to store the content of the text editor in the form of a string, which seems quite costly (not that I'm too familiar with how string implementations differ between one language/platform and the next; but I know that in .NET, for example, they're immutable, so frequent manipulation such as what you'd need to support in a text editor would be magnificently wasteful, constructing one string instance after another in very rapid succession).
Presumably some mutable data structure containing text is used instead; but figuring out what this structure might look like strikes me as a bit of a challenge. Random access would be good (I would think, anyway—after all, don't you want the user to be able to jump around to anywhere in the text?), but then I wonder about the cost of, say, navigating to somewhere in the middle of a huge document and starting to type immediately. Again, the novice approach (say you store t开发者_C百科he text as a resizeable array of characters) would lead to very poor performance, I'm thinking, as with every character typed by the user there would be a huge amount of data to "shift" over.
So if I had to make a guess, I'd suppose that text editors employ some sort of structure that breaks the text down into smaller pieces (lines, maybe?), which individually comprise character arrays with random access, and which are themselves randomly accessible as discrete chunks. Even that seems like it must be a rather monstrous oversimplification, though, if it is even remotely close to begin with.
Of course I also realize that there may not be a "standard" way that text editors are implemented; maybe it varies dramatically from one editor to another. But I figured, since it's clearly a problem that's been tackled many, many times, perhaps a relatively common approach has surfaced over the years.
Anyway, I'm just interested to know if anyone out there has some knowledge on this topic. Like I said, I'm definitely not looking to write my own text editor; I'm just curious.
One technique that's common (especially in older editors) is called a split buffer. Basically, you "break" the text into everything before the cursor and everything after the cursor. Everything before goes at the beginning of the buffer. Everything after goes at the end of the buffer.
When the user types in text, it goes into the empty space in between without moving any data. When the user moves the cursor, you move the appropriate amount of text from one side of the "break" to the other. Typically there's a lot of moving around a single area, so you're usually only moving small amounts of text at a time. The biggest exception is if you have a "go to line xxx" kind of capability.
Charles Crowley has written a much more complete discussion of the topic. You might also want to look at The Craft of Text Editing, which covers split buffers (and other possibilities) in much greater depth.
A while back, I wrote my own text editor in Tcl (actually, I stole the code from somewhere and extended it beyond recognition, ah the wonders of open source).
As you mentioned, doing string operations on very, very large strings can be expensive. So the editor splits the text into smaller strings at each newline ("\n" or "\r" or "\r\n"). So all I'm left with is editing small strings at line level and doing list operations when moving between lines.
The other advantage of this is that it is a simple and natural concept to work with. My mind already considers each line of text to be separate reinforced by years of programming where newlines are stylistically or syntactically significant.
It also helps that the use case for my text editor is as a programmers editor. For example, I implemented syntax hilighting but not word/line wrap. So in my case there is a 1:1 map between newlines in text and the lines drawn on screen.
In case you want to have a look, here's the source code for my editor: http://wiki.tcl.tk/16056
It's not a toy BTW. I use it daily as my standard console text editor unless the file is too large to fit in RAM (Seriously, what text file is? Even novels, which are typically 4 to 5 MB, fit in RAM. I've only seen log files grow to hundreds of MB).
Depending on the amount of text that needs to be in the editor at one time, a one string for the entire buffer approach would probably be fine. I think Notepad does this-- ever notice how much slower it gets to insert text in a large file?
Having one string per line in a hash table seems like a good compromise. It would make navigation to a particular line and delete/paste efficient without much complexity.
If you want to implement an undo function, you'll want a representation that allows you to go back to previous versions without storing 30 copies of the entire file for 30 changes, although again that would probably be fine if the file was sufficiently small.
The simplest way would be to use some kind of string buffer class provided by the language. Even a simple array of char objects would do at a pinch.
Appending, replacing and seeking text are then relatively quick. Other operations are potentially more time-consuming, of course, with insertion of a sequence of characters at the start of the buffer being one of the more expensive actions.
However, this may be perfectly acceptable performance-wise for a simple use case.
If the cost of insertions and deletions is particularly significant, I'd be tempted to optimise by creating a buffer wrapper class that internally maintained a list of buffer objects. Any action (except simple replacement) that didn't occur at the tail of an existing buffer would result in the buffer concerned being split at the relevant point, so the buffer could be modified at its tail. However, the outer wrapper would maintain the same interface as a simple buffer, so that I didn't have to rewrite e.g. my search action.
Of course, this simple approach would quickly end up with an extremely fragmented buffer, and I'd consider having some kind of rule to coalesce the buffers when appropriate, or to defer splitting a buffer in the case of e.g. a single character insertion. Maybe the rule would be that I'd only ever have at most 2 internal buffers, and I'd coalesce them before creating a new one – or when something asked me for a view of the whole buffer at once. Not sure.
Point is, I'd start simple but access the mutable buffer through a carefully chosen interface, and play with the internal implementation if profiling showed me I needed to.
However, I definitely wouldn't start with immutable String objects!
精彩评论