What is the Fastest Way to Check for a Keyword in a List of Keywords in Delphi?
I have a small list of keywords. What I'd really like to do is akin to:
case MyKeyword of
'CHIL': (code for CHIL);
'HUSB': (code for HUSB);
'WIFE': (code for WIFE);
'SEX': (code for SEX);
else (code for everything else);
end;
Unfortunately the CASE statement can't be used like that for strings.
I could use the straight IF THEN ELSE IF construct, e.g.:
if MyKeyword = 'CHIL' then (code for CHIL)
else if MyKeyword = 'HUSB' then (code for HUSB)
else if MyKeyword = 'WIFE' then (code for WIFE)
else if MyKeyword = 'SEX' then (code for SEX)
else (code for everything else);
but I've heard this is relatively inefficient.
What I had been doing instead is:
P := pos(' ' + MyKeyword + ' ', ' CHIL HUSB WIFE SEX ');
case P of
1: (code for CHIL);
6: (code for HUSB);
11: (code for WIFE);
17: (code for SEX);
else (code for everything else);
end;
This, of course is not the best programming style, but it works fine for me and up to now didn't make a difference.
So what is the best way to rewrite this in Delphi so that it is both simple, understandable but also fast?
(For reference, I am using Delphi 2009 with Unicode strings.)
Followup:
Toby recommended I simply use the If Then Else construct. Looking back at my examples that used a CASE statement, I can see how that is a viable answer. Unfortunately, my inclusion of the CASE inadvertently hid my real question.
I actually don't care which keyword it is. That is just a bonus if the particular method can identify it like the POS method can. What I need is to know whether or not the keyword is in the set of keywords.
So really I want to know if there is anything better than:
if pos(' ' + MyKeyword + ' ', ' CHIL HUSB WIFE SEX ') > 0 then
The If Then Else equivalent does not seem better in this case being:
if (MyKeyword = 'CHIL') or (MyKeyword = 'HUSB') or (MyKeyword = 'WIFE')
or (MyKeyword = 'SEX') then
In Barry's comment to Kornel's question, he mentions the TDictionary Generic. I've not yet picked up on the new Generic collections and it looks like I should delve into them. My question here would be whether they are built for efficiency and how would using TDictionary compare in looks and开发者_如何学JAVA in speed to the above two lines?
In later profiling, I have found that the concatenation of strings as in: (' ' + MyKeyword + ' ') is VERY expensive time-wise and should be avoided whenever possible. Almost any other solution is better than doing this.
type TKeyword = ( ECHIL, EHUSB, EWIFE, ESEX )
const TKeyNames : array[TKeyword] of string = ( 'CHIL', 'HUSB', 'WIFE', 'SEX' );
Key : TKeyword
case Key of
ECHIL : (code for CHIL);
EHUSB : (code for HUSB);
EWIFE : (code for WIFE);
ESEX : (code for SEX);
else (code for everything else);
end;
In general, don't use strings as "keys", use enumerations -- they're safer, and you get much of a speed increase.
Unfortunately Delphi (as far as I know) has no standard hashtable implementation that'd be easy to use, you can always roll up your own however.
BTW, code for SEX
sounds much more fun than "will code for beer" :P
You can use a const table (which must be alpha-sorted) and a quick binary sort. It's very efficient, and doesn't require any hashing.
Here is the function to use:
function IsKeyWord(const KeyWords: array of string; const aToken: String): Boolean;
// aToken must be already uppercase
var First, Last, I, Compare: Integer;
begin
First := Low(Keywords);
Last := High(Keywords);
Result := False;
while First <= Last do
begin
I := (First + Last) shr 1;
Compare := CompareStr(Keywords[i],aToken);
if Compare = 0 then
begin
Result := True;
break;
end
else
if Compare < 0 then
First := I + 1 else
Last := I - 1;
end;
end;
And here, some examples of keywords:
const
PASCALKEYWORDS: array[0..100] of string =
('ABSOLUTE', 'ABSTRACT', 'AND', 'ARRAY', 'AS', 'ASM', 'ASSEMBLER',
'AUTOMATED', 'BEGIN', 'CASE', 'CDECL', 'CLASS', 'CONST', 'CONSTRUCTOR',
'DEFAULT', 'DESTRUCTOR', 'DISPID', 'DISPINTERFACE', 'DIV', 'DO',
'DOWNTO', 'DYNAMIC', 'ELSE', 'END', 'EXCEPT', 'EXPORT', 'EXPORTS',
'EXTERNAL', 'FAR', 'FILE', 'FINALIZATION', 'FINALLY', 'FOR', 'FORWARD',
'FUNCTION', 'GOTO', 'IF', 'IMPLEMENTATION', 'IN', 'INDEX', 'INHERITED',
'INITIALIZATION', 'INLINE', 'INTERFACE', 'IS', 'LABEL', 'LIBRARY',
'MESSAGE', 'MOD', 'NAME', 'NEAR', 'NIL', 'NODEFAULT', 'NOT', 'OBJECT',
'OF', 'OR', 'OUT', 'OVERRIDE', 'PACKED', 'PASCAL', 'PRIVATE', 'PROCEDURE',
'PROGRAM', 'PROPERTY', 'PROTECTED', 'PUBLIC', 'PUBLISHED', 'RAISE',
'READ', 'READONLY', 'RECORD', 'REGISTER', 'REINTRODUCE', 'REPEAT', 'RESIDENT',
'RESOURCESTRING', 'SAFECALL', 'SET', 'SHL', 'SHR', 'STDCALL', 'STORED',
'STRING', 'STRINGRESOURCE', 'THEN', 'THREADVAR', 'TO', 'TRY', 'TYPE',
'UNIT', 'UNTIL', 'USES', 'VAR', 'VARIANT', 'VIRTUAL', 'WHILE', 'WITH', 'WRITE',
'WRITEONLY', 'XOR');
DFMKEYWORDS: array[0..4] of string = (
'END', 'FALSE', 'ITEM', 'OBJECT', 'TRUE');
CKEYWORDS: array[0..47] of string = (
'ASM', 'AUTO', 'BREAK', 'CASE', 'CATCH', 'CHAR', 'CLASS', 'CONST', 'CONTINUE',
'DEFAULT', 'DELETE', 'DO', 'DOUBLE', 'ELSE', 'ENUM', 'EXTERN', 'FLOAT', 'FOR',
'FRIEND', 'GOTO', 'IF', 'INLINE', 'INT', 'LONG', 'NEW', 'OPERATOR', 'PRIVATE',
'PROTECTED', 'PUBLIC', 'REGISTER', 'RETURN', 'SHORT', 'SIGNED', 'SIZEOF',
'STATIC', 'STRUCT', 'SWITCH', 'TEMPLATE', 'THIS', 'THROW', 'TRY', 'TYPEDEF',
'UNION', 'UNSIGNED', 'VIRTUAL', 'VOID', 'VOLATILE', 'WHILE');
CSHARPKEYWORDS : array[0..86] of string = (
'ABSTRACT', 'AS', 'BASE', 'BOOL', 'BREAK', 'BY3', 'BYTE', 'CASE', 'CATCH', 'CHAR',
'CHECKED', 'CLASS', 'CONST', 'CONTINUE', 'DECIMAL', 'DEFAULT', 'DELEGATE', 'DESCENDING',
'DO', 'DOUBLE', 'ELSE', 'ENUM', 'EVENT', 'EXPLICIT', 'EXTERN', 'FALSE', 'FINALLY',
'FIXED', 'FLOAT', 'FOR', 'FOREACH', 'FROM', 'GOTO', 'GROUP', 'IF', 'IMPLICIT',
'IN', 'INT', 'INTERFACE', 'INTERNAL', 'INTO', 'IS', 'LOCK', 'LONG', 'NAMESPACE',
'NEW', 'NULL', 'OBJECT', 'OPERATOR', 'ORDERBY', 'OUT', 'OVERRIDE', 'PARAMS',
'PRIVATE', 'PROTECTED', 'PUBLIC', 'READONLY', 'REF', 'RETURN', 'SBYTE',
'SEALED', 'SELECT', 'SHORT', 'SIZEOF', 'STACKALLOC', 'STATIC', 'STRING',
'STRUCT', 'SWITCH', 'THIS', 'THROW', 'TRUE', 'TRY', 'TYPEOF', 'UINT', 'ULONG',
'UNCHECKED', 'UNSAFE', 'USHORT', 'USING', 'VAR', 'VIRTUAL', 'VOID', 'VOLATILE',
'WHERE', 'WHILE', 'YIELD');
And it's very easy to use:
if IsKeyWord(PASCALKEYWORDS,UpperCase('BEGIN')) then
....
You can easily modify the IsKeyWord() function to return the index of the token if you need it.
function KeyWordIndex(const KeyWords: array of string; const aToken: String): integer;
// aToken must be already uppercase
var First, Last, Compare: Integer;
begin
First := Low(Keywords);
Last := High(Keywords);
while First <= Last do
begin
result := (First + Last) shr 1;
Compare := CompareStr(Keywords[result],aToken);
if Compare = 0 then
exit else
if Compare < 0 then
First := result + 1 else
Last := result - 1;
end;
result := -1; // not found
end;
Mostly I use the IndexText function from StrUtils for that purpose. It is similar to your pos approach, but the return value is independent of the indiviual length of the strings. As a gimmick it is also case insensitive (use IndexStr if you don't want this).
function IndexText(const AText: string; const AValues: array of string): Integer; overload;
The comment to these functions actually mentions the case construct:
{ AnsiMatchText & AnsiIndexText provide case like function for dealing with strings }
Your series of if
statements to check whether the input is any of the given keywords could be shortened by checking single characters, to bail out as soon as possible. Your example
if MyKeyword = 'CHIL' then (code for CHIL)
else if MyKeyword = 'HUSB' then (code for HUSB)
else if MyKeyword = 'WIFE' then (code for WIFE)
else if MyKeyword = 'SEX' then (code for SEX)
else (code for everything else);
could be replaced by
KW := kwOther;
if MyKeyword <> '' then begin
case MyKeyword[1] of
'C': if MyKeyword = 'CHIL' then KW := kwCHIL;
'H': if MyKeyword = 'HUSB' then KW := kwHUSB;
'S': if MyKeyword = 'SEX' then KW := kwSEX;
'W': if MyKeyword = 'WIFE' then KW := kwWIFE;
end;
end;
case KW of
kwCHIL: {code for CHIL};
kwHUSB: {code for HUSB};
kwSEX: {code for SEX};
kwWIFE: {code for WIFE};
else
{code for everything else};
end;
For case-insensitive keywords the case
would check for upper and lower case, and the comparison would use AnsiCompareText()
. If you have several keywords with the same first letter you could nest these case
statements, but readability would probably suffer soon for very little speed gain.
Taking this to the maximum you could implement a state machine that uses a PChar
to calculate the next state, which would branch to the else
case as soon as the first non-matching character is found. It would be faster than any solution involving hashes.
I think the
if x then
else
is by far the best solution. Your own solution is very inelegant and for the miniscule improvement in efficiency its not worth it. The if then else construct is perfect for what you want.
For cleanest code, it's best to use case with enumerations, or if..else with strings, as others have suggested. There are a few solutions off the beaten track, though, if you really want to go there.
One is to use a string hash map, which is like a list "indexed" by strings. Values in the list would be procedure pointers to the code you wish to execute for each string. All the procedures would have to have the same exact parameters - and you'd have to write the hash map yourself, or find one you can use e.g. in JVCL.
// prepare the hash map
myHashMap[ 'CHIL' ] := @procToCallForChil;
myHashMap[ 'HUSB' ] := @procToCallForHusb;
// use the hash map
type
TPrototypeProc = procedure( your-parameters-here );
var
procToCall : TPrototypeProc;
begin
@procToCall := myHashMap[ 'CHIL' ];
procToCall( your-parameters-here );
end;
Two, and this is something weird I've tried out once: if and only if your identifier strings fit in 4 characters (as in all your examples), and they are ansi strings (not unicode strings), you can map strings to integers directly. A 32-bit integer is equivalent in size to a 4-byte string, so you can do:
function FourCharStringToInteger( const s : ansistring ) : integer;
begin
assert(( length( s ) = 4 ), 'String to convert must be exactly 4 characters long!');
move( s[1], result, 4 );
end;
Any string identifier that's shorter than 4 characters would have to be padded with spaces, and the strings are case-sensitive ('chil' and 'CHIL' yield different values). To use this with a case statement, you'd have to precompute the values, which may or may not be suitable for your purpose:
id_CHIL := FourCharStringToInteger( 'CHIL' );
And finally you can have your case statement:
case id of
id_CHIL : ...
id_HUSB : ...
end;
This is "special care" code, and might only make a difference if you have a hundreds or more of your identifier strings - it really shouldn't be done at all :) It is certainly easy to break. I agree though that case statements are neater than interminable processions of ...else... blocks.
disclaimer: answer is based on the updated problem statement, i.e. simply checking whether a string matches or not.
If you really want to go for that last bit of performance, some extra information about your data sets could help.
- how many keywords are we talking about? (what order of magnitude)
- is the set of keywords fixed?
- is there a lot of repetition in the input set? (i.e. the same X keywords often repeat)
- what is the expected hit/miss ratio like? Do you expect to match one keyword for every thousand input words, or do you expect almost every word to be found?
For example,
- for a small amount of keywords (let's say about 20, depending on the implementation), the overhead of hashing will become important.
- If If you get can get a perfect hashing scheme (see here for an example in C), you can get rid of any chaining or similar scheme, shaving off some important cycles. Then again, this would require both your keywords and input set to be known up front, which isn't very likely.
- if there is a lot of repetition in the keywords (and a large hash collection to match against), a small local cache of the last X words could help.
- if you expect a lot of blatant misses (or your hash algorithm is very inefficient ;P) a trie could be more efficient than a hash table.
Most of these are a bit extreme for common performance tuning tasks though. I'd probably profile standard "hashed set" implementations (delphi generics, jcl, etc.) first to see which one performs best on your problem set.
You could also switch to a more object-oriented approach and have something like
TCommand = class
public
procedure Execute; virtual; abstract;
end;
TCommandClass = class of TCommand;
and have a factory create the commands for you
TCommandFactory = class
public
procedure RegisterKeyWord (const Keyword : String; CmdClass : TCommandClass);
function CreateCommand (const Keyword : String) : TCommand;
end;
The calling code would then look as easy as:
CommandFactory.CreateCommand (Keyword).Execute;
That way you have localized and encapsulated all keyword strings in a simple factory class, which makes later changes to the input language very easy. Using this command-based approach has other advantages such as easy extensibility.
I know that this maybe interpreted as not an answer to your question, because it's not about how fast you can do it. But it's another approach that might be worth thinking about.
精彩评论