Best practice for function to handle 1-256 bytes
I have some functions that are designed to handle 1-256 bytes, running on an embedded C platform where passing a byte is much faster and more compact than passing an int (one instruction versus three), what is the preferred way of coding it:
- Accept an int, early-exit if zero, and otherwise copy the LSB of the count value t开发者_JS百科o an unsigned char and use that in a do {} while(--count); loop (a parameter value of 256 will get converted to 0, but will run 256 times)
- Accept an unsigned char, early-exit if zero, and have a special version of the function for 256 bytes (those cases will be known in advance).
- Accept an unsigned char, and run 256 times if it's zero.
- Have a function like the above, but call it via wrappers functions that behave as (0-255) and (256 only).
- Have a function like the above, but call it via wrapper macros that behave as (0-255) and (256 only).
It is expected that the inner loop of the function will probably represent 15%-30% of processor execution time when the system is busy; it will sometimes be used for small numbers of bytes, and sometimes for large ones. The memory chip used by the function has a per-transaction overhead, and I prefer to have my memory-access function do the start-transaction/do-stuff/end-transaction sequence internally.
The most efficient code would be to simply accept an unsigned char and regard a parameter value of 0 as a request to do 256 bytes, relying on the caller to avoid any accidental attempts to read 0 bytes. That seems a bit dangerous, though. Have others dealt with such issues on embedded systems? How were they handled?
EDIT The platform is a PIC18Fxx (128K code space; 3.5K RAM), connecting to an SPI flash chip; reading 256 bytes when fewer are expected would potentially overrun read buffers in the PIC. Writing 256 bytes instead of 0 would corrupt data in the flash chip. The PIC's SPI port is limited to one byte every 12 instruction times if one doesn't check busy status; it will be slower if one does. A typical write transaction requires sending 4 bytes in addition to the data to be received; a read requires an extra byte for "SPI turnaround" (the fastest way to access the SPI port is to read the last byte just before sending the next one).
The compiler is HiTech PICC-18std.
I've generally liked the HiTech's PICC-16 compilers; HiTech seems to have diverted their energies away from the PICC-18std product toward their PICC-18pro line which has even slower compilation times, seems to require the use of 3-byte 'const' pointers rather than two-byte pointers, and has its own ideas about memory allocation. Maybe I should look more at the PICC-18pro, but when I tried compiling my project on an eval version of PICC-18pro it didn't work and I didn't figure out exactly why--perhaps something about variable layout not agreeing with my asm routines--I just kept using PICC-18std.
Incidentally, I just discovered that PICC-18 particularly likes do {} while(--bytevar); and particularly dislikes do {} while(--intvar); I wonder what's going through the compiler's "mind" when it generates the latter?
do { local_test++; --lpw; } while(lpw); 2533 ;newflashpic.c: 792: do 2534 ;newflashpic.c: 793: { 2535 0144A8 2AD9 incf fsr2l,f,c 2536 ;newflashpic.c: 795: } while(--lpw); 2537 0144AA 0E00 movlw low ?_var_test 2538 0144AC 6EE9 movwf fsr0l,c 2539 0144AE 0E01 movlw high ?_var_test 2540 0144B0 6EEA movwf fsr0h,c 2541 0144B2 06EE decf postinc0,f,c 2542 0144B4 0E00 movlw 0 2543 0144B6 5AED subwfb postdec0,f,c 2544 0144B8 50EE movf postinc0,w,c 2545 0144BA 10ED iorwf postdec0,w,c 2546 0144BC E1F5 bnz l242
The compiler loads a pointer to the variable, not even using the LFSR instruction (which would take two words) but a combination of MOVLW/MOVWF (taking four). Then it uses this pointer to do the decrement and compare. While I'll admit that do{}while(--wordvar); cannot yield as nice code as do{}while(wordvar--); the code is better than what the latter format actually generates. Doing a separate decrement and while-test (e.g. while (--lpw,lpw)) yields sensible code, but it seems a bit ugly. The post-decrement operator could yield the best code for a down-counting loop:
decf _lpw btfss _STATUS,0 ; Skip next inst if carry (i.e. wasn't zero) decf _lpw+1 bc loop ; Carry will be clear only if lpw was zero
but it instead generates worse code than --lpw. The best code would be for an up-counting loop:
infsnz _lpw incfsz _lpw+1 bra loop
but the compiler doesn't generate that.
EDIT 2 Another approach I might use: allocate a global 16-bit variable for the number of bytes, and write the functions so that the counter is always zeroed before exit. Then if only an 8-bit value is required, it would only be necessary to load 8 bits. I'd use macros for stuff so they could be tweaked for best efficiency. On the PIC, using |= on a variable which is known to be zero is never slower than using =, and is sometimes faster. For example, intvar |= 15 or intvar |= 0x300 would be two instructions (each case only has to bother with one byte of the result and can ignore the other); intvar |= 4 (or any power of 2) is one instruction. Obviously on some other processors, intvar = 0x300 would be faster than intvar |= 0x300; if I use a macro it could be tweaked as appropriate.
Your inner function should copy count + 1
bytes, e.g.,
do /* copy one byte */ while(count-- != 0);
If the post-decrement is slow, other alternatives are:
... /* copy one byte */
while (count != 0) { /* copy one byte */; count -= 1; }
or
for (;;) { /* copy one byte */; if (count == 0) break; count -= 1; }
The caller/wrapper can do:
if (count > 0 && count <= 256) inner((uint8_t)(count-1))
or
if (((unsigned )(count - 1)) < 256u) inner((uint8_t)(count-1))
if its faster in your compiler.
FWIW, I'd choose some variant of option #1. The function's interface remains sensible, intuitive, and seems less likely to be called incorrectly (you might want to think about what you want to do if a value larger than 256 is passed in - a debug-build-only assertion might be appropriate).
I don't think the minor 'hack'/micro-optimization to loop the correct number of times using an 8-bit counter would really be a maintenance problem, and it seems you've done considerable analysis to justify it.
I wouldn't argue against wrappers if someone preferred them, but I'd personally lean toward option 1 ever-so-slightly.
However, I would argue against having the public interface require the caller to pass in a value one less than they wanted to read.
If an int parameter costs 3 instructions and a char parameter costs 1, you could pass an extra char parameter for the extra 1 bit you're missing. It seems pretty silly that your (presumably 16-bit) int takes more than twice as many instructions as an 8-bit char.
精彩评论