Fastest factorial implementation with 64-bit result in assembly

2023-01-06 16:26 问答作者：

This is not homework, just something I though of. So, straight computing factorial is not exactly fast; memoization can help, but if the result is to fit into 32 or 64 bits, then the factorial only can work for inputs 0 through 12 and 20 respectively. So ... we might as well use a lookup table:

n   n!
0   1       
1   1       
2   2       
3   6       
4   24      
5   120     
6   720     
7   5040        
8   40320       
9   362880      
10  3628800     
11  39916800        
12  479001600  开发者_如何转开发     
13  6227020800  2^32=   4294967296
14  87178291200     
15  1.30767E+12     
16  2.09228E+13     
17  3.55687E+14     
18  6.40237E+15     
19  1.21645E+17     
20  2.4329E+18      
        2^64=   1.84467E+19

So, suppose I want to have an inline C++ factorial function which uses inline assembly, with a 32 bit or 64 bit unsigned integer expected as a result. If the input is either negative or large enough to cause overflow, the output should be 0. How can this be done in assembly such that it consumes the least amount of cycles? This code will run on a 64-bit Intel/AMD architecture. If feasible, I am interested in improving the worst case scenario, so 20! should not take a lot longer to compute than 0! - hopefully there is a binary search approach. Hopefully there is a clever trick for doing if (n == 0 || n == 1) { return 1; }. Also, if the output needs to be 32 bit, then I think assembly instructions can contain both code and data in them. My assembly knowledge is weak. Please let me know if the question does not make a whole lot of sense.

Being able to use the function in C++ would be nice - makes it a more realistic problem. If, for instance, calling a function is expensive, then trying to save 1-2 clock cycles in the body of the assembly will not help much.

I have cleverly built a lookup table in assembly. Just in case you're rusty on your assembly, The routine expects the parameter to be in the ecx register. I verify that it is in range then read the lookup table's values into the eax and edx registers. If the value is out of range, I just xor the eax and edx registers with themselves (this forces them to 0). Unfortunately, since it's an assembly routine the compiler will be unable to inline the code. But, I'm sure the few cycles I saved by writing my awesome assembly routine will more than make up for any gain by inlining.

factorial:
    xorl    %eax, %eax
    xorl    %edx, %edx
    cmpl    $20, %ecx
    ja  .TOOBIG
    movl    CSWTCH.1(,%ecx,8), %eax
    movl    CSWTCH.1+4(,%ecx,8), %edx
.TOOBIG:

LOOKUP_TABLE:
    .section    .rodata
    .align 32
    .type   CSWTCH.1, @object
    .size   CSWTCH.1, 168
CSWTCH.1:
    .long   1
    .long   0
    .long   1
    .long   0
    .long   2
    .long   0
    .long   6
    .long   0
    .long   24
    .long   0
    .long   120
    .long   0
    .long   720
    .long   0
    .long   5040
    .long   0
    .long   40320
    .long   0
    .long   362880
    .long   0
    .long   3628800
    .long   0
    .long   39916800
    .long   0
    .long   479001600
    .long   0
    .long   1932053504
    .long   1
    .long   1278945280
    .long   20
    .long   2004310016
    .long   304
    .long   2004189184
    .long   4871
    .long   -288522240
    .long   82814
    .long   -898433024
    .long   1490668
    .long   109641728
    .long   28322707
    .long   -2102132736
    .long   566454140

The lookup table is difficult to maintain, so I've included the script I used to build it

static constexpr uint64_t const_factorial(uint32_t i) {
    return (i==0)? 1: (i * const_factorial(i-1));
}

uint64_t factorial(uint32_t i) {
    switch(i) {
        case 0: return const_factorial(0);
        case 1: return const_factorial(1);
        case 2: return const_factorial(2);
        case 3: return const_factorial(3);
        case 4: return const_factorial(4);
        case 5: return const_factorial(5);
        case 6: return const_factorial(6);
        case 7: return const_factorial(7);
        case 8: return const_factorial(8);
        case 9: return const_factorial(9);
        case 10: return const_factorial(10);
        case 11: return const_factorial(11);
        case 12: return const_factorial(12);
        case 13: return const_factorial(13);
        case 14: return const_factorial(14);
        case 15: return const_factorial(15);
        case 16: return const_factorial(16);
        case 17: return const_factorial(17);
        case 18: return const_factorial(18);
        case 19: return const_factorial(19);
        case 20: return const_factorial(20);
        default: return 0;
    }
}

Just in case you missed it in my poor attempt at humor. The C++ compiler is more than capable at correctly optimizing your code. As you can see I didn't need to do anything fancy with lookup tables, binary search trees, or hashes. Just a simple switch statement and the compiler did the rest.

It's been a while since I flexed my assembly muscles so I'll just offer some general advice.

Since you know in advance exactly how many and what size all the items will be, just make a contiguous array of values (either hard-coded or pre-calculated). After validating the input to the function (< 0 or > 12/20), you can then use a simple offset addressing to retrieve the appropriate value. This will work in O(1) time.

Update from the year 2021. Having C++17 at hand.

I guess that there is no faster way than the below. Assembler is not needed.

Because the number of factorials that will fit into an unsigned 64 bit value is very low (21), a compile time constexpr array will use mainly only 21*8 = 168 bytes.

168 bytes

That number is that low that we can build easily a compile time constexpr std::array and stop all further considerations.

Really everything can be done at compile time.

We will first define the default approach for calculation a factorial as a constexpr function:

constexpr unsigned long long factorial(unsigned long long n) noexcept {
    return n == 0ull ? 1 : n * factorial(n - 1ull);
}

With that, factorials can easily be calculated at compile time. Then, we fill a std::array with all factorials. We use also a constexpr and make it a template with a variadic parameter pack.

We use std::integer_sequence to create a factorials for indices 0,1,2,3,4,5, ....

That is straigtforward and not complicated:

template <size_t... ManyIndices>
constexpr auto generateArrayHelper(std::integer_sequence<size_t, ManyIndices...>) noexcept {
    return std::array<unsigned long long, sizeof...(ManyIndices)>{ { factorial(ManyIndices)... } };
};

This function will be fed with an integer sequence 0,1,2,3,4,... and return a std::array<unsigned long long, ...> with the corresponding factorials.

We know that we can store maximum 21 values. And therefore we make a next function, that will call the above with the integer sequence 1,2,3,4,...,20,21, like so:

constexpr auto generateArray()noexcept {
    return generateArrayHelper(std::make_integer_sequence<size_t, MaxIndexFor64BitValue>());
}

And now, finally,

constexpr auto Factorial = generateArray();

will give us a compile-time std::array<unsigned long long, 21> with the name Factorial containing all factorials. And if we need the i'th factorial, then we can simply write Factorial[i]. There will be no calculation at runtime.

I do not think that there is a faster way to calculate a factorial.

Please see the complete program below:

#include <iostream>
#include <array>
#include <utility>
// ----------------------------------------------------------------------
// All the below will be calculated at compile time
// constexpr factorial function
constexpr unsigned long long factorial(unsigned long long n) noexcept {
    return n == 0ull ? 1 : n * factorial(n - 1ull);
}
// We will automatically build an array of factorials at compile time
// Generate a std::array with n elements 
template <size_t... ManyIndices>
constexpr auto generateArrayHelper(std::integer_sequence<size_t, ManyIndices...>) noexcept {
    return std::array<unsigned long long, sizeof...(ManyIndices)>{ { factorial(ManyIndices)... } };
};
// Max index for factorials for an 64bit unsigned value 
constexpr size_t MaxIndexFor64BitValue = 21;

// Generate the required number of elements
constexpr auto generateArray()noexcept {
    return generateArrayHelper(std::make_integer_sequence<size_t, MaxIndexFor64BitValue>());
}
// This is an constexpr array of all factorials numbers
constexpr auto Factorial = generateArray();

// All the above was compile time
// ----------------------------------------------------------------------

// Test function
int main() {
    for (size_t i{}; i < MaxIndexFor64BitValue; ++i)
        std::cout << i << '\t' << Factorial[i] << '\n';
    return 0;
}

Developed, compiled and tested with Microsoft Visual Studio Community 2019, Version 16.8.2

Additionally compiled and tested with gcc 10.2 and clang 11.0.1

Language: C++17

Who says that your assembly version is going to be any faster than the C++ version anyway. In fact, who says it will even match in speed? I'd bet $100 you never even manage to make it as fast as the compiler does.

On popular demand, performancewise it's fabled to be a binary search, not a hashtable (std C++ doesn't have that I believe).

#include <map>

void main()
{
    std::map<int, BigIntThing> factMap;
    // insert all elements here, probably fancier ways to do this
    factMap.insert( 1 );
    factMap.insert( 1 );
    factMap.insert( 2 );
    // ....
    // to access, say 15!
    BigIntThing factMap[15]; // I think the index is right >_<
}

That's it. A std::map is ordered, so if your BigIntThing has a comparison operator all is good. There should be a way to get this const and/or static and/or global to get it compiled in the way you want.

If you're only working with numbers from 0-19, a hash table or binary tree is overkill. Just create an unsigned int[20] and then query the index:

const unsigned int FACTORIALS[20] = {1,1,2,6,24,120,etc..};

unsigned int factorial(unsigned int num) {
    if(num >= 0 && num <= 19) {
        return FACTORIALS[num];
    }
    else {
        throw // some sort of exception
    }
}

You could probably use templates to build the array too.

gcc's Answer

...which probably beat's yours, was compiled from:

uint64_t answers[] = {
    1ULL,
    1ULL,
    2ULL,
    6ULL,
    24ULL,
    ...
    2432902008176640000ULL,
};

uint64_t factorial(unsigned int i) {
    if(i >= sizeof(answers) / sizeof(*answers))
        return 0;
    else
        return answers[i];
}

...and the assembly...

factorial:
    cmpl    $20, %edi
    movl    $0, %eax
    ja  .L3
    movslq  %edi,%eax
    movq    answers(,%rax,8), %rax
.L3:
    rep
    ret
answers:
    .quad 1
    .quad 1
    ...

...which seems to be the first 64-bit assembler answer up...

继续阅读：assembly factorial inline-assembly x86-64

Fastest factorial implementation with 64-bit result in assembly

gcc's Answer

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

gcc's Answer

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？