Speed-up and best practices: Using ets for per-module pre-computed data

2023-03-09 15:03 问答作者：

((Please forgive me that I ask more than one question in a single thread. I think they are related.))

Hello, I wanted to know, what best practices exist in Erlang in regards to per-module precompiled data.

Example: I have a module that heavily operates on a priory know, veeery complex regular expressions. re:compile/2's documentations says: “Compiling once and executing many times is far more efficient than compiling each time one wants to match”. Since re's mp() datatype is in no way specified, and as such cannot be put at compile time if you want a target-independ beam, one has to compile the RegEx at runtime. ((Note: re:compile/2 is only an example. Any complex function to memoize would fit my question.))

Erlang's module (can) have an -on_load(F/A) attribute, denoting a method that should executed once when the module is loaded. As such, I could place my regexes to compile in this method and save the result in a new ets table named ?MODULE.

Updated after Dan's answer.

My quest开发者_StackOverflow中文版ions are:

If I am understanding ets right, its data is saved in another process (differently form the process dictionary) and retrieving a value for an ets table is quite expensive. (Please prove me wrong, if I am wrong!) Should the content in ets be copied to the process dictionary for speedup? (Remember: the data is never being updated.)
Are there any (considerable) drawbacks of putting all data as one record (instead of many table items) into the ets/process dictionary?

Working example:

-module(memoization).
-export([is_ipv4/1, fillCacheLoop/0]).
-record(?MODULE, { re_ipv4 = re_ipv4() }).
-on_load(fillCache/0).

fillCacheLoop() ->
    receive
        { replace, NewData, Callback, Ref } ->
            true = ets:insert(?MODULE, [{ data, {self(), NewData} }]),
            Callback ! { on_load, Ref, ok },
            ?MODULE:fillCacheLoop();
        purge ->
            ok
    end
.
fillCache() ->
    Callback = self(),
    Ref = make_ref(),
    process_flag(trap_exit, true),
    Pid = spawn_link(fun() ->
        case catch ets:lookup(?MODULE, data) of
            [{data, {TableOwner,_} }] ->
                TableOwner ! { replace, #?MODULE{}, self(), Ref },
                receive
                    { on_load, Ref, Result } ->
                        Callback ! { on_load, Ref, Result }
                end,
                ok;
            _ ->
                ?MODULE = ets:new(?MODULE, [named_table, {read_concurrency,true}]),
                true = ets:insert_new(?MODULE, [{ data, {self(), #?MODULE{}} }]),
                Callback ! { on_load, Ref, ok },
                fillCacheLoop()
        end
    end),
    receive
        { on_load, Ref, Result } ->
            unlink(Pid),
            Result;
        { 'EXIT', Pid, Result } ->
            Result
    after 1000 ->
        error
    end
.

is_ipv4(Addr) ->
    Data = case get(?MODULE.data) of
        undefined ->
            [{data, {_,Result} }] = ets:lookup(?MODULE, data),
            put(?MODULE.data, Result),
            Result;
        SomeDatum -> SomeDatum
    end,
    re:run(Addr, Data#?MODULE.re_ipv4)
.

re_ipv4() ->
    {ok, Result} = re:compile("^0*"
            "([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])\\.0*"
            "([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])\\.0*"
            "([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])\\.0*"
            "([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])$"),
    Result
.

You have another option. You can precompute the regular expression's compiled form and refer to it directly. One way to do this is to use a module designed specifically for this purpose such as ct_expand: http://dukesoferl.blogspot.com/2009/08/metaprogramming-with-ctexpand.html

You can also roll your own by generating a module on the fly with a function to return this value as a constant (taking advantage of the constant pool): http://erlang.org/pipermail/erlang-questions/2011-January/056007.html

~~Or you could even run re:compile in a shell and copy and paste the result into your code. Crude but effective.~~ This wouldn't be portable in case the implementation changes.

To be clear: all of these take advantage of the constant pool to avoid recomputing every time. But of course, this is added complexity and it has a cost.

Coming back to your original question: the problem with the process dictionary is that, well, it can only be used by its own process. Are you certain this module's functions will only be called by the same process? Even ETS tables are tied to the process that creates them (ETS is not itself implemented using processes and message passing, though) and will die if that process dies.

ETS isn't implemented in a process and doesn't have its data in a separate process heap, but it does have its data in a separate area outside of all processes. This means that when reading/writing to ETS tables data must be copied to/from processes. How costly this is depends, of course, on the amount of data being copied. This is one reason why we have functions like ets:match_object and ets:select which allow more complex selection rules before data is copied.

One benefit of keeping your data in an ETS table is that it can be reached by all processes not just the process which owns the table. This can make it more efficient than keeping your data in a server. It also depends on what type of operations you want to do on your data. ETS is just a data store and provides limited atomicity. In your case that is probably no problem.

You should definitely keep you data in separate records, one for each different compiled regular expression, as it will greatly increase the access speed. You can then directly get the re you are after, otherwise you will get them all and then search again after the one you want. That sort of defeats the point of putting them in ETS.

While you can do things like building ETS tables in on_load functions it is not a good idea for ETS tables. This is because an ETS is owned by a process and is deleted when the process dies. You never really know in which process the on_load function is called. You should also avoid doing things which can take a long time as the module is not considered to be loaded until it has completed.

Generating a parse transform to statically insert the result of compiling your re's directly into your code is a cool idea, especially if your re's are really that statically defined. As is the idea of dynamically generating, compiling and loading a module into your system. Again if your data is that static you could generate this module at compile time.

mochiglobal implements this by compiling a new module to store your constant(s). The advantage here is that the memory is shared across processes, where in ets it's copied and in the process dictionary it's just local to that one process.

https://github.com/mochi/mochiweb/blob/master/src/mochiglobal.erl

继续阅读：big-o erlang ets memoization

Speed-up and best practices: Using ets for per-module pre-computed data

Updated after Dan's answer.

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Updated after Dan's answer.

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？