Erlang word counter
I'm attempting to make a multi-threaded(?) program in Erlang that:
- Reads in a large file (600mb)
- Fires messages to a group of created processes that contain the lines read from the file
- The created processes process the word and store in a hashtree
- created process then fires the hashtree back to the master
- master prints the tree.
So far, I think I have the first three done... I can't figure out how to test my tree by printing out each key-hash pair each time they're inserted.
Master thread:
-module(lineread).
-export([read/1]).
read(Wordlength) ->
{ok, Input} = file:open("/home/ml/datasets/tweets/first60kTweets.txt", [read]),
Threads = makeThreads(Wordlength),
read_lines(Input, Threads).
read_lines(Input, Threads) ->
case file:read_line(Input) of
eof ->
file:close(Input);
{ok, Line} ->
send_to_all(Threads, Line),
read_lines(开发者_开发百科Input, Threads)
end.
send_to_all(Threads, Line) ->
lists:foreach(fun(Pid) ->
Pid ! {line, Line} end,
Threads).
makeThreads(NumThreads) ->
[ spawn(counter, run, [N]) || N <- lists:seq(1, NumThreads) ].
Other thread:
-module(counter).
-export([run/1]).
%%entry point for the code
run(K) ->
loop(K, gb_trees:empty()).
%%loops for a recieved message
loop(Size, Tree) ->
receive
{line, Line} ->
%%io:format("~p~n", [Line]),
Splits = re:split(Line, " "),
NewTree = build_tree(Splits, Tree),
loop(Size, NewTree);
{char, Char} ->
io:format("~p", [Char])
%%loop(Size, )
end.
%%puts the data into a Tree...
build_tree([], Tree) ->
Tree;
build_tree([W|R], Tree) ->
case gb_trees:is_defined(W, Tree) of
true ->
I = gb_trees:get(W, Tree),
NewTree = gb_trees:update(W, I + 1, Tree),
io:format(I),
build_tree(R, NewTree);
false ->
NewTree = gb_trees:insert(W, 1, Tree),
%%io:format("~p/~n"),
build_tree(R, NewTree)
end.
Thanks for your help.
I see a lot of things wrong with your program but to answer your question, you need to learn how to use receive
in order to have your processes talk to each other. I'd recommend reading the first few chapters here - specifically this one. That being said, here are my comments about the code:
- It would be helpful to know what the end goal is. The code doesn't make a lot of sense, particularly because
Size
is never used so all processes are the same. - It's unclear why you are spawning anything at all. Spawning W processes and sending each line to all of them is a trap - this will end up being much slower than not spawning because of the IO overhead (the text has to be copied when sending to each process). You will end up sending something like (W)(600MB)(2) between them. Yikes.
- You may want a dict or an orddict instead of a gb tree. See dict:update_counter/3. Also see this for more details. Using an ets table may make sense but again I'm not sure what the goal is.
- string:tokens/2 should be used instead of re:split/2 in this case - you don't need the overhead of re for simple whitespace splitting.
I do not If I'm quite following, I guess you are trying to get the word count.
You can add a new message to your loop that looks like
{print} -> io:format("~p~n", gb_trees:to_list(Tree)), io:format("Words ~p~n", gb_trees:size(Tree)), loop(Size, NewTree);
Also I do not understand why you send each line to all processes.. It's this just for practice? or trying to build something that can count words fast? this will not.
I few suggestios: You can also use stringtokens to split words. http://www.erlang.org/doc/man/string.html
I would use the word process instead of thread since Erlang claims it creates light weight processes instead of threads http://www.erlang.org/doc/efficiency_guide/processes.html
The biggest question is what is purpose of this exercise? Is it for real work or only some sort of learning. If it should be real product you don't start it well. You first question should be where those data will come from and then you should try identify where will be bottle neck. In your case you are talking about 600MB. It is not so much data to keep it in memory. If you have to read it from some storage your solution will strongly depend of this storage because from CPU point of view, 600MB of words can be about 60Mwords if you expect average word size 10B or 100Mwords if you expect average 6B words. How fast you can count 100Mwords? If you use k/v solution based of something really fast you can do it in 1s on one commodity CPU core. Sounds crazy? No, it is easier than it looks on first sight. If you base your solution on something like Judy Array or your own handcrafted trie implementation you can achieve result like this in relatively easy if you are not scare by coding in C. So how fast you can fed up this beast from data storage? If you use commodity disks you can read data in 6s! If you tune it better you can do better but do it in comparable time to processing will be challenge. Why you are trying make it parallel? Above is true only if you have to really read data from storage. It is not true for example when you are reading data which was store by some process just before. In this case you can access this data much faster because it will be in caches with high probability. In this case it will all behave in very different way. For example you can access data randomly. Above reading timings can be achieved only by sequential reading from classical disks. SSD behaves differently but overall data throughput is not in order of magnitude better. So only if you can fed up your CPU work fast enough then your effort to parallelism is worth of it. Then biggest issue will be how to fed up data and partition it. For inspiration you can look at Tim Bray's Wide Finder Project. But keep in mind first Wide Finder project was not measured with flushed IO caches so it was case when data could be accessed randomly from caches. Even in this case you can use Erlang but don't forget one of its strength is interfacing with other languages, especially C and I will point you at NIFs.
If purpose is only practicing in Erlang, my question is why trees? Why not ets:update_counter/3? If you consider ets is not fast enough, try process dictionary instead of gb_trees. It is not as much dirty when you use it in process under your own control. Consider reading whole file in memmory but work with binary. Use binary module intensively and so ...
精彩评论