开发者

Best R data structure to return table value counts

The following function returns a data.frame with two columns:

fetch_count_by_day=function(con){
  q="SELECT t,count(*) AS count FROM data GROUP BY t"
  dbGetQuery(con,q)    #Returns a data frame
}

t is a DATE column, so output looks like:

       t       count(*)
1 2011-09-22     1438
...

All I'm really interested in is if any records for a given date already exist; but I will also use the count as a sanity check.

In C++ I'd return a std::map<std::string,int> or std::unordered_map<std::string,int> (*). In PHP I'd use an associative array with the date as the key.

What is the best data structure in R? Is it a 2-column data.frame? My first thought was to turn the t column into rownames:

...
d=dbGetQuery(con,q)
rownames(d)=d[,1]
d$t=NULL

But data.frame rownames are not unique, so conceptually it does not quite fit. I'm also not sure if it makes using it any quicker.

(Any and all definitions of "best": quickest, least memory, code clarity, least surprise for experienced R developers, etc开发者_高级运维. Maybe there is one solution for all; if not then I'd like to understand the trade-offs and when to choose each alternative.)

*: (for C++) If benchmarking showed this was a bottleneck, I might convert the datestamp to a YYYYMMDD integer and use std::unordered_map<int,int>; knowing the data only covers a few years I might even use a block of memory with one int per day between min(t) and max(t) (wrapping all that in a class).


Contingency tables are actually arrays (or matrices) and can very easily be created.The dimnames hold the values and the array/matrix at its "core" holds the count data. The "table" and "tapply" functions are natural creators. You access the counts with "[" and use dimnames( ) followed by an "[" to get you the row annd column names. I would say it was wiser to use the "Date" class for dates than storing in "character" vectors.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜