Improve performance of matrices/table aggregation and search

2023-01-22 16:23 问答作者：

There is a product-feature matrix. It has thousands rows (products) and hundreds features. It has binary values that show if the product has this feature or not. So it could be a table of 40 000 rows and 900 columns.

Product-feature matrix

pr f1 f2 f3 fn ...

01 0 1 1 1

02 0 0 0 0

03 1 0 1 0

04 1 0 1 0

.....

First, I have to find products that have a given set of features Q. E.g. Q=(f1=1, f5=1, f27=1). Simple said, find blue cars, hatchback, 3-doors.

Result 1

Given Q=(f1=1, f5=1, f27=1)

Relevant products: 03, 04, 08...

Second, and most important, I have to find how many products that have a set of features Q, also have a feature f_i (where i - 1..n). In other words, we are selecting rows that satisfy Q, and count how many 1 in each column (make SUM aggregation). E.g. how many blue cars, hatchback, 3-doors also has: diesel engine, gasoline engine, xenon lights.

Result 2

Given Q=(f1=1, f5=1, f27=1)

sum f2 = 943

sum f3 = 543

sum f4 = 7

sum f6 = 432

....

Of course it is possible to solve this task using an RDBMS but it's not so effective - in general case it will require a fullscan both for fi开发者_JAVA技巧nding products and aggregation in each columns. At least I don't know how to build an effective b-tree index for this task. Oracle bitmap index could help, but I can't use Oracle.

Currently, we are using MySQL for this task, but it's not showing good results. Actually we are using integer representation (we grouped features and store integers in columns, not bool values) to reduce the amount of columns.

It's possible to treat this matrix as a sparse binary matrix. And it's not a big problem to store it completely in memory. And I'm wondering if it's possible to apply some algorithms to work with sparse matrices, vector space (SVD, matrix-vector multiplications and so on). But it probably helps in finding products that satisfy vector Q, not in aggregation. The problem lies more in time of aggregation, not in space.

Probably, it's possible to store matrix as a multi-linked list that would help to find products and make aggregation for each column.

Finally, the question is how to treat this task. What is the most effective algorithm to find products with given features and then count the products with additional features (aggregate by each column).

You could arrange your data by column. i.e. have one BitSet for column listing the cars/rows which have that feature.

This allows you to take advantage of the fast and/or operations provided by BitSet.

Using these features you should be able to achieve less than 2 micro-seconds for the search and the count of each feature.

For a 40,000 * 900 dataset, the following prints

average search time 1396 ns.
average count time 1234 ns.

This should a few orders of magnitude faster than you can get with a RDBMS database. Even one million rows, take less than 50 micro-seconds each.

public static void main(String... args) throws IOException {
    final int rows = 40 * 1000;
    final int cols = 900;

    List<BitSet> features = new ArrayList<BitSet>();
    features.add(new BitSet(rows));
    features.add(new BitSet(rows));
    for (int i = 2; i < cols; i++) {
        final BitSet bs = new BitSet(rows);
        for (int j = 0; j < rows; j++)
            bs.set(j, j % i == 0);
        features.add(bs);
    }

    // perform the search
    int[] factors = new int[]{2, 5, 7};
    BitSet matches = new BitSet();
    long runs =  1000*1000;
    {
        long start = System.nanoTime();
        for (int i = 0; i < runs; i++) {
            // perform lookup.
            lookup(matches, features, factors);
        }
        long avgTime = (System.nanoTime() - start) / runs;
        System.out.println("average search time " + avgTime  + " ns.");
    }
    {
        long start = System.nanoTime();
        int count9 = 0;
        BitSet col9matched = new BitSet(cols);
        for (int i = 0; i < runs; i++) {
            final int index = 9;
            final BitSet feature = features.get(index);
            count9 = countMatches(col9matched, matches, feature);
        }
        long avgTime = (System.nanoTime() - start) / runs;
        System.out.println("average count time " + avgTime + " ns.");
    }
}

private static int countMatches(BitSet scratch, BitSet matches, BitSet feature) {
    // recycle.
    scratch.clear();
    scratch.or(matches);
    scratch.and(feature);
    return scratch.cardinality();
}

private static void lookup(BitSet matches, List<BitSet> data, int[] factors) {
    matches.clear();
    matches.or(data.get(factors[0]));
    for (int i = 1, factorsLength = factors.length; i < factorsLength; i++) {
        matches.and(data.get(factors[i]));
    }
}

please have a look at this example I did some-while ago, it follows what jaydee correctly outlined, but in more detail and against 125 million poster_categories (car_features) with a runtime of 0.02 secs - yours, worst case would have 40K++ * 900 cols = 36 million rows i.e, it's a baby !

Rewriting mysql select to reduce time and writing tmp to disk

select count(*) from category
count(*)
========
500,000


select count(*) from poster
count(*)
========
1,000,000

select count(*) from poster_category
count(*)
========
125,675,688

select count(*) from poster_category where cat_id = 623
count(*)
========
342,820

explain
select
 p.*,
 c.*
from
 poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
 pc.cat_id = 623
order by
 p.name
limit 32;

id  select_type table   type    possible_keys   key     key_len ref                         rows
==  =========== =====   ====    =============   ===     ======= ===                         ====
1   SIMPLE      c       const   PRIMARY         PRIMARY 3       const                       1   
1   SIMPLE      p       index   PRIMARY         name    257     null                        32  
1   SIMPLE      pc      eq_ref  PRIMARY         PRIMARY 7       const,foo_db.p.poster_id    1   

select
 p.*,
 c.*
from
 poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
 pc.cat_id = 623
order by
 p.name
limit 32;

0.021: Query OK

If I understand your current solution, you have one table with a row for each product with a column for each feature. This is quite an inefficient way to solve the problem.

How about three tables

"products" (product_ref, product_name) index product_ref (a list of products)

"features" (feature_ref, description) index feature_ref (a list of possible features)

and

"productfeatures" (product_ref, feature_ref) index product_ref,feature_ref and feature_ref,product_ref (each row represents a feature of a product)

You can then perform a join between the tables

select * from products t1 join productfeatures t2 join productfeatures t3 where t1.product_ref=t2.product_ref and t1.product_ref=t3.product_ref and t2.feature_ref=45 and t3.feature_ref=67 etc.

继续阅读：algorithm rdbms sparse-matrix

Improve performance of matrices/table aggregation and search

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？