开发者

Improve performance of matrices/table aggregation and search

There is a product-feature matrix. It has thousands rows (products) and hundreds features. It has binary values that show if the product has this feature or not. So it could be a table of 40 000 rows and 900 columns.

Product-feature matrix

pr f1 f2 f3 fn ...

01 0 1 1 1

02 0 0 0 0

03 1 0 1 0

04 1 0 1 0

.....

First, I have to find products that have a given set of features Q. E.g. Q=(f1=1, f5=1, f27=1). Simple said, find blue cars, hatchback, 3-doors.

Result 1

Given Q=(f1=1, f5=1, f27=1)

Relevant products: 03, 04, 08...

Second, and most important, I have to find how many products that have a set of features Q, also have a feature f_i (where i - 1..n). In other words, we are selecting rows that satisfy Q, and count how many 1 in each column (make SUM aggregation). E.g. how many blue cars, hatchback, 3-doors also has: diesel engine, gasoline engine, xenon lights.

Result 2

Given Q=(f1=1, f5=1, f27=1)

sum f2 = 943

sum f3 = 543

sum f4 = 7

sum f6 = 432

....

Of course it is possible to solve this task using an RDBMS but it's not so effective - in general case it will require a fullscan both for fi开发者_JAVA技巧nding products and aggregation in each columns. At least I don't know how to build an effective b-tree index for this task. Oracle bitmap index could help, but I can't use Oracle.

Currently, we are using MySQL for this task, but it's not showing good results. Actually we are using integer representation (we grouped features and store integers in columns, not bool values) to reduce the amount of columns.

It's possible to treat this matrix as a sparse binary matrix. And it's not a big problem to store it completely in memory. And I'm wondering if it's possible to apply some algorithms to work with sparse matrices, vector space (SVD, matrix-vector multiplications and so on). But it probably helps in finding products that satisfy vector Q, not in aggregation. The problem lies more in time of aggregation, not in space.

Probably, it's possible to store matrix as a multi-linked list that would help to find products and make aggregation for each column.

Finally, the question is how to treat this task. What is the most effective algorithm to find products with given features and then count the products with additional features (aggregate by each column).


You could arrange your data by column. i.e. have one BitSet for column listing the cars/rows which have that feature.

This allows you to take advantage of the fast and/or operations provided by BitSet.

Using these features you should be able to achieve less than 2 micro-seconds for the search and the count of each feature.

For a 40,000 * 900 dataset, the following prints

average search time 1396 ns.
average count time 1234 ns.

This should a few orders of magnitude faster than you can get with a RDBMS database. Even one million rows, take less than 50 micro-seconds each.

public static void main(String... args) throws IOException {
    final int rows = 40 * 1000;
    final int cols = 900;

    List<BitSet> features = new ArrayList<BitSet>();
    features.add(new BitSet(rows));
    features.add(new BitSet(rows));
    for (int i = 2; i < cols; i++) {
        final BitSet bs = new BitSet(rows);
        for (int j = 0; j < rows; j++)
            bs.set(j, j % i == 0);
        features.add(bs);
    }

    // perform the search
    int[] factors = new int[]{2, 5, 7};
    BitSet matches = new BitSet();
    long runs =  1000*1000;
    {
        long start = System.nanoTime();
        for (int i = 0; i < runs; i++) {
            // perform lookup.
            lookup(matches, features, factors);
        }
        long avgTime = (System.nanoTime() - start) / runs;
        System.out.println("average search time " + avgTime  + " ns.");
    }
    {
        long start = System.nanoTime();
        int count9 = 0;
        BitSet col9matched = new BitSet(cols);
        for (int i = 0; i < runs; i++) {
            final int index = 9;
            final BitSet feature = features.get(index);
            count9 = countMatches(col9matched, matches, feature);
        }
        long avgTime = (System.nanoTime() - start) / runs;
        System.out.println("average count time " + avgTime + " ns.");
    }
}

private static int countMatches(BitSet scratch, BitSet matches, BitSet feature) {
    // recycle.
    scratch.clear();
    scratch.or(matches);
    scratch.and(feature);
    return scratch.cardinality();
}

private static void lookup(BitSet matches, List<BitSet> data, int[] factors) {
    matches.clear();
    matches.or(data.get(factors[0]));
    for (int i = 1, factorsLength = factors.length; i < factorsLength; i++) {
        matches.and(data.get(factors[i]));
    }
}


please have a look at this example I did some-while ago, it follows what jaydee correctly outlined, but in more detail and against 125 million poster_categories (car_features) with a runtime of 0.02 secs - yours, worst case would have 40K++ * 900 cols = 36 million rows i.e, it's a baby !

Rewriting mysql select to reduce time and writing tmp to disk

select count(*) from category
count(*)
========
500,000


select count(*) from poster
count(*)
========
1,000,000

select count(*) from poster_category
count(*)
========
125,675,688

select count(*) from poster_category where cat_id = 623
count(*)
========
342,820

explain
select
 p.*,
 c.*
from
 poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
 pc.cat_id = 623
order by
 p.name
limit 32;

id  select_type table   type    possible_keys   key     key_len ref                         rows
==  =========== =====   ====    =============   ===     ======= ===                         ====
1   SIMPLE      c       const   PRIMARY         PRIMARY 3       const                       1   
1   SIMPLE      p       index   PRIMARY         name    257     null                        32  
1   SIMPLE      pc      eq_ref  PRIMARY         PRIMARY 7       const,foo_db.p.poster_id    1   

select
 p.*,
 c.*
from
 poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
 pc.cat_id = 623
order by
 p.name
limit 32;

0.021: Query OK


If I understand your current solution, you have one table with a row for each product with a column for each feature. This is quite an inefficient way to solve the problem.

How about three tables

"products" (product_ref, product_name) index product_ref (a list of products)

"features" (feature_ref, description) index feature_ref (a list of possible features)

and

"productfeatures" (product_ref, feature_ref) index product_ref,feature_ref and feature_ref,product_ref (each row represents a feature of a product)

You can then perform a join between the tables

select * from products t1 join productfeatures t2 join productfeatures t3 where t1.product_ref=t2.product_ref and t1.product_ref=t3.product_ref and t2.feature_ref=45 and t3.feature_ref=67 etc.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜