Linq2sql: efficient way to get random elements with weight?

2022-12-12 03:39 问答作者：

Byt lets say I have an integer weight where i.e. elements with weight 10 has 10 times higher probability to be selected than element with weight 1.

var ws = db.WorkTypes
.Where(e => e.HumanId != null && e.SeoPriority != 0)
.OrderBy(e => /*????*/ * e.SeoPriority)
.Select(e => new
{
   DescriptionText = e.DescriptionText,
   HumanId = e.HumanId
})
.Take(take).ToArray();

How do I solved getting random records in Linq when I need the result to be weighted?

I need somthing like Random Weighted Choice in T-SQL but in linq and not only getting one record?

If I wouldn't have the weighted requirement, I'd use the NEWID approach, can I adopt this some way?

partial class DataContext
{
    [Function(Name = "NEWID", IsComposable = true)]
    public Guid Random()
    {
        throw new NotImplementedException();
    }
}

...

var ws = db.WorkTypes
.Where(开发者_开发知识库e => e.HumanId != null && e.SeoPriority != 0)
.OrderBy(e => db.Random())
.Select(e => new
{
   DescriptionText = e.DescriptionText,
   HumanId = e.HumanId
})
.Take(take).ToArray();

My first idea was the same as Ron Klein's - create a weighted list and select randomly from that.

Here's a LINQ extension method to create the weighted list from the normal list, given a lambda function that knows the weight property of the object.

Don't worry if you don't get all the generics stuff right away... The usage below should make it clearer:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConsoleApplication1
{
    public class Item
    {
        public int Weight { get; set; }
        public string Name { get; set; }
    }

    public static class Extensions
    {
        public static IEnumerable<T> Weighted<T>(this IEnumerable<T> list, Func<T, int> weight)
        {
            foreach (T t in list)
                for (int i = 0; i < weight(t); i++)
                    yield return t;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            List<Item> list = new List<Item>();
            list.Add(new Item { Name = "one", Weight = 5 });
            list.Add(new Item { Name = "two", Weight = 1 });

            Random rand = new Random(0);

            list = list.Weighted<Item>(x => x.Weight).ToList();

            for (int i = 0; i < 20; i++)
            {
                int index = rand.Next(list.Count());
                Console.WriteLine(list.ElementAt(index).Name);
            }

            Console.ReadLine();
        }
    }
}

As you can see from the output, the results are both random and weighted as you require.

I'm assuming that the weight is an integer. Here's an approach which joins to a dummy table to increase the row-count per the weight; first, lets prove it just at TSQL:

SET NOCOUNT ON
--DROP TABLE [index]
--DROP TABLE seo
CREATE TABLE [index] ([key] int not null) -- names for fun ;-p
CREATE TABLE seo (url varchar(10) not null, [weight] int not null)

INSERT [index] values(1) INSERT [index] values(2)
INSERT [index] values(3) INSERT [index] values(4)
INSERT [index] values(5) INSERT [index] values(6)
INSERT [index] values(7) INSERT [index] values(8)
INSERT [index] values(9) INSERT [index] values(10)

INSERT [seo] VALUES ('abc',1)  INSERT [seo] VALUES ('def',2)
INSERT [seo] VALUES ('ghi',1)  INSERT [seo] VALUES ('jkl',3)
INSERT [seo] VALUES ('mno',1)  INSERT [seo] VALUES ('mno',1)
INSERT [seo] VALUES ('pqr',2)

DECLARE @count int, @url varchar(10)
SET @count = 0
DECLARE @check_rand TABLE (url varchar(10) not null)

-- test it lots of times to check distribution roughly matches weights
WHILE @count < 11000
BEGIN
    SET @count = @count + 1

    SELECT TOP 1 @url = [seo].[url]
    FROM [seo]
    INNER JOIN [index] ON [index].[key] <= [seo].[weight]
    ORDER BY NEWID()

    -- this to check distribution
    INSERT @check_rand VALUES (@url)
END

SELECT ISNULL(url, '(total)') AS [url], COUNT(1) AS [hits]
FROM @check_rand
GROUP BY url WITH ROLLUP
ORDER BY url

This outputs something like:

url        hits
---------- -----------
(total)    11000
abc        1030
def        1970
ghi        1027
jkl        2972
mno        2014
pqr        1987

Showing that we have the correct overall distribution. Now lets bring that into LINQ-to-SQL; I've added the two tables to a data-context (you will need to create something like the [index] table to do this) - my DBML:

  <Table Name="dbo.[index]" Member="indexes">
    <Type Name="index">
      <Column Name="[key]" Member="key" Type="System.Int32" DbType="Int NOT NULL" CanBeNull="false" />
    </Type>
  </Table>
  <Table Name="dbo.seo" Member="seos">
    <Type Name="seo">
      <Column Name="url" Type="System.String" DbType="VarChar(10) NOT NULL" CanBeNull="false" />
      <Column Name="weight" Type="System.Int32" DbType="Int NOT NULL" CanBeNull="false" />
    </Type>
  </Table>

Now we'll consume this; in the partial class for the data-context, add a compiled-query (for performance) in addition to the Random method:

partial class MyDataContextDataContext
{
    [Function(Name = "NEWID", IsComposable = true)]
    public Guid Random()
    {
        throw new NotImplementedException();
    }
    public string GetRandomUrl()
    {
        return randomUrl(this);
    }
    static readonly Func<MyDataContextDataContext, string>
        randomUrl = CompiledQuery.Compile(
        (MyDataContextDataContext ctx) =>
                 (from s in ctx.seos
                 from i in ctx.indexes
                 where i.key <= s.weight
                 orderby ctx.Random()
                 select s.url).First());
}

This LINQ-to-SQL query is very similar to the key part of the TSQL we wrote; lets test it:

using (var ctx = CreateContext()) {
    // show sample query
    ctx.Log = Console.Out;
    Console.WriteLine(ctx.GetRandomUrl());
    ctx.Log = null;

    // check distribution
    var counts = new Dictionary<string, int>();
    for (int i = 0; i < 11000; i++) // obviously a bit slower than inside db
    {
        if (i % 100 == 1) Console.WriteLine(i); // show progress
        string s = ctx.GetRandomUrl();
        int count;
        if (counts.TryGetValue(s, out count)) {
            counts[s] = count + 1;
        } else {
            counts[s] = 1;
        }
    }
    Console.WriteLine("(total)\t{0}", counts.Sum(p => p.Value));
    foreach (var pair in counts.OrderBy(p => p.Key)) {
        Console.WriteLine("{0}\t{1}", pair.Key, pair.Value);
    }
}

This runs the query once to show the TSQL is suitable, then (like before) 11k times to check the distribution; output (not including the progress updates):

SELECT TOP (1) [t0].[url]
FROM [dbo].[seo] AS [t0], [dbo].[index] AS [t1]
WHERE [t1].[key] <= [t0].[weight]
ORDER BY NEWID()
-- Context: SqlProvider(Sql2008) Model: AttributedMetaModel Build: 3.5.30729.4926

which doesn't look too bad at all - it has both tables and the range condition, and the TOP 1, so it is doing something very similar; data:

(total) 11000
abc     939
def     1893
ghi     1003
jkl     3104
mno     2048
pqr     2013

So again, we've got the right distribution, all from LINQ-to-SQL. Sorted?

Your suggested solution, as it seems from the question, is bound to Linq/Linq2Sql.

If I understand correctly, your main goal is to fetch at most X records from the database, that have a weight of more than 0. If the database holds more than X records, you'd like to choose from them using the record's weight, and have a random result.

If all is correct so far, my solution is to clone each record by its weight: if a record's weight is 5, make sure you have it 5 times. This way the random choice takes into account the weight.

However, cloning the records makes, well, duplications. So you can't just take X records, you should take more and more records until you have X distinct records.

So far I described a general solution, not related to the implementation.

I think it's harder to implement my solution using only Linq2Sql. If the total records count in the DB is not huge, I suggest reading the entire table and do the cloning and random outside the SQL Server.

If the total count is huge, I suggest you take, say, 100,000 records (or less) chosen at random (via Linq2Sql), and apply the implementation as above. I believe it's random enough.

Try by using the RAND() sql function - it'll give you a 0 to 1 float.

The downside is that I am not sure if it would cause a full table scan on the sql server side i.e. if the resulting query + execution on sql would be optimized in such a way that once you have the top n records it ignores the rest of the table.

var rand = new Random();

var ws = db.WorkTypes
.Where(e => e.HumanId != null && e.SeoPriority != 0)
.OrderByDescending(e => rand.Next() * e.SeoPriority)
.Select(e => new
{
   DescriptionText = e.DescriptionText,
   HumanId = e.HumanId
})
.Take(take).ToArray();

The reason the GUID (NEWID) function was being used in the SQL example you are looking at is simply that SQL Servers RAND function only calculates once per statement. So is useless in a randomising select.

But as your using linq, A quick and dirty solution is to create a Random object and replace your order by statement.

Random rand = new Random(DateTime.Now.Millisecond);

var ws = db.WorkTypes .Where(e => e.HumanId != null && e.SeoPriority != 0) .OrderByDescending(e => rand.Next(10) * e.SeoPriority) .Select(e => new{ DescriptionText = e.DescriptionText, HumanId = e.HumanId}) .Take(take).ToArray();

The rand.Next(10) assumes your SeoPriority scales from 0 to 10.

It's not 100% acurate, but it's close, adjusting the Next value can tweak it.

继续阅读：linq-to-sql random

Linq2sql: efficient way to get random elements with weight?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？