Linq2sql: efficient way to get random elements with weight?
Byt lets say I have an integer weight where i.e. elements with weight 10 has 10 times higher probability to be selected than element with weight 1.
var ws = db.WorkTypes
.Where(e => e.HumanId != null && e.SeoPriority != 0)
.OrderBy(e => /*????*/ * e.SeoPriority)
.Select(e => new
{
DescriptionText = e.DescriptionText,
HumanId = e.HumanId
})
.Take(take).ToArray();
How do I solved getting random records in Linq when I need the result to be weighted?
I need somthing like Random Weighted Choice in T-SQL but in linq and not only getting one record?
If I wouldn't have the weighted requirement, I'd use the NEWID approach, can I adopt this some way?
partial class DataContext
{
[Function(Name = "NEWID", IsComposable = true)]
public Guid Random()
{
throw new NotImplementedException();
}
}
...
var ws = db.WorkTypes
.Where(开发者_开发知识库e => e.HumanId != null && e.SeoPriority != 0)
.OrderBy(e => db.Random())
.Select(e => new
{
DescriptionText = e.DescriptionText,
HumanId = e.HumanId
})
.Take(take).ToArray();
My first idea was the same as Ron Klein's - create a weighted list and select randomly from that.
Here's a LINQ extension method to create the weighted list from the normal list, given a lambda function that knows the weight property of the object.
Don't worry if you don't get all the generics stuff right away... The usage below should make it clearer:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication1
{
public class Item
{
public int Weight { get; set; }
public string Name { get; set; }
}
public static class Extensions
{
public static IEnumerable<T> Weighted<T>(this IEnumerable<T> list, Func<T, int> weight)
{
foreach (T t in list)
for (int i = 0; i < weight(t); i++)
yield return t;
}
}
class Program
{
static void Main(string[] args)
{
List<Item> list = new List<Item>();
list.Add(new Item { Name = "one", Weight = 5 });
list.Add(new Item { Name = "two", Weight = 1 });
Random rand = new Random(0);
list = list.Weighted<Item>(x => x.Weight).ToList();
for (int i = 0; i < 20; i++)
{
int index = rand.Next(list.Count());
Console.WriteLine(list.ElementAt(index).Name);
}
Console.ReadLine();
}
}
}
As you can see from the output, the results are both random and weighted as you require.
I'm assuming that the weight is an integer. Here's an approach which joins to a dummy table to increase the row-count per the weight; first, lets prove it just at TSQL:
SET NOCOUNT ON
--DROP TABLE [index]
--DROP TABLE seo
CREATE TABLE [index] ([key] int not null) -- names for fun ;-p
CREATE TABLE seo (url varchar(10) not null, [weight] int not null)
INSERT [index] values(1) INSERT [index] values(2)
INSERT [index] values(3) INSERT [index] values(4)
INSERT [index] values(5) INSERT [index] values(6)
INSERT [index] values(7) INSERT [index] values(8)
INSERT [index] values(9) INSERT [index] values(10)
INSERT [seo] VALUES ('abc',1) INSERT [seo] VALUES ('def',2)
INSERT [seo] VALUES ('ghi',1) INSERT [seo] VALUES ('jkl',3)
INSERT [seo] VALUES ('mno',1) INSERT [seo] VALUES ('mno',1)
INSERT [seo] VALUES ('pqr',2)
DECLARE @count int, @url varchar(10)
SET @count = 0
DECLARE @check_rand TABLE (url varchar(10) not null)
-- test it lots of times to check distribution roughly matches weights
WHILE @count < 11000
BEGIN
SET @count = @count + 1
SELECT TOP 1 @url = [seo].[url]
FROM [seo]
INNER JOIN [index] ON [index].[key] <= [seo].[weight]
ORDER BY NEWID()
-- this to check distribution
INSERT @check_rand VALUES (@url)
END
SELECT ISNULL(url, '(total)') AS [url], COUNT(1) AS [hits]
FROM @check_rand
GROUP BY url WITH ROLLUP
ORDER BY url
This outputs something like:
url hits
---------- -----------
(total) 11000
abc 1030
def 1970
ghi 1027
jkl 2972
mno 2014
pqr 1987
Showing that we have the correct overall distribution. Now lets bring that into LINQ-to-SQL; I've added the two tables to a data-context (you will need to create something like the [index]
table to do this) - my DBML:
<Table Name="dbo.[index]" Member="indexes">
<Type Name="index">
<Column Name="[key]" Member="key" Type="System.Int32" DbType="Int NOT NULL" CanBeNull="false" />
</Type>
</Table>
<Table Name="dbo.seo" Member="seos">
<Type Name="seo">
<Column Name="url" Type="System.String" DbType="VarChar(10) NOT NULL" CanBeNull="false" />
<Column Name="weight" Type="System.Int32" DbType="Int NOT NULL" CanBeNull="false" />
</Type>
</Table>
Now we'll consume this; in the partial class
for the data-context, add a compiled-query (for performance) in addition to the Random
method:
partial class MyDataContextDataContext
{
[Function(Name = "NEWID", IsComposable = true)]
public Guid Random()
{
throw new NotImplementedException();
}
public string GetRandomUrl()
{
return randomUrl(this);
}
static readonly Func<MyDataContextDataContext, string>
randomUrl = CompiledQuery.Compile(
(MyDataContextDataContext ctx) =>
(from s in ctx.seos
from i in ctx.indexes
where i.key <= s.weight
orderby ctx.Random()
select s.url).First());
}
This LINQ-to-SQL query is very similar to the key part of the TSQL we wrote; lets test it:
using (var ctx = CreateContext()) {
// show sample query
ctx.Log = Console.Out;
Console.WriteLine(ctx.GetRandomUrl());
ctx.Log = null;
// check distribution
var counts = new Dictionary<string, int>();
for (int i = 0; i < 11000; i++) // obviously a bit slower than inside db
{
if (i % 100 == 1) Console.WriteLine(i); // show progress
string s = ctx.GetRandomUrl();
int count;
if (counts.TryGetValue(s, out count)) {
counts[s] = count + 1;
} else {
counts[s] = 1;
}
}
Console.WriteLine("(total)\t{0}", counts.Sum(p => p.Value));
foreach (var pair in counts.OrderBy(p => p.Key)) {
Console.WriteLine("{0}\t{1}", pair.Key, pair.Value);
}
}
This runs the query once to show the TSQL is suitable, then (like before) 11k times to check the distribution; output (not including the progress updates):
SELECT TOP (1) [t0].[url]
FROM [dbo].[seo] AS [t0], [dbo].[index] AS [t1]
WHERE [t1].[key] <= [t0].[weight]
ORDER BY NEWID()
-- Context: SqlProvider(Sql2008) Model: AttributedMetaModel Build: 3.5.30729.4926
which doesn't look too bad at all - it has both tables and the range condition, and the TOP 1
, so it is doing something very similar; data:
(total) 11000
abc 939
def 1893
ghi 1003
jkl 3104
mno 2048
pqr 2013
So again, we've got the right distribution, all from LINQ-to-SQL. Sorted?
Your suggested solution, as it seems from the question, is bound to Linq/Linq2Sql.
If I understand correctly, your main goal is to fetch at most X records from the database, that have a weight of more than 0. If the database holds more than X records, you'd like to choose from them using the record's weight, and have a random result.
If all is correct so far, my solution is to clone each record by its weight: if a record's weight is 5, make sure you have it 5 times. This way the random choice takes into account the weight.
However, cloning the records makes, well, duplications. So you can't just take X records, you should take more and more records until you have X distinct records.
So far I described a general solution, not related to the implementation.
I think it's harder to implement my solution using only Linq2Sql. If the total records count in the DB is not huge, I suggest reading the entire table and do the cloning and random outside the SQL Server.
If the total count is huge, I suggest you take, say, 100,000 records (or less) chosen at random (via Linq2Sql), and apply the implementation as above. I believe it's random enough.
Try by using the RAND() sql function - it'll give you a 0 to 1 float.
The downside is that I am not sure if it would cause a full table scan on the sql server side i.e. if the resulting query + execution on sql would be optimized in such a way that once you have the top n records it ignores the rest of the table.
var rand = new Random();
var ws = db.WorkTypes
.Where(e => e.HumanId != null && e.SeoPriority != 0)
.OrderByDescending(e => rand.Next() * e.SeoPriority)
.Select(e => new
{
DescriptionText = e.DescriptionText,
HumanId = e.HumanId
})
.Take(take).ToArray();
The reason the GUID (NEWID) function was being used in the SQL example you are looking at is simply that SQL Servers RAND function only calculates once per statement. So is useless in a randomising select.
But as your using linq, A quick and dirty solution is to create a Random object and replace your order by statement.
Random rand = new Random(DateTime.Now.Millisecond);
var ws = db.WorkTypes .Where(e => e.HumanId != null && e.SeoPriority != 0) .OrderByDescending(e => rand.Next(10) * e.SeoPriority) .Select(e => new{ DescriptionText = e.DescriptionText, HumanId = e.HumanId}) .Take(take).ToArray();
The rand.Next(10) assumes your SeoPriority scales from 0 to 10.
It's not 100% acurate, but it's close, adjusting the Next value can tweak it.
精彩评论