SQL huge selection of IDs - How to make it faster?

2022-12-18 14:02 问答作者：

I have an array with a huge amounts of IDs I would like to select out from the DB.

The usual approach would be to do select blabla from xxx where yyy IN (ids) OPTION (RECOMPILE)开发者_高级运维. (The option recompile is needed, because SQL server is not intelligent enough to see that putting this query in its query cache is a huge waste of memory)

However, SQL Server is horrible at this type of query when the amount of IDs are high, the parser that it uses to simply too slow. Let me give an example:

SELECT * FROM table WHERE id IN (288525, 288528, 288529,<about 5000 ids>, 403043, 403044) OPTION (RECOMPILE)

Time to execute: ~1100 msec (This returns appx 200 rows in my example)

Versus:

SELECT * FROM table WHERE id BETWEEN 288525 AND 403044 OPTION (RECOMPILE)

Time to execute: ~80 msec (This returns appx 50000 rows in my example)

So even though I get 250 times more data back, it executes 14 times faster...

So I built this function to take my list of ids and build something that will return a reasonable compromise between the two (something that doesn't return 250 times as much data, yet still gives the benefit of parsing the query faster)

  private const int MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH = 5;
  public static string MassIdSelectionStringBuilder(
       List<int> keys, ref int startindex, string colname)
  {
     const int maxlength = 63000;
     if (keys.Count - startindex == 1)
     {
        string idstring = String.Format("{0} = {1}", colname, keys[startindex]);
        startindex++;
        return idstring;
     }
     StringBuilder sb = new StringBuilder(maxlength + 1000);
     List<int> individualkeys = new List<int>(256);
     int min = keys[startindex++];
     int max = min;
     sb.Append("(");
     const string betweenAnd = "{0} BETWEEN {1} AND {2}\n";
     for (; startindex < keys.Count && sb.Length + individualkeys.Count * 8 < maxlength; startindex++)
     {
        int key = keys[startindex];
        if (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH)
        {
           if (min == max)
              individualkeys.Add(min);
           else
           {
              if(sb.Length > 2)
                 sb.Append(" OR ");
              sb.AppendFormat(betweenAnd, colname, min, max);
           }
           min = max = key;
        }
        else
        {
           max = key;
        }
     }
     if (min == max)
        individualkeys.Add(min);
     else
     {
        if (sb.Length > 2)
           sb.Append(" OR ");
        sb.AppendFormat(betweenAnd, colname, min, max);
     }
     if (individualkeys.Count > 0)
     {
        if (sb.Length > 2)
           sb.Append(" OR ");
        string[] individualkeysstr = new string[individualkeys.Count];
        for (int i = 0; i < individualkeys.Count; i++)
           individualkeysstr[i] = individualkeys[i].ToString();
        sb.AppendFormat("{0} IN ({1})", colname,  String.Join(",",individualkeysstr));
     }
     sb.Append(")");
     return sb.ToString();
  }

It is then used like this:

 List<int> keys; //Sort and make unique
 ...
 for (int i = 0; i < keys.Count;)
 {
    string idstring = MassIdSelectionStringBuilder(keys, ref i, "id");
    string sqlstring = string.Format("SELECT * FROM table WHERE {0} OPTION (RECOMPILE)", idstring);

However, my question is... Does anyone know of a better/faster/smarter way to do this?

In my experience the fastest way was to pack numbers in binary format into an image. I was sending up to 100K IDs, which works just fine:

Mimicking a table variable parameter with an image

Yet is was a while ago. The following articles by Erland Sommarskog are up to date:

Arrays and Lists in SQL Server

If the list of Ids were in another table that was indexed, this would execute a whole lot faster using a simple INNER JOIN

if that isn't possible then try creating a TABLE variable like so

DECLARE @tTable TABLE
(
   @Id int
)

store the ids in the table variable first, then INNER JOIN to your table xxx, i have had limited success with this method, but its worth the try

You're using (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH) as the check to determine whether to do a range fetch instead of an individual fetch. It appears that's not the best way to do that.

let's consider the 4 ID sequences {2, 7}, {2,8}, {1,2,7}, and {1,2,8}. They translate into

ID BETWEEN 2 AND 7
ID ID in (2, 8)
ID BETWEEN 1 AND 7 
ID BETWEEN 1 AND 2 OR ID in (8)

The decision to fetch and filter the IDs 3-6 now depends only on the difference between 2 and 7/8. However, it does not take into account whether 2 is already part of a range or a individual ID.

I think the proper criterium is how many individual IDs you save. Converting two individuals into a range removes has a net benefit of 2 * Cost(Individual) - Cost(range) whereas extending a range has a net benefit of Cost(individual) - Cost(range extension).

Adding recompile not a good idea. Precompiling means sql does not save your query results but it saves the execution plan. Thereby trying to make the query faster. If you add recompile then it will have the overhead of compiling the query always. Try creating a stored procedure and saving the query and calling it from there. As stored procedures are always precompiled.

Another dirty idea similar to Neils,

Have a indexed view which holds the IDs alone based on your business condition
And you can join the view with your actual table and get the desired result.

The efficient way to do this is to:

Create a temporary table to hold the IDs
Call a SQL stored procedure with a string parameter holding all the comma-separated IDs
The SQL stored procedure uses a loop with CHARINDEX() to find each comma, then SUBSTRING to extract the string between two commas and CONVERT to make it an int, and use INSERT INTO @Temporary VALUES ... to insert it into the temporary table
INNER JOIN the temporary table or use it in an IN (SELECT ID from @Temporary) subquery

Every one of these steps is extremely fast because a single string is passed, no compilation is done during the loop, and no substrings are created except the actual id values.

No recompilation is done at all when this is executed as long as the large string is passed as a parameter.

Note that in the loop you must tracking the prior and current comma in two separate values

Off the cuff here - does incorporating a derived table help performance at all? I am not set up to test this fully, just wonder if this would optimize to use between and then filter the unneeded rows out:

Select * from 
( SELECT *
  FROM dbo.table 
  WHERE ID between <lowerbound> and <upperbound>) as range
where ID in ( 
    1206,
    1207,
    1208,
    1209,
    1210,
    1211,
    1212,
    1213,
    1214,
    1215,
    1216,
    1217,
    1218,
    1219,
    1220,
    1221,
    1222,
    1223,
    1224,
    1225,
    1226,
    1227,
    1228,
    <...>,
    1230,
    1231
)

继续阅读：performance sql-server

SQL huge selection of IDs - How to make it faster?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？