开发者

SQL huge selection of IDs - How to make it faster?

I have an array with a huge amounts of IDs I would like to select out from the DB.

The usual approach would be to do select blabla from xxx where yyy IN (ids) OPTION (RECOMPILE)开发者_高级运维. (The option recompile is needed, because SQL server is not intelligent enough to see that putting this query in its query cache is a huge waste of memory)

However, SQL Server is horrible at this type of query when the amount of IDs are high, the parser that it uses to simply too slow. Let me give an example:

SELECT * FROM table WHERE id IN (288525, 288528, 288529,<about 5000 ids>, 403043, 403044) OPTION (RECOMPILE)

Time to execute: ~1100 msec (This returns appx 200 rows in my example)

Versus:

SELECT * FROM table WHERE id BETWEEN 288525 AND 403044 OPTION (RECOMPILE)

Time to execute: ~80 msec (This returns appx 50000 rows in my example)

So even though I get 250 times more data back, it executes 14 times faster...

So I built this function to take my list of ids and build something that will return a reasonable compromise between the two (something that doesn't return 250 times as much data, yet still gives the benefit of parsing the query faster)

  private const int MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH = 5;
  public static string MassIdSelectionStringBuilder(
       List<int> keys, ref int startindex, string colname)
  {
     const int maxlength = 63000;
     if (keys.Count - startindex == 1)
     {
        string idstring = String.Format("{0} = {1}", colname, keys[startindex]);
        startindex++;
        return idstring;
     }
     StringBuilder sb = new StringBuilder(maxlength + 1000);
     List<int> individualkeys = new List<int>(256);
     int min = keys[startindex++];
     int max = min;
     sb.Append("(");
     const string betweenAnd = "{0} BETWEEN {1} AND {2}\n";
     for (; startindex < keys.Count && sb.Length + individualkeys.Count * 8 < maxlength; startindex++)
     {
        int key = keys[startindex];
        if (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH)
        {
           if (min == max)
              individualkeys.Add(min);
           else
           {
              if(sb.Length > 2)
                 sb.Append(" OR ");
              sb.AppendFormat(betweenAnd, colname, min, max);
           }
           min = max = key;
        }
        else
        {
           max = key;
        }
     }
     if (min == max)
        individualkeys.Add(min);
     else
     {
        if (sb.Length > 2)
           sb.Append(" OR ");
        sb.AppendFormat(betweenAnd, colname, min, max);
     }
     if (individualkeys.Count > 0)
     {
        if (sb.Length > 2)
           sb.Append(" OR ");
        string[] individualkeysstr = new string[individualkeys.Count];
        for (int i = 0; i < individualkeys.Count; i++)
           individualkeysstr[i] = individualkeys[i].ToString();
        sb.AppendFormat("{0} IN ({1})", colname,  String.Join(",",individualkeysstr));
     }
     sb.Append(")");
     return sb.ToString();
  }

It is then used like this:

 List<int> keys; //Sort and make unique
 ...
 for (int i = 0; i < keys.Count;)
 {
    string idstring = MassIdSelectionStringBuilder(keys, ref i, "id");
    string sqlstring = string.Format("SELECT * FROM table WHERE {0} OPTION (RECOMPILE)", idstring);

However, my question is... Does anyone know of a better/faster/smarter way to do this?


In my experience the fastest way was to pack numbers in binary format into an image. I was sending up to 100K IDs, which works just fine:

Mimicking a table variable parameter with an image

Yet is was a while ago. The following articles by Erland Sommarskog are up to date:

Arrays and Lists in SQL Server


If the list of Ids were in another table that was indexed, this would execute a whole lot faster using a simple INNER JOIN

if that isn't possible then try creating a TABLE variable like so

DECLARE @tTable TABLE
(
   @Id int
)

store the ids in the table variable first, then INNER JOIN to your table xxx, i have had limited success with this method, but its worth the try


You're using (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH) as the check to determine whether to do a range fetch instead of an individual fetch. It appears that's not the best way to do that.

let's consider the 4 ID sequences {2, 7}, {2,8}, {1,2,7}, and {1,2,8}. They translate into

ID BETWEEN 2 AND 7
ID ID in (2, 8)
ID BETWEEN 1 AND 7 
ID BETWEEN 1 AND 2 OR ID in (8)

The decision to fetch and filter the IDs 3-6 now depends only on the difference between 2 and 7/8. However, it does not take into account whether 2 is already part of a range or a individual ID.

I think the proper criterium is how many individual IDs you save. Converting two individuals into a range removes has a net benefit of 2 * Cost(Individual) - Cost(range) whereas extending a range has a net benefit of Cost(individual) - Cost(range extension).


Adding recompile not a good idea. Precompiling means sql does not save your query results but it saves the execution plan. Thereby trying to make the query faster. If you add recompile then it will have the overhead of compiling the query always. Try creating a stored procedure and saving the query and calling it from there. As stored procedures are always precompiled.


Another dirty idea similar to Neils,

  • Have a indexed view which holds the IDs alone based on your business condition
  • And you can join the view with your actual table and get the desired result.


The efficient way to do this is to:

  1. Create a temporary table to hold the IDs
  2. Call a SQL stored procedure with a string parameter holding all the comma-separated IDs
  3. The SQL stored procedure uses a loop with CHARINDEX() to find each comma, then SUBSTRING to extract the string between two commas and CONVERT to make it an int, and use INSERT INTO @Temporary VALUES ... to insert it into the temporary table
  4. INNER JOIN the temporary table or use it in an IN (SELECT ID from @Temporary) subquery

Every one of these steps is extremely fast because a single string is passed, no compilation is done during the loop, and no substrings are created except the actual id values.

No recompilation is done at all when this is executed as long as the large string is passed as a parameter.

Note that in the loop you must tracking the prior and current comma in two separate values


Off the cuff here - does incorporating a derived table help performance at all? I am not set up to test this fully, just wonder if this would optimize to use between and then filter the unneeded rows out:

Select * from 
( SELECT *
  FROM dbo.table 
  WHERE ID between <lowerbound> and <upperbound>) as range
where ID in ( 
    1206,
    1207,
    1208,
    1209,
    1210,
    1211,
    1212,
    1213,
    1214,
    1215,
    1216,
    1217,
    1218,
    1219,
    1220,
    1221,
    1222,
    1223,
    1224,
    1225,
    1226,
    1227,
    1228,
    <...>,
    1230,
    1231
)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜