mysql index optimization for a table with multiple indexes that index some of the same columns

2022-12-25 16:05 问答作者：

I have a table that stores some basic data about visitor sessions on third party web sites. This is its structure:

id, site_id, unixtime, unixtime_last, ip_address, uid

There are four indexes: id, site_id/unixtime, site_id/ip_address, and site_id/uid

There are many different types of ways that we query this table, and all of them are specific to the site_id. The index with unixtime is used to display the list of visitors for a given date or time range. The other two are used to find all visits from an IP address or a "uid" (a unique cookie value created for each visitor), as well as determining if this is a new visitor or a returning visitor.

Obviously storing site_id inside 3 indexes is inefficient for both write speed and storage, but I see no way around it, since I need to be able to quickly query this data for a given specific site_id.

Any ideas on making this more efficient?

I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct? Because I considered having the site_id being the second column of the index for both ip_address and uid but I think that would make the index less efficient since the IP and UID are going to vary more than the site ID will, because we only have about 8000 unique sites per database server, but millions of unique visitors across all ~8000 sites on a daily basis.

I've also considered removing site_id from the IP and UID indexes completely, since the chances of the same visitor going to multiple sites that share the same database server are quite small, but in cases where this does happen, I fear it could be quite slow to determine if this is a new visitor to this site_id or not. The query would be something like:

select id from sessions where uid = 'value' and site_id = 123 limit 1

... so if this visitor had visited this site before, it would only need to find one row with this site_id before it stopped. This wouldn't be super fast necessarily, but acceptably fast. But say we have a site that gets 500,000 visitors a day, and a particular visitor loves this site and goes there 10 times a day. Now they happen to hit another site on the same database server for the first time. The above query could take quite a long time to search through all of the potentially thousands of rows for this UID, scattered all over the disk, since it wouldn't be finding one for this site ID.

Any insight on making this as efficient as possible would be appreciated :)

Update - this is a MyISAM table with MySQL 5.0. My concerns are both with performance as well as storage s开发者_开发百科pace. This table is both read and write heavy. If I had to choose between performance and storage, my biggest concern is performance - but both are important.

We use memcached heavily in all areas of our service, but that's not an excuse to not care about the database design. I want the database to be as efficient as possible.

I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct?

There is one important property of B-tree indices you need to be aware of: It is possible (efficient) to search for an arbitrary prefix of the full key, but not a suffix. If you have an index site_ip(site_id, ip), and you ask for where ip = 1.2.3.4, MySQL will not use the site_ip index. If you instead had ip_site(ip, site_id), then MySQL would be able to use the ip_site index.

The is a second property of B-tree indices you should be aware of as well: they are sorted. A b-tree index can be used for queries like where site_id < 40.

There is also an important property of disk drives to keep in mind: sequential reads are cheap, seeks are not. If there are any columns used that are not in the index, MySQL must read the row from the table data. That's generally a seek, and slow. So if MySQL believes it'd wind up reading even a small percent of the table like this, it'll instead ignore the index. One big table scan (a sequential read) is usually faster than random reads of even a few percent of the rows in a table.

The same, by the way, applies to seeks through an index. Finding a key in a B-tree actually potentially requires a few seeks, so you'll find that WHERE site_id > 800 AND ip = '1.2.3.4' may not use the site_ip index, becuase each site_id requires several index seeks to find the start of the 1.2.3.4 records for that site. The ip_site index, however, would be used.

Ultimately, you're going to have to make liberal use of benchmarking and EXPLAIN to figure out the best indices for your database. Remember, you can freely add and drop indices as needed. Non-unique indices are not part of your data model; they are merely an optimization.

PS: Benchmark InnoDB as well, it often has better concurrent performance. Same with PostgreSQL.

First of all, if you are using ip as a string than change it to INT UNSIGNED column and use INET_ATON(expr) and INET_NTOA(expr) function to deal with this. Indexing on integer value is more efficient than indexing on strings of variable length.

Well indexes trade storage for performance. Its hard if you want both. Its hard to optimize this any further without know all the queries you run and their quantities per interval.

What you have will work. If you're running into a bottleneck, you'll need to find out whether its cpu,ram,disk and/or network and adjust accordingly. Its hard and wrong to prematurely optimize.

You probably want to switch to innodb if you have any updates, other wise myisam is good for insert/select. Also since your row size is small, you could look into mysql cluster (nbd). There is also an archive engine that can help with storage requirements but partitioning in 5.1 is probably a better thing to look into.

Flipping the order of your index doesn't make any sense, if these indexes are already used in all of your queries.

but it's more efficient to have the left-most column of an index be the one with the least variance - correct?

not sure but I haven't heard this before. Doesn't seem true to me for this application. The index order matters for sorting and by having multiple unique 1st most index fields, allows more possible queries to use index.

继续阅读：indexing

mysql index optimization for a table with multiple indexes that index some of the same columns

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？