SQL Server to PostgreSQL - Migration and design concerns

2022-12-09 05:23 问答作者：

Currently migrating from SQL Server to PostgreSQL and attempting to improve a couple of key areas on the way:

I have an Articles table:

CREATE TABLE [dbo].[Articles](
    [server_ref] [int] NOT NULL,
    [article_ref] [int] NOT NULL,
    [article_title] [varchar](400) NOT NULL,
    [category_ref] [int] NOT NULL,
    [size] [bigint] NOT NULL
)

Data (comma delimited text files) is dumped on the import server by ~500 (out of ~1000) servers on a daily basis.

Importing:

Indexes are disabled on the Articles table.
For each dumped text file
- Data is BULK copied to a temporary table.
- Temporary table is updated.
- Old data for the server is dropped from the Articles table.
- Temporary table data is copied to Articles table.
- Temporary table dropped.

Once this process is complete for all servers the indexes are built and the new database is copied to a web server.

I am reasonably happy with this process but there is always room for improvement as I strive for a real-time (haha!) system. Is what I am doing correct? The Articles table contains ~500 million records and is expected to grow. Searching across this table is okay but could be better. i.e. SELECT * FROM Articles WHERE server_ref=33 AND article_title LIKE '%criteria%' has been satisfactory but I want to improve the speed of searching. Obviously the "LIKE" is my problem here. Suggestions? SELECT * FROM Articles WHERE article_title LIKE '%criteria%' is horrendous.

Partitioning is a feature of SQL Server Enterprise but $$$ which is one of the many exciting prospects of PostgreSQL. What performance hit will be incurred for the import process (drop data开发者_C百科, insert data) and building indexes? Will the database grow by a huge amount?

The database currently stands at 200 GB and will grow. Copying this across the network is not ideal but it works. I am putting thought into changing the hardware structure of the system. The thought process of having an import server and a web server is so that the import server can do the dirty work (WITHOUT indexes) while the web server (WITH indexes) can present reports. Maybe reducing the system down to one server would work to skip the copying across the network stage. This one server would have two versions of the database: one with the indexes for delivering reports and the other without for importing new data. The databases would swap daily. Thoughts?

This is a fantastic system, and believe it or not there is some method to my madness by giving it a big shake up.

UPDATE: I am not looking for help with relational databases, but hoping to bounce ideas around with data warehouse experts.

I am not a data warehousing expert, but a couple of pointers.

Seems like your data can be easily partitioned. See Postgresql documentation about partitioning on how to split data into different physical tables. This lets you manage data at your natural per server granularity.

You can use postgresql transactional DDL to avoid some copying. The process will then look something like this for each input file:

create a new table to store the data.
use COPY to bulk load data into the table.
create any necessary indexes and do any processing that is required.
In a transaction drop the old partition, rename the new table and add it as a partition.

If you do it like this, you can swap out the partitions on the go if you want to. Only the last step requires locking the live table, and it's a quick DDL metadata update.

Avoid deleting and reloading data to an indexed table - that will lead to considerable table and index bloat due to the MVCC mechanism PostgreSQL uses. If you just swap out the underlying table you get a nice compact table and indexes. If you have any data locality on top of the partitioning in your queries then either order your input data on that or if that's not possible use PostgreSQL cluster functionality to reorder the data physically.

To speed up the text searches use a GIN full text index if the constraints are acceptable (can only search at word boundaries). Or a trigram index (supplied by the pg_trgm extension module) if you need to search for arbitrary substrings.

继续阅读：database-design postgresql sql-server

SQL Server to PostgreSQL - Migration and design concerns

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？