开发者

SQL Server to PostgreSQL - Migration and design concerns

Currently migrating from SQL Server to PostgreSQL and attempting to improve a couple of key areas on the way:

I have an Articles table:

CREATE TABLE [dbo].[Articles](
    [server_ref] [int] NOT NULL,
    [article_ref] [int] NOT NULL,
    [article_title] [varchar](400) NOT NULL,
    [category_ref] [int] NOT NULL,
    [size] [bigint] NOT NULL
)

Data (comma delimited text files) is dumped on the import server by ~500 (out of ~1000) servers on a daily basis.

Importing:

  • Indexes are disabled on the Articles table.
  • For each dumped text file
    • Data is BULK copied to a temporary table.
    • Temporary table is updated.
    • Old data for the server is dropped from the Articles table.
    • Temporary table data is copied to Articles table.
    • Temporary table dropped.

Once this process is complete for all servers the indexes are built and the new database is copied to a web server.

I am reasonably happy with this process but there is always room for improvement as I strive for a real-time (haha!) system. Is what I am doing correct? The Articles table contains ~500 million records and is expected to grow. Searching across this table is okay but could be better. i.e. SELECT * FROM Articles WHERE server_ref=33 AND article_title LIKE '%criteria%' has been satisfactory but I want to improve the speed of searching. Obviously the "LIKE" is my problem here. Suggestions? SELECT * FROM Articles WHERE article_title LIKE '%criteria%' is horrendous.

Partitioning is a feature of SQL Server Enterprise but $$$ which is one of the many exciting prospects of PostgreSQL. What performance hit will be incurred for the import process (drop data开发者_C百科, insert data) and building indexes? Will the database grow by a huge amount?

The database currently stands at 200 GB and will grow. Copying this across the network is not ideal but it works. I am putting thought into changing the hardware structure of the system. The thought process of having an import server and a web server is so that the import server can do the dirty work (WITHOUT indexes) while the web server (WITH indexes) can present reports. Maybe reducing the system down to one server would work to skip the copying across the network stage. This one server would have two versions of the database: one with the indexes for delivering reports and the other without for importing new data. The databases would swap daily. Thoughts?

This is a fantastic system, and believe it or not there is some method to my madness by giving it a big shake up.

UPDATE: I am not looking for help with relational databases, but hoping to bounce ideas around with data warehouse experts.


I am not a data warehousing expert, but a couple of pointers.

Seems like your data can be easily partitioned. See Postgresql documentation about partitioning on how to split data into different physical tables. This lets you manage data at your natural per server granularity.

You can use postgresql transactional DDL to avoid some copying. The process will then look something like this for each input file:

  1. create a new table to store the data.
  2. use COPY to bulk load data into the table.
  3. create any necessary indexes and do any processing that is required.
  4. In a transaction drop the old partition, rename the new table and add it as a partition.

If you do it like this, you can swap out the partitions on the go if you want to. Only the last step requires locking the live table, and it's a quick DDL metadata update.

Avoid deleting and reloading data to an indexed table - that will lead to considerable table and index bloat due to the MVCC mechanism PostgreSQL uses. If you just swap out the underlying table you get a nice compact table and indexes. If you have any data locality on top of the partitioning in your queries then either order your input data on that or if that's not possible use PostgreSQL cluster functionality to reorder the data physically.

To speed up the text searches use a GIN full text index if the constraints are acceptable (can only search at word boundaries). Or a trigram index (supplied by the pg_trgm extension module) if you need to search for arbitrary substrings.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜