开发者

Parallelism or PLINQ with Postgres and Npgsql

I have a Posrgres 9.04 database table with over 12,000,000 rows.

I need a program to read each row, do some calculations and lookups (against a 2nd table), then write a new row in a 3rd table with the results of these calculations. When done, the 3rd table will have the same number of rows as the 1st table.

Executing serially on a Core i7 720QM processor takes more than 24 hours. It only taxes one of my 8 cores (4 physical cores, but 8 visible to Windows 7 via HTT).

I want to speed this up with parallelism. I thought I could use PLINQ and Npgsql:

NpgsqlDataReader records = new NpgsqlCommand("SELECT * FROM table", conn).ExecuteReader();
var single_record = from row in records.AsParallel()
             select row;

However, I get an error for records.AsParallel(): Could not find an implementation of the query pattern for source type 'System.Linq.ParallelQuery'. 'Select' not found. Consider explicitly specifying the type of the range variable 'row'.

I've done a lot of Google searches, and I'm just coming up more confused. NpgsqlDataReader inherits from System.Data.Common.开发者_如何学编程DbDataReader, which in turn implements IEnumerable, which has the AsParallel extension, so seems like the right stuff is in place to get this working?

It's not clear to me what I could even do to explicitly specify the type of the range variable. It appears that best practice is not to specify this.

I am open to switching to a DataSet, presuming that's PLINQ compatible, but would rather avoid if possible because of the 12,000,000 rows.

Is this even something achievable with Npgsql? Do I need to use Devart's dotConnect for PostgreSQL instead?

UPDATE: Just found http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/2f5ce226-c500-4899-a923-99285ace42ae, which led me to try this:

foreach(IDataRecord arrest in
            from row in arrests.AsParallel().Cast <IDataRecord>()
            select row)

So far no errors in the IDE, but is this a proper way of constructing this?


This is indeed the solution:

foreach(IDataRecord arrest in
        from row in arrests.AsParallel().Cast <IDataRecord>()
        select row)

This solution was inspired by what I found at http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/2f5ce226-c500-4899-a923-99285ace42ae#1956768e-9403-4671-a196-8dfb3d7070e3. It's not clear to me why the cast and type specification is needed, but it works.

EDIT: While this doesn't cause syntax or runtime errors, it in fact does not make things run in parallel. Everything is still serialized. See PLINQ on ConcurrentQueue isn't multithreading for a superior solution.


You should consider using Greenplum. It's trivial to accomplish this in a Greenplum database. The free version isn't gimped in any way and it's postgresql at its core.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜