开发者

Is it an error of PostgreSQL SQL engine and how to avoid (workaround) it?

I'm parsing text documents and inserting them into PostgreSQL DB. My code is written in Java and开发者_如何转开发 I use JDBC for DB connectivity. I have encountered very strange error when adding data to DB - it seems that at unpredictable moment (different number of iteration of main loop) Postgres does not see rows just added to tables and fails to perform updates properly.

Maybe I am doing something wrong, so maybe there is a way to correct my code? Or is it serious error of PostgreSQL and I should post it at PostgreSQL home page (as bug report)?

Here are the details of what I'm doing and what is going wrong. I've simplified my code to isolate the error - simplified version doesn't parse any text but I simulate it with generated words. Source files are included (java and sql) at the end of my question.

In simplified example of my problem I have one-threaded code, one JDBC Connection, 3 tables and few SQL statements involved (full Java sources are less than 90 lines).

Main loop works for "documents" - 20 words with subsequent doc_id (integer).

  1. Buffer table spb_word4obj is cleared for doc_id to be just inserted.
  2. Words are inserted into buffer table (spb_word4obj),
  3. Then unique new words are inserted into table spb_word
  4. And finally - document's words are inserted into spb_obj_word - with word bodies replaced by word-ids from spb_word (references).

While iterating this loop for some time (e.g. 2,000 or 15,000 iterations - it is unpredictable) it fails with SQL error - cannot insert null word_id into spb_word. It gets more strange as repeating this very last iteration by hand gives no error. It seems like PostgreSQL have some issue with record insertion and statement execution speed - it loses some data or makes it is visible for subsequent statement after small delay.

Sequence of generated words is repeatable - every time code is run it generates the same sequence of words, but iteration number when code fails is different every time.

Here is my sql code to create tables:

create sequence spb_word_seq;

create table spb_word (
  id bigint not null primary key default nextval('spb_word_seq'),
  word varchar(410) not null unique
);

create sequence spb_obj_word_seq;

create table spb_obj_word (
  id int not null primary key default nextval('spb_obj_word_seq'),
  doc_id int not null,
  idx int not null,
  word_id bigint not null references spb_word (id),
  constraint spb_ak_obj_word unique (doc_id, word_id, idx)
);

create sequence spb_word4obj_seq;

create table spb_word4obj (
  id int not null primary key default nextval('spb_word4obj_seq'),
  doc_id int not null,
  idx int not null,
  word varchar(410) not null,
  word_id bigint null references spb_word (id),
  constraint spb_ak_word4obj unique (doc_id, word_id, idx),
  constraint spb_ak_word4obj2 unique (doc_id, word, idx)
);

And here goes Java code - it may just be executed (it has static main method).

package WildWezyrIsAstonished;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.Statement;

public class StrangePostgresBehavior {

    private static final String letters = "abcdefghijklmnopqrstuvwxyząćęłńóśźż";
    private static final int llen = letters.length();
    private Connection conn;
    private Statement st;
    private int wordNum = 0;

    public void runMe() throws Exception {
        Class.forName("org.postgresql.Driver");

        conn = DriverManager.getConnection("jdbc:postgresql://localhost:5433/spb",
                "wwspb", "*****");
        conn.setAutoCommit(true);
        st = conn.createStatement();

        st.executeUpdate("truncate table spb_word4obj, spb_word, spb_obj_word");

        for (int j = 0; j < 50000; j++) {

            try {
                if (j % 100 == 0) {
                    System.out.println("j == " + j);
                }

                StringBuilder sb = new StringBuilder();

                for (int i = 0; i < 20; i++) {
                    sb.append("insert into spb_word4obj (word, idx, doc_id) values ('"
                            + getWord() + "'," + i + "," + j + ");\n");
                }
                st.executeUpdate("delete from spb_word4obj where doc_id = " + j);

                st.executeUpdate(sb.toString());

                st.executeUpdate("update spb_word4obj set word_id = w.id "
                        + "from spb_word w "
                        + "where w.word = spb_word4obj.word and doc_id = " + j);

                st.executeUpdate("insert into spb_word (word) "
                        + "select distinct word from spb_word4obj "
                        + "where word_id is null and doc_id = " + j);

                st.executeUpdate("update spb_word4obj set word_id = w.id "
                        + "from spb_word w "
                        + "where w.word = spb_word4obj.word and "
                        + "word_id is null and doc_id = " + j);

                st.executeUpdate("insert into spb_obj_word (word_id, idx, doc_id) "
                        + "select word_id, idx, doc_id from spb_word4obj "
                        + "where doc_id = " + j);
            } catch (Exception ex) {
                System.out.println("error for j == " + j);
                throw ex;
            }
        }
    }

    private String getWord() {
        int rn = 3 * (++wordNum + llen * llen * llen);
        rn = (rn + llen) / (rn % llen + 1);
        rn = rn % (rn / 2 + 10);

        StringBuilder sb = new StringBuilder();
        while (true) {
            char c = letters.charAt(rn % llen);
            sb.append(c);
            rn /= llen;
            if (rn == 0) {
                break;
            }
        }

        return sb.toString();
    }

    public static void main(String[] args) throws Exception {
        new StrangePostgresBehavior().runMe();
    }
}

So again: is it me doing something wrong (what exactly?) or is it serious flaw in PosgreSQL SQL Engine (than - is there a way for work-around)?

I've tested above on Windows Vista box with: Java 1.6 / PostgreSQL 8.3.3 and 8.4.2 / JDBC PostgreSQL drivers postgresql-8.2-505.jdbc3 and postgresql-8.4-701.jdbc4. All combinations lead to error described above. To be sure that it is not something with my machine I've tested in in similar environment on other machine.


UPDATE: I've turned on Postgres logging - as suggested by Depesz. Here are latest sql statements that were executed:

2010-01-18 16:18:51 CETLOG:  execute <unnamed>: delete from spb_word4obj where doc_id = 1453
2010-01-18 16:18:51 CETLOG:  execute <unnamed>: insert into spb_word4obj (word, idx, doc_id) values ('ouc',0,1453)
2010-01-18 16:18:51 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('rbjb',1,1453)
2010-01-18 16:18:51 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('pvr',2,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('gal',3,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('cai',4,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('żjg',5,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('egf',6,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('śne',7,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('ęęd',8,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('lnd',9,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('cbd',10,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('dąc',11,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('łrc',12,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('zmł',13,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('zxo',14,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('oćj',15,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('zlh',16,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('lńf',17,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('cóe',18,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: 
  insert into spb_word4obj (word, idx, doc_id) values ('uge',19,1453)
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: update spb_word4obj set word_id = w.id from spb_word w where w.word = spb_word4obj.word and doc_id = 1453
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: insert into spb_word (word) select distinct word from spb_word4obj where word_id is null and doc_id = 1453
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: update spb_word4obj set word_id = w.id from spb_word w where w.word = spb_word4obj.word and word_id is null and doc_id = 1453
2010-01-18 16:18:52 CETLOG:  execute <unnamed>: insert into spb_obj_word (word_id, idx, doc_id) select word_id, idx, doc_id from spb_word4obj where doc_id = 1453
2010-01-18 16:18:52 CETERROR:  null value in column "word_id" violates not-null constraint
2010-01-18 16:18:52 CETSTATEMENT:  insert into spb_obj_word (word_id, idx, doc_id) select word_id, idx, doc_id from spb_word4obj where doc_id = 1453

Now - code to check what is wrong in table spb_word4obj:

select * 
from spb_word4obj w4o left join spb_word w on w4o.word = w.word
where w4o.word_id is null

and it shows that two words: 'gal', 'zxo' caused the problem. But... they are found in spb_word table - just freshly inserted with sql statements from log (included above).

So - it is not issue with JDBC driver, it is rather Postgres itself?


UPDATE2: If I eliminate polish national chars (ąćęłńóśźż) from generated words, there is no error - code performs all 50,000 iterations. I've tested it few times. So, for this line:

    private static final String letters = "abcdefghijklmnopqrstuvwxyz";

there is no error, everything seems to be fine, but with this line (or with original line in full source above):

    private static final String letters = "ąćęłńóśźżjklmnopqrstuvwxyz";

I get error described above.


UPDATE3: I've just posted similar question without Java usage - fully ported to pure plpgsql, look here: Why this code fails in PostgreSQL and how to fix it (work-around)? Is it Postgres SQL engine flaw?. Now I know that it is not related to Java - it is Postgres alone problem.


My further investigation of the problem revealed that the problem is related to pure Postgres SQL, I developed pure plpgsql version which is one-to-one port of the code above. Restated question for pure plpgsql is here: Why this code fails in PostgreSQL and how to fix it (work-around)? Is it Postgres SQL engine flaw?.

So - it is not Java/JDBC related problem.

Furthermore, I've managed to simplify test code - now it uses one table. Simplified problem was posted on pgsql-bugs mailing list: http://archives.postgresql.org/pgsql-bugs/2010-01/msg00182.php. It is confirmed to occur on other machines (not only mine).

Here is workaround: change database collation from polish to standard 'C'. With 'C' collation there is no error. But without polish collation polish words are sorted incorrectly (with respect to polish national characters), so problem should be fixed in Postgres itself.


Turn on query logging (log_statement = all) in postgresql.conf, and check the queries.

My bet is that it's the problem of driver (JDBC).


If you are trying to index natural language documents with Postgres (for all I can see you are trying to build an inverted index on the words of the documents), I would recommend you to take a look at Full text search in Postgres instead.


If that is not an option, then check your encoding settings:

  • database (In Postgres you cannot actually change the database encoding - you will have to recreate the database from scratch.)
  • JDBC driver/Postgres client setting (sorry can't remember the details),
  • and your Java source (editor)

I suggest to set them all to UTF-8.

If that still did not help, then I suspect some kind of escaping/encoding issue between the source of the data (your Java source code file) and the destination of the data (the database).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜