Using pythons oursql to store data in a mysql database is crawling. Why?

2023-03-18 14:19 问答作者：

I am parsing a huge (I mean really huuuuge) xml file. It contains some million article entrys like this one:

<article key="journals/cgf/HaeglerWAGM10" mdate="2010-11-12">
  <author>Simon Haegler</author>
  <author>Peter Wonka</author>
  <author>Stefan Müller Arisona</author>
  <author>Luc J. Van Gool</author>
  <author>Pascal Müller</author>
  <title>Grammar-based Encoding of Facades.</title>
  <pages>1479-1487</pages>
  <year>2010</year>
  <volume>29</volume>
  <journal>Comput. Graph. Forum</journal>
  <number>4</number>
  <ee>http://dx.doi.org/10.1111/j.1467-8659.2010.01745.x</ee>
  <url>db/journals/cgf/cgf29.html#HaeglerWAGM10</url>
</article>

I step through the file and parse those articles by lxml. If I run the code without storing the items into my database (comenting out populate_database() it makes some 1000 entrys in ~3 seconds. But if I activate the storage (uncommentpopulate_database()`) it makes some 10 entrys per second. Is this normal? I remember parsing the file once upton a time and the database was not such a bottle neck. But I had a different approach... (looking throug my files to find it...)

Here is the function, that makes my head ache. I commented out those three cursor.executes and the code was racing again. So it seams, that something is wrong with MySQL or something is wrong with the executes (my noobisch guesses). Any advice?

def add_paper(paper, cursor):
    questionmarks = str(('?',)*len(paper)).replace("'", "")
    # The line above: produces (?, ?, ?, ... ,?) for oursql query
    keys, values = paper.keys(), paper.values()
    keys = str(tuple(keys)).replace("'", "")
    # The line above: produces (mdate, title, ... date, some_key)
    query_paper = '''INSERT INTO dblp2.papers {0} VALUES {1};'''.\
                    format(keys, questionmarks)
    values = tuple(v.encode('utf8') for v in values)
    cursor.execute(query_paper, values)
    paper_id = cursor.lastrowid
    return paper_id

def populate_database(paper, authors, cursor):
    paper_id = add_paper(paper, cursor)
    query_author ="""INSERT INTO dblp2.authors
                     (name) VALUES (?) ON DUPLICATE KEY UPDATE
                     id=LAST_INSERT_ID(id)"""
    query_link_table = "INSERT INTO dblp2.author_paper
                        (author_id, paper_id) VALUES (?, ?)"
    for author in authors:
        cursor.execute(query_author, (author.encode('utf8'),))
        author_id = cursor.lastrowid
        cursor.execute(query_link_table, (author_id, paper_id))

I added a profiling output from cProfile:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.001    0.001   15.666   15.666 <string>:1(<module>)
1    0.000    0.000    0.000    0.000 __init__.py:49(normalize_encoding)
1    0.000    0.000    0.000    0.000 __init__.py:71(search_function)
510    0.002    0.000    0.002    0.000 _elementpath.py:222(_build_path_iterator)
510    0.005    0.000    0.008    0.000 _elementpath.py:260(iterfind)
408    0.005    0.000    0.017    0.000 _elementpath.py:270(find)
102    0.003    0.000    0.011    0.000 _elementpath.py:285(findall)
10    0.000    0.000    0.000    0.000 _elementpath.py:70(xpath_tokenizer)
5    0.000    0.000    0.000    0.000 _elementpath.py:85(prepare_child)
987    0.009    0.000    0.013    0.000 _elementpath.py:87(select)
1    0.000    0.000    0.000    0.000 codecs.py:77(__new__)
1    0.000    0.000    0.000    0.000 utf_8.py:15(decode)
1    0.000    0.000    0.000    0.000 utf_8.py:33(getregentry)
102    0.008    0.000    5.601    0.055 xml2db.py:25(add_paper)
680    0.003    0.000    0.006    0.000 xml2db.py:31(<genexpr>)
102    0.005    0.000   15.468    0.152 xml2db.py:36(populate_database)
477    0.003    0.000    0.013    0.000 xml2db.py:45(clean_parse)
101    0.002    0.000    0.005    0.000 xml2db.py:52(clear_element)
103    0.019    0.000    0.024    0.000 xml2db.py:57(extract_paper_elements)
1    0.017    0.017   15.557   15.5开发者_如何学C57 xml2db.py:63(fast_iter)
1    0.004    0.004   15.665   15.665 xml2db.py:89(main)
1    0.000    0.000    0.000    0.000 {__import__}
1    0.000    0.000    0.000    0.000 {_codecs.utf_8_decode}
1    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x8245fc0}
5    0.000    0.000    0.000    0.000 {built-in method findall}
1    0.000    0.000    0.000    0.000 {hasattr}
2    0.000    0.000    0.000    0.000 {isinstance}
515    0.001    0.000    0.001    0.000 {iter}
107    0.000    0.000    0.000    0.000 {len}
477    0.010    0.000    0.010    0.000 {lxml.etree.strip_tags}
5    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
101    0.002    0.000    0.002    0.000 {method 'clear' of 'lxml.etree._Element' objects}
1    0.000    0.000    0.000    0.000 {method 'cursor' of 'oursql.Connection' objects}
1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
778    0.007    0.000    0.007    0.000 {method 'encode' of 'str' objects}
5    0.000    0.000    0.000    0.000 {method 'encode' of 'unicode' objects}
516   15.544    0.030   15.544    0.030 {method 'execute' of 'oursql.Cursor' objects}
408    0.004    0.000    0.023    0.000 {method 'find' of 'lxml.etree._Element' objects}
102    0.001    0.000    0.012    0.000 {method 'findall' of 'lxml.etree._Element' objects}
103    0.001    0.000    0.001    0.000 {method 'format' of 'str' objects}
2    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
204    0.001    0.000    0.001    0.000 {method 'get' of 'lxml.etree._Element' objects}
100    0.000    0.000    0.000    0.000 {method 'getparent' of 'lxml.etree._Element' objects}
201    0.001    0.000    0.001    0.000 {method 'getprevious' of 'lxml.etree._Element' objects}
510    0.004    0.000    0.004    0.000 {method 'iterchildren' of 'lxml.etree._Element' objects}
1    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
102    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
204    0.001    0.000    0.001    0.000 {method 'replace' of 'str' objects}
1    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
1    0.000    0.000    0.000    0.000 {method 'translate' of 'str' objects}
102    0.000    0.000    0.000    0.000 {method 'values' of 'dict' objects}
2    0.000    0.000    0.000    0.000 {time.time}

It seems to me you are running a whole bunch of individual insert statements. Turning on logging on the mysql db should show you a bunch of statements like this:

....
INSERT INTO dblp2.authors (name) VALUES (a) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
INSERT INTO dblp2.authors (name) VALUES (b) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
INSERT INTO dblp2.authors (name) VALUES (c) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
INSERT INTO dblp2.authors (name) VALUES (d) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
INSERT INTO dblp2.authors (name) VALUES (e) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
INSERT INTO dblp2.authors (name) VALUES (f) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
....

You want to run either paramaterized/multiple insert statement, or package everything into a csv file and do a bulk insert.

Bulk insert is faster, and it supports replacing duplicate rows (check the docs). If you want a true update, use multiple inserts or try this temp table idea.

Multiple insert statements would be sort of like this:

cursor.execute(query_author, [author.encode('utf8') for author in authors])

which should give you entries in the log like this:

INSERT INTO dblp2.authors (name) VALUES (a,b,c,d,e,f) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)

You may have some issues with the ids for the second table.

Each database provides the EXPLAIN command for figuring out executing plans.

In addition: Python has a profiler to detect which code is slow.

So first: you making your analysis and and if you don't get any further you come back.

继续阅读：performance python

Using pythons oursql to store data in a mysql database is crawling. Why?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？