Problem reading files greater than 1GB with XMLReader

2023-01-10 07:33 问答作者：

Is there a maximum file size the XMLReader can handle?

I'm trying to process an XML feed about 3GB large. There are certainly no PHP errors as the script runs fine and successfully loads to the database after it's been run.

The script also runs fine with smaller test feeds - 1GB and below. However, when processing larger feeds the script stops reading the XML File after about 1GB and continues running the rest of the script.

Has anybody experie开发者_如何学编程nced a similar problem? and if so how did you work around it?

Thanks in advance.

I had same kind of problem recently and I thought to share my experience.

It seems that problem is in the way PHP was compiled, whether it was compiled with support for 64bit file sizes/offsets or only with 32bit.

With 32bits you can only address 4GB of data. You can find a bit confusing but good explanation here: http://blog.mayflower.de/archives/131-Handling-large-files-without-PHP.html

I had to split my files with Perl utility xml_split which you can find here: http://search.cpan.org/~mirod/XML-Twig/tools/xml_split/xml_split

I used it to split my huge XML file into manageable chunks. The good thing about the tool is that it splits XML files over whole elements. Unfortunately its not very fast.

I needed to do this one time only and it suited my needs, but I wouldn't recommend it repetitive use. After splitting I used XMLReader on smaller files of about 1GB in size.

Splitting up the file will definitely help. Other things to try...

adjust the memory_limit variable in php.ini. http://php.net/manual/en/ini.core.php
rewrite your parser using SAX -- http://php.net/manual/en/book.xml.php . This is a stream-oriented parser that doesn't need to parse the whole tree. Much more memory-efficient but slightly harder to program.

Depending on your OS, there might also be a 2gb limit on the RAM chunk that you can allocate. Very possible if you're running on a 32-bit OS.

It should be noted that PHP in general has a max file size. PHP does not allow for unsigned integers, or long integers, meaning you're capped at 2^31 (or 2^63 for 64 bit systems) for integers. This is important because PHP uses an integer for the file pointer (your position in the file as you read through), meaning it cannot process a file larger than 2^31 bytes in size.

However, this should be more than 1 gigabyte. I ran into issues with two gigabytes (as expected, since 2^31 is roughly 2 billion).

I've run into a similar issue when parsing large documents. What I wound up doing is breaking the feed into smaller chunks using filesystem functions, then parsing those smaller chunks... So if you have a bunch of <record> tags that you are parsing, parse them out with string functions as a stream, and when you get a full record in the buffer, parse that using the xml functions... It sucks, but it works quite well (and is very memory efficient, since you only have at most 1 record in memory at any one time)...

Do you get any errors with

libxml_use_internal_errors(true);
libxml_clear_errors();

// your parser stuff here....    
$r = new XMLReader(...);
// ....


foreach( libxml_get_errors() as $err ) {
   printf(". %d %s\n", $err->code, $err->message);
}

when the parser stops prematurely?

Using WindowsXP, NTFS as filesystem and php 5.3.2 there was no problem with this test script

<?php
define('SOURCEPATH', 'd:/test.xml');

if ( 0 ) {
  build();
}
else {
  echo 'filesize: ', number_format(filesize(SOURCEPATH)), "\n";
  timing('read');
}

function timing($fn) {
  $start = new DateTime();
  echo 'start: ', $start->format('Y-m-d H:i:s'), "\n";
  $fn();
  $end = new DateTime();
  echo 'end: ', $start->format('Y-m-d H:i:s'), "\n";
  echo 'diff: ', $end->diff($start)->format('%I:%S'), "\n";
}

function read() {
  $cnt = 0;
  $r = new XMLReader;
  $r->open(SOURCEPATH);
  while( $r->read() ) {
    if ( XMLReader::ELEMENT === $r->nodeType ) {
      if ( 0===++$cnt%500000 ) {
        echo '.';
      }
    }
  }
  echo "\n#elements: ", $cnt, "\n";
}

function build() {
  $fp = fopen(SOURCEPATH, 'wb');

  $s = '<catalogue>';
  //for($i = 0; $i < 500000; $i++) {
  for($i = 0; $i < 60000000; $i++) {
    $s .= sprintf('<item>%010d</item>', $i);
    if ( 0===$i%100000 ) {
      fwrite($fp, $s);
      $s = '';
      echo $i/100000, ' ';
    }
  }

  $s .= '</catalogue>';
  fwrite($fp, $s);
  flush($fp);
  fclose($fp);
}

output:

filesize: 1,380,000,023
start: 2010-08-07 09:43:31
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:43:31
diff: 07:31

(as you can see I screwed up the output of the end-time but I don't want to run this script another 7+ minutes ;-))

Does this also work on your system?

As a side-note: The corresponding C# test application took only 41 seconds instead of 7,5 minutes. And my slow harddrive might have been the/one limiting factor in this case.

filesize: 1.380.000.023
start: 2010-08-07 09:55:24
........................................................................................................................

#elements: 60000001

end: 2010-08-07 09:56:05
diff: 00:41

and the source:

using System;
using System.IO;
using System.Xml;

namespace ConsoleApplication1
{
  class SOTest
  {
    delegate void Foo();
    const string sourcepath = @"d:\test.xml";
    static void timing(Foo bar)
    {
      DateTime dtStart = DateTime.Now;
      System.Console.WriteLine("start: " + dtStart.ToString("yyyy-MM-dd HH:mm:ss"));
      bar();
      DateTime dtEnd = DateTime.Now;
      System.Console.WriteLine("end: " + dtEnd.ToString("yyyy-MM-dd HH:mm:ss"));
      TimeSpan s = dtEnd.Subtract(dtStart);
      System.Console.WriteLine("diff: {0:00}:{1:00}", s.Minutes, s.Seconds);
    }

    static void readTest()
    {
      XmlTextReader reader = new XmlTextReader(sourcepath);
      int cnt = 0;
      while (reader.Read())
      {
        if (XmlNodeType.Element == reader.NodeType)
        {
          if (0 == ++cnt % 500000)
          {
            System.Console.Write('.');
          }
        }
      }
      System.Console.WriteLine("\n#elements: " + cnt + "\n");
    }

    static void Main()
    {
      FileInfo f = new FileInfo(sourcepath);
      System.Console.WriteLine("filesize: {0:N0}", f.Length);
      timing(readTest);
      return;
    }
  }
}

继续阅读：file max php size xmlreader

Problem reading files greater than 1GB with XMLReader

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？