开发者

how to read the multiple xml contents in a file using php

I'm dealing with this kind of XML sequence file can you any one suggest me to parse this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<name>ccccc</name>
开发者_Go百科<document-id>
<country>US</country>
<doc-number>D0629997</doc-number>
<kind>S1</kind>
<date>20110104</date>
</document-id>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<name>dddd</name>
<document-id>
<country>US</country>
<doc-number>D0629998</doc-number>
<kind>S2</kind>
<date>20110104</date>
</document-id>


That's not a valid XML file. It looks like two files in one, but even then it is invalid. Assuming those are two separate files, you could try "tidying" them first. Assuming $xml is a string containing the xml contents:

$xml = tidy_repair_string($xml, array(
    'output-xml' => true,
    'input-xml' => true
)); 

Then you could use SimpleXml on it:

$xml = new SimpleXmlElement($xml);


I know where this XML file has come from and I find it quite strange that Google would provide some invalid XML (unless they are simply just hosting this file that they got from somewhere else). This suggestion for parsing it worked for me: How to parse an xml file with multiple xml declaration using PHP? (A concatenation of several XML files)


That file contains a sequence of XML documents concatenated to each other. You need to register a PHP streamwrapper that transparently divides the file for you, then you can process each document individually and even in a streaming fashion. Example:

stream_wrapper_register('xmlseq', 'XMLSequenceStream');

$path = "xmlseq://zip://ipg140107.zip#ipg140107.xml";

while (XMLSequenceStream::notAtEndOfSequence($path)) {
    $reader = new XMLReader();
    $reader->open($path);
    // just consume the whole document
    while ($reader::next()) {
        XMLReaderNode::dump($reader);
    }
}

XMLSequenceStream::clean();    

That stream-wrapper is part of the XMLReaderIterator library and works as well with SimpleXMLElement or DOMDocument albeit for larger files XMLReader is a better fit.

For the file I've taken in my example (http://storage.googleapis.com/patents/grant_full_text/2014/ipg140107.zip from https://www.google.com/googlebooks/uspto-patents-grants-text.html), the overall element-structure counting elements of the different trees in that sequence for example is:

\-us-patent-grant (473)
  |-us-bibliographic-data-grant (473)
  | |-publication-reference (473)
  | | \-document-id (473)
  | |   |-country (473)
  | |   |-doc-number (473)
  | |   |-kind (473)
  | |   \-date (473)
  | |-application-reference (473)
  | | \-document-id (473)
  | |   |-country (473)
  | |   |-doc-number (473)
  | |   \-date (473)
  | |-us-application-series-code (473)
  | |-us-term-of-grant (470)
  | | |-length-of-grant (450)
  | | |-disclaimer (18)
  | | | \-text (18)
  | | \-us-term-extension (20)
  | |-classification-locarno (450)
  | | |-edition (450)
  | | \-main-classification (450)
  | |-classification-national (473)
  | | |-country (473)
  | | |-main-classification (473)
  | | \-further-classification (143)
  | |-invention-title (473)
  | | \-i (12)
  | |-us-references-cited (458)
  | | \-us-citation (11000)
  | |   |-patcit (10265)
  | |   | \-document-id (10265)
  | |   |   |-country (10265)
  | |   |   |-doc-number (10265)
  | |   |   |-kind (9884)
  | |   |   |-name (9811)
  | |   |   \-date (10264)
  | |   |-category (10999)
  | |   |-classification-national (6309)
  | |   | |-country (6309)
  | |   | \-main-classification (6309)
  | |   |-nplcit (735)
  | |   | \-othercit (735)
  | |   |   |-sub (281)
  | |   |   |-i (7)
  | |   |   \-sup (1)
  | |   \-classification-cpc-text (1)
  | |-number-of-claims (472)
  | |-us-exemplary-claim (472)
  | |-us-field-of-classification-search (472)
  | | \-classification-national (8991)
  | |   |-country (8991)
  | |   |-main-classification (8991)
  | |   \-additional-info (1205)
  | |-figures (472)
  | | |-number-of-drawing-sheets (472)
  | | \-number-of-figures (472)
  | |-us-parties (472)
  | | |-us-applicants (472)
  | | | \-us-applicant (765)
  | | |   |-addressbook (765)
  | | |   | |-last-name (573)
  | | |   | |-first-name (573)
  | | |   | |-address (765)
  | | |   | | |-city (765)
  | | |   | | |-country (765)
  | | |   | | \-state (423)
  | | |   | \-orgname (192)
  | | |   \-residence (765)
  | | |     \-country (765)
  | | |-inventors (472)
  | | | \-inventor (969)
  | | |   \-addressbook (969)
  | | |     |-last-name (969)
  | | |     |-first-name (969)
  | | |     \-address (969)
  | | |       |-city (969)
  | | |       |-country (969)
  | | |       \-state (519)
  | | \-agents (429)
  | |   \-agent (500)
  | |     \-addressbook (500)
  | |       |-orgname (361)
  | |       |-address (500)
  | |       | \-country (500)
  | |       |-last-name (139)
  | |       \-first-name (139)
  | |-assignees (385)
  | | \-assignee (391)
  | |   |-addressbook (390)
  | |   | |-orgname (386)
  | |   | |-role (390)
  | |   | |-address (390)
  | |   | | |-city (355)
  | |   | | |-country (390)
  | |   | | \-state (192)
  | |   | |-last-name (4)
  | |   | \-first-name (4)
  | |   |-orgname (1)
  | |   \-role (1)
  | |-examiners (472)
  | | |-primary-examiner (472)
  | | | |-last-name (472)
  | | | |-first-name (472)
  | | | \-department (472)
  | | \-assistant-examiner (65)
  | |   |-last-name (65)
  | |   \-first-name (65)
  | |-us-related-documents (65)
  | | |-continuation-in-part (16)
  | | | \-relation (16)
  | | |   |-parent-doc (16)
  | | |   | |-document-id (16)
  | | |   | | |-country (16)
  | | |   | | |-doc-number (16)
  | | |   | | \-date (16)
  | | |   | |-parent-status (11)
  | | |   | \-parent-grant-document (5)
  | | |   |   \-document-id (5)
  | | |   |     |-country (5)
  | | |   |     |-doc-number (5)
  | | |   |     \-date (2)
  | | |   \-child-doc (16)
  | | |     \-document-id (16)
  | | |       |-country (16)
  | | |       \-doc-number (16)
  | | |-continuation (21)
  | | | \-relation (21)
  | | |   |-parent-doc (21)
  | | |   | |-document-id (21)
  | | |   | | |-country (21)
  | | |   | | |-doc-number (21)
  | | |   | | \-date (21)
  | | |   | |-parent-status (16)
  | | |   | \-parent-grant-document (5)
  | | |   |   \-document-id (5)
  | | |   |     |-country (5)
  | | |   |     |-doc-number (5)
  | | |   |     \-date (2)
  | | |   \-child-doc (21)
  | | |     \-document-id (21)
  | | |       |-country (21)
  | | |       \-doc-number (21)
  | | |-division (32)
  | | | \-relation (32)
  | | |   |-parent-doc (32)
  | | |   | |-document-id (32)
  | | |   | | |-country (32)
  | | |   | | |-doc-number (32)
  | | |   | | \-date (32)
  | | |   | |-parent-grant-document (24)
  | | |   | | \-document-id (24)
  | | |   | |   |-country (24)
  | | |   | |   |-doc-number (24)
  | | |   | |   \-date (1)
  | | |   | \-parent-status (8)
  | | |   \-child-doc (32)
  | | |     \-document-id (32)
  | | |       |-country (32)
  | | |       \-doc-number (32)
  | | \-related-publication (9)
  | |   \-document-id (9)
  | |     |-country (9)
  | |     |-doc-number (9)
  | |     |-kind (9)
  | |     \-date (9)
  | |-priority-claims (140)
  | | \-priority-claim (182)
  | |   |-country (182)
  | |   |-doc-number (182)
  | |   \-date (182)
  | |-us-sir-flag (1)
  | |-classifications-ipcr (23)
  | | \-classification-ipcr (24)
  | |   |-ipc-version-indicator (24)
  | |   | \-date (24)
  | |   |-classification-level (24)
  | |   |-section (24)
  | |   |-class (24)
  | |   |-subclass (24)
  | |   |-main-group (24)
  | |   |-subgroup (24)
  | |   |-symbol-position (24)
  | |   |-classification-value (24)
  | |   |-action-date (24)
  | |   | \-date (24)
  | |   |-generating-office (24)
  | |   | \-country (24)
  | |   |-classification-status (24)
  | |   \-classification-data-source (24)
  | |-us-botanic (21)
  | | |-latin-name (21)
  | | \-variety (21)
  | \-classifications-cpc (1)
  |   \-main-cpc (1)
  |     \-classification-cpc (1)
  |       |-cpc-version-indicator (1)
  |       | \-date (1)
  |       |-section (1)
  |       |-class (1)
  |       |-subclass (1)
  |       |-main-group (1)
  |       |-subgroup (1)
  |       |-symbol-position (1)
  |       |-classification-value (1)
  |       |-action-date (1)
  |       | \-date (1)
  |       |-generating-office (1)
  |       | \-country (1)
  |       |-classification-status (1)
  |       |-classification-data-source (1)
  |       \-scheme-origination-code (1)
  |-drawings (472)
  | \-figure (3033)
  |   \-img (3033)
  |-description (472)
  | |-description-of-drawings (472)
  | | |-p (3955)
  | | | |-figref (4478)
  | | | |-b (86)
  | | | \-i (6)
  | | \-heading (22)
  | |-heading (162)
  | \-p (340)
  |   |-figref (15)
  |   |-b (250)
  |   |-i (146)
  |   |-ul (96)
  |   | \-li (444)
  |   |   |-ul (215)
  |   |   | \-li (273)
  |   |   |   |-ul (199)
  |   |   |   | \-li (1192)
  |   |   |   |   |-i (1219)
  |   |   |   |   |-b (1)
  |   |   |   |   |-sup (10)
  |   |   |   |   \-sub (2)
  |   |   |   \-i (11)
  |   |   |-sup (2)
  |   |   \-i (26)
  |   |-tables (15)
  |   | \-table (15)
  |   |   \-tgroup (49)
  |   |     |-colspec (175)
  |   |     |-thead (15)
  |   |     | \-row (27)
  |   |     |   \-entry (51)
  |   |     \-tbody (49)
  |   |       \-row (291)
  |   |         \-entry (997)
  |   |           \-sup (28)
  |   \-sup (2)
  |-us-claim-statement (472)
  |-claims (472)
  | \-claim (476)
  |   \-claim-text (476)
  |     |-figref (1)
  |     |-claim-text (5)
  |     |-claim-ref (4)
  |     \-i (15)
  \-abstract (22)
    \-p (22)
      |-i (27)
      \-ul (2)
        \-li (2)
          \-ul (2)
            \-li (11)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜