Extracting dates from html meta data in FAST-ESP
During document processing I want to extract all dates from html meta data and then identify the latest date which will be used to populate a date field (dtgeneric1).
<meta name="OriginalPublicationDate" content="2010/04/21 12:06:36" />
<meta name="LastModificationDate" content="2010/04/22 14:10:16" />
+ other non-date meta data
Inspection using spy stages shows that our pipeline already adds meta_* attributes but the meta data names will be different across documents from different sources.
#### ATTRIBUTE meta_originalpublicationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/21 12:06:36
#### ATTRIBUTE meta_lastmodificationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/22 14:10:16
+ other non-date meta attributes
Ideally we would like to pass all the meta_* attributes to a Python stage and use that to work out which are dates and which is the largest but there seems to be no way of sp开发者_JS百科ecifying "all meta attributes" as input.
Has anyone done something similar and can offer any advice on the best way to do this.
Thanks
Neil
I suppose that a custom stage that takes all the needed date attributes as an input, processes a comparison between all them (to find the newest date), and outputs the most up-to-date field will do the job.
精彩评论