How to extract strings wrapped in HTML from a column in an SQL table into a new table?
I have a column on an SQL Server 2005 table called BIO - the data in the BIO column is formatted like this:
<HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.</A>, <A name=HO>M.Sc.</A>, <A name=HO>Ph.D.</A>; <A name=OC>scientist, professor</A>; b. <A name=BC>St. Marys</A>, Ont. <A name=BY>1970</A>; <A name=PA>d. Wm. and H. Aarts</A>; <A name=ED>e. Univ. of Western Ont. B.Sc.(Hons.) 1994, M.Sc. 1997</A>; <A name=ED>McGill Univ. Ph.D. 2002</A>; <A name=MA>m. L. MacManus</A>; two children; <A name=PO>CANADA RESEARCH CHAIR IN SIGNAL TRANSDUCTION IN ISCHEMIA</A> and <A name=PO>ASST. PROF., DEPT. OF BIOL. SCI., UNIV. OF TORONTO SCARBOROUGH 2006– </A>; Postdoctoral Fellow, Toronto Western Hosp. 2000–06; Expert Cons., Auris Med. SAS, Montpellier, France; mem., Centre for the Neurobiol. of Stress; named INMHA Brainstar of the Year 2003; Bd. of Dirs. & Fundraising Chair, N'Sheemaehn Childcare; mem., Soc. for Neurosci.; Cdn. Physiol. Soc.; Cdn. Assn. for Neurosci.; <A name=WK>co-author: 'Therapeutic Tools in Brain Damage' in <EM>Proteomics and Protein Interactions: Biology, Chemistry, Bioinformatics and Drug Design </EM>2005; 18 pub. journal articles</A>; Office: <A name=OF1_L1>1265 Military Trail</A>, <A name=OF1_CT>Scarborough</A>, <A name=OF1_PR>Ont.</A> <A name=OF1_PC>M1C 1A4</A>. </BODY></HTML>
I need to extract values from each of the anchor tags ie:
<A name=SN>AARTS</A>
I would need to have AARTS in a column called SN in the result set
This is what I have so far...
SELECT CONTACT_ID
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=SN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=SN>', [BIO])) - CHARINDEX('<A name=SN>', [BIO])-11))) AS 'SN'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=GN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=GN>', [BIO])) - CHARINDEX('<A name=GN>', [BIO])-11))) AS 'GN'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=HO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=HO>', [BIO])) - CHARINDEX('<A name=HO>', [BIO])-11))) AS 'HO'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=OC>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=OC>', [BIO])) - CHARINDEX('<A name=OC>', [BIO])-11))) AS 'OC'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PO>', [BIO])) - CHARINDEX('<A name=PO>', [BIO])-11))) AS 'PO'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BD>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BD>', [BIO])) - CHARINDEX('<A name=BD>', [BIO])-11))) AS 'BD'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PA>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PA>', [BIO])) - CHARINDEX('<A name=PA>', [BIO])-11))) AS 'PA'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BY>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BY>', [BIO])) - CHARINDEX('<A name=BY>', [BIO])-11))) AS 'BY'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=ED>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=ED>', [BIO])) - CHARINDEX('<A name=ED>', [BIO])-11))) AS 'ED'
FROM [cww].[dbo].[Contacts]
ORDER BY CONTACT_ID
The results I get from that look like this:
CONTACT_ID SN GN HO OC PO DB PA BY ED
3 AARON Raymond Leonard B.Sc. business coach, professional speaker, real estate entrepreneur D>AARON
5 AATAMI Pita C.Q. business executive; Kuujjuaq
7 ABBOTT Anthony C. P.C. lawyer Montreal
8 ABBOTT Elizabeth M.A. historian Ottawa
9 ABBOTT (Caroline) Louise D>ABBOTT writer, photographer, filmmaker Montreal
I can keep going and manually add all of the substrings for each differently named anchor but the problem with this is that I do not know all of 'names' that are used in the anchors and there are 22000+ records in this table that I would have to look through to make sure I catch them all. As well, not all BIOs have all the anchors so if you look at the result for 'ABBOTT (Caroline) Louise' she doesn't have an 'HO' anchor so it returns incorrect data 'D>ABBOTT' and I haven't seen this yet with the limited results I'm bringing up but some records have multiple anchors such as 2 'HO's which I imagine will cause problems..
One last problem is that not all anchor names are 2 letters so the 11 I'm using in the charindex would be wrong for those ones..
Is there a better way to do this? Any help would be appreciated.
UPDATE - I've added CASE statements to remove incorrect data when the anchor name doesn't exist for the current record.
SELECT CONTACT_ID
,'SN' =
CASE
WHEN CHARINDEX('<A name=SN>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=SN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=SN>', [BIO])) - CHARINDEX('<A name=SN>', [BIO])-11)))
END
,'GN' =
CASE
WHEN CHARINDEX('<A name=GN>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=GN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=GN>', [BIO])) - CHARINDEX('<A name=GN>', [BIO])-11)))
END
,'HO' =
CASE
WHEN CHARINDEX('<A name=HO>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=HO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=HO>', [BIO])) - CHARINDEX('<A name=HO>', [BIO])-11)))
END
,'OC' =
CASE
WHEN CHARINDEX('<A name=OC>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=OC>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=OC>', [BIO开发者_运维百科])) - CHARINDEX('<A name=OC>', [BIO])-11)))
END
,'PO' =
CASE
WHEN CHARINDEX('<A name=PO>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PO>', [BIO])) - CHARINDEX('<A name=PO>', [BIO])-11)))
END
,'BD' =
CASE
WHEN CHARINDEX('<A name=BD>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BD>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BD>', [BIO])) - CHARINDEX('<A name=BD>', [BIO])-11)))
END
,'PA' =
CASE
WHEN CHARINDEX('<A name=PA>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PA>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PA>', [BIO])) - CHARINDEX('<A name=PA>', [BIO])-11)))
END
,'BY' =
CASE
WHEN CHARINDEX('<A name=BY>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BY>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BY>', [BIO])) - CHARINDEX('<A name=BY>', [BIO])-11)))
END
,'ED' =
CASE
WHEN CHARINDEX('<A name=ED>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=ED>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=ED>', [BIO])) - CHARINDEX('<A name=ED>', [BIO])-11)))
END
--INTO [cww].[dbo].[BioDetails]
FROM [cww].[dbo].[Contacts]
ORDER BY CONTACT_ID
I don't know how you could do this purely in T-SQL.
If you can retrieve the CONTACT_ID and BIO columns into an application, you could iterate over the result set, parse the BIO data as XML, then use XPath to get the name attribute value and the anchor body, building a map of the data to be inserted into your new table. Since you don't know all the different names that could exist, you'll probably need to recreate the table each time it's run, so store names found in a Set and after iterating over all the rows use the Set to generate your create table
statement.
The DB code is pure fantasy, but here's a snippet showing how you could do it using the XOM XML library for Java. I'm not positive this would work since your attribute values aren't quoted, but you might be able to find a parser that isn't too picky, and I'm sure you could do something similar in .NET.
ResultSet results = db.query("select CONTACT_ID, BIO from [cww].[dbo].[Contacts]");
Set<String> newTableColumns = new Set<String>();
newTableColumns.put("CONTACT_ID");
List<Map<String,String> > dataToInsert = new ArrayList<Map<String,String> >();
Builder parser = new Builder();
for (ResultRow resultRow : results) { // iterate over the result set
Map<String,String> rowDataToInsert = new HashMap<String,String>();
rowData.put("CONTACT_ID", resultRow.get("CONTACT_ID"));
// parse the BIO data as an XML document
Document doc = parser.build(resultRow.get("BIO"), "");
// query the document using XPath
Nodes namedAnchors = doc.query("//a[@name]");
for (int nItr = 0; nItr < namedAnchors.size(); nItr++) {
Element anchor = (Element) namedAnchors.get(nItr);
String name = anchor.getAttributeValue("name");
String anchorBody = anchor.getValue();
newTableColumns.put(name);
rowDataToInsert.put(name, anchorBody);
}
// we've stored all the anchor data from this row, so put it away
dataToInsert.add(rowDataToInsert);
}
// create your table
db.createTable("NEW_TABLE_NAME", newTableColumns);
// insert into your new table
db.batchInsert("NEW_TABLE_NAME", dataToInsert);
I'd create a table with contact_id, anchor type and value, parse the data and add records to it, then spit it out with a crosstab. That would allow you to easily find what all the anchor types are, allow for multiple/no anchor types, etc.
You may also need either multiple passes on the data, or consider using tools that are not primarily set based (like SQL is), eg c#.
精彩评论