Lua Pattern Matching to "fix" html code
I have a lot of badly formatted HTML which I am trying to fix using Lua for example
<p class='heading'>my useful information</p>
<p class='body'>lots more text</p>
which I want to replace with
<h2>my useful information</h2>
<p class='body'>lots more text</p>
What I am trying to use is the following Lua function which is passed the whole html page. How ever I have two problems, I want the gsub to pass the replace function the whole ma开发者_StackOverflowtch including the top and tail and I will then replace the top and tails and return the string. The other problem is my inner replace function can't see the top and tail fields.
Sorry if this is an obvious one, but I am still learning Lua.
function topandtailreplace(str,top,tail,newtop,newtail)
local strsearch = top..'(.*)'..tail
function replace(str)
str = string.gsub(str,top,newtop)
str = string.gsub(str,tail,newtail)
return str
end
local newstr = str:gsub(strsearch,replace())
return newstr
end
This seems to work:
s=[[
<p class='heading'>my useful information</p>
<p class='body'>lots more text</p>
]]
s=s:gsub("<p class='heading'>(.-)</p>","<h2>%1</h2>")
print(s)
You could use a HTML parsing library with a DOM tree, for example lua-gumbo:
luarocks install gumbo
The following example would do what you want:
local gumbo = require "gumbo"
local input = [[
<p class='heading'>my useful information</p>
<p class='body'>lots more text</p>
]]
local document = assert(gumbo.parse(input))
local headings = assert(document:getElementsByClassName("heading"))
local heading1 = assert(headings[1])
local textnode = assert(heading1.childNodes[1])
local new_h2 = assert(document:createElement("h2"))
heading1.parentNode:insertBefore(new_h2, heading1)
new_h2:appendChild(textnode)
heading1:remove()
io.write(document:serialize(), "\n")
精彩评论