Convert Html to Docx in c# [closed]
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this questioni want to convert a html page to docx in c#, how can i do it?
My solution uses Html2OpenXml along with DocumentFormat.OpenXml (NuGet package for Html2OpenXml is here) to provide an elegant solution for ASP.NET MVC.
WordHelper.cs
public static class WordHelper
{
public static byte[] HtmlToWord(String html)
{
const string filename = "test.docx";
if (File.Exists(filename)) File.Delete(filename);
using (MemoryStream generatedDocument = new MemoryStream())
{
using (WordprocessingDocument package = WordprocessingDocument.Create(
generatedDocument, WordprocessingDocumentType.Document))
{
MainDocumentPart mainPart = package.MainDocumentPart;
if (mainPart == null)
{
mainPart = package.AddMainDocumentPart();
new Document(new Body()).Save(mainPart);
}
HtmlConverter converter = new HtmlConverter(mainPart);
Body body = mainPart.Document.Body;
var paragraphs = converter.Parse(html);
for (int i = 0; i < paragraphs.Count; i++)
{
body.Append(paragraphs[i]);
}
mainPart.Document.Save();
}
return generatedDocument.ToArray();
}
}
}
Controller
[HttpPost]
[ValidateInput(false)]
public FileResult Demo(CkEditorViewModel viewModel)
{
return File(WordHelper.HtmlToWord(viewModel.CkEditorContent),
"application/vnd.openxmlformats-officedocument.wordprocessingml.document");
}
I'm using CKEditor to generate HTML for this sample.
Below does the same thing as Luis code, but just a bit more readable and applied to an ASP.NET MVC application:
var word = new Microsoft.Office.Interop.Word.Application();
word.Visible = false;
var filePath = Server.MapPath("~/MyFiles/Html2PdfTest.html");
var savePathPdf = Server.MapPath("~/MyFiles/Html2PdfTest.pdf");
var wordDoc = word.Documents.Open(FileName: filePath, ReadOnly: false);
wordDoc.SaveAs2(FileName: savePathPdf, FileFormat: WdSaveFormat.wdFormatPDF);
you can also save in other formats such as docx like this:
var savePathDocx = Server.MapPath("~/MyFiles/Html2PdfTest.docx");
var wordDoc = word.Documents.Open(FileName: filePath, ReadOnly: false);
wordDoc.SaveAs2(FileName: savePathDocx, FileFormat: WdSaveFormat.wdFormatXMLDocument);
Using that code to convert
Microsoft.Office.Interop.Word.Application word =
new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document wordDoc =
new Microsoft.Office.Interop.Word.Document();
Object oMissing = System.Reflection.Missing.Value;
wordDoc = word.Documents.Add(ref oMissing, ref oMissing, ref oMissing, ref oMissing);
word.Visible = false;
Object filepath = "c:\\page.html";
Object confirmconversion = System.Reflection.Missing.Value;
Object readOnly = false;
Object saveto = "c:\\doc.pdf";
Object oallowsubstitution = System.Reflection.Missing.Value;
wordDoc = word.Documents.Open(ref filepath, ref confirmconversion,
ref readOnly, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oMissing);
object fileFormat = WdSaveFormat.wdFormatPDF;
wordDoc.SaveAs(ref saveto, ref fileFormat, ref oMissing, ref oMissing, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oallowsubstitution, ref oMissing,
ref oMissing);
The OpenXML SDK allows you to programmatically build docx documents:
OpenXml SDK Download
You might consider using altChunk. See, amongst others, adding images to openxml doc created from altchunk
If you don't want to rely on Word to convert the HTML, you could try docx4j-ImportXHTML for .NET; see this walkthrough.
Aspose.Words for .NET is a commercial component allowing you to achieve this.
MigraDoc can help. Or using VS tools for Office. Or connecting to Office via COM.
Using office applications on the web server is not recommended by Microsoft. however this can be done fairly easily using the OpenXML 2.5
All you have to really do is split the HTML by ("<", ">") then for each part shove it into a switch and identify if it is a HTML tag or not.
Then for each part you can start converting the HTML to "Run" and "RunProperties" and the non-html text is simply placed into the "Text"
It sounds harder then it is... and yes I have no idea why there isn't code available to do exactly this.
Things to keep in mind. The two formats do not cleanly convert into each other, so if you focus on the cleanest code possible you will run into issue where the format its self becomes messy.
You may consider using PHPDocX that offers a very convenient tool to convert HTML files and/or HTML strings into WordML.
It has plenty of options among them:
- you can filter using CSS style selector which chunks of HTML should be inserted into the Word document.
- You may choos if download the image or letthem as external links.
- It parses HTML forms.
- You may use native Word styles for tables and paragraphs overwritting the original CSS.
- Transforms HTML anchors in Word bookmarks.
- etcetera
I hope you find it useful :-)
精彩评论