开发者

Combining HTML docs with different charset

I saved a MS-Word Doc with the 'save-as' option of "Web Page, Filtered". I want to insert the HTML & CSS code that was generated inside an HTML5 document that has my header, menu, footer, etc. The first question is in regard to charset and header info:

MS-Word generated HTML (Saved as "Web Page, Filtered"):

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 12 (filtered)">

My HTML5 template:

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8" /> 

The main issue I see is the two different character sets (UTF-8 vs windows-1252). Additionally, I am guessing the meta tag "name=Generator content="Microsoft Word 12 (filtered)" 开发者_如何学编程will not be a problem and perhaps can just be removed (?).

I can sort out the CSS with one exception. I do not know what the '@' symbol means. Example:

@font-face
    {font-family:"Book Antiqua";
    panose-1:2 4 6 2 5 3 5 3 3 4;}

I looked through the document and do not see "font-face" IDs or classes. So I am guessing this might change all of the fonts in the document. This might be a problem (if true); as stated, the new document will have my menu, header, footer, etc.


You should not copy&paste anything that ms office pukes out into a website; mostly because your code becomes a big mess, and it will most likely only look right in IE. This just my experience after i got a lot "Your website is broken!!!" complains after someone pasted ms-word-"html" into joomla pages. Anyway, charset on your website must be utf-8. Your @font-face looks broken to me. I only know it in a slightly different syntax:

@font-face {
  font-family: "Awesomefont";
  src: url("fonts/awesome.ttf");
}

this alone wont do anything, until you apply "Awesomefont" somewhere else:

h1 { font-family: "Awesomefont"; }


Here is a set of PowerShell scripts that will clean Word-Filtered HTML and correctly tag super/subscripts about 95% of the time. (No, you can't get better than that, Word is made for print.)

https://github.com/suzumakes/replaceit

This also changes the characters that M$ barfs out in windows-1252 class to their appropriate UTF-8 counterparts. It removes all the styling and classes so that you can drop the HTML straight into your template with minimal fuss. Depending on how crazy the person who made your Word doc went with justifying text and funky layouts you may have just a few minutes of cleanup, or you may have to fix M$'s propensity to insert soft hyphens all over the place.

Instructions are there in the ReadMe and if you happen to encounter any additional characters that need to be caught or come up with any tweaks/improvements, I'd be happy to see your pull request.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜