Garbage codes that may be generated by copying word to html editor

In addition to windowtext, there are many garbage codes generated after Word is copied to HTML. Here are the names of some common junk codes:

MsoNormal: This is the style name that Word uses by default and is one of the most common types of junk code.
MsoListParagraph: The style name used for list paragraphs.
MsoTitle: The style name used for the title.
MsoHeader: The name of the style used for the header.
MsoFooter: The style name used for the footer.
MsoTable: The style name used for the table.
MsoCommentText: The style name used for comments.
MsoBodyText: The style name used for the body text.
These junk codes often lead to messy styles, so it is recommended to remove them as much as possible when dealing with HTML codes.

There are also some other names, such as:

mso-padding-alt
mso-margin-top-alt
mso-margin-bottom-alt
mso-list
mso-list-level
mso-tab-count
mso-layout-grid-align
mso-char-indent-count
mso-header- margin
mso-footer-margin
mso-paper-source
mso-page-orientation
These names also come from the code generated after editing in Word and copying to HTML.

There are also some other common style names in Word, such as:

mso-spacerun: styles automatically generated when multiple spaces are entered in Word
mso-list: list styles
mso-bidi-font-size, mso-ascii-font-family, mso-hansi-font-family, mso-ansi- language, mso-fareast-language, mso-bidi-language: These styles are mainly used to process multilingual text, which may affect the page layout and font display effect. It should be noted that the HTML code generated by different versions of
Word may be different. May result in different style names and garbage code. When dealing with the HTML code generated by Word, it is best to understand the specific situation first, and then formulate a corresponding solution.

Here is a more comprehensive list, including some common, uncommon, and special style property names:

windowtext
mso-padding-alt
mso-margin-alt
mso-table-anchor-horizontal
mso-table-anchor-vertical
mso-padding-alt-top
mso-padding-alt-right
mso-padding-alt-bottom
mso-padding- alt-left
mso-margin-alt-top
mso-margin-alt-right
mso-margin-alt-bottom
mso-margin-alt-left
mso-table-layout-alt
mso-border-alt
mso-border-alt-top
mso-border-alt-right
mso-border-alt-bottom
mso-border-alt-left
mso-border-alt-colspan
mso-border-alt-rowspan
mso-cellspacing
mso-cellpadding
mso-yfti-tbllook
mso-yfti- relative-size
mso-yfti-font-family
mso-yfti-font-size
mso-yfti-font-weight
mso-yfti-font-style
mso-yfti-font-color
mso-yfti-rowanchor
mso-yfti-firstrow
mso-yfti-lastrow
mso-yfti-trowgranularity
mso- yfti-trowautofit
mso-yfti-rowheight
mso-yfti-wrap
mso-hansi-font-family
mso-bidi-font-family
mso-ansi-font-size
mso-bidi-font-size
mso-ansi-font-style
mso- bidi-font-style
mso-ansi-font-weight
mso-bidi-font-weight
mso-ansi-font-color
mso-bidi-font-color
mso-font-kerning
mso-font-charset
mso-generic-font-family
mso-font-format
mso-font-pitch
mso-font-signature
mso-ascii-font-family
mso-hansi-theme-font
mso-ascii-theme-font
mso-bidi-theme-font
mso-theme-font mso-
theme-font-major
mso-theme-font -minor
mso-bidi-language
mso-ansi-language
mso-language
mso-no-proof
mso-spacerun
mso-style-locked
mso-style-priority
mso-background-source
mso-pattern
mso-protection
mso-position-horizontal
mso -position-horizontal-relative
mso-position-vertical
mso-position-vertical-relative
mso-width-percent
mso-height-percent
mso-horizontal-position-percent
mso-vertical-position-percent
mso-ignore
mso-number-format
mso-layout-grid-align
mso-layout-grid-mode
mso-layout-grid-type
mso-line-height-rule
mso-list
mso-list-template
mso-list-id
mso-list-type
mso-outline-level
mso-list-level
mso-list-level-text
mso-list-level-tab-stop
mso-list-level-number-position
mso-list-level-tab-stop-position
mso-list-level-align
mso-list-level-text-indent
mso-list-level-number-indent
mso-list-level-previous
mso-list-level-following
mso-list-indent
mso-list-hang

And some other junk code like:

mso-padding-alt
mso-table-anchor
mso-char-indent
mso-pagination
mso-para-margin
mso-border-alt
These codes are also generated when Word and other editors are copied into HTML, and they need to be cleared.

insert image description here

produces the following garbage elements

The <v:…> element is a non-standard tag in Microsoft Office products, known as the "VML" (Vector Markup Language) tag, used to create vector graphics in Office applications. These tags are often present when exporting to HTML and can cause page rendering issues. If you need to display these graphics in HTML, you can use third-party libraries for parsing and rendering.

In addition to elements such as <v:…>, office products may also generate the following garbage elements when copied to HTML:

<o:…>: Office XML elements
<w:…>: WordML elements
<m:…>: MathML elements
st1:...: Legacy Word elements
st2:...: Legacy Word elements
These elements can contain a lot of unnecessary The styles and code of the HTML page need to be cleaned up and optimized to improve the performance and maintainability of the HTML page.

In addition to the garbage elements mentioned above, if you use some special functions in Word, such as formula editor, SmartArt, charts, etc., the corresponding garbage code will also be generated when copied to HTML. In addition, some special formats and styles in Word (such as paragraph styles, list styles, table styles, etc.) may also cause redundant codes to be generated when converted to HTML.

Other elements that may appear in Word copy-pasted HTML include:

<o:p>: Used to mark the beginning and end of a paragraph.
<w:br>: Used to indicate a newline character.
<w:tab>: Used to represent tab characters.
<w:pict>: used to represent a picture.
<w:smartTag>: Used to represent smart tags (such as autolinks, spell check, dates, etc.).
<w:hyperlink>: Used to indicate a hyperlink.
These elements may affect the layout and presentation of the webpage, and need to be paid attention to in subsequent processing.

These should be relatively common garbage elements, and there may be some others besides these, but they are not very common. In general, copying word to html will generate a lot of redundant code. If it needs to be cleaned up, it is recommended to deal with it according to the specific situation.

There are also some uncommon garbage elements, such as:

<o:p>: A placeholder in an Office document that marks a blank line at the end of a paragraph.
: Before Office 2003 it was used to store additional information, now it is used to store custom XML data.
<m:oMathPara> and <m:oMath>: Elements for MathML formulas, these elements appear frequently in Office, but are not commonly used in other scenarios.
Of course, the frequency of use of these elements is relatively low. If you only need to deal with common junk elements, you can ignore these uncommon elements.

Most of them start with o, w, m. Of course, there are other less common tags, such as:

<v:textbox>
<v:image>
mso-spacerun:
mso-tab-count:
mso-hide:
mso-ignore:
mso-element:
mso-comment-text:
These tags are also office products copied into html Garbage code that may be generated when .

In addition to the above mentioned, there are some uncommon ones, such as:

<o:p> and <w:WordDocument>: These elements are tags that Microsoft Word uses in HTML output.
<m:math> and <m:oMath>: These elements are the tags that Microsoft Word uses to convert mathematical equations into HTML.
It should be noted that although these elements may appear in HTML when copying Word content, they are not necessarily "junk elements", and some may be useful. Therefore, when processing the HTML copied by Word, it is necessary to filter and clean it according to the actual situation.

Guess you like

Origin blog.csdn.net/snans/article/details/129251290