How to convert html to markdown

How to convert html to markdown?
What is Turndown

"Turndown" is a JavaScript library for converting HTML to Markdown. It is often used to convert rich text content from web pages or other HTML formats into plain text Markdown format for display or storage on different platforms.

If you want to use Turndown with a node environment, you first need to add it to your project. Turndown can be installed using npm or yarn as follows:

Install using npm:

npm install turndown

Install using yarn:

yarn add turndown

Or import the html file directly:

<script src="https://unpkg.com/turndown/dist/turndown.js"></script>

The installation is complete and you can use it in your project. Here's a simple example:

import TurndownService from 'turndown';

const turndownService = new TurndownService();

const html = '<h1>Hello, World!</h1><p>This is a <em>sample</em> HTML document.</p>';
const markdown = turndownService.turndown(html);

console.log(markdown);

No installation required, use directly in html file:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <!-- 引入 Turndown 库 -->
  <script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/turndown.js"></script>

  <title>Document</title>
</head>
<body>
  <script>
    // 创建 TurndownService 实例
    const turndownService = new TurndownService();
  
    // 要转换的 HTML 内容
    const htmlContent = '<h1>Hello, World!</h1><p>This is a <em>sample</em> HTML document.</p>';
  
    // 使用 Turndown 进行转换
    const markdown = turndownService.turndown(htmlContent);
  
    // 输出 Markdown
    console.log(markdown);
  </script>
  
</body>
</html>

parameter

Turndown provides some parameters and configuration options for more fine-grained control over the HTML to Markdown conversion process. The following are some commonly used parameters and configuration options:

  1. headingStyle (title style): Used to specify the style of the generated Markdown title. Optional values ​​include "setext" (default) and "atx". The "setext" style uses an underscore or equal sign to indicate heading level, while the "atx" style uses a number of pound signs to indicate heading level.
const turndownService = new TurndownService({ headingStyle: 'atx' });
  1. hr (horizontal separator line): used to specify the style of the generated Markdown horizontal separator line. The default is * * *.
const turndownService = new TurndownService({ hr: '- - -' });
  1. bulletListMarker (unordered list tag): Tag used to specify the generated Markdown unordered list. The default is "*".
const turndownService = new TurndownService({ bulletListMarker: '-' });
  1. codeBlockStyle(code block style): Used to specify the style of the generated Markdown code block. The default is three backticks "```".
const turndownService = new TurndownService({ codeBlockStyle: '```' });
  1. fence (code block tag): A tag used to specify the generated Markdown code block. The default is ~~~.
const turndownService = new TurndownService({ fence: '```' });
  1. emDelimiter (emphasis/italic tag): Tag used to specify the emphasis (italics) of the generated Markdown. The default is "_".
const turndownService = new TurndownService({ emDelimiter: '*' });
  1. strongDelimiter (bold tag): A tag used to specify the bold font of the generated Markdown. The default is "**".
const turndownService = new TurndownService({ strongDelimiter: '__' });

API

Turndown provides a set of API methods for customizing and configuring the HTML to Markdown conversion process. Here are some commonly used Turndown API methods and options:

  1. turndown(html: string): This is the main method of Turndown, used to convert a given HTML string to Markdown. It returns a Markdown string.

  2. addRule(key: string, rule: Rule): Allows you to add custom rules to handle the conversion of HTML elements. key is the unique identifier of the rule, and rule is an object containing the rule definition.

  3. keep(filter: string | RegExp | KeepFilterFunction): Allows you to specify which HTML elements should be left as raw HTML without conversion. You can pass a string, regular expression, or custom function to define the conditions for retention.

  4. remove(filter: string | RegExp | RemoveFilterFunction): Allows you to specify which HTML elements should be completely removed from the output Markdown. You can pass a string, regular expression, or custom function to define the conditions for removal.

  5. use(plugins: Plugin | Plugin[]): Allows you to load Turndown plugins, which can add additional conversion rules and functionality. Plugins are a way to extend the functionality of Turndown.

  6. keepReplacement: used to customize the replacement string when retaining elements, the default is '\n\n'. You can change this string to suit your needs.

  7. addRuleBefore(existingKey: string, newKey: string, rule: Rule): Add custom rules before existing rules.

  8. addRuleAfter(existingKey: string, newKey: string, rule: Rule): Add custom rules after existing rules.

There are many other tools for converting HTML to markdown on the market, such as html-to-markdown, showdown, remark, and Marked, which can be used according to project needs and personal preferences.

This article only briefly introduces the turndown tool. More details can be viewed in the official documentation: https://github.com/domchristie/turndown.

This article uses Article Synchronization Assistant Synchronization

Guess you like

Origin blog.csdn.net/xielinrui123/article/details/133719499