RSS Can: Convert website information stream to RSS feed (3)

In the third article, let's talk about converting structured data into an RSS feed that can be subscribed to.

written in front

Through the first two articles "RSS Can: Using Golang to Achieve Better RSS Hub Service (1)" and "RSS Can: Using V8 to Make Golang Applications Dynamic (2)" , we have been able to integrate the information on the website , organized into structured data through dynamic configuration.

In this article, let's briefly talk about how to turn these structured data into subscribable RSS feeds, so that the data of the website can be "connected" with our RSS reader.

RSS format standard

Before talking about code implementation, whether as a developer or an RSS product user, it is very necessary to understand the RSS format standard.

There are three well-known genres of "RSS" format standards on the Internet, namely: Atom , RSS , and JSON Feed . The third type appeared in the decline of RSS, and there are few applications and voices. Therefore, the formats supported by major network applications are all in Focus on the first two: RSS and Atom.

TLDR, to put it simply, if you are a content provider and you want your content to be accessed by more people using various RSS clients, choosing the supported RSS 2.0 will maintain very good compatibility. If you are a reader, considering the continuous tracking of article updates and a better reading experience, when the website provides multiple RSS feed formats at the same time, you might as well choose the RSS feed in Atom format first .

Of course, in this article, we will use the open source software library to output the data organized in the previous two articles into three formats. ( no cost anyway )

Key advantages of the Atom format over RSS 2.0

If you don't want to do detailed development for "RSS", we only know how to use it, and this section can be skipped.

  1. Ability to mark whether the HTML content in the field has been escaped or encoded, which is convenient for developers to use the data when rendering.
  2. It is no longer necessary to mix both the "body" and "abstract" of content in descriptionthe field , a new summaryfield is provided that distinguishes between "abstract" and "body", while allowing non-text content to be added to the body.
  3. "RSS" exists in several variants, Atom is more stable and consistent.
  4. Provides a namespace that conforms to XML standards, can use XML built-in tags to support the description of relative addresses, can use XML built-in tags to tell subscribers the content language, and supports XML Schema, which RSS 2.0 does not have.
  5. Each information item has a unique ID, and subscribers can track the update of specific content.
  6. There is a unified and clear time expression specification, which is convenient for the program to process.
  7. application/atom+xmlThe MIME media type that is registered with IANA , making it a standard specification, the one used by RSS application/rss+xmlhas not yet been standardized.

Convert data to RSS feed format using Go

There are many software packages in the Go ecosystem that support generating RSS feeds, and I chose gorilla/feeds, which has a ten-year maintenance history . Although on the 9th of this month, the maintenance team announced that all warehouses in the open source organization will enter a "dormant state" (archive) and will no longer be maintained.

However, for our needs, RSS is an "old and stable" protocol, and gorilla/feeds has been verified for a long time, so it is more appropriate to choose to use it. In addition, for such projects that are not actively maintained or are no longer maintained, Go's special package management method can also be used to help us manage code and make code maintenance changes, which we will mention in subsequent articles.

General Use of Gorilla Feeds

Let's first understand how to use Gorilla Feeds to generate feeds in RSS Feed format, first introduce the package:

import (
	"time"
	"github.com/gorilla/feeds"
)

The reason why it is introduced here at the same time timeis because I don't want to bother to manually create data. Because different RSS formats have different requirements for time, it may be more appropriate to discuss the processing of time in a follow-up article.

Let's take the previously published article as an example and write a piece of Mock data, which will be used to test the generation of RSS feeds:

now := time.Now()
feed := &feeds.Feed{
    
    
	Title:       "苏洋博客",
	Link:        &feeds.Link{
    
    Href: "https://soulteary.com/"},
	Description: "醉里不知天在水,满船清梦压星河。",
	Author:      &feeds.Author{
    
    Name: "soulteary", Email: "[email protected]"},
	Created:     now,
}

feed.Items = []*feeds.Item{
    
    
	{
    
    
		Title:       "RSS Can:借助 V8 让 Golang 应用具备动态化能力(二)",
		Link:        &feeds.Link{
    
    Href: "https://soulteary.com/2022/12/13/rsscan-make-golang-applications-with-v8-part-2.html"},
		Description: "继续聊聊之前做过的一个小东西的踩坑历程,如果你也想高效获取信息,或许这个系列的内容会对你有用。",
		Author:      &feeds.Author{
    
    Name: "soulteary", Email: "[email protected]"},
		Created:     now,
	},
	{
    
    
		Title:       "RSS Can:使用 Golang 实现更好的 RSS Hub 服务(一)",
		Link:        &feeds.Link{
    
    Href: "https://soulteary.com/2022/12/12/rsscan-better-rsshub-service-build-with-golang-part-1.html"},
		Description: "聊聊之前做过的一个小东西的踩坑历程,如果你也想高效获取信息,或许这个系列的内容会对你有用。这个事情涉及的东西比较多,所以我考虑拆成一个系列来聊,每篇的内容不要太长,整理负担和阅读负担都轻一些。本篇是系列第一篇内容。",
		Author:      &feeds.Author{
    
    Name: "soulteary", Email: "[email protected]"},
		Created:     now,
	},
	{
    
    
		Title:       "在搭载 M1 及 M2 芯片 MacBook设备上玩 Stable Diffusion 模型",
		Link:        &feeds.Link{
    
    Href: "https://soulteary.com/2022/12/10/play-the-stable-diffusion-model-on-macbook-devices-with-m1-and-m2-chips.html"},
		Description: "本篇文章,我们聊了如何使用搭载了 Apple Silicon 芯片(M1 和 M2 CPU)的 MacBook 设备上运行 Stable Diffusion 模型。",
		Created:     now,
	},
	{
    
    
		Title:       "使用 Docker 来快速上手中文 Stable Diffusion 模型:太乙",
		Link:        &feeds.Link{
    
    Href: "https://soulteary.com/2022/12/09/use-docker-to-quickly-get-started-with-the-chinese-stable-diffusion-model-taiyi.html"},
		Description: "本篇文章,我们聊聊如何使用 Docker 快速运行中文 Stable Diffusion 模型:太乙。 ",
		Created:     now,
	},
}

Then, write a simple call statement, and the data can be "converted" into the result we need:

atom, err := feed.ToAtom()
if err != nil {
    
    
	log.Fatal(err)
}

rss, err := feed.ToRss()
if err != nil {
    
    
	log.Fatal(err)
}

json, err := feed.ToJSON()
if err != nil {
    
    
	log.Fatal(err)
}

fmt.Println(atom, "\n", rss, "\n", json)

Put the above code into a function that can be called for testing (for example main), after the program is executed, we will see results similar to the following:

<?xml version="1.0" encoding="UTF-8"?><feed xmlns="http://www.w3.org/2005/Atom">
  <title>苏洋博客</title>
  <id>https://soulteary.com/</id>
  <updated>2022-12-14T12:29:55+08:00</updated>
  <subtitle>醉里不知天在水,满船清梦压星河。</subtitle>
  <link href="https://soulteary.com/"></link>
  <author>
    <name>soulteary</name>
    <email>soulteary@gmail.com</email>
  </author>
  <entry>
    <title>RSS Can:借助 V8 让 Golang 应用具备动态化能力(二)</title>
    <updated>2022-12-14T12:29:55+08:00</updated>
    <id>tag:soulteary.com,2022-12-14:/2022/12/13/rsscan-make-golang-applications-with-v8-part-2.html</id>
    <link href="https://soulteary.com/2022/12/13/rsscan-make-golang-applications-with-v8-part-2.html" rel="alternate"></link>
    <summary type="html">继续聊聊之前做过的一个小东西的踩坑历程,如果你也想高效获取信息,或许这个系列的内容会对你有用。</summary>
    <author>
      <name>soulteary</name>
      <email>soulteary@qq.com</email>
    </author>
  </entry>
...
...
</feed> 

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>苏洋博客</title>
    <link>https://soulteary.com/</link>
    <description>醉里不知天在水,满船清梦压星河。</description>
    <managingEditor>soulteary@gmail.com (soulteary)</managingEditor>
    <pubDate>Wed, 14 Dec 2022 12:29:55 +0800</pubDate>
    <item>
      <title>RSS Can:借助 V8 让 Golang 应用具备动态化能力(二)</title>
      <link>https://soulteary.com/2022/12/13/rsscan-make-golang-applications-with-v8-part-2.html</link>
      <description>继续聊聊之前做过的一个小东西的踩坑历程,如果你也想高效获取信息,或许这个系列的内容会对你有用。</description>
      <author>soulteary</author>
      <pubDate>Wed, 14 Dec 2022 12:29:55 +0800</pubDate>
    </item>
    <item>
      <title>RSS Can:使用 Golang 实现更好的 RSS Hub 服务(一)</title>
      <link>https://soulteary.com/2022/12/12/rsscan-better-rsshub-service-build-with-golang-part-1.html</link>
      <description>聊聊之前做过的一个小东西的踩坑历程,如果你也想高效获取信息,或许这个系列的内容会对你有用。这个事情涉及的东西比较多,所以我考虑拆成一个系列来聊,每篇的内容不要太长,整理负担和阅读负担都轻一些。本篇是系列第一篇内容。</description>
      <author>soulteary</author>
      <pubDate>Wed, 14 Dec 2022 12:29:55 +0800</pubDate>
    </item>
...
...
  </channel>
</rss>

{
    
    
  "version": "https://jsonfeed.org/version/1",
  "title": "苏洋博客",
  "home_page_url": "https://soulteary.com/",
  "description": "醉里不知天在水,满船清梦压星河。",
  "author": {
    
    
    "name": "soulteary"
  },
  "items": [
    {
    
    
      "id": "",
      "url": "https://soulteary.com/2022/12/13/rsscan-make-golang-applications-with-v8-part-2.html",
      "title": "RSS Can:借助 V8 让 Golang 应用具备动态化能力(二)",
      "summary": "继续聊聊之前做过的一个小东西的踩坑历程,如果你也想高效获取信息,或许这个系列的内容会对你有用。",
      "date_published": "2022-12-14T12:29:55.50867+08:00",
      "author": {
    
    
        "name": "soulteary"
      }
    },
...
...
  ]
}

The log results output above include the three formats mentioned above, which can cover the subscription usage of most RSS clients.

Link to information from the website

In the previous article, we parsed the target website through dynamic configuration in the previous article, and converted the information in the website into a data structure in Go. After understanding how Gorilla Feeds outputs RSS format, we only need to "connect" the two together to get the news feed in RSS format.

First, make some adjustments to the function of "parsing website information according to configuration" mentioned above:

func getWebsiteDataWithConfig(config define.JavaScriptConfig) (result define.BodyParsed) {
    
    
	doc := network.GetRemoteDocument("https://36kr.com/", "utf-8")
	if doc.Body == "" {
    
    
		return result
	}

	return parser.ParsePageByGoQuery(doc, func(document *goquery.Document) []define.InfoItem {
    
    
		var items []define.InfoItem
		document.Find(config.ListContainer).Each(func(i int, s *goquery.Selection) {
    
    
			var item define.InfoItem

			title := strings.TrimSpace(s.Find(config.Title).Text())
			author := strings.TrimSpace(s.Find(config.Author).Text())
			time := strings.TrimSpace(s.Find(config.DateTime).Text())
			category := strings.TrimSpace(s.Find(config.Category).Text())
			description := strings.TrimSpace(s.Find(config.Description).Text())

			href, _ := s.Find(config.Link).Attr("href")
			link := strings.TrimSpace(href)

			item.Title = title
			item.Author = author
			item.Date = time
			item.Category = category
			item.Description = description
			item.Link = link
			items = append(items, item)
		})
		return items
	})
}

When the above function runs normally, you can get an array containing structured data.

Next, write a simple function that calls Gorilla Feeds to generate the RSS feed we need:

func generateFeeds(data define.BodyParsed) {
    
    
	now := time.Now()

	rssFeed := &feeds.Feed{
    
    
		Title:   "36Kr",
		Link:    &feeds.Link{
    
    Href: "https://36kr.com/"},
		Created: now,
	}

	for _, data := range data.Body {
    
    
		feedItem := feeds.Item{
    
    
			Title:       data.Title,
			Author:      &feeds.Author{
    
    Name: data.Author},
			Description: data.Description,
			Link:        &feeds.Link{
    
    Href: data.Link},
			// 时间处理这块比较麻烦,后续文章再展开
			Created: now,
		}
		rssFeed.Items = append(rssFeed.Items, &feedItem)
	}

	atom, err := rssFeed.ToAtom()
	if err != nil {
    
    
		log.Fatal(err)
	}

	rss, err := rssFeed.ToRss()
	if err != nil {
    
    
		log.Fatal(err)
	}

	json, err := rssFeed.ToJSON()
	if err != nil {
    
    
		log.Fatal(err)
	}

	fmt.Println(atom, "\n", rss, "\n", json)
}

Finally, adjust the calling function of the program so that we can test and print the RSS generation result to the terminal log:

func main() {
    
    
	jsApp, _ := os.ReadFile("./config/config.js")
	inject := string(jsApp)

	jsConfig, err := javascript.RunCode(inject, "JSON.stringify(getConfig());")
	if err != nil {
    
    
		fmt.Println(err)
		return
	}

	config, err := parser.ParseConfigFromJSON(jsConfig)
	if err != nil {
    
    
		fmt.Println(err)
		return
	}
	data := getWebsiteDataWithConfig(config)
	generateFeeds(data)
}

go run main.goExecuting the program with , we get the expected result:

<?xml version="1.0" encoding="UTF-8"?><feed xmlns="http://www.w3.org/2005/Atom">
  <title>36Kr</title>
  <id>https://36kr.com/</id>
  <updated>2022-12-14T13:41:37+08:00</updated>
  <link href="https://36kr.com/"></link>
  <entry>
    <title>iOS 16.2来了,这7个新功能值得关注</title>
    <updated>2022-12-14T13:41:37+08:00</updated>
    <id>tag:,2022-12-14:/p/2043412066405640</id>
    <link href="/p/2043412066405640" rel="alternate"></link>
    <summary type="html">Apple 画的饼终于来了。</summary>
    <author>
      <name>少数派</name>
    </author>
  <entry>
    <title>如何更好地思考:人只能获得自己认知内的成就</title>
    <updated>2022-12-14T13:41:37+08:00</updated>
    <id>tag:,2022-12-14:/p/2018320727015942</id>
    <link href="/p/2018320727015942" rel="alternate"></link>
    <summary type="html">5个原则,让你成为一个更好的思考者。</summary>
    <author>
      <name>神译局</name>
    </author>
  </entry>
...

Now that the data format that the RSS client can use is settled, let's solve the last step of "RSS subscription", start a simple Web service, and turn the above data into an accessible interface address.

Use Gin to handle RSS web services

Gin is an excellent HTTP web framework. It is not necessarily the fastest framework among all frameworks in the Go ecosystem, but it is definitely among the best in terms of community activity and ease of use.

Start a simple web service with Gin

Gin encapsulates net/httpthe capabilities and provides a simple calling method, allowing us to start a web service, such as the following code of less than 20 lines:

package main

import (
  "net/http"

  "github.com/gin-gonic/gin"
)

func main() {
    
    
  r := gin.Default()
  r.GET("/ping", func(c *gin.Context) {
    
    
    c.JSON(http.StatusOK, gin.H{
    
    
      "message": "pong",
    })
  })
  r.Run()
}

After the above code is run, a web service will be started, and the default service address is http://localhost:8080. When we visit in the browser /ping, the server will respond and return pong.

Make RSS subscription data interface

As mentioned above, since there is no cost to generate RSS in different formats, we can support them all and respond to requests from various RSS clients.

When actually providing services, we need to output different data according to the RSS format type requested by the client. Therefore, we need to adjust the function we used to generate the RSS feed above so that it supports generating content according to the type in the request parameter:

func generateFeeds(data define.BodyParsed, rssType string) string {
    
    
	now := time.Now()

	rssFeed := &feeds.Feed{
    
    
		Title:   "36Kr",
		Link:    &feeds.Link{
    
    Href: "https://36kr.com/"},
		Created: now,
	}

	for _, data := range data.Body {
    
    
		feedItem := feeds.Item{
    
    
			Title:       data.Title,
			Author:      &feeds.Author{
    
    Name: data.Author},
			Description: data.Description,
			Link:        &feeds.Link{
    
    Href: data.Link},
			// 时间处理这块比较麻烦,后续文章再展开
			Created: now,
		}
		rssFeed.Items = append(rssFeed.Items, &feedItem)
	}

	var rss string
	var err error

	switch rssType {
    
    
	case "RSS":
		rss, err = rssFeed.ToRss()
	case "ATOM":
		rss, err = rssFeed.ToAtom()
	case "JSON":
		rss, err = rssFeed.ToJSON()
	default:
		rss = ""
	}

	if err != nil {
    
    
		fmt.Println(err)
		return ""
	}

	return rss
}

After completing the adjustment of the generation function, let's complete a simple function implementation that supports calling the above function to output RSS feeds in different formats according to different API request paths:

route := gin.Default()
route.GET("/:type/", func(c *gin.Context) {
    
    
	var rssType RSSType
	if err := c.ShouldBindUri(&rssType); err != nil {
    
    
		c.JSON(http.StatusNotFound, gin.H{
    
    "msg": err})
		return
	}

	var response string
	var mimetype string
	switch strings.ToUpper(rssType.Type) {
    
    
	case "RSS":
		mimetype = "application/rss+xml"
		response = generateFeeds(data, "RSS")
	case "ATOM":
		mimetype = "application/atom+xml"
		response = generateFeeds(data, "ATOM")
	case "JSON":
		mimetype = "application/feed+json"
		response = generateFeeds(data, "JSON")
	}
	c.Data(http.StatusOK, mimetype, []byte(response))
})


route.Run(":8080")

Start the service, we visit any address in http://localhost:8080/rss, http://localhost:8080/atom, and you can see the data of the RSS feed in the browser.http://localhost:8080/json

There are many RSS subscription tools that support automatic detection of RSS feeds based on tags in web pages, such as Reeder.

In order to facilitate our testing in Reeder, we can write the above RSS feed address into an HTML page, and then "bind" to /the root :

const hello = `<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>RSS Feed Discovery.</title>
	<link rel="alternate" type="application/rss+xml" title="RSS 2.0 Feed" href="http://localhost:8080/rss">
	<link rel="alternate" type="application/atom+xml" title="RSS Atom Feed" href="http://localhost:8080/atom">
	<link rel="alternate" type="application/rss+json" title="RSS JSON Feed" href="http://localhost:8080/json">
</head>
<body>
	RSS Feed Discovery.
</body>
</html>`

route.GET("/", func(c *gin.Context) {
    
    
	c.Data(http.StatusOK, "text/html", []byte(hello))
})

Re-run the program, when we http://127.0.0.1:8080input , Reeder will inform us that three feeds have been found. Because the data of the three feeds are the same, you can choose any one here (Atom is recommended).

Verify RSS feed validity with Reeder

Click the "Subscribe" button, and the information from the website will appear in Reeder's information list.

RSS information list obtained by RSS client

So far, we have initially solved the subscription problem of some information sources that cannot be subscribed by the RSS subscription tool mentioned in the first article. As for the "keyword screening" and "NLP content summary aggregation" mentioned in the previous two articles, we will continue to expand in subsequent articles.

Other: a hidden memory leak hidden danger

In the previous article, in order to safely run external JavaScript code that may have an "infinite loop", we used the following code to solve the problem:

duration := time.Since(start)
select {
    
    
case val := <-vals:
	fmt.Fprintf(os.Stderr, "cost time: %v\n", duration)
	return val, nil
case err := <-errs:
	return nil, err
case <-time.After(JS_EXECUTE_TIMEOUT):
	vm := ctx.Isolate()
	vm.TerminateExecution()
	err := <-errs
	fmt.Fprintf(os.Stderr, "execution timeout: %v\n", duration)
	time.Sleep(JS_EXECUTE_THORTTLING)
	return nil, err
}

The classmate @Etran in the tossing group today reminded that there is a hidden memory leak problem , which time.After()may be executed later than we receive valsthe data , resulting in the timer not being released correctly.

So, how to solve this problem? Correcting the code is simple:

duration := time.Since(start)
timeout := time.NewTimer(define.JS_EXECUTE_TIMEOUT)

select {
    
    
case val := <-vals:
	if !timeout.Stop() {
    
    
		<-timeout.C
	}
	fmt.Fprintf(os.Stderr, "cost time: %v\n", duration)
	return val, nil
case err := <-errs:
	return nil, err
case <-timeout.C:
	timeout.Stop()
	vm := ctx.Isolate()
	vm.TerminateExecution()
	err := <-errs
	fmt.Fprintf(os.Stderr, "execution timeout: %v\n", duration)
	time.Sleep(define.JS_EXECUTE_THORTTLING)
	return nil, err
}

at last

When writing this article, I reviewed the development history of RSS and the career history of the core soul figure David Winter, trying to use my perspective to briefly describe the wonderful moments in the long history of RSS.

When the article was about to be published, I changed my mind. Perhaps the story about RSS should be published at the end of this series of articles.

–EOF


We have a small tossing group, which gathers some friends who like tossing.

In the absence of advertisements, we will chat about software and hardware, HomeLab, and programming issues together, and will also share some information about technical salons in the group from time to time.

Friends who like tossing, welcome to read the following content, scan the code to add friends.


This article uses the "Signature 4.0 International (CC BY 4.0)" license agreement. You are welcome to reprint or re-use it, but you need to indicate the source. Attribution 4.0 International (CC BY 4.0)

Author of this article: Su Yang

Created time: December 14, 2022
Counted words: 11361 words
Reading time: 23 minutes Read
this link: https://soulteary.com/2022/12/14/rsscan-convert-website-information-stream-to-rss -feed-part-3.html

Guess you like

Origin blog.csdn.net/soulteary/article/details/128318495