CDN deal with the Nuggets turned anti-hotlinking remember once crawling markdown picture experience

Use markdown writing articles What are the benefits?

  • markdown is a plain text format (suffix .md), writing simple, regardless of layout, style simple and elegant output article
  • markdown property comes with open source, after the first writing, to any supported platform in markdown format of the release (domestic support platform, 掘金, 知乎(以文档方式导入), 简书(原本是最好用的, 最近在走下坡路))
  • Famous code hosting platform github, each code repository specification README.mdis a typical markdown format

I like to write original articles in Denver or markdown Jane book back, then copy and paste into gitbook (provided that gitbook already done and github association), it can be posted to github repository, since the content is very attractive, in github harvest wave of stars (stars equivalent thumbs up)

3203841-78d5645da2d961a5.png

! But recently the Nuggets and Jane books and other platforms suddenly announced that store pictures in your site outside the chain is no longer supported, that is, other sites site server requests stored images are always 404 Jane book is a direct closure of the chain; Nuggets issued a bulletin, a stay of execution one week;

3203841-bc63e4b0b3cc3581.png

How to do?

I had to save the document to your local md, then according to the source image info saved md, using pictures to the local reptile crawling, then upload images to github repository (github repository supports image upload and do not seal the chain), the original picture replace the saved information github repository Image

First on github new one called GraphBed warehouse used to store pictures

3203841-41e4f1c58b799792.png
  • The warehouse clone to a local /Users/lijianzhao/githubfolder
cd /Users/lijianzhao/github
git clone https://github.com/zhaoolee/GraphBed.git
3203841-4cab2af3813ef899.png

And to ensure that in this folder have permission to push to github, permission to add methods https://www.jianshu.com/p/7167122783b5

The github existing .md corresponding article is downloaded to a local warehouse (Zim poly abandon treatment list, for example)

git clone https://github.com/zhaoolee/StarsAndClown.git
3203841-40e29a090a632b3a.png

Write python script md_images_upload.py

3203841-71b839e833f6f242.png

This script:

  • You can search the current directory of all files md, md in each picture crawling to the local store to the /Users/lijianzhao/github/GraphBed/imagesdirectory;
  • Pictures crawling completed, will automatically /Users/lijianzhao/github/GraphBed/imagescatalog all the pictures in, push to Github
  • The new picture using Github address, replace the original picture address
  • Done
import os
import imghdr
import re
import requests
import shutil
import git
import hashlib

## 用户名
user_name = "zhaoolee";
## 仓库名
github_repository = "GraphBed";

## git仓库在本机的位置
git_repository_folder = "/Users/lijianzhao/github/GraphBed"

## 存放图片的git文件夹路径
git_images_folder = "/Users/lijianzhao/github/GraphBed/images"

## 设置忽略目录
ignore_dir_list=[".git"]

# 设置用户代理头
headers = {
    # 设置用户代理头(为狼披上羊皮)
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
}


# 根据输入的url输入md5命名
def create_name(src_name):
    src_name = src_name.encode("utf-8")
    s = hashlib.md5()
    s.update(src_name)
    return s.hexdigest()

# 获取当前目录下所有md文件
def get_md_files(md_dir):
    md_files = [];
    for root, dirs, files in sorted(os.walk(md_dir)):
        for file in files:
            # 获取.md结尾的文件
            if(file.endswith(".md")):
                file_path = os.path.join(root, file)
                print(file_path)
                #忽略排除目录
                need_append = 0
                for ignore_dir in ignore_dir_list:
                    if(ignore_dir in file_path.split("/") == True):
                        need_append = 1
                if(need_append == 0):
                    md_files.append(file_path)
    return md_files

# 获取网络图片
def get_http_image(image_url):
    image_info = {"image_url": "", "new_image_url": ""}
    file_uuid_name = create_name(image_url)
    image_data = requests.get(image_url, headers=headers).content
    # 创建临时文件
    tmp_new_image_path_and_name = os.path.join(git_images_folder, file_uuid_name)
    with open(tmp_new_image_path_and_name, "wb+") as f:
        f.write(image_data)
    img_type = imghdr.what(tmp_new_image_path_and_name)
    if(img_type == None):
        img_type = ""
    else:
        img_type = "."+img_type
    # 生成新的名字加后缀
    new_image_path_and_name = tmp_new_image_path_and_name+img_type
    # 重命名图片
    os.rename(tmp_new_image_path_and_name, new_image_path_and_name)

    new_image_url = "https://raw.githubusercontent.com/"+ user_name + "/" +github_repository+"/master/"+git_images_folder.split("/")[-1]+"/"+new_image_path_and_name.split("/")[-1]
    image_info = {
        "image_url": image_url,
        "new_image_url": new_image_url
    }
    print(image_info)

    return image_info


# 获取本地图片
def get_local_image(image_url):
    image_info = {"image_url": "", "new_image_url": ""}
    try:
        # 创建文件名
        file_uuid_name = uuid.uuid4().hex
        # 获取图片类型
        img_type = image_url.split(".")[-1]
        # 新的图片名和文件后缀
        image_name = file_uuid_name+"."+img_type
        # 新的图片路径和名字
        new_image_path_and_name = os.path.join(git_images_folder, image_name);
        shutil.copy(image_url, new_image_path_and_name)
        # 生成url
        new_image_url = "https://raw.githubusercontent.com/"+ user_name + "/" +github_repository+"/master/"+git_images_folder.split("/")[-1]+"/"+new_image_path_and_name.split("/")[-1]
        # 图片信息
        image_info = {
            "image_url": image_url,
            "new_image_url": new_image_url
        }
        print(image_info)
        return image_info
    except Exception as e:
        print(e)

    return image_info
    
# 爬取单个md文件内的图片
def get_images_from_md_file(md_file):
    md_content = ""
    image_info_list = []
    with open(md_file, "r+") as f:
        md_content = f.read()
        image_urls = re.findall(r"!\[.*?\]\((.*?)\)", md_content)
        for image_url in image_urls:
            # 处理本地图片
            if(image_url.startswith("http") == False):
                image_info = get_local_image(image_url)
                image_info_list.append(image_info)
            # 处理网络图片
            else:
                # 不爬取svg
                if(image_url.startswith("https://img.shields.io") == False):
                    try:
                        image_info = get_http_image(image_url)
                        image_info_list.append(image_info)
                    except Exception as e:
                        print(image_url, "无法爬取, 跳过!")
                        pass
        for image_info in image_info_list:
            md_content = md_content.replace(image_info["image_url"], image_info["new_image_url"])

        print("替换完成后::", md_content);

        md_content = md_content

    with open(md_file, "w+") as f:
        f.write(md_content)


def git_push_to_origin():
    # 通过git提交到github仓库
    repo = git.Repo(git_repository_folder)
    print("初始化成功", repo)
    index = repo.index
    index.add(["images/"])
    print("add成功")
    index.commit("新增图片1")
    print("commit成功")
    # 获取远程仓库
    remote = repo.remote()
    print("远程仓库", remote);
    remote.push()
    print("push成功")

def main():
    if(os.path.exists(git_images_folder)):
        pass
    else:
        os.mkdir(git_images_folder)
    # 获取本目录下所有md文件
    md_files = get_md_files("./")

    # 将md文件依次爬取
    for md_file in md_files:
      # 爬取单个md文件内的图片
      get_images_from_md_file(md_file)
    
    git_push_to_origin()
    


if __name__ == "__main__":
    main()
Several optimization points:
  • Support local directory md reference picture of crawling (later you can write files in the local markdown, after completion, run the above script, you can automatically upload pictures md quoted local to github, while the local picture is a reference to the address github online picture address replaced)
  • To prevent duplicate names image, the image name using the Rename uuid (uuid later found that the use will lead to the same image repeatedly crawling network storage, so the back using a network address corresponding to the image url md5 code to the new name, can prevent the formation of the same content the name of different pictures)
  • Crawling local images, uuid still use the same name to prevent duplication (personal name may be used repeatedly 001.png, 002.pngand other common name)
  • Picture of crawling carried out to determine the type, automatic replenishment picture extensions

Instructions

  1. Installation python3

Installation methods, see Python environment to build data mining

  1. The script md_images_upload.pyinto the /Users/lijianzhao/github/GraphBeddirectory (this directory can be their own, but the top few lines of script parameters should also be amended)
3203841-8727f30dc738cef5.png
3203841-2a5662456dfcdaed.png
  1. Dependent packages installed on the command line
pip3 install requests
pip3 install git
  1. Enter the command line/Users/lijianzhao/github/GraphBed
cd /Users/lijianzhao/github/GraphBed
  1. Run the script
python3 md_images_upload.py
3203841-e31ef9e0eafca502.gif

Here is the second time I replaced the picture, so the above figure shows the original motion picture also GitHub picture, description script was first successfully complete replacement of ~

And it can display a picture

3203841-b05949f7d014cedb.png

3203841-e6104c88c5082045.png

Reproduced in: https: //www.jianshu.com/p/01b418642014

Guess you like

Origin blog.csdn.net/weixin_33743248/article/details/91079533