[Github repo introduce ourselves] code agricultural weekly one week Featured Categories

The following excerpt self github code farmers organize weekly repo , we welcome the star.

Write the words in front

As an initial batch of code agricultural magazine subscribers, we can not say that has experienced its growth, but did witness his growth. Agriculture weekly code does is basically full of dry goods from a selection of the first phase of the start, at least from one week featured each period so I get really look. However, with the cumulative time of the week selection of the total amount becomes large, the time I write these words is 280. I thought, is it possible to be classified, so that these caring people classification of useful resources more efficiently is to use.

What should I do

For this repo, I want to do is actually mainly the following two points:

  • Get all the content code to the first phase of the current agricultural magazine latest issue, do show a classification
  • Future updates can handle every phase of human intervention as possible

Pie has drawn well, considering I'm a programmer, adhering to the Director-General to help me make the machine should not own laborious principle, for such a thing can not be considered particularly difficult, I started tossing it road.

I think of how

I think so, whatever the outcome, also wrote a lot of code, do not talk about my spiritual path, it is difficult to pave my technical life. So, this part is to write me from the very beginning to do this repo to write all the code in the spiritual path, and run the code without any relationship, but I think the code comments have a lot of help.

All beginnings are hard

In fact, this thing, if consideration is from 0, the idea is very intuitive, and I do not want to believe in what, under the circumstances by the stream of consciousness is the basic idea of ​​a rhythm. But with the deepening of thinking, hands-on involvement, into the brain, clear thinking two is the best choice.

A thought Ideas two
Use crawlers get content for each of the weekly farmer's yard, a title and a link Use crawlers get content for each of the weekly farmer's yard, at one time all the titles and links into text files
Screening every link, mainly to remove broken links Reading said text file classified according to the configuration file classification key
For each link, from the configuration file in accordance with the rules to find keyword classification Good-time classification, form md all the sorted file
Written in the corresponding file in accordance with the classification md Md handles all files, remove dead links, to form the final document md

Why use two ideas, the following main points:

  • In step two ideas are relatively independent of each document, each handled a little simple function, keep it simple, stupid
  • In this process, the most time-consuming is the exploration of the link is valid, according to the idea of ​​a way, basic to all runs out to get the final document can git push, and the idea of ​​two, can achieve while execution, while acquiring section full can git push up files

Then later more difficult

I have been adhering to the principle of pragmatism, software only tool to materialize ideas, the key is not a virtuoso but to meet the demand. Well, I can already start to get from zero to all of the markdown files, then the next question is how to update? First, get clear in time to be updated every time they have any.

  • We have a good number of classified documents md
  • A readme.md file used to explain this stuff to do

My main task is updated when a new file number md came out, of course, this update is the need for effective, md formatted text, and with a minimum of effort. So, my idea is this:

  • The latest issue of access to content
  • Repeat the process one of.
  • And consolidated in accordance with the existing file name and file md.

The so-called right to talk about the classification key file

The repo has a core document, this will be specified in the next big festival, but the problem I faced in the beginning is how the classification, to find which keywords? So I think the word, but this stuff really did not think so easy to use, you can not say easy, just make a reference to it. So I use the results of the classification, the most important is that I have to add true "artificial intelligence", human intelligence, generally come up with several types of keywords. And this thing is basically only needs to be performed once, because the points classification does not drastic changes, unless our language has changed. So, there are no tools associated with this code, but rest assured, json file classification configuration is there.

In addition, the sub-sub-classification process, I found that I separate the feelings. So I began to self-exhibition (chui) Wang (niu), I also began to use artificial intelligence, this is AI, samples from nearly 2w higher in a reasonable and efficient computer classification. Not for anything else, whenever I buy computer books online, I found that classification is not so reasonable, maybe one day, I can make a set of "industry" standards?

How do I

Specific to the realization, in accordance with my initial thoughts, so I wrote a few files:

  • GetAllTitles.py -> This code is used to crawl agricultural magazine content, more formal, then called reptilian part of it
  • ExtractMD.py -> This is to save the reptile down well formatted content, and form a markdown file classification based on the configuration file, the organization, which is presented in the main content
  • Erase404.py -> name is not particularly appropriate since, in fact, I left a corresponding code file 200 is, by definition, to weed out those who might be in disrepair, perhaps the content has been moved away
  • MergeFile.py -> this is written specifically for updates, features simple, is the latest summary of markdown file into a form of classification has a good big markdown file
  • OmitDup.py -> This file is added later, the purpose is to remove duplicate items in markdown
  • category.json -> This is the legendary brought together human intelligence automatic word plus the results of the classification profiles

All files are complete stand-alone, simple, functional mentally handicapped, there are interdependent but will not affect any functionality. For me personally, I really like this design philosophy, in order not to make things complicated principle. Because the code is not commented, so, then, I said something to pull a little specific use of these documents is how.

GetAllTitles.py

Since the command line to join artifact click, so whether written or are easier to use. Use python3 GetAllTitles.py --helpcan see that the script has three sub-commands:

  • new -> get the contents of all codes agricultural magazine from the first period to the latest issue
  • update -> specify the starting and ending periods periods, access to the contents of the specified periods, if the initial periods not specified, from the first issue, if not the end of the periods specified, has been acquired to date a
  • latest -> get the contents of the latest issue

In the reptile section, I tried more methods, selenium Why use this instead of reptiles tend to test a library? The reason is that I used the requests, but the code signature weekly farmers the domain name in question, then use requests will complain. I tried several methods, hovering before the settlement and can not solve this problem, so I simply changed to another frame.

In fact, the content explained later use subcommand --help can see, for example, python3 GetAllTitles.py new --helpwill show the following results:

Usage: GetAllTitles.py new [OPTIONS]
Options:
   --fname TEXT  The output file name for all content, default file name is allLists.txt.
   --help        Show this message and exit.

Simply, if you execute python3 GetAllTitles.py new would happen then? In theory, if you unattended, then, after a while you will find a file has been updated allLists.txt in your code directory, the contents inside are like this format:

Installments: $ url title

Why use $ as a separator, a, it's a yearning for the good things, and secondly, and more importantly, the beginning of my choice is the most common comma, but found that in practice the process, which itself has a lot of title comma. Resulting in a later step classification markdown file format has become a problem, so changed to a very popular $ wealthy, oh no, symbol.

The other two major sub-command functions are the same, will not repeat them.

ExtractMD.py

The document flow is quite clear:

  • Read the contents of a file from crawling preserved, find title
  • Read and rejecter filter in from category.json, why these two gadgets, can be explained in the next section
  • Category formatted in accordance with good strings, become a markdown file storage
  • Next record is not assigned to any type of content, are automatically saved in a file called uncategorized1.md, the artificial selection of articles which, according to the result of the screening, update category.json file in order in a time of baptism obtain more accurate and reasonable classification profile

Use python3 ./ExtractMD.py --helpcan see the following help:

Usage: ExtractMD.py [OPTIONS]
Options:
  --fname TEXT    The raw file name for crawling file,default is allList.txt.
  --filters TEXT  Input the filers that you need, seperate by ',', if no keywords, means use all filters by default.
  --help          Show this message and exit.

Which filters are keywords you want to classify documents, such as "c ++, java, big data," and so on, these classes can be queried in category.json names mentioned in the next section. If you do not pass this parameter, try to mark classified according to category.json file inside the category.

category.json

This should be considered the core components of a repo, the program to be classified according to the json file marked, specifically excerpts from part of the convenience of description:

{
    "root":[
        {
            "keywords":"C++",
            "filters":["c++","c 语言"],
            "rejecters":[],
            "fileName":"CppLinks.md"
        },
        {
            "keywords":"Java",
            "filters":["java","jar","jvm","jdk"],
            "rejecters":["招聘","bjarne","javascript"],
            "fileName":"JavaLinks.md"
        }
    ]
}

There are four main elements, effect are as follows:

  • keywords: This is a category name, there is no direct use in the code, mainly to do is give people a marked
  • filters: The title contains only one of these keywords to fall into this category
  • rejecters: If the title contains keywords such, must not belong to this category
  • Markdown file name of the final form of: fileName

There are a worthy explanation is rejecters, there is no way this thing. For example with java, jar keywords belong to this category of Java, but it happens famous c ++ name the father and there was this keyword jar, I feel an exact match, then did not need, so we use the rejecter such a thing.

The focus here mention that the classification of here with a strong personal preferences and limitations of self-experience, very welcome you to this classification proposed supplement.

For this json file, I have some idea of ​​the extended, unified twitched in the last one.

Erase404.py

This is to remove content unreachable url, because the code from the first phase of the weekly farmers began to count the words, there are already six years, it is a two-dimensional code are just beginning to pay became common, it is estimated that many of disrepair On a whim or remove their black history.

The method is very simple requests using the library to get a url of each response code. But as I have said in this section as the first section, code domain and requests weekly farmers library issues a certificate error will occur. How to do? I chose the easiest way, if it contains code url weekly farmer's domain name, which is toutiao.io, I have a default Web site that is accessible. Use python3 ./Erase404.py --helpcan see the following:

Usage: Erase404.py [OPTIONS]
Options:
   --folder TEXT  The folder that contains markdown files to be processed, default is current folder.
   --help         Show this message and exit.

Only one parameter, the name of the folder containing the markdown files need to be processed, the default is the current file. Deal files will be placed under the current new file called filtered folder under the folder. Because the count IO-intensive operations, the use of multi-threading greatly improved its speed.

MergeFiles.py

This file is a simple and merge two folders under the same file name file. The purpose is to facilitate an automatic merge the latest arising out of an updated when new markdown file has been classified and good existing markdown file. Use python3 ./MergeFiles.py --helpcan see the following:

Usage: MergeFiles.py [OPTIONS]
Options:
  --src TEXT  Sorce folder that contains markdown files to be merged.
  --dst TEXT  Destination folder that contains existing,categorized markdown files.
  --help      Show this message and exit.

OmitDup.py

This file is added later, because sometimes there will be duplicates found in markdown file formed earlier, the reason I guess it should be small series when finishing the article is already hard to remember that before elected before. For example, there is a first article 16 is selected, then in 36 has been selected, in which the difference between the 20. According to the update frequency code of Agricultural magazine, which was at least five months, let a person remember so long content is clearly unrealistic. The deduplication can actually do at first, but it is hanging open perspective on this issue, because I've found that there is a large number of files tidied up. In the front I said to myself to keep it in principle simple, stupid, and I chose to go in a single file function.

Usage is very simple, python3 ./OmitDup.py --helpyou can see the following:

Usage: OmitDup.py [OPTIONS]
Options:
  --folder TEXT  the folder name to omit duplications of markdowns.
  --help         Show this message and exit.

The program will search all files folder path following the end of md, and then they were to re-operation, of course, it is multi-threaded.

How to link them together

Strong promotion staff here I do a Raspberry Pi, the above four files obviously can be automated, only the last git push requires a little more care. And this quietly gathering data plus some processing work using the Raspberry Pi very appropriate, only need to use the script in cron program on linux, you can implement these programs timed run, save time and power.

Guess you like

Origin www.cnblogs.com/ZXYloveFR/p/12000635.html