Three platforms: China University mooc https://www.icourse163.org/university/view/all.htm school online http://www.xuetangx.com/partners good university online https://www.cnmooc.org/ school / view / list.mooc
Idea: After obtaining a college name from the three sources, cleaning and de-emphasis, the statistics get results
Steps:
First, think or else to use to get data reptiles do?
(For reptiles do not do it by hand, it is not suitable for manual use crawlers to do)
suitable for manual completion: a one-time, a small number of data is not complicated
for reptiles completed: periodic collection, crawling more sites, complex but regular data structure
using manual methods described herein,
For the Chinese University mooc:
Save the corresponding HTML code for the text, save the file in icourses.txt
write icourses.py, including the University of extracting information, space-delimited output university name
. 1 Fi = Open ( " D: /icourses.txt " , " R & lt " ) 2 LS = [] . 3 for Line in Fi: . 4 IF " Alt " in Line: . 5 ls_temp line.split = ( ' " ' ) . 6 uName ls_temp = [-2 ] 7 IF " students " in uName: # remove similar institution "National mathematical contest in modeling Organizing Committee for the students," 8 the Continue 9 IF " University" In uName or " Institute " in uName: 10 ls.append (uName) . 11 str_ls = " " .join (LS) 12 is Print (str_ls) # the Join form string 13 is 14 Print (len (LS)) 15 Fi .close () 16 The result is: 287
For School Online:
us for ways:
directly in the browser page copy content stored in xuetangx.txt file
write code xuetangx.py, extract the university information, space-delimited output name of university
information at this time, the file on very confusing, contains a lot of duplicate information
. 1 Fi = Open ( " D: /xuetangx.txt " , " R & lt " ) 2 the U-SET = () # empty set # even if there is no repeated addition also added into . 3 for Line in Fi: . 4 IF " Mu class " in Line: 5 the Continue 6 IF " university " in Line or " college " in Line: 7 U.add (line.strip ( " \ the n- " )) 8 Print ( " " .Join (the U-)) # to give a string of space-separated . 9 Print (len (the U-)) 10 fi.close () . 11 The result is: 166 12 is here to be noted that, because the data is not very regular files (HTML inferior direct copy code specifications), 13 so we can no longer use the list should be used to re-set of features,
For a good university Online:
directly in the browser page copy content stored in cnmooc.txt file
write code cnmooc.py, extract the university information, space-delimited output university name
. 1 Fi = Open ( " D: /cnmooc.txt " , " R & lt " , encoding = " UTF-. 8 " ) 2 the U-SET = () # empty set # even if there is no repeated addition also added into . 3 for Line in fi: 4 5 IF " university " in Line or " college " in Line: 6 U.add (line.strip ( " \ the n- " )) 7 Print ( " " .the Join (the U-)) # give a string of space-separated . 8 Print (len (the U-)) . 9 fi.close () 10 Results: 101
To solve the problem: putting together the result of the program, the final results of the statistical
summary of the results
of manual and automated?
If the program considers only run a few times, you can not pursue perfection, direct labor to do just fine,
if we will become very complicated procedures, just to fully automatic, which is a bit more harm than good,
we will be the first three procedures manual the results are copied to the program as input:
1 ic = " Peking ... " 2 XT = " Chinese entrepreneurs Institute Beijing Institute of Physical ... " 3 cm = " Tsing Hua University ... " 4 U = the SET () 5 6 U | = the SET (ic. Split ()) # list becomes the collection and conversion ic and U or lower . 7 U | = sET (xt.split ()) # list becomes the collection and conversion ic and U or lower . 8 U | = sET (cm.split ()) # the list ic converted and becomes set and U or lower . 9 10 LS = list (U) # U-into a set of list (this is to give it a sort) . 11 LS. the Sort () # ordering 12 Print ( " " .Join (LS)) 13 Print (len (LS)) 14 result: 352 Note: If you have questions about the results, you can manually reviewed, can be removed.
This example mainly through understanding of various operations on the text data types out time,
including statistics, ah, ah deduplication, the operation of the text data and the like.