Chinese universities to participate in the construction of MOOC how many are?

Three platforms: 
	China University mooc https://www.icourse163.org/university/view/all.htm 
	school online http://www.xuetangx.com/partners 
	good university online https://www.cnmooc.org/ school / view / list.mooc
    

  

Idea: After obtaining a college name from the three sources, cleaning and de-emphasis, the statistics get results

Steps:
First, think or else to use to get data reptiles do?
(For reptiles do not do it by hand, it is not suitable for manual use crawlers to do)
suitable for manual completion: a one-time, a small number of data is not complicated
for reptiles completed: periodic collection, crawling more sites, complex but regular data structure
using manual methods described herein,

For the Chinese University mooc:
Save the corresponding HTML code for the text, save the file in icourses.txt
write icourses.py, including the University of extracting information, space-delimited output university name

 

. 1 Fi = Open ( " D: /icourses.txt " , " R & lt " )
 2 LS = []
 . 3  for Line in Fi:
 . 4      IF  " Alt "  in Line:
 . 5          ls_temp line.split = ( ' " ' )
 . 6          uName ls_temp = [-2 ]
 7          IF  " students "  in uName:   # remove similar institution "National mathematical contest in modeling Organizing Committee for the students," 
8              the Continue 
9          IF  " University"  In uName or  " Institute "  in uName:
 10              ls.append (uName)
 . 11 str_ls = "  " .join (LS)
 12 is  Print (str_ls)   # the Join form string 
13 is  
14  Print (len (LS))
 15  Fi .close ()
 16 The result is: 287

For School Online:
us for ways:
directly in the browser page copy content stored in xuetangx.txt file
write code xuetangx.py, extract the university information, space-delimited output name of university
information at this time, the file on very confusing, contains a lot of duplicate information

 

. 1 Fi = Open ( " D: /xuetangx.txt " , " R & lt " )
 2 the U-SET = () # empty set # even if there is no repeated addition also added into 
. 3  for Line in Fi:
 . 4      IF  " Mu class "  in Line:
 5          the Continue 
6      IF  " university "  in Line or  " college "  in Line:
 7          U.add (line.strip ( " \ the n- " ))
 8  Print ( " " .Join (the U-))   # to give a string of space-separated 
. 9  Print (len (the U-))
 10  fi.close ()
 . 11 The result is: 166
 12 is  here to be noted that, because the data is not very regular files (HTML inferior direct copy code specifications),
 13      so we can no longer use the list should be used to re-set of features,

 

 

 

For a good university Online:
directly in the browser page copy content stored in cnmooc.txt file
write code cnmooc.py, extract the university information, space-delimited output university name

. 1 Fi = Open ( " D: /cnmooc.txt " , " R & lt " , encoding = " UTF-. 8 " )
 2 the U-SET = () # empty set # even if there is no repeated addition also added into 
. 3  for Line in fi:
 4  
5  IF  " university "  in Line or  " college "  in Line:
 6 U.add (line.strip ( " \ the n- " ))
 7  Print ( "  " .the Join (the U-)) # give a string of space-separated
. 8  Print (len (the U-))
 . 9  fi.close ()
 10 Results: 101

 

 

To solve the problem: putting together the result of the program, the final results of the statistical
summary of the results
of manual and automated?
If the program considers only run a few times, you can not pursue perfection, direct labor to do just fine,
if we will become very complicated procedures, just to fully automatic, which is a bit more harm than good,
we will be the first three procedures manual the results are copied to the program as input:

1 ic = " Peking ... " 
2 XT = " Chinese entrepreneurs Institute Beijing Institute of Physical ... " 
3 cm = " Tsing Hua University ... " 
4 U = the SET ()
 5  
6 U | = the SET (ic. Split ()) # list becomes the collection and conversion ic and U or lower 
. 7 U | = sET (xt.split ()) # list becomes the collection and conversion ic and U or lower 
. 8 U | = sET (cm.split ()) # the list ic converted and becomes set and U or lower 
. 9  
10 LS = list (U) # U-into a set of list (this is to give it a sort) 
. 11 LS. the Sort () # ordering 
12  Print ( " " .Join (LS))
 13  Print (len (LS))
 14 result: 352 Note: If you have questions about the results, you can manually reviewed, can be removed.

 

 

This example mainly through understanding of various operations on the text data types out time,
including statistics, ah, ah deduplication, the operation of the text data and the like.

 

Guess you like

Origin www.cnblogs.com/zach0812/p/11275656.html