Crawl the program information of CCTV network!

1. Data crawling

Tool introduction

Use python's automation framework selenium for dynamic crawling. Selenium is a tool for automated testing of web applications. Selenium tests run directly in the browser, just like a real user is operating. Supported browsers include IE (7, 8, 9, 10, 11), Mozilla Firefox, Safari, Google Chrome, Opera, etc. The main functions include: test compatibility with the browser-test your application to see if it can work well on different browsers and operating systems. Test system functions-create regression tests to verify software functions and user requirements. Support automatic recording of actions and automatic generation of test scripts in different languages ​​such as .Net, Java, Perl, etc.

Analyze web pages

Website: https://tv.cctv.com/

To jump to the homepage of CCTV.com, you first need to download it. The browser driver already on your computer. I downloaded the Google browser, so I installed the Google driver. Then start your site through the webdriver of the python Selenium library. Install the Google driver, so that you can enter the CCTV homepage through the CCTV website

from selenium import webdriver 
import requests 
import time 
#Start 
Google browser browser = webdriver.Chrome('D://Baidu installation package//chromedriver_win32//chromedriver.exe') #First 
level page 
browser.get('https:/ /tv.cctv.com/')

getting information

 

After entering the home page, you can find that the CCTV program list information that needs to be crawled is not on the home page. You need to jump to another web page through a program list tag on the home page. This tag is the third under class="nav" li, and then through the link in the a tag under the li tag to jump to the playbill page


#Find the location of the CCTV program list on the homepage (the second one in the list whose class is'nav li') 
input1 = browser.find_elements_by_class_name('nav li')[2] input2 = input1.find_element_by_css_selector('a')

Obtain secondary information

 

After entering the playbill page, we can see through the web page source code that the playbill information that needs to be crawled is located in the td tag under the tr tag under class="r", so you can use dynamic crawling of the playbill information


		#Crawl how many programs are in each channel leng = result.find_element_by_class_name('r') 
		leng1 = leng.find_elements_by_css_selector('tr') #Crawl 
		the information of each program (broadcast time, program name, whether to broadcast or not ) 
		for i in range(len(leng1)): #Crawling 
			the program information of the i-th program 
			leng2 = leng.find_elements_by_css_selector('tr')[i] 
			input3 = leng2.find_elements_by_css_selector('td')[0] 
			input3_1 = leng2.find_elements_by_css_selector('td')[1] 
			input3_2 = leng2.find_elements_by_css_selector('td')[2].find_elements_by_css_selector('span')

The method of crawling CCTV channel pictures is the same as above

input1_1 = result.find_element_by_class_name('l')
	input1_0 = input1_1.find_elements_by_css_selector('li')
	#爬取央视频道的频道图片
	for s in range(len(input1_0)):
		input1_00 = input1_1.find_elements_by_css_selector('li')[s]
		input1_2 = input1_00.find_element_by_tag_name('img').get_attribute('src')

Click event

Through the above steps, we can find all the program information of the first program channel, and we cannot find the program information of other channels in this webpage, so we need to use the click event to click the next channel to make us crawl to all The playbill information.

leng0 = result.find_elements_by_css_selector('li')[j].click() #Set 
	to sleep for 0.5 seconds after clicking 
	time.sleep(0.5)

2. Data storage

This article introduces two storage methods: one for pictures; one for text information

Storage of text information

#写入txt文件
				with open("YS_cctv.txt","a+",encoding='utf-8') as f:
					f.write("时间:")
					f.write(input3.text)
					f.write("节目:")
					f.write(input3_1.text)
					f.write(input3_3_2.text)
					f.write(input3_3_1.text)
					f.write('\n')
					print("时间:",input3.text,"节目:",input3_1.text,input3_3_2.text,input3_3_1.text)
			else:
				input3_3_1 = input3_2[0]
				with open("YS_cctv.txt","a+",encoding='utf-8') as f:
					f.write("时间:")
					f.write(input3.text)
					f.write("节目:")
					f.write(input3_1.text)
					f.write(input3_3_1.text) 
					f.write('\n')
					print("time:",input3.text,"program:",input3_1.text,input3_3_1.text)

Storage of pictures

name=str(s+1)+'.jpg' 
		erwer=requests.get(input1_2).content #Write 
		the picture into the folder 
		with open("YS_cctv/"+name,"wb") as f: 
			f. write(erwer)

3. Complete crawling code

from selenium import webdriver 
import requests 
import time 
#Start 
Google browser browser = webdriver.Chrome('D://Baidu installation package//chromedriver_win32//chromedriver.exe') #First 
level page 
browser.get('https:/ /tv.cctv.com/') #Find the 
location of the CCTV program list on the homepage (the second one in the list whose class is'nav li') 
input1 = browser.find_elements_by_class_name('nav li')[2] 
input2 = input1.find_element_by_css_selector('a') 
browser2 = webdriver.Chrome('D://Baidu installation package//chromedriver_win32//chromedriver.exe') #Find 
the link to enter the program board section through the program list location you just found, and enter the page Two pages 
browser2.get(input2.get_attribute('href')) 
result = browser2.find_element_by_class_name('channel_con') #Find the 
li tag, and find out how many 
leng3 = result.find_elements_by_css_selector('li') #Crawl 
the program list of all channels
for j in range(len(leng3)): 
	leng0 = result.find_elements_by_css_selector('li')[j].click() #Set 
	to sleep for 0.5 seconds after clicking 
	time.sleep(0.5) #Crawl 
	CCTV program list 
	def cctv_program (): 
		#Crawl 
		how many programs are in each channel leng = result.find_element_by_class_name('r') 
		leng1 = leng.find_elements_by_css_selector('tr') #Crawl 
		the information of each program (broadcast time, program name, Whether to broadcast) 
		for i in range(len(leng1)): 
			# crawl the program information of the i-th program 
			leng2 = leng.find_elements_by_css_selector('tr')[i] 
			input3 = leng2.find_elements_by_css_selector('td')[0 ] 
			input3_1 = leng2.find_elements_by_css_selector('td')[1] 
			input3_2 = leng2.find_elements_by_css_selector('td')[2].find_elements_by_css_selector('span')
			#Judging how 
			many third messages of each program are there (only one will only crawl one message, otherwise all will be crawled out) if len(input3_2) != 1: 
				input3_3_1 = input3_2[0] 
				input3_3_2 = input3_2[1] 
				# Write to txt file 
				with open("YS_cctv.txt","a+",encoding='utf-8') as f: 
					f.write("Time:") 
					f.write(input3.text) 
					f.write(" Program:") 
					f.write(input3_1.text) 
					f.write(input3_3_2.text) 
					f.write(input3_3_1.text) 
					f.write('\n') 
					print("Time:",input3.text,"program :",input3_1.text,input3_3_2.text,input3_3_1.text) 
			else: 
				input3_3_1 = input3_2[0] 
				with open("YS_cctv.txt","a+",encoding='utf-8') as f: 
					f.write("时间:")
					f.write(input3.text)
					f.write("program:") 
					f.write(input3_1.text) 
					f.write(input3_3_1.text) 
					f.write('\n') 
					print("time:",input3.text,"program:", input3_1.text,input3_3_1.text) 
		print("--------------------------------------- -------------------------------") 
	cctv_program( ) #Crawling 
CCTV channel pictures 
def cctv_picture(): 
	input1_1 = result.find_element_by_class_name('l') 
	input1_0 = input1_1.find_elements_by_css_selector('li') #Crawl 
	the channel picture of the CCTV channel 
	for s in range(len(input1_0)): 
		input1_00 = input1_1.find_elements_by_css_selector('li')[ 
		input1_2 = input1_00.find_element_by_tag_name('img').get_attribute('src')
		name=str(s+1)+'.jpg'
		erwer=requests.get(input1_2).content #Write 
		the picture into the folder 
		with open("YS_cctv/"+name,"wb") as f: 
			f.write(erwer) 
cctv_picture() 
browser.close() 
browser2 .close()

Save screenshots:

 

4. Data visualization analysis

The visual analysis of data needs to use csv files, so we need to change the way of writing files from txt to csv. code show as below:

with open("YS_cctv.csv","a+",newline="") as f:
					writer = csv.writer(f)
					writer.writerow([input3.text,input3_1.text,input3_3_1.text,input3_3_2.text])

The information saved to the csv file is as follows:

 

Time period-amount of programs broadcast

In order to explore the amount of CCTV programs broadcast in different time periods, we can convert the time into '0-2','2-4','4-6','6-8','8-10', '10-12','12-14','14-16','16-18','18-20','20-22','22-0' twelve time periods for analysis

df=pd.read_csv(r'D:/python file/YS_cctv.csv',encoding='gbk') #Total 
number of programs broadcast from 00:00 to 2:00 
a=len(df[(df['Time '] >='00:00') & (df['time'] <='02:00')]) 
#Total number of programs broadcast from 2:00 to 4:00 
b=len(df[(df ['Time'] >'02:00') & (df['Time'] <='04:00')]) 
#Total number of programs broadcast from 4:00 to 6:00 
c=len(df[ (df['Time'] >'04:00') & (df['Time'] <='06:00')]) 
#Total number of programs broadcast from 6:00 to 8:00 
d=len( df [(df [ 'time']> = '06:00') & (df [ ' time'] <= '08:00')]) 
# 8:00 to 10:00 time broadcast programs Total 
e =len(df[(df['time'] >'08:00') & (df['time'] <='10:00')]) 
#Time to broadcast the program from 10:00 to 12:00 Total 
f=len(df[(df['Time'] >'10:00') & (df['time'] <='12:00')]) 
#Total number of programs broadcast from 12:00 to 14:00 
g=len(df[(df[' Time'] >='12:00') & (df['time'] <='14:00')]) 
#Total number of programs broadcast from 14:00 to 16:00
h=len(df[(df['time'] >'14:00') & (df['time'] <='16:00')]) 
#Time is broadcast from 16:00 to 18:00 Total number of programs 
i=len(df[(df['time'] >'16:00') & (df['time'] <='18:00')]) 
#Time is between 18:00 and 20:00 Total number of broadcast programs 
j=len(df[(df['time'] >='18:00') & (df['time'] <='20:00')]) 
#Time is at 20:00 Total number of programs broadcast at 22:00 
k=len(df[(df['time'] >'20:00') & (df['time'] <='22:00')]) 
#Time at 22: Total number of programs broadcast from 00 to 23:59 
l=len(df[(df['time'] >'22:00') & (df['time'] <='23:59')]) 
labels =' 0-2','2-4','4-6','6-8','8-10','10-12','12-14','14-16','16- 18','18-20','20-22','22-0' # Label 
sizes = [a,b,c,d,e,f,g,h,i,j,k,l] # The proportion of the size of each area 
#plt.pie() means Drawing a pie chart 
# sizes must be the first parameter, autopct='%1.1f%%' indicates that the area ratio accuracy is 0.1% 
# startangle=0 indicates that the A area is on the upper left, startangle=90 indicates that the A area is on the upper right
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.show()

Use matplotlib library for visualization: Pie chart:

Bar graph:

#
条形图plt.bar(['0-2','2-4','4-6','6-8','8-10','10-12','12-14' ,'14-16','16-18','18-20','20-22','22-0'],[a,b,c,d,e,f,g,h,i ,j,k,l]) 
plt.legend() 
plt.xlabel('Time period') 
plt.ylabel('Amount of broadcast programs') 
plt.title('Time period-Amount of broadcast programs') 
plt.show ()

 

line chart:

plt.plot(x_data,y_data) 
#Line chart plt.show()

 

 

Broadcasting method-amount of programs broadcast

To explore the broadcast volume of different types of CCTV programs in the way that the program is being broadcast, we can analyze it by looking at whether the broadcast method is viewing, live streaming, and not yet started. The Chinese language cannot be displayed. The solution is to adopt Dynamic configuration method

plt.rcParams['font.sans-serif']=['SimHei'] 
plt.rcParams['axes.unicode_minus'] = False #negative sign display

The last line is used to solve the problem that the minus sign cannot be displayed normally after changing to Chinese font.

df=pd.read_csv(r'D:/python file/YS_cctv.csv',encoding='gbk') #The 
broadcast method is the total number of broadcast programs in the live broadcast 
a=len(df[(df['whether broadcast '] =='Live broadcast')]) 
#Broadcast mode is the total number of unstarted programs 
b=len(df[(df['Whether it is broadcast'] =='Not started')]) 
#Broadcast The method is the total number of broadcasted programs watched back 
c=len(df[(df['whether broadcasted'] =='return to watch')]) 
labels ='live broadcast','not started','reviewed back' # Label 
sizes = [a,b,c] # The proportion of the size of each area 
# The first parameter is 0.1, indicating that the distance between the A area and the pie chart is 0.1 
explode = (0.1, 0, 0) 
#plt. pie() means drawing a pie chart 
# sizes must be the first parameter, autopct='%1.1f%%' indicates that the area ratio accuracy is 0.1% 
# startangle=0 indicates that the A area is at the upper left, startangle=90 indicates that the A area is in Upper right 
plt.pie(sizes, labels=labels,explode=explode,autopct='%1.1f%%', startangle=0) 
plt.show()

Use matplotlib library for visualization: Pie chart:

Bar graph:

plt.rcParams['font.sans-serif']=['SimHei'] 
plt.rcParams['axes.unicode_minus'] = False 
#negative sign display df=pd.read_csv(r'D:/python file/YS_cctv. csv',encoding='gbk') #The 
broadcast method is the total number of broadcast programs in the live broadcast 
a=len(df[(df['whether it is broadcasted'] =='live broadcast')]) 
#The broadcast method is The total number of 
unstarted broadcast programs b=len(df[(df['whether to broadcast'] =='not started')]) 
#Broadcasting method is the total number of broadcasted programs watched back 
c=len(df[( df['Whether to broadcast'] =='Review')]) 
plt.bar(['Live broadcast','not started','Review'],[a,b,c]) 
plt.legend( ) 
plt.xlabel('Broadcast method') 
plt.ylabel('Broadcast program volume') 
plt.title('Broadcast method-broadcast program volume') 
plt.show()

 

 

line chart:

Integrate keywords through word cloud

The above is the entire content of the program information of CCTV. If you are interested in the content, please comment and like below. Your support is the greatest motivation for the author. If you feel that there are deficiencies in this article, you are welcome to point it out and you may add it later. More new content!

Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/112363416