[Li Mu Machine Learning] 1. Data acquisition + data web crawling

find dataset

insert image description here
1. Common data sets for papers
2. Machine learning competition websites + user-submitted data sets
3. Search engine
insert image description here
academic data sets: a lot of processing has been done Moderately difficult and related to common models Not suitable for application
competition data sets: closer to application data The set has done some preprocessing, which is relatively clean and focuses on the more popular aspects.
Raw data set: flexible and requires preprocessing

Data Fusion

The entire data may be placed in different places, the table join
insert image description here
numbers may be wrongly written or the units of the numbers are different factors to be considered

artificially generated data

GAN: Unsupervised image generation

insert image description here

Data augmentation: make some changes to the data

insert image description here

web page data scraping

Goal: Extract interesting data from web pages
insert image description here

from selenium import webdriver

chrome_options = webdriver.ChromeOptions() # 拿出chrome的属性 
chrome_options.headless = True # 不需要图形界面
chrome = webdriver.Chrome(
	chrome_options=chrome_options) # 创建chrome
page = chrome.get(url) 

Selenium is a tool in python, and webdriver is the background of Chrome
1. Pretend to browse the webpage artificially instead of a machine
2. Update ip in large numbers, and grab webpages through different ip

Data annotation

insert image description here
Data labeling process diagram

semi-supervised learning

There is a small part of the data that is labeled with labeled data and unlabeled data.
Assumptions of semi-supervised learning:
1. Two samples that are similar may have the same label
2. Clustering -> data in a class may have the same Labels, different classes may have the same label
3. Popular hypothesis: the complexity of the data may be much lower than the complexity of seeing the data, and the cleaner data can be obtained by dimensionality reduction

self-study

insert image description here

Guess you like

Origin blog.csdn.net/weixin_48983346/article/details/126447761