find dataset
1. Common data sets for papers
2. Machine learning competition websites + user-submitted data sets
3. Search engine
academic data sets: a lot of processing has been done Moderately difficult and related to common models Not suitable for application
competition data sets: closer to application data The set has done some preprocessing, which is relatively clean and focuses on the more popular aspects.
Raw data set: flexible and requires preprocessing
Data Fusion
The entire data may be placed in different places, the table join
numbers may be wrongly written or the units of the numbers are different factors to be considered
artificially generated data
GAN: Unsupervised image generation
Data augmentation: make some changes to the data
web page data scraping
Goal: Extract interesting data from web pages
from selenium import webdriver
chrome_options = webdriver.ChromeOptions() # 拿出chrome的属性
chrome_options.headless = True # 不需要图形界面
chrome = webdriver.Chrome(
chrome_options=chrome_options) # 创建chrome
page = chrome.get(url)
Selenium is a tool in python, and webdriver is the background of Chrome
1. Pretend to browse the webpage artificially instead of a machine
2. Update ip in large numbers, and grab webpages through different ip
Data annotation
Data labeling process diagram
semi-supervised learning
There is a small part of the data that is labeled with labeled data and unlabeled data.
Assumptions of semi-supervised learning:
1. Two samples that are similar may have the same label
2. Clustering -> data in a class may have the same Labels, different classes may have the same label
3. Popular hypothesis: the complexity of the data may be much lower than the complexity of seeing the data, and the cleaner data can be obtained by dimensionality reduction