实用机器学习笔记（五）：数据变换

企业开发 2023-04-09 17:26:01 阅读次数: 0

文章目录

Data Transformation

Data Transformation

Data are transformed into forms appropriate for ML algorithms
We focus on methods in a particular step for various data types

1. Normalization for Real Value Columns

min-max normalization:linearly map to a new min a and max b
$x_i' = \frac{x_i - min_x}{max_x - min_x}(b-a) + a \tag{1}$
Z-score normalization: 0 mean ,1 stadard deviation
$x_i' = \frac{x_i - mean(x)}{std(x)} \tag{2}$
Decimal scaling
$x_i' = \frac{x_i}{10^j} \; smallest \; j \; s.t. \; max(|x|) < 1 \tag{3}$
log scaling
$x_i' = log(x_i) \tag{4}$

2. Image Transformations

(注意数据的存储、分辨率、是否可直接用于训练、加载效率等问题)

our previous web scraping will scrape 15TB images for a year
Downsampling and cropping
- reduce images size
- low resolution
- image quality(jpeg)
Image whitening

3. Video Transformations

(编码解码应用比较多，编码解码可以节省存储空间，但是解码后需要的运行存储空间需要10倍以上)

Average video length
preprocessing to balance storage ,quality and loading spped
we often use short video clips(<10 sec)
Decode aplayable video,sample a sequence of frames
- best for loading ,but 10x more space
- computation may be cheaper than storage
- can apply other image transformations to the frames

4. Text transformations

stemming and lenmatization: a word -> a common base form
- E.g. am,are,is --> be
- E.g. car,cars,car’s,cars’ -->car
tokenization: text -> a list of tokens (smallest unit to ML algorithms)
- By word : text.split(" ")
- By char: text.split("")
- By subwords:
  - unigram,wordpiece,…
  - e.g. “a new gpu!” --> “a”,“new”,“gp”,"##u","!"

5. Summary

Data transformation transfers data into formats preferred by ML algorithms, and balance data size,quality and loading speed
Tabular: normalize real value features
images: cropping downsampling ,whitening
videos: clipping , sampling frames
text : stemming ,lemmatization, tokenization

猜你喜欢

转载自blog.csdn.net/jerry_liufeng/article/details/123430644

实用机器学习笔记（五）：数据变换

实用机器学习笔记（四）：数据清洗

实用机器学习笔记（二）：数据标注

实用机器学习笔记（一）：数据获取

LearnOpenGL学习笔记(五):变换

读《数据挖掘-实用机器学习技术》笔记

读《数据挖掘-实用机器学习技术》笔记（二）

《数据挖掘-实用机器学习技术》读书笔记

实用机器学习笔记（三）：数据预处理

机器学习笔记（五）

【机器学习笔记33】小波变换

【机器学习笔记32】短时傅里叶变换

【机器学习笔记31】傅里叶变换

现代OpenGL学习笔记五：变换

WebGL学习笔记（五）：变换库

机器学习4 非线性数据变换

实用机器学习笔记（六）：特征工程

Scikit-Learn 与 TensorFlow 机器学习实用指南学习笔记 3 —— 数据获取与清洗

《数据挖掘-实用机器学习工具与技术》学习笔记第一章绪论

机器学习入门笔记(七)----机器学习实用方法

数据挖掘-实用机器学习技术

机器学习（五）：数据缩放

opencv学习笔记(五)重映射及仿射变换

OpenGL学习笔记五——摄像机与坐标变换

OpenCV-python学习笔记（五）——几何变换

Scikit-Learn 与 TensorFlow 机器学习实用指南学习笔记 4 —— 数据探索与可视化、发现规律

吴恩达机器学习笔记（五）

机器学习读书笔记（五）AdaBoost

Coursera机器学习课程笔记（五）

机器学习笔记（五）—— 逻辑回归

今日推荐

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

周排行

rbac——界面、权限

Apache CXF + SpringMVC 整合发布WebService

so插件化

Vue.js实战系列---图标字体制作（svg格式）

PAT乙级 1007 素数对猜想(孪生素数对) (20分) ---（C语言 + 详细注释）

被IRM保护的文档，打开失败

Calendar和Date计算日期差的小问题

win10子系统ubuntu18.4安装docker

利用Wrap Shell Script定位Android Native内存泄漏

MySQL: Transaction (Part I - Basic Concept)

每日归档

更多

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)