文章目录
Data Transformation
- Data are transformed into forms appropriate for ML algorithms
- We focus on methods in a particular step for various data types
1. Normalization for Real Value Columns
-
min-max normalization:linearly map to a new min a and max b
x i ′ = x i − m i n x m a x x − m i n x ( b − a ) + a (1) x_i' = \frac{x_i - min_x}{max_x - min_x}(b-a) + a \tag{1} xi′=maxx−minxxi−minx(b−a)+a(1) -
Z-score normalization: 0 mean ,1 stadard deviation
x i ′ = x i − m e a n ( x ) s t d ( x ) (2) x_i' = \frac{x_i - mean(x)}{std(x)} \tag{2} xi′=std(x)xi−mean(x)(2) -
Decimal scaling
x i ′ = x i 1 0 j s m a l l e s t j s . t . m a x ( ∣ x ∣ ) < 1 (3) x_i' = \frac{x_i}{10^j} \; smallest \; j \; s.t. \; max(|x|) < 1 \tag{3} xi′=10jxismallestjs.t.max(∣x∣)<1(3) -
log scaling
x i ′ = l o g ( x i ) (4) x_i' = log(x_i) \tag{4} xi′=log(xi)(4)
2. Image Transformations
(注意数据的存储、分辨率、是否可直接用于训练、加载效率等问题)
- our previous web scraping will scrape 15TB images for a year
- Downsampling and cropping
- reduce images size
- low resolution
- image quality(jpeg)
- Image whitening
3. Video Transformations
(编码解码应用比较多,编码解码可以节省存储空间,但是解码后需要的运行存储空间需要10倍以上)
- Average video length
- preprocessing to balance storage ,quality and loading spped
- we often use short video clips(<10 sec)
- Decode aplayable video,sample a sequence of frames
- best for loading ,but 10x more space
- computation may be cheaper than storage
- can apply other image transformations to the frames
4. Text transformations
-
stemming and lenmatization: a word -> a common base form
- E.g. am,are,is --> be
- E.g. car,cars,car’s,cars’ -->car
-
tokenization: text -> a list of tokens (smallest unit to ML algorithms)
- By word : text.split(" ")
- By char: text.split("")
- By subwords:
- unigram,wordpiece,…
- e.g. “a new gpu!” --> “a”,“new”,“gp”,"##u","!"
5. Summary
- Data transformation transfers data into formats preferred by ML algorithms, and balance data size,quality and loading speed
- Tabular: normalize real value features
- images: cropping downsampling ,whitening
- videos: clipping , sampling frames
- text : stemming ,lemmatization, tokenization