实用机器学习笔记(五):数据变换

Data Transformation

  • Data are transformed into forms appropriate for ML algorithms
  • We focus on methods in a particular step for various data types

1. Normalization for Real Value Columns

  • min-max normalization:linearly map to a new min a and max b
    x i ′ = x i − m i n x m a x x − m i n x ( b − a ) + a (1) x_i' = \frac{x_i - min_x}{max_x - min_x}(b-a) + a \tag{1} xi=maxxminxximinx(ba)+a(1)

  • Z-score normalization: 0 mean ,1 stadard deviation
    x i ′ = x i − m e a n ( x ) s t d ( x ) (2) x_i' = \frac{x_i - mean(x)}{std(x)} \tag{2} xi=std(x)ximean(x)(2)

  • Decimal scaling
    x i ′ = x i 1 0 j    s m a l l e s t    j    s . t .    m a x ( ∣ x ∣ ) < 1 (3) x_i' = \frac{x_i}{10^j} \; smallest \; j \; s.t. \; max(|x|) < 1 \tag{3} xi=10jxismallestjs.t.max(x)<1(3)

  • log scaling
    x i ′ = l o g ( x i ) (4) x_i' = log(x_i) \tag{4} xi=log(xi)(4)

2. Image Transformations

(注意数据的存储、分辨率、是否可直接用于训练、加载效率等问题)

  • our previous web scraping will scrape 15TB images for a year
  • Downsampling and cropping
    • reduce images size
    • low resolution
    • image quality(jpeg)
  • Image whitening

3. Video Transformations

(编码解码应用比较多,编码解码可以节省存储空间,但是解码后需要的运行存储空间需要10倍以上)

  • Average video length
  • preprocessing to balance storage ,quality and loading spped
  • we often use short video clips(<10 sec)
  • Decode aplayable video,sample a sequence of frames
    • best for loading ,but 10x more space
    • computation may be cheaper than storage
    • can apply other image transformations to the frames

4. Text transformations

  • stemming and lenmatization: a word -> a common base form

    • E.g. am,are,is --> be
    • E.g. car,cars,car’s,cars’ -->car
  • tokenization: text -> a list of tokens (smallest unit to ML algorithms)

    • By word : text.split(" ")
    • By char: text.split("")
    • By subwords:
      • unigram,wordpiece,…
      • e.g. “a new gpu!” --> “a”,“new”,“gp”,"##u","!"

5. Summary

  • Data transformation transfers data into formats preferred by ML algorithms, and balance data size,quality and loading speed
  • Tabular: normalize real value features
  • images: cropping downsampling ,whitening
  • videos: clipping , sampling frames
  • text : stemming ,lemmatization, tokenization

猜你喜欢

转载自blog.csdn.net/jerry_liufeng/article/details/123430644