Binning, Transforming, Encoding, Scaling, and shuffling for feature engineering

AWS Certified Machine Learning Specialty

by Taeyoon.Kim.DS 2023. 11. 10. 20:29

https://www.udemy.com/course/aws-machine-learning/learn/lecture/16573022#overview

Various feature engineering techniques used in data processing and machine learning, emphasizing their importance and applications:

Binning: Converting numerical data into categorical data by grouping values into ranges. For example, ages grouped by decades.
Quantile Binning: Ensures each bin has an equal number of samples, categorizing data by their distribution.
Data Transformation: Applying functions to data to suit algorithm needs, like logarithmic transforms for exponential trends. YouTube's feature engineering, involving squares and square roots of features, is cited as an example.
Encoding: Necessary in deep learning, where data is formatted to meet model requirements. One-hot encoding is a common method where categories are represented by zeros and ones.
Scaling and Normalizing: Most models require data to be normally distributed or scaled to comparable values. Tools like scikit-learn's MinMax Scaler in Python are useful for this.
Shuffling: Randomizing the order of training data to eliminate biases from data collection order, often improving model performance.

빈닝: 수치 데이터를 값의 범위에 따라 그룹화하여 범주 데이터로 변환하는 것. 예를 들어, 나이를 십년 단위로 그룹화하는 것입니다.
분위수 빈닝: 각 빈에 동일한 수의 샘플이 있도록 보장하며, 데이터를 분포에 따라 분류합니다.
데이터 변환: 알고리즘 요구에 맞게 데이터에 함수를 적용하는 것으로, 지수적 추세가 있는 데이터에는 로그 변환을 적용하는 것이 좋습니다. 예를 들어, YouTube의 특징 공학에서는 특징의 제곱과 제곱근을 사용하는 것이 예시로 들립니다.
인코딩: 딥러닝에서 필요하며, 데이터를 모델 요구 사항에 맞게 포맷하는 것입니다. 원-핫 인코딩은 카테고리를 0과 1로 표현하는 일반적인 방법입니다.
스케일링과 정규화: 대부분의 모델은 데이터가 정규 분포되거나 비교 가능한 값으로 스케일되는 것을 요구합니다. 파이썬의 scikit-learn의 MinMax Scaler와 같은 도구가 이에 유용합니다.
셔플링: 훈련 데이터의 순서를 무작위화하여 데이터 수집 순서에서 비롯된 편향을 제거하는 것으로, 모델 성능을 종종 향상시킵니다.

충분한 정보를 가지면서도 과도한 데이터를 피하는 것(차원의 저주) 사이의 균형을 유지하는 것이 feature engineering에서 신중한 고려가 필요합니다.

## Binning

import pandas as pd

# Sample data
data = {'age': [25, 37, 19, 45, 55]}
df = pd.DataFrame(data)

# Define bins
bins = [0, 20, 30, 40, 50, 60]

# Binning using cut function
df['age_bin'] = pd.cut(df['age'], bins)

print(df)

   age   age_bin
0   25  (20, 30]
1   37  (30, 40]
2   19   (0, 20]
3   45  (40, 50]
4   55  (50, 60]

## Transforming

import numpy as np

# Sample data
df['income'] = [30000, 45000, 60000, 80000, 100000]

# Logarithmic transformation
df['log_income'] = np.log(df['income'])

print(df)

   age   age_bin  income  log_income
0   25  (20, 30]   30000   10.308953
1   37  (30, 40]   45000   10.714418
2   19   (0, 20]   60000   11.002100
3   45  (40, 50]   80000   11.289782
4   55  (50, 60]  100000   11.512925

## Encoding (One-Hot Encoding)

from sklearn.preprocessing import OneHotEncoder

# Sample data
df['gender'] = ['Male', 'Female', 'Female', 'Male', 'Female']

# One-hot encoding
encoder = OneHotEncoder(sparse=False)
encoded_cols = encoder.fit_transform(df[['gender']])

# Creating a DataFrame with encoded columns
encoded_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['gender']))

# Concatenating with original data
df = pd.concat([df, encoded_df], axis=1)

print(df)

   gender_Female  gender_Male  
0            0.0          1.0  
1            1.0          0.0  
2            1.0          0.0  
3            0.0          1.0  
4            1.0          0.0

## Scaling

from sklearn.preprocessing import MinMaxScaler

# Sample data
df['height'] = [5.0, 5.5, 6.0, 6.2, 5.8]

# Min-Max scaling
scaler = MinMaxScaler()
df['scaled_height'] = scaler.fit_transform(df[['height']])

print(df)

   age   age_bin  income  log_income  gender  gender_Female  gender_Male  \
0   25  (20, 30]   30000   10.308953    Male            0.0          1.0   
1   37  (30, 40]   45000   10.714418  Female            1.0          0.0   
2   19   (0, 20]   60000   11.002100  Female            1.0          0.0   
3   45  (40, 50]   80000   11.289782    Male            0.0          1.0   
4   55  (50, 60]  100000   11.512925  Female            1.0          0.0   

   gender_Female  gender_Male  height  scaled_height  
0            0.0          1.0     5.0       0.000000  
1            1.0          0.0     5.5       0.416667  
2            1.0          0.0     6.0       0.833333  
3            0.0          1.0     6.2       1.000000  
4            1.0          0.0     5.8       0.666667

## Shuffling

from sklearn.utils import shuffle

# Shuffling the DataFrame
shuffled_df = shuffle(df)

print(shuffled_df)

   age   age_bin  income  log_income  gender  gender_Female  gender_Male  \
1   37  (30, 40]   45000   10.714418  Female            1.0          0.0   
0   25  (20, 30]   30000   10.308953    Male            0.0          1.0   
2   19   (0, 20]   60000   11.002100  Female            1.0          0.0   
4   55  (50, 60]  100000   11.512925  Female            1.0          0.0   
3   45  (40, 50]   80000   11.289782    Male            0.0          1.0   

   gender_Female  gender_Male  height  scaled_height  
1            1.0          0.0     5.5       0.416667  
0            0.0          1.0     5.0       0.000000  
2            1.0          0.0     6.0       0.833333  
4            1.0          0.0     5.8       0.666667  
3            0.0          1.0     6.2       1.000000

저작자표시 비영리 변경금지

'AWS Certified Machine Learning Specialty' 카테고리의 다른 글

Introduction to Deep Learning (0)	2023.11.11
Feature Engineering and the Curse of Dimensionality (0)	2023.11.01
Apache Spark on EMR (0)	2023.10.31
AWS EMR (Elastic Map Reduce) (0)	2023.10.31
AWS Athena & Quicksight (1)	2023.10.30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Taeyoon.Kim.DS

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

'AWS Certified Machine Learning Specialty' 카테고리의 다른 글

관련글 더보기

추가 정보

인기글

최신글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역