DA

All about Data Analysis

데이터 전처리 간단 정리

Thanks to Kaggle Learn 데이터 분석 과제 정도의 수준이지 실전용은 아니라는 점을 참고하시면 좋겠습니다. 결측치 처리 갯수 세기 없애기 채우기 스케일링 데이터의 범위를 바꿔주는 역할 (달러와 엔의 scale이 다르면 맞춰주는 그런 용도) SVM, KNN 쓸때 주로 사용함. 정규화 데이터의 분포를 정규화 주로 정규분포를 필요로 하는 알고리즘을 쓸 때 사용 (k-Means, PCA, CNN, RNN, GAN,...)

TikaToka

2024/11/28 5:09 PM

Pandas 간단 정리

Thanks to Kaggle Learn. 만약에 뭔가 기억이 나지 않는다면, 두가지만 기억하자 함수 사용법을 알려주거나 어떤 함수가 있는지 알려준다. Import 데이터 만들기 csv 읽기 특정 열에 access review라는 변수에 dataframe이 담겨있고, country라는 열에 접근하려면 indexing iloc (index based) loc (label based)

TikaToka

2024/11/28 4:42 PM

All about TIKA

데이터 전처리 간단 정리

TikaToka

Nov 28, 20248m

Thanks to Kaggle Learn

Learn Data Cleaning Tutorials

Master efficient workflows for cleaning real-world, messy data.

kaggle.com

데이터 분석 과제 정도의 수준이지 실전용은 아니라는 점을 참고하시면 좋겠습니다.

결측치 처리

갯수 세기

missing_values_count = nfl_data.isnull().sum()

없애기

nfl_data.dropna()

채우기

subset_nfl_data.fillna(0)
subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)

스케일링

데이터의 범위를 바꿔주는 역할 (달러와 엔의 scale이 다르면 맞춰주는 그런 용도)

SVM, KNN 쓸때 주로 사용함.

from mlxtend.preprocessing import minmax_scaling

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns=[column names])

정규화

데이터의 분포를 정규화

주로 정규분포를 필요로 하는 알고리즘을 쓸 때 사용 (k-Means, PCA, CNN, RNN, GAN,...)

# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

날짜 데이터 다루기

datetime 으로 바꾸기

pd.to_datetime(landslides['date'])

정보 추

 landslides['date_parsed'].dt.year
 landslides['date_parsed'].dt.month
 landslides['date_parsed'].dt.day
 landslides['date_parsed'].dt.weekday

 landslides['date_parsed'].dt.hour
 landslides['date_parsed'].dt.minute
 landslides['date_parsed'].dt.second

오탈자 처리

import fuzzywuzzy
from fuzzywuzzy import process
import charset_normalizer


# get the top 10 closest matches to "south korea"
matches = fuzzywuzzy.process.extract("south korea", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

Subscribe to 'All about TIKA'

AI Tech Blog with Curriculum Vitae