ddaddo_data.log

머신러닝의 개념

Wed, 11 Aug 2021 08:28:05 GMT

1) 머신러닝의 개념

애플리케이션을 수정하지 않고도 데이터를 기반으로 패턴을 학습하고 결과를 예측하는 알고리즘 기법을 통칭
데이터를 기반으로 통계적인 신뢰도를 강화하고 예측 오류를 최소화하기 위한 다양한 수학적 기법을 적용해 데이터 내의 패턴을 스스로 인지하고 신뢰도 있는 예측 결과를 도출해 냄
머신러닝의 가장 큰 단점은 데이터에 매우 의존적이라는 점

2) 머신러닝의 분류

1. 지도학습

분류, 회귀, 추천 시스템, 시각/음성감지/인지, 텍스트분석, NLP

2. 비지도학습

클러스터링, 차원 축소, 강화학습

31. 군집, k-Means

Sat, 31 Jul 2021 04:25:18 GMT

1) k-Means

데이터 간의 유사성을 측정하는 기준으로 각 클러스터의 중심까지의 거리를 이용
벡터 공간에 위치한 어떤 데이터에 대하여 k개의 클러스터가 주어졌을 때 클러스터의 중심까지 거리가 가장 가까운 클러스터로 해당 데이터 할당
k값에 따라 모형의 성능이 달라짐
일반적으로 k가 클수록 모형의 정확도가 개선되지만 k값이 너무 커지면 분석의 효과가 사라짐
2) 데이터 전처리
```
import pandas as pd
import matplotlib.pyplot as plt
```

uci_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv' df = pd.read_csv(uci_path, header=0)

X = df.iloc[:,:]

from sklearn import preprocessing

X = preprocessing.StandardScaler().fit(X).transform(X)

정규화

### 3) 모델 학습 및 예측

from sklearn import cluster

kmeans = cluster.KMeans(init = 'k-means++', n_clusters = 5, n_init = 10)

5개의 클러스터 생성

kmeans.fit(X)

모델 학습

cluster_label = kmeans.labels_

모델 예측

df['Cluster'] = cluster_label df.head()

![](https://images.velog.io/images/ddaddo_data/post/9f7ca750-0c09-4115-8f73-06636c7ab11b/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-31%20123841.jpg)
### 4) 시각화

df.plot(kind = 'scatter', x ='Grocery', y= 'Frozen', c ='Cluster', cmap ='Set1', colorbar = False, figsize = (10,10))

x축이 Grocery, y축이 Frozen

산점도 방식으로 표현

colorbar 없이 출력

df.plot(kind = 'scatter', x = 'Milk', y = 'Delicassen', c = 'Cluster', cmap = 'Set1', colorbar = True, figsize = (10,10))

x축이 Milk, y축이 Delicassen

산점도 방식으로 표현

colorbar 출력

plt.show()

![](https://images.velog.io/images/ddaddo_data/post/2bb8a0f0-7943-4c89-b278-bcc0fafd6cf1/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-31%20124352.jpg)

30. 분류

Fri, 30 Jul 2021 13:47:29 GMT

1) KNN

기존 데이터 중에서 가장 속성이 비슷한 k개의 이웃을 찾음
가까운 이웃들이 가지고 있는 목표 값과 같은 값으로 분류하여 예측
1-1. 데이터 전처리
```
import pandas as pd
import seaborn as sns
```

df = sns.load_dataset('titanic') pd.set_option('display.max_columns', 15)

rdf = df.drop(['deck', 'embark_town'], axis = 1) rdf = rdf.dropna(subset = ['age'], how = 'any', axis = 0)

age 열에 나이 데이터가 없는 모든 행 삭제

most_freq = rdf['embarked'].value_counts(dropna=True).idxmax()

embarked 열의 NaN 값을 승선도시 중에서 가장 많이 출현한 값 도출

rdf['embarked'].fillna(most_freq, inplace = True)

가장 많이 출현한 값인 s로 NaN값을 대체

ndf = rdf[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'embarked']]

onehot_sex = pd.get_dummies(ndf['sex'])

sex 열 데이터인 female와 male을 더미 변수 열로 만들기

ndf = pd.concat([ndf, onehot_sex], axis = 1)

onehot_embarked = pd.get_dummies(ndf['embarked'], prefix = 'town')

열 이름에 접두어 'town'을 추가

embarked 열 데이터를 더미 변수 열로 만들기

ndf = pd.concat([ndf, onehot_embarked], axis = 1)

ndf.drop(['sex', 'embarked'], axis = 1, inplace = True)

원래 존재했던 sex와 embarked 열을 제거

#### 1-2. 훈련/검증 데이터 분할

X = ndf[['pclass', 'age', 'sibsp', 'parch', 'female', 'male']] y = ndf['survived']

from sklearn import preprocessing

X = preprocessing.StandardScaler().fit(X).transform(X)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

#### 1-3. 모델 학습 및 예측, 성능 측정

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

모델 학습

y_hat = knn.predict(X_test)

모델 예측

from sklearn import metrics

knn_report = metrics.classification_report(y_test, y_hat)

모델 성능 평가 지표 확인 가능

print(knn_report)

![](https://images.velog.io/images/ddaddo_data/post/7aa19b6d-936c-437d-8ec3-f4d46253028b/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-30%20222824.jpg)
### 2) SVM
* 벡터 공간에 위치한 훈련 데이터의 좌표와 각 데이터가 어떤 분류값을 가져야 하는지 정답을 입력 받아서 학습함
#### 2-1. 데이터 전처리

import pandas as pd import seaborn as sns

df = sns.load_dataset('titanic') pd.set_option('display.max_columns', 15)

rdf = df.drop(['deck', 'embark_town'], axis = 1) rdf = rdf.dropna(subset = ['age'], how = 'any', axis = 0)

age 열에 나이 데이터가 없는 모든 행 삭제

most_freq = rdf['embarked'].value_counts(dropna=True).idxmax()

embarked 열의 NaN 값을 승선도시 중에서 가장 많이 출현한 값 도출

rdf['embarked'].fillna(most_freq, inplace = True)

가장 많이 출현한 값인 s로 NaN값을 대체

ndf = rdf[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'embarked']]

onehot_sex = pd.get_dummies(ndf['sex'])

sex 열 데이터인 female와 male을 더미 변수 열로 만들기

ndf = pd.concat([ndf, onehot_sex], axis = 1)

onehot_embarked = pd.get_dummies(ndf['embarked'], prefix = 'town')

열 이름에 접두어 'town'을 추가

embarked 열 데이터를 더미 변수 열로 만들기

ndf = pd.concat([ndf, onehot_embarked], axis = 1)

ndf.drop(['sex', 'embarked'], axis = 1, inplace = True)

원래 존재했던 sex와 embarked 열을 제거

#### 2-2. 훈련/검증 데이터 분할

X = ndf[['pclass', 'age', 'sibsp', 'parch', 'female', 'male', 'town_C', 'town_Q', 'town_S']] y = ndf['survived']

from sklearn import preprocessing

X = preprocessing.StandardScaler().fit(X).transform(X)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

#### 2-3. 모델 학습 및 예측, 성능 측정

from sklearn import svm

svm_model = svm.SVC(kernel = 'rbf')

svm_model.fit(X_train, y_train)

모델 학습

y_hat = svm_model.predict(X_test)

모델 예측

from sklearn import metrics

svm_matrix = metrics.confusion_matrix(y_test, y_hat)

모형 성능 평가 지표 확인 가능 (confusion matrix)

print(svm_matrix) print('\n')

svm_report = metrics.classification_report(y_test, y_hat)

모델 성능 평가 지표 확인 가능

print(svm_report)

![](https://images.velog.io/images/ddaddo_data/post/5c124e1a-a122-4403-bfdb-598bf3906890/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-30%20222836.jpg)
### 3) Decision Tree
* 트리 구조 사용
* 각 분기점(node)에는 분석 대상의 속성들이 위치
* 각 분기점마다 목표 값을 가장 잘 분류할 수 있는 속성을 찾아서 배치
* 해당 속성이 갖는 값을 이용하여 새로운 가지를 만드는 구조
* Entropy가 일정 수준 이하로 낮아질 때까지 앞의 과정 반복
#### 3-1. 데이터 전처리

import pandas as pd import seaborn as sns

df = sns.load_dataset('titanic') pd.set_option('display.max_columns', 15)

rdf = df.drop(['deck', 'embark_town'], axis = 1) rdf = rdf.dropna(subset = ['age'], how = 'any', axis = 0)

age 열에 나이 데이터가 없는 모든 행 삭제

most_freq = rdf['embarked'].value_counts(dropna=True).idxmax()

embarked 열의 NaN 값을 승선도시 중에서 가장 많이 출현한 값 도출

rdf['embarked'].fillna(most_freq, inplace = True)

가장 많이 출현한 값인 s로 NaN값을 대체

ndf = rdf[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'embarked']]

onehot_sex = pd.get_dummies(ndf['sex'])

sex 열 데이터인 female와 male을 더미 변수 열로 만들기

ndf = pd.concat([ndf, onehot_sex], axis = 1)

onehot_embarked = pd.get_dummies(ndf['embarked'], prefix = 'town')

열 이름에 접두어 'town'을 추가

embarked 열 데이터를 더미 변수 열로 만들기

ndf = pd.concat([ndf, onehot_embarked], axis = 1)

ndf.drop(['sex', 'embarked'], axis = 1, inplace = True)

원래 존재했던 sex와 embarked 열을 제거

#### 3-2. 훈련/검증 데이터 분할

X = ndf[['pclass', 'age', 'sibsp', 'parch', 'female', 'male']] y = ndf['survived']

from sklearn import preprocessing

X = preprocessing.StandardScaler().fit(X).transform(X)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

#### 3-3. 모델 학습 및 예측, 성능 측정

from sklearn import tree

tree_model = tree.DecisionTreeClassifier(criterion='entropy', max_depth = 5)

tree_model.fit(X_train, y_train)

모델 학습

y_hat = tree_model.predict(X_test)

모델 예측

from sklearn import metrics

tree_report = metrics.classification_report(y_test, y_hat)

모델 성능 평가 지표 확인 가능

print(tree_report)

```

29. 회귀분석

Thu, 29 Jul 2021 08:22:38 GMT

1) 단순회귀분석

두 변수 사이에 일대일로 대응되는 확률적, 통계적 상관성을 찾는 알고리즘

대표적인 지도학습 유형

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('./auto-mpg.csv', header = None) df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'name']

pd.set_option('display.max_columns', 10)

df['horsepower'].replace('?', np.nan, inplace=True) df.dropna(subset=['horsepower'], axis = 0, inplace = True) df['horsepower'] = df['horsepower'].astype('float')

ndf = df[['mpg', 'cylinders', 'horsepower', 'weight']] ndf.plot(kind = 'scatter', x = 'weight', y = 'mpg', c = 'coral', s = 10, figsize = (10,5))

산점도 그리기

x축은 weight, y축은 mpg, 산점도의 색은 코랄색, 점의 크기는 10, 그래프 사이즈는 (10,5)

plt.show()

![](https://images.velog.io/images/ddaddo_data/post/41b9f374-d735-45b0-9cb4-5b6b6cce0722/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-29%20112948.jpg)

fig = plt.figure(figsize = (10,5)) ax1 = fig.add_subplot(1,2,1) ax2 = fig.add_subplot(1,2,2)

sns.regplot(x = 'weight', y = 'mpg', data = ndf, ax = ax1)

회귀선 표시

regplot() 함수를 이용하여 두 변수에 대한 산점도 그리기 가능

sns.regplot(x = 'weight', y = 'mpg', data = ndf, ax = ax2, fit_reg = False)

회귀선 미표시

plt.show()

![](https://images.velog.io/images/ddaddo_data/post/1e86c8b1-0b6c-4319-9eba-78deceb1f434/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-29%20113234.jpg)

sns.jointplot(x = 'weight', y = 'mpg', data = ndf)

산점도, 히스토그램

회귀선 없음

sns.jointplot(x = 'weight', y = 'mpg', kind = 'reg', data = ndf)

산점도, 히스토그램

회귀선 있음

plt.show()

![](https://images.velog.io/images/ddaddo_data/post/41f3245f-7ab4-448a-9ddf-7b86b32856f8/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-29%20114805.jpg)

grid_ndf = sns.pairplot(ndf)

ndf에 속한 4가지 변수에 대한 모든 경우의 수 그리기

plt.show()

![](https://images.velog.io/images/ddaddo_data/post/30569903-fb86-4bab-9f68-0d14d9a99d86/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-29%20115013.jpg)

X = ndf[['weight']]

독립변수 weight

y = ndf['mpg']

종속변수 mpg

from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

전체 데이터를 test : train = 3 : 7로 나누기

lr = LinearRegression()

단순회귀분석 모형 객체 생성

lr.fit(X_train, y_train)

train data를 가지고 모형 학습

print('기울기 : ', lr.coef_) print('y 절편 : ', lr.intercept_)

y_hat = lr.predict(X)

X를 통해 예측한 값을 y_hat에 저장

plt.figure(figsize = (10, 5)) ax1 = sns.distplot(y, hist = False, label = "y")

실제 y값

ax2 = sns.distplot(y_hat, hist = False, label ="y_hat", ax = ax1)

예측한 y값

plt.show()

![](https://images.velog.io/images/ddaddo_data/post/fa602539-7447-4879-9be1-18d282e54091/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-29%20115912.jpg)
![](https://images.velog.io/images/ddaddo_data/post/1bb2d4e4-b5ef-4cf2-ae7f-fdfa1787734f/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-29%20115924.jpg)
### 2) 다항회귀분석
* 다항 함수를 사용하면 보다 복잡한 곡선 형태의 회귀선 표현 가능
* 2차함수 이상의 다항함수를 사용

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns

df = pd.read_csv('./auto-mpg.csv', header = None) df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'name']

pd.set_option('display.max_columns', 10)

df['horsepower'].replace('?', np.nan, inplace=True) df.dropna(subset=['horsepower'], axis = 0, inplace = True) df['horsepower'] = df['horsepower'].astype('float')

ndf = df[['mpg', 'cylinders', 'horsepower', 'weight']]

X = ndf[['weight']]

독립변수 weight

y = ndf['mpg']

종속변수 mpg

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

전체 데이터를 test : train = 3 : 7로 나누기

from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree = 2)

2차항 적용

X_train_poly = poly.fit_transform(X_train)

X_train 데이터를 2차항으로 변환

pr = LinearRegression() pr.fit(X_train_poly, y_train)

X_test_poly = poly.fit_transform(X_test) y_hat_test = pr.predict(X_test_poly)

fig = plt.figure(figsize = (10, 5)) ax = fig.add_subplot(1,1,1)

ax.plot(X_train, y_train, 'o', label = 'Train Data')

train 데이터 값을 o으로 표현

ax.plot(X_test, y_hat_test, 'r+', label = 'Predicted Value')

예측한 값을 + 로 표현

ax.legend(loc = 'best')

plt.xlabel('weight') plt.ylabel('mpg') plt.show()

![](https://images.velog.io/images/ddaddo_data/post/001dc4e2-79f0-4f1a-acc5-9db00c0c96ae/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-29%20125237.jpg)

X_poly = poly.fit_transform(X) y_hat = pr.predict(X_poly)

plt.figure(figsize = (10,5)) ax1 = sns.distplot(y, hist = False, label = "y")

실제 y값

ax2 = sns.distplot(y_hat, hist = False, label = "y_hat", ax = ax1)

예측한 y값

plt.show()

![](https://images.velog.io/images/ddaddo_data/post/780368ed-b3ba-43cd-a53b-43e8d36537ac/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-29%20125707.jpg)
### 3) 다중회귀분석

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns

df = pd.read_csv('./auto-mpg.csv', header = None) df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'name']

pd.set_option('display.max_columns', 10)

df['horsepower'].replace('?', np.nan, inplace=True) df.dropna(subset=['horsepower'], axis = 0, inplace = True) df['horsepower'] = df['horsepower'].astype('float')

ndf = df[['mpg', 'cylinders', 'horsepower', 'weight']]

X = ndf[['cylinders', 'horsepower', 'weight']]

독립변수 cylinders, horsepower, weight

y = ndf['mpg']

종속변수 mpg

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

전체 데이터를 test : train = 3 : 7로 나누기

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(X_train, y_train)

print('X 변수의 계수 a : ', lr.coef_) print('\n') print('상수항 b : ', lr.intercept_)

![](https://images.velog.io/images/ddaddo_data/post/459dab80-ac57-4f3b-9075-ad8d40957eaa/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-29%20171659.jpg)

y_hat = lr.predict(X_test)

plt.figure(figsize = (10,5))

ax1 = sns.distplot(y_test, hist= False, label = "y_test")

실제 y값

ax2 = sns.distplot(y_hat, hist=False, label = "y_hat", ax = ax1)

예측한 y값

plt.show()

```

단순회귀분석 -> 다항회귀분석 -> 다중회귀분석으로 갈수록 데이터가 어느 한쪽으로 편향되는 경향은 그대로 남아있지만 그래프의 첨도가 약간 누그러짐

28. 머신러닝 개요

Tue, 27 Jul 2021 01:14:25 GMT

1) 머신러닝이란?

기계 스스로 데이터를 학습하여 서로 다른 변수 간의 관계를 찾아 나가는 과정

2) 지도 학습 vs 비지도 학습

지도 학습 : 정답 데이터를 다른 데이터와 함께 컴퓨터 알고리즘에 입력하는 방식 ex) 회귀분석, 분류 모형
비지도 학습 : 정답 데이터없이 컴퓨터 알고리즘 스스로 데이터로부터 숨은 패턴을 찾아내는 방식 ex) 군집 분석

3) 머신러닝 프로세스

데이터 정리
데이터 분리 (train data와 test data로 분리)
알고리즘 준비
모형 학습 (train data 이용)
예측 (test data 이용)
모형 평가
모형 활용

27. 피벗

Tue, 27 Jul 2021 01:07:08 GMT

1) 피벗테이블

import pandas as pd
import seaborn as sns

pd.set_option('display.max_columns', 10)
pd.set_option('display.max_colwidth', 20)

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]

pdf1 = pd.pivot_table(df,
                     index = 'class', # 행 위치에 들어갈 열
                     columns = 'sex', # 열 위치에 들어갈 열
                     values = 'age', # 데이터로 사용할 열
                     aggfunc = 'mean') # 데이터 집계 함수

# class 와 sex 별 age 평균 계산

print(pdf1.head())

pdf2 = pd.pivot_table(df,
                     index = 'class', # 행 위치에 들어갈 열
                     columns = 'sex', # 열 위치에 들어갈 열
                     values = 'survived', # 데이터로 사용할 열
                     aggfunc = ['mean', 'sum']) # 데이터 집계 함수

# class 와 sex 별 survived 평균과 합 계산

print(pdf2.head())

pdf3 = pd.pivot_table(df,
                     index = ['class', 'sex'], # 행 위치에 들어갈 열
                     columns = 'survived', # 열 위치에 들어갈 열
                     values = ['age', 'fare'], # 데이터로 사용할 열
                     aggfunc = ['mean', 'max']) # 데이터 집계 함수

# class 와 sex, survived 별 age, fare 평균과 최대값계산

print(pdf3.xs('First'))
# First class만 출력

print(pdf3.xs(('First', 'female')))
# First class이고 female인 데이터에 해당하는 결과만 출력

print(pdf3.xs('male', level = 'sex'))
# 행 인덱스의 sex 레벨이 male인 행 선택

print('\n')

print(pdf3.xs(('Second', 'male'), level = [0, 'sex']))
# class가 Second이고 sex가 male인 행 선택

print('\n')

print(pdf3.xs('mean', axis = 1))
# 열 인덱스가 mean인 열을 선택 (평균값이 계산된 열 선택)

print('\n')

print(pdf3.xs(('mean', 'age'), axis = 1))
# 열 인덱스가 mean과 age인 열을 선택 (나이의 평균값 반환)

print('\n')

print(pdf3.xs(1, level = 'survived', axis = 1))
# survived 열이 1 인 것만 선택

print('\n')

print(pdf3.xs(('max', 'fare', 0), level = [0,1,2], axis = 1))
# max, fare, survived열이 0인 것만 선택

26. 멀티 인덱스

Mon, 26 Jul 2021 08:45:27 GMT

1) 멀티 인덱스

import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]

grouped = df.groupby(['class', 'sex'])
# class와 sex로 그룹화
# 3 * 2 = 6가지 경우

gdf = grouped.mean()
# 그룹별 연산 가능한 열에 대하여 평균값 구하기

print(gdf)

print(gdf.loc[('First', 'female')])
# class가 First 이고 sex가 female인 값만 추출

print(gdf.xs('male', level = 'sex'))
# sex가 male인 것만 구하기

25. 그룹 연산

Mon, 26 Jul 2021 08:34:38 GMT

1) 그룹 객체 만들기 (분할 단계)

import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]

grouped = df.groupby(['class'])
# class 별로 데이터프레임을 분할

for key, group in grouped:
    print('key : ', key) # 그룹의 key 값인 class 값 출력
    print('number: ', len(group)) # 그룹에 속한 데이터 값의 수 출력
    print(group.head()) # 그룹별 상위 5개 항목 출력
    print('\n')

average = grouped.mean()
# group 별 연산 가능한 다른 열에 대해서 평균값 계산

print(average)

group3 = grouped.get_group('Third')
# Third class만 선택

print(group3.head())

grouped_two = df.groupby(['class', 'sex'])

for key, group in grouped_two:
    print('key : ', key) # 그룹의 key 값인 class와 sex 값 출력 (경우가 3 * 2 = 6 가지)
    print('number: ', len(group)) # 그룹에 속한 데이터 값의 수 출력
    print(group.head()) # 그룹별 상위 5개 항목 출력
    print('\n')

average_two = grouped_two.mean()
# class와 성별을 기준으로 그룹화한 것을 연산 가능한 다른 열에 대해서 평균값 계산
print(average_two)
print('\n')

group3f = grouped_two.get_group(('Third', 'female'))
# class가 Third 이고 sex가 female인 그룹만 추출

print(group3f.head())

2) 그룹 연산 메소드 (적용 - 결합 단계)

import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]

grouped = df.groupby(['class'])
# class 별로 데이터프레임을 분할

std_all = grouped.std()
# 각 그룹에 대한 모든 열의 표준편차를 데이터프레임으로 만드는 작업

print(std_all)
print('\n')

std_fare = grouped.fare.std()
# 각 그룹에 대한 fare열의 표준편차를 집계하여 시리즈로 반환
print(std_fare)

def min_max(x):
    return x.max() - x.min()
# 각 그룹의 최대값과 최소값을 반환하는 함수

agg_minmax = grouped.agg(min_max)
print(agg_minmax)

agg_all = grouped.agg(['min', 'max'])
# 모든 열에 동일 함수인 min과 max 함수 적용

print(agg_all)
print('\n')

agg_sep = grouped.agg({'fare' : ['min', 'max'], 'age':'mean'})
# fare 열에는 min과 max 함수 적용
# age 열에는 mean 함수 적용

print(agg_sep)

age_mean = grouped.age.mean()
# 그룹별 age 열의 평균값 구하기

print(age_mean)

def z_score(x):
    return (x-x.mean())/x.std()
# z-score 를 구하는 함수

age_zscore = grouped.age.transform(z_score)
# 그룹별 age열의 z-score를 구하는 함수

print(age_zscore.loc[0:9])
# 첫 10개의 z-score 출력

grouped_filter = grouped.filter(lambda x : len(x) >= 200)
# 그룹별 데이터 수가 200개 이상만 출력

print(grouped_filter.head())

agg_grouped = grouped.apply(lambda x : x.describe())
# 각 그룹별 요약 통계 정보 집계

print(agg_grouped)

def z_score(x):
    return (x - x.mean()) / x.std()

age_zscore = grouped.age.apply(z_score)
# 그룹별 age열의 z-score를 구하는 함수

print(age_zscore)

age_filter = grouped.apply(lambda x : x.age.mean() < 30)
# age 열의 데이터 평균이 30보다 작은 그룹만을 필터링하여 출력

for x in age_filter.index:
    if age_filter[x]==True:
        age_filter_df = grouped.get_group(x)
        print(age_filter_df.head())
        print('\n')

24. 데이터프레임 합치기

Sun, 25 Jul 2021 04:36:49 GMT

1) 데이터프레임 연결

import pandas as pd

df1 = pd.DataFrame({'a' : ['a0', 'a1', 'a2', 'a3'],
                   'b' :  ['b0', 'b1', 'b2', 'b3'],
                   'c' :  ['c0', 'c1', 'c2', 'c3']},
                  index = [0,1,2,3])

df2 = pd.DataFrame({'a' : ['a2', 'a3', 'a4', 'a5'],
                   'b' :  ['b2', 'b3', 'b4', 'b5'],
                   'c' :  ['c2', 'c3', 'c4', 'c5'],
                    'd' :  ['d2', 'd3', 'd4', 'd5'] },
                  index = [2,3,4,5])

result1 = pd.concat([df1, df2])
# 2개의 데이터프레임을 위 아래 행 방향으로 이어붙이듯 연결

print(result1)

result2 = pd.concat([df1, df2], ignore_index = True)
# 2개의 데이터프레임을 위 아래 행 방향으로 이어붙이듯 연결
# 기존 행 인덱스 무시 & 새로운 행 인덱스 설정

print(result2)

result3 = pd.concat([df1,df2], axis =1)
# 2개의 데이터프레임을 좌우 열 방향으로 이어붙이듯 연결

print(result3)

result4 = pd.concat([df1, df2], axis = 1, join = 'inner')
# 2개의 데이터프레임 내 공통으로 존재하는 데이터만 반환

print(result4)

sr1 = pd.Series(['e0', 'e1', 'e2', 'e3'], name = 'e')
sr2 = pd.Series(['f0', 'f1', 'f2'], name = 'f', index = [3,4,5])
sr3 = pd.Series(['g0', 'g1', 'g2', 'g3'], name = 'g')

result5 = pd.concat([df1, sr1], axis =1, sort = True)
# df1 데이터프레임에 e열 추가

print(result5)

result6 = pd.concat([sr1,sr3], axis =0)
print(result6)

2) 데이터프레임 병합

import pandas as pd

pd.set_option('display.max_columns', 10) # 출력할 최대 열의 개수
pd.set_option('display.max_colwidth', 20) # 출력할 열의 너비
pd.set_option('display.unicode.east_asian_width', True) # 유니코드 사용 너비 조정

df1 = pd.read_excel('./stock price.xlsx')
df2 = pd.read_excel('./stock valuation.xlsx')

merge_inner = pd.merge(df1,df2)
# df1과 df2 합치기 (교집합)

print(merge_inner)

merge_outer = pd.merge(df1,df2, how = 'outer', on = 'id')
# df1과 df2 합치기 (합집합)
# id 열을 기준으로 합침

print(merge_outer)

merge_left = pd.merge(df1,df2, how = 'left', left_on = 'stock_name', right_on = 'name')
# df1과 df2 합치기 (왼쪽 데이터프레임 기준)
# df1의 stock_name 열과 df2의 name 열을 기준으로 병합

print(merge_left)

merge_right = pd.merge(df1,df2, how = 'right', left_on = 'stock_name', right_on = 'name')
# df1과 df2 합치기 (오른쪽 데이터프레임 기준)
# df1의 stock_name 열과 df2의 name 열을 기준으로 병합

print(merge_right)

price = df1[df1['price'] < 50000]
# df1 중 가격이 50000 미만인 것

value = pd.merge(price, df2)
# df1 중 가격이 50000 미만이면서 df2에 존재하는 것

print(value)

3) 데이터프레임 결합

join() 메소드는 두 데이터프레임의 행 인덱스를 기준으로 결합 on = 'keys' 옵션을 사용하면 열을 기준으로 결합하는 것이 가능

import pandas as pd

pd.set_option('display.max_columns', 10) # 출력할 최대 열의 개수
pd.set_option('display.max_colwidth', 20) # 출력할 열의 너비
pd.set_option('display.unicode.east_asian_width', True) # 유니코드 사용 너비 조정

df1 = pd.read_excel('./stock price.xlsx', index_col = 'id')
df2 = pd.read_excel('./stock valuation.xlsx', index_col = 'id')

df3 = df1.join(df2)
# df1과 df2 결합
# df1을 기준으로 df2 결합(df2에 존재하지 않아도 df1에 존재하면 결합), how = 'left' 옵션이 기본으로 적용됨

print(df3)

df4 = df1.join(df2, how = 'inner')
# df1과 df2 교집합
# df1과 df2에 모두 존재하는 값만 추출

print(df4)

23. 필터링

Sun, 25 Jul 2021 03:46:59 GMT

1) 불린 인덱싱

import seaborn as sns

titanic = sns.load_dataset('titanic')

mask1 = (titanic.age >=10) & (titanic.age <20)
# 10세 이상 20세 미만

df_teenage = titanic.loc[mask1, :]

print(df_teenage.head())

mask2 = (titanic.age <10) & (titanic.sex == 'female')
# 10세 미만 여자

df_female_under10 = titanic.loc[mask2, :]

print(df_female_under10.head())

mask3 = (titanic.age <10) | (titanic.age > 60)
# 10세 미만 혹은 60세 초과

df = titanic.loc[mask3, :]

print(df.head())

2) isin() 메소드 활용

import seaborn as sns

titanic = sns.load_dataset('titanic')
isin_filter = titanic['sibsp'].isin([3,4,5])
# sibsp 열의 값이 3 혹은 4 혹은 5인 값만 추출

df_isin = titanic[isin_filter]
print(df_isin.head())

22. 열 재구성

Fri, 23 Jul 2021 08:30:05 GMT

1) 열 순서 변경

import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[0:4, 'survived':'age']

columns = list(df.columns.values)
# 열 이름을 list로 만들기

columns_sort = sorted(columns)
# 열 이름을 알파벳 순으로 정렬하기

df_sorted = df[columns_sort]
print(df_sorted)

import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[0:4, 'survived':'age']

columns = list(df.columns.values)
# 열 이름을 list로 만들기

columns_reversed = list(reversed(columns))
# 열 이름을 기존 순서의 역순으로 정렬하기

df_reversed = df[columns_reversed]
print(df_reversed)

import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[0:4, 'survived':'age']

columns = list(df.columns.values)
# 열 이름을 list로 만들기

columns_custom = ['pclass', 'sex', 'age', 'survived']
df_customed = df[columns_custom]
print(df_customed)

2) 열 분리

import pandas as pd

df = pd.read_excel('./주가데이터.xlsx')
print(df.head())
print('\n')

df['연월일'] = df['연월일'].astype('str')
# 연월일 데이터 형을 string 형으로 변환

dates = df['연월일'].str.split('-')
# '-'을 기준으로 데이터를 분리

df['연'] = dates.str.get(0)
# dates 변수의 원소 리스트 0 번째 인덱스 값

df['월'] = dates.str.get(1)
# dates 변수의 원소 리스트 1 번째 인덱스 값

df['일'] = dates.str.get(2)
# dates 변수의 원소 리스트 2 번째 인덱스 값

21. 함수 매핑

Fri, 23 Jul 2021 08:07:39 GMT

1) 개별 원소에 함수 매핑

시리즈 객체에 apply() 메소드를 적용하면 인자로 전달하는 매핑 함수에 시리즈의 모든 원소를 하나씩 입력하고 함수의 리턴값을 돌려받음

import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'fare']]

df['ten'] = 10

def add_10(n):
    return n+10

def add_two_obj(a,b):
    return a + b

sr1 = df['age'].apply(add_10)
print(sr1.head())
print('\n')

sr2 = df['age'].apply(add_two_obj, b = 10)
# a = age열의 값, b = 10

print(sr2.head())
print('\n')

sr3 = df['age'].apply(lambda x : add_10(x))
# x = age열의 값

print(sr3.head())

import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'fare']]

df['ten'] = 10

def add_10(n):
    return n+10

def add_two_obj(a,b):
    return a + b

df_map = df.applymap(add_10)
# 데이터프레임 전체에 함수 적용
# 모든 데이터 값에 10을 더함

print(df_map.head())

2) 시리즈 객체에 함수 매핑

import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'fare']]

def missing_value(series):
    return series.isnull()
# 시리즈 객체가 null값인지 여부 반환

result = df.apply(missing_value, axis = 0)
# 데이터프레임을 반환

print(result.head())

import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'fare']]

def max_min(x):
    return x.max() - x.min()
# 각 열별로 최대값 - 최소값

result = df.apply(max_min)
# 데이터프레임을 반환

print(result)

import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'fare']]

df['ten'] = 10

def add_two_obj(a,b):
    return a + b

df['add'] = df.apply(lambda x: add_two_obj(x['age'], x['ten']), axis =1)
# a = age열, b = ten열
# age 열에 10을 더한 값을 반환

print(df.head())

3) 데이터프레임 객체에 함수 매핑

import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'fare']]

def missing_value(x):
    return x.isnull()

def missing_count(x):
    return missing_value(x).sum()

def total_number_missing(x):
    return missing_count(x).sum()

result_df = df.pipe(total_number_missing)
# pipe 메소드를 통해 전체 데이터프레임 내 null 값의 수를 반환

print(result_df)

20. 시계열 데이터

Thu, 22 Jul 2021 02:54:58 GMT

Timestamp : 특정한 시점을 기록
Period : 두 시점 사이의 일정한 기간을 나타냄
1) 다른 자료형을 시계열 객체로 변환 - Timestamp ver.
```
import pandas as pd
```

df = pd.read_csv('./stock-data.csv') print(df.info())

![](https://images.velog.io/images/ddaddo_data/post/46dc0de6-1dbd-45e0-9339-a093fb494fba/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-22%20110619.jpg)

df['new_Date'] = pd.to_datetime(df['Date'])

to_datetime을 활용하여 timestamp로 변환

print(df.info()) print('\n')

df.set_index('new_Date', inplace = True) df.drop('Date', axis = 1, inplace = True)

print(df.head()) print('\n') print(df.info())

![](https://images.velog.io/images/ddaddo_data/post/4a7711ee-d605-4c90-819e-3b76d0ad9597/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-22%20111111.jpg)
### 2) 다른 자료형을 시계열 객체로 변환 - Period ver.

import pandas as pd

dates = ['2019-01-01', '2020-03-01', '2021-06-21']

ts_dates = pd.to_datetime(dates) print(ts_dates)

pr_day = ts_dates.to_period(freq = 'D')

freq 옵션이 D인 경우, 1일의 기간

print(pr_day)

pr_month = ts_dates.to_period(freq = 'M')

freq 옵션이 M인 경우, 1달의 기간

print(pr_month)

pr_year = ts_dates.to_period(freq = 'A')

freq 옵션이 A인 경우, 1년의 기간

print(pr_year)

![](https://images.velog.io/images/ddaddo_data/post/1c9db7c9-33a5-4559-ba54-d3e2c7a3039e/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-22%20111523.jpg)

### 3) 시계열 데이터 만들기 - Timestamp 배열

import pandas as pd

ts_ms = pd.date_range(start = '2019-01-01', # 시작일 end = None, # 날짜 범위의 끝 periods = 6, # 생성할 Timestamp 수 freq = 'MS', # 월의 시작일 기준으로 한달 간격 tz = 'Asia/Seoul') # 아시아/서울 시간대

print(ts_ms) print('\n')

ts_m = pd.date_range(start = '2019-01-01', # 시작일 periods = 3, # 생성할 Timestamp 수 freq = 'M', # 월의 마지막일 기준으로 한달 간격 tz = 'Asia/Seoul') # 아시아/서울 시간대

print(ts_m) print('\n')

ts_3m = pd.date_range(start = '2019-01-01', # 시작일 periods = 6, # 생성할 Timestamp 수 freq = '3M', # 월의 마지막일 기준으로 3달 간격 tz = 'Asia/Seoul') # 아시아/서울 시간대

print(ts_3m)

![](https://images.velog.io/images/ddaddo_data/post/a81cf174-de96-40b3-980b-0fbe2ef7ce87/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-22%20113110.jpg)
### 4) 시계열 데이터 만들기 - Period 배열

import pandas as pd

pr_m = pd.period_range(start = '2019-01-01', # 시작일 end = None, # 날짜 범위의 끝 periods = 6, # 생성할 Period 수 freq = 'M') # 기간의 길이 (월)

print(pr_m) print('\n')

pr_h = pd.period_range(start = '2019-01-01', # 시작일 end = None, # 날짜 범위의 끝 periods = 6, # 생성할 Period 수 freq = 'H') # 기간의 길이 (시간)

print(pr_h) print('\n')

pr_2h = pd.period_range(start = '2019-01-01', # 시작일 end = None, # 날짜 범위의 끝 periods = 6, # 생성할 Period 수 freq = '2H') # 기간의 길이 (2시간)

print(pr_2h)

![](https://images.velog.io/images/ddaddo_data/post/021d10b9-86b0-47c4-b4a6-86d5c38c6876/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-22%20113457.jpg)
### 5) 시계열 데이터 활용 - 날짜 데이터 분리

import pandas as pd

df = pd.read_csv('./stock-data.csv') df['new_Date'] = pd.to_datetime(df['Date'])

to_datetime을 활용하여 timestamp로 변환

df['Year'] = df['new_Date'].dt.year

년도만 분리

df['Month'] = df['new_Date'].dt.month

달만 분리

df['Day'] = df['new_Date'].dt.day

일만 분리

df['Date_yr'] = df['new_Date'].dt.to_period(freq = 'A')

년도만 표기

df['Date_m'] = df['new_Date'].dt.to_period(freq = 'M')

년-월 표기

print(df.head())

![](https://images.velog.io/images/ddaddo_data/post/0796122c-01ad-4280-8489-049e76dc573e/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-22%20114034.jpg)
### 6) 시계열 데이터 활용 - 날짜 인덱스 활용

import pandas as pd

df = pd.read_csv('./stock-data.csv') df['new_Date'] = pd.to_datetime(df['Date'])

to_datetime을 활용하여 timestamp로 변환

df.set_index('new_Date', inplace = True)

df_2018 = df.loc['2018'] df_2018

![](https://images.velog.io/images/ddaddo_data/post/d5831f96-f466-495d-be8d-413ab6b62cd5/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-22%20114620.jpg)

df_201807 = df.loc['2018-07'] df_201807

![](https://images.velog.io/images/ddaddo_data/post/edabece1-5912-498b-9ae3-554a360878bc/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-22%20114722.jpg)

df_ym_cols = df.loc['2018-07', 'Start':'High']

2018-07에 해당하는 데이터의 Start 열과 High열 값 추출

print(df_ym_cols)

![](https://images.velog.io/images/ddaddo_data/post/2c4dd127-221f-447f-bb73-d60b0014296d/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-22%20114902.jpg)

df_range = df['2018-06-30' : '2018-06-20'] df_range

![](https://images.velog.io/images/ddaddo_data/post/3835d067-9cca-4431-8fe9-670ca1a0ccaf/%ED%99%94%EB%A9%B4%20%EC%BA%A1%EC%B2%98%202021-07-22%20115047.jpg)

today = pd.to_datetime('2018-12-25')

기준날짜 설정

df['time_delta'] = today - df.index

기준 날짜와 각 데이터값의 차이

df.set_index('time_delta', inplace = True) df_180 = df['180 days':'189 days']

차이값이 180일 이상 189일 이하인 값만 추출

print(df_180)

```

19. 정규화

Thu, 22 Jul 2021 01:53:57 GMT

각 변수에 들어 있는 숫자 데이터의 상대적 크기 차이 때문에 머신러닝 분석 결과가 달라질 수 있음
숫자 데이터의 상대적인 크기 차이를 제거할 필요가 있음

1) 정규화 - 해당 열의 최대값으로 나누는 방법

: 각 열에 속하는 데이터 값을 동일한 크기 기준으로 나눈 비율로 나타내는 것 : 정규화를 거친 데이터의 범위는 0~~1 또는 -1~~1이 됨

import pandas as pd
import numpy as np

df = pd.read_csv('./auto-mpg.csv', header = None)
df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year',
          'origin', 'name']
df['horsepower'].replace('?', np.nan, inplace = True)
df.dropna(subset = ['horsepower'], axis = 0, inplace = True)
df['horsepower'] = df['horsepower'].astype('float')

print(df.horsepower.describe())

df.horsepower = df.horsepower/abs(df.horsepower.max())
# horsepower열의 값을 최대값으로 나누는 과정 (정규화 과정)

print(df.horsepower.head())
print(df.horsepower.describe())

2) 정규화 - 해당 열의 (최대값-최소값)으로 나누는 방법

: 각 값에서 해당 열의 최소값을 뺀 것(값 - 최소값)을 해당 열의 (최대값 - 최소값)으로 나누는 방법

import pandas as pd
import numpy as np

df = pd.read_csv('./auto-mpg.csv', header = None)
df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year',
          'origin', 'name']
df['horsepower'].replace('?', np.nan, inplace = True)
df.dropna(subset = ['horsepower'], axis = 0, inplace = True)
df['horsepower'] = df['horsepower'].astype('float')

print(df.horsepower.describe())
print('\n')

min_x = df.horsepower - df.horsepower.min()
# 열의 값에서 최소값을 뺌

min_max = df.horsepower.max() - df.horsepower.min()
# 최대값에서 최소값을 뺌

df.horsepower = min_x / min_max
# (값 - 최소값) / (최대값 - 최소값)

print(df.horsepower.head())
print('\n')
print(df.horsepower.describe())

18. 범주형(카테고리) 데이터 처리

Wed, 21 Jul 2021 02:02:40 GMT

1) 구간 분할

import pandas as pd
import numpy as np

df = pd.read_csv('./auto-mpg.csv', header = None)

df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight','acceleration','model year', 'origin', 'name']
df['horsepower'].replace('?', np.nan, inplace = True)
# ?을 NaN 값으로 대체

df.dropna(subset = ['horsepower'], axis = 0, inplace = True)
# horsepower 내 NaN 값 drop

df['horsepower'] = df['horsepower'].astype('float')
# object 형을 float 형으로 변환

count , bin_dividers = np.histogram(df['horsepower'], bins = 3)
# np.histogram 함수로 3개의 bin으로 나누는 경계값의 리스트 구하기
print(bin_dividers)

bin_name = ['저출력', '보통출력', '고출력']
# 3개의 bin 이름 지정

df['hp_bin'] = pd.cut(x = df['horsepower'],
                     bins = bin_dividers,
                     labels = bin_name,
                     include_lowest = True) # 첫 경계값 포함
# pd.cut 함수로 각 데이터를 3개의 bin에 할당

print(df[['horsepower', 'hp_bin']].head(10))

2) 더미 변수

숫자 0 또는 1로 표현되는 더미 변수 특성이 있는지 없는지를 표현하는 변수 해당 특성이 존재하면 1 해당 특성이 존재하지 않으면 0

horsepower_dummies = pd.get_dummies(df['hp_bin'])
# hp_bin 열의 범주형 데이터를 더미 변수로 변환 (원핫인코딩)
print(horsepower_dummies.head(10))

3) sklearn을 활용한 원핫인코딩

from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
onehot_encoder = preprocessing.OneHotEncoder()

onehot_labeled = label_encoder.fit_transform(df['hp_bin'].head(15))
# label_encoder로 문자열 범주를 숫자형 범주로 변환

print(onehot_labeled)

onehot_reshaped = onehot_labeled.reshape(len(onehot_labeled),1)
# 2차원 행렬로 형태 변경

print(onehot_reshaped)

onehot_fitted = onehot_encoder.fit_transform(onehot_reshaped)
# 희소행렬로 변환
# (행, 열)에 데이터 값이 있음을 1.0으로 표현

print(onehot_fitted)

17. 데이터 표준화

Wed, 21 Jul 2021 01:36:48 GMT

1) 단위 환산

mpg 열 : 갤런당 마일 1마일 : 1.690934km 1갤런 : 3.78541L

import pandas as pd

df = pd.read_csv('./auto-mpg.csv', header = None)

df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight','acceleration','model year', 'origin', 'name']

mpg_to_kpl = 1.60934/3.78541
# mpg를 kpl로 변환하기 위한 과정
# 1마일 / 1km

df['kpl'] = df['mpg'] * mpg_to_kpl
# mpg * (1마일 / 1km)

df['kpl'] = df['kpl'].round(2)

print(df.head(3))

2) 자료형 변환 (object -> float)

import pandas as pd

df = pd.read_csv('./auto-mpg.csv', header = None)

df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight','acceleration','model year', 'origin', 'name']

print(df.dtypes)
# 각 열의 자료형 확인

print(df['horsepower'].unique())
# horsepower 열의 고유값 확인

import numpy as np

df['horsepower'].replace('?', np.nan, inplace = True)
# ?을 NaN 값으로 대체

df.dropna(subset = ['horsepower'], axis = 0, inplace = True)
# horsepower 내 NaN 값 drop

df['horsepower'] = df['horsepower'].astype('float')
# object 형을 float 형으로 변환

print(df['horsepower'].dtypes)

3) 자료형 변환 (int -> object)

print(df['origin'].unique())

df['origin'].replace({1: 'USA', 2: 'EU', 3: 'JPN'}, inplace = True)
# 정수형 데이터를 문자형 데이터로 변환

print(df['origin'].unique())
print(df['origin'].dtypes)

4) 자료형 변환 (object -> category)

df['origin'] = df['origin'].astype('category')
print(df['origin'].dtypes)

5) 자료형 변환 (category -> object)

df['origin'] = df['origin'].astype('str')
print(df['origin'].dtypes)

6) 자료형 변환 (int -> category)

print(df['model year'].sample(3))

df['model year'] = df['model year'].astype('category')
print(df['model year'].sample(3))

16. 중복 데이터 처리

Tue, 20 Jul 2021 12:52:02 GMT

1) 중복 데이터 확인

import pandas as pd

df = pd.DataFrame({'c1' : ['a', 'a','b','a','b'],
                  'c2' : [1,1,1,2,2],
                  'c3' : [1,1,2,2,2]})

print(df)
df_dup = df.duplicated()
# 데이터 중에서 중복값 찾기

print(df_dup)

2) 중복 데이터 제거

df2 = df.drop_duplicates()

subset 옵션을 통해 중복 데이터 제거 가능 데이터 중복 여부를 판단할 때, subset 옵션에 해당하는 열을 기준으로 판단

df3 = df.drop_duplicates(subset = ['c2', 'c3'])
# c2 열과 c3열을 기준으로 데이터 중복 판단

15. 누락 데이터 처리

Tue, 20 Jul 2021 12:11:37 GMT

** 누락 데이터가 NaN으로 표시되지 않는 경우도 있음** ** 해당 값을 NaN으로 치환해야함

df.replace('?', np.nan, inplace = True)

1) 누락 데이터 확인

import seaborn as sns

df = sns.load_dataset('titanic')
df.info()

age, embarked, deck, embark_town 컬럼에 NULL 값이 존재함을 알 수 있음

nan_deck = df['deck'].value_counts(dropna = False)
# 누락된 데이터를 함께 확인하기 위해서는 반드시 dropna = False

print(nan_deck)

print(df.head().isnull())
# 상위 5개의 행에 대해 nan값 존재여부 확인
# isnull() 의 경우, nan 값이면 True
# notnull() 의 경우, nan 값이면 False

print(df.isnull().sum(axis = 0))
# 열별 nan값의 수

2) 누락 데이터 제거

import seaborn as sns

df = sns.load_dataset('titanic')

missing_df = df.isnull()
# nan 값이 포함되어 있는 데이터 선택

for col in missing_df.columns:
    missing_count = missing_df[col].value_counts()
    # 열별로 nan 값인 데이터 수 구하기
    try:
        print(col, ': ', missing_count[True])
        # nan 값이 있으면 개수 출력
    except:
        print(col, ': ', 0)
        # nan 값이 없으면 0 출력

df_thresh = df.dropna(axis =1, thresh = 500)
# nan 값이 500개 이상인 열을 모두 삭제

print(df_thresh.columns)

df_age = df.dropna(subset = ['age'], how = 'any', axis = 0)
# age 열에 nan이 있으면 삭제
# how = 'any' 의 경우, nan 값이 하나라도 존재하면 삭제한다는 뜻
# how = 'all' 의 경우, 모든 데이터가 nan 값이면 삭제한다는 뜻

3) 누락 데이터 치환 - 평균 ver.

import seaborn as sns

df = sns.load_dataset('titanic')

mean_age = df['age'].mean(axis = 0)
# age 열의 평균값 (nan 제외)

df['age'].fillna(mean_age, inplace = True)
# age 열 중, nan 값인 데이터를 평균으로 대체

4) 누락 데이터 치환 - 최빈 ver.

import seaborn as sns

df = sns.load_dataset('titanic')

most_freq = df['embark_town'].value_counts(dropna = True).idxmax()
# embark_town 열의 최빈값 구하기

df['embark_town'].fillna(most_freq, inplace = True)
# embark_town 열 중, nan 값인 데이터를 최빈값으로 대체

5) 누락 데이터 치환 - 이웃 ver.

import seaborn as sns

df = sns.load_dataset('titanic')

df['embark_town'].fillna(method = 'ffill', inplace = True)
# embark_town 열 중, nan 값인 데이터를 바로 직전 데이터로 대체
# method = 'bfill'을 사용하면 nan 값인 데이터 바로 직후 데이터로 대체

14. Folium 라이브러리 - 지도 활용

Mon, 19 Jul 2021 03:23:30 GMT

Folium 설치 방법 아나콘다 프롬프트를 실행한 후, conda install -c conda-forge folium 입력

1) 지도 만들기

import folium

seoul_map = folium.Map(location = [37.55, 126.98], zoom_start = 12)
# 위도, 경도 수치를 입력하면 그 지점을 중심으로 지도를 보여줌
# zoom_start 를 통해 확대 비율 조절

seoul_map.save('./seoul.html')
# 지도 저장

2) 지도 스타일 적용하기

import folium

seoul_map = folium.Map(location = [37.55, 126.98], tiles = 'Stamen Terrain', zoom_start = 12)
# 위도, 경도 수치를 입력하면 그 지점을 중심으로 지도를 보여줌
# zoom_start 를 통해 확대 비율 조절

seoul_map.save('./seoul2.html')
# 지도 저장

3) 지도에 마커 표시하기

import pandas as pd
import folium

df = pd.read_excel('./서울지역 대학교 위치.xlsx')

seoul_map = folium.Map(location = [37.55, 126.98], tiles = 'Stamen Terrain', zoom_start = 12)

for name, lat, lng in zip(df.index, df.위도, df.경도):
    folium.Marker([lat, lng], popup = name).add_to(seoul_map)
# 대학교 위치 정보를 Marker로 표시

seoul_map.save('./seoul_colleges.html')

import pandas as pd
import folium

df = pd.read_excel('./서울지역 대학교 위치.xlsx')

seoul_map = folium.Map(location = [37.55, 126.98], tiles = 'Stamen Terrain', zoom_start = 12)

for name, lat, lng in zip(df.index, df.위도, df.경도):
    folium.CircleMarker([lat, lng],
                        radius = 10,
                        color = 'brown',
                        fill = True,
                        fill_color = 'coral',
                        fill_opacity = 0.7,
                        popup = name).add_to(seoul_map)
# 대학교 위치 정보를 CircleMarker로 표시
# popup = name은 해당 마커를 클릭하면 지정한 열의 값이 보임

seoul_map.save('./seoul_colleges.html')

+) 팝업 박스 크기 조절

import pandas as pd
import folium

df = pd.read_excel('./서울지역 대학교 위치.xlsx')

seoul_map = folium.Map(location = [37.55, 126.98], tiles = 'Stamen Terrain', zoom_start = 12)

for name, lat, lng in zip(df.name, df.위도, df.경도):
    iframe = folium.IFrame(name)
    popup = folium.Popup(iframe, min_width=150, max_width=150)
    # popup 크기 설정

    folium.CircleMarker([lat, lng],
                        radius = 10,
                        color = 'brown',
                        fill = True,
                        fill_color = 'coral',
                        fill_opacity = 0.7,
                        popup = popup).add_to(seoul_map)
# 대학교 위치 정보를 CircleMarker로 표시
# popup = name은 해당 마커를 클릭하면 지정한 열의 값이 보임

seoul_map.save('./seoul_colleges.html')

4) 지도 영역에 단계구분도(Choropleth Map) 표시하기

import pandas as pd
import folium
import json

file_path = './경기도인구데이터.xlsx'
df = pd.read_excel(file_path, index_col = '구분')
df.columns = df.columns.map(str)

geo_path = './경기도행정구역경계.json'
try:
    geo_data = json.load(open(geo_path, encoding = 'utf-8'))
except:
    geo_data = json.load(open(geo_path, encoding = 'utf-8-sig'))

g_map = folium.Map(location = [37.5502, 126.982],
                  tiles = 'Stamen Terrain', zoom_start = 9)

year = '2007'
# 출력할 년도 선택

folium.Choropleth(geo_data = geo_data, # 지도 경계
                 data = df[year], # 표시하려는 데이터
                 columns = [df.index, df[year]], # 열 지정
                 fill_color = 'YlOrRd', fill_opacity = 0.7, line_opacity = 0.3,
                 threshold_scale = [10000, 100000, 300000, 500000, 700000],
                 key_on = 'feature.properties.name',).add_to(g_map)

g_map.save('./gyeonggi_population_'+year+'.html')

+) 2017년도 버전

13. Seaborn 라이브러리 - 고급 그래프 도구

Mon, 19 Jul 2021 02:15:30 GMT

1) 회귀선이 있는 산점도

import seaborn as sns
import matplotlib.pyplot as plt

titanic = sns.load_dataset('titanic')
# Seaborn 제공 titanic 데이터 셋 사용

sns.set_style('darkgrid')
# 스타일 테마 설정

fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
# 그래프 객체 생성

sns.regplot(x = 'age',
           y = 'fare',
           data = titanic,
           ax = ax1)
# 그래프 그리기 - 선형 회귀선 표시

sns.regplot(x = 'age',
           y = 'fare',
           data = titanic,
           ax = ax2,
           fit_reg = False)
# fit_reg = False 는 선형 회귀선 미표시

plt.show()

2) 히스토그램/커널 밀도 그래프

import seaborn as sns
import matplotlib.pyplot as plt

titanic = sns.load_dataset('titanic')

sns.set_style('darkgrid')
# 스타일 테마 설정

fig = plt.figure(figsize = (15, 5))
ax1 = fig.add_subplot(1,3,1)
ax2 = fig.add_subplot(1,3,2)
ax3 = fig.add_subplot(1,3,3)

sns.distplot(titanic['fare'], ax = ax1)
# 기본값

sns.distplot(titanic['fare'], hist = False, ax = ax2)
# hist = False 인 경우 (히스토그램이 표시되지 않음)

sns.distplot(titanic['fare'], kde = False, ax = ax3)
# kde = False 인 경우 (커널 밀도 그래프를 표시하지 않음)

ax1.set_title('titanic fare - hist/ked')
ax2.set_title('titanic fare - ked')
ax3.set_title('titanic fare - hist')

plt.show()

3) 히트맵

import seaborn as sns
import matplotlib.pyplot as plt

titanic = sns.load_dataset('titanic')

table = titanic.pivot_table(index = ['sex'], columns = ['class'], aggfunc = 'size')
# 피벗테이블을 활용하여 범주형 변수를 행은 성별, 열은 클래스로 재구분하여 정리
# aggfunc = 'size'는 데이터 값의 크기를 기준으로 집계

sns.heatmap(table, #데이터 프레임
           annot = True, fmt = 'd', # 데이터 값 표시 여부, 정수형 포맷
           cmap = 'YlGnBu', # 컬러 맵
           linewidth = .5, # 구분 선 굵기
           cbar = False) # 컬러바 표시 여부

4) 범주형 데이터의 산점도

import seaborn as sns
import matplotlib.pyplot as plt

titanic = sns.load_dataset('titanic')

sns.set_style('whitegrid')
# 스타일 테마 설정

fig = plt.figure(figsize = (15, 5))
ax1 = fig.add_subplot(1,3,1)
ax2 = fig.add_subplot(1,3,2)
ax3 = fig.add_subplot(1,3,3)

sns.stripplot(x = 'class',
             y = 'age',
             data = titanic,
             ax = ax1)
# 이산형 변수의 분포 (데이터 분산 미고려)

sns.swarmplot(x = 'class',
             y = 'age',
             data = titanic,
             ax = ax2)
# 이산형 변수의 분포 (데이터 분산 고려 - 중복 X)

sns.swarmplot(x = 'class',
             y = 'age',
             data = titanic,
             ax = ax3,
             hue = 'sex')
# 이산형 변수의 분포 (데이터 분산 고려 - 중복 X)
# 성별 열의 데이터 값인 남/여 성별을 색상으로 구분하여 출력

plt.show()

5) 막대 그래프

import seaborn as sns
import matplotlib.pyplot as plt

titanic = sns.load_dataset('titanic')

sns.set_style('whitegrid')
# 스타일 테마 설정

fig = plt.figure(figsize = (15, 5))
ax1 = fig.add_subplot(1,3,1)
ax2 = fig.add_subplot(1,3,2)
ax3 = fig.add_subplot(1,3,3)

sns.barplot(x = 'sex', y = 'survived', data = titanic, ax = ax1)
# x축, y축에 변수 할당
sns.barplot(x = 'sex', y = 'survived', hue = 'class', data = titanic, ax = ax2)
# x축, y축에 변수 할당 / hue 옵션 추가(class 별로 색상 다르게)
sns.barplot(x = 'sex', y = 'survived', hue = 'class', dodge = False, data = titanic, ax = ax3)
# # x축, y축에 변수 할당 / hue 옵션 추가(class 별로 색상 다르게) / 누적

6) 빈도 그래프

import seaborn as sns
import matplotlib.pyplot as plt

titanic = sns.load_dataset('titanic')

sns.set_style('whitegrid')
# 스타일 테마 설정

fig = plt.figure(figsize = (15, 5))
ax1 = fig.add_subplot(1,3,1)
ax2 = fig.add_subplot(1,3,2)
ax3 = fig.add_subplot(1,3,3)

sns.countplot(x = 'class', palette = 'Set1', data = titanic, ax = ax1)
# 기본값
# palette를 다르게 하여 색상 변경 가능

sns.countplot(x = 'class', hue = 'who', palette = 'Set2', data = titanic, ax = ax2)
# who 별로 다르게 하여 색상 표현

sns.countplot(x = 'class', hue = 'who', palette = 'Set3', dodge = False, data = titanic, ax = ax3)
# who 별로 다르게 하여 색상 표현
# 누적 출력

plt.show()

7) 박스 플롯/바이올린 그래프

import seaborn as sns
import matplotlib.pyplot as plt

titanic = sns.load_dataset('titanic')

sns.set_style('whitegrid')
# 스타일 테마 설정

fig = plt.figure(figsize = (15, 10))
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)

sns.boxplot(x = 'alive', y = 'age', data = titanic, ax = ax1)
# 박스 플롯 (기본값)

sns.boxplot(x = 'alive', y = 'age', hue = 'sex', data = titanic, ax = ax2)
# 성별을 나눠서 박스 플롯을 그림

sns.violinplot(x = 'alive', y = 'age', data = titanic, ax = ax3)
# 바이올린 그래프 (기본값)

sns.violinplot(x = 'alive', y = 'age', hue = 'sex', data = titanic, ax = ax4)
# 성별을 나눠서 바이올린 그래프를 그림

plt.show()

8) 조인트 그래프

조인트 그래프는 산점도를 기본으로 표시하고 x-y축에 각 변수에 대한 히스토그램을 동시에 보여줌 : 두 변수의 관계와 데이터가 분산되어 있는 정도를 한눈에 파악하기 좋음

import seaborn as sns
import matplotlib.pyplot as plt

titanic = sns.load_dataset('titanic')

sns.set_style('whitegrid')
# 스타일 테마 설정

j1 = sns.jointplot(x = 'fare', y = 'age', data = titanic)
# 산점도
j2 = sns.jointplot(x = 'fare', y = 'age', kind = 'reg', data = titanic)
# 회귀선
j3 = sns.jointplot(x = 'fare', y = 'age', kind = 'hex', data = titanic)
# 육각 그래프
j4 = sns.jointplot(x = 'fare', y = 'age', kind = 'kde', data = titanic)
# 커럴 밀집 그래프

j1.fig.suptitle('titanic fare - scatter', size = 15)
j2.fig.suptitle('titanic fare - reg', size = 15)
j3.fig.suptitle('titanic fare - hex', size = 15)
j4.fig.suptitle('titanic fare - kde', size = 15)

plt.show()

9) 조건을 적용하여 화면을 그리드로 분할하기

import seaborn as sns
import matplotlib.pyplot as plt

titanic = sns.load_dataset('titanic')

sns.set_style('whitegrid')
# 스타일 테마 설정

g = sns.FacetGrid(data = titanic, col = 'who', row = 'survived')
# 조건에 따라 그리드 나누기
# who열과 row열에 따라 나누기

g = g.map(plt.hist, 'age')
# 그래프 적용하기
# 각 조건에 맞는 탑승객을 구분하여 age열의 나이를 기준으로 히스토그램을 그림

10) 이변수 데이터의 분포

pairplot() 함수는 인자로 전달되는 데이터프레임의 열을 2개씩 짝을 지을 수 있는 모든 조합에 대해 표현함

import seaborn as sns
import matplotlib.pyplot as plt

titanic = sns.load_dataset('titanic')

sns.set_style('whitegrid')
# 스타일 테마 설정

titanic_pair = titanic[['age', 'pclass', 'fare']]
# 3개의 변수 선택
g = sns.pairplot(titanic_pair)
# 조건에 따라 그리드 나누기
# 변수가 3개이므로, 3*3 = 9개의 그리드가 만들어짐

ddaddo_data.log

머신러닝의 개념

1) 머신러닝의 개념

2) 머신러닝의 분류

1. 지도학습

2. 비지도학습

31. 군집, k-Means

1) k-Means

2) 데이터 전처리

정규화

5개의 클러스터 생성

모델 학습

모델 예측

x축이 Grocery, y축이 Frozen

산점도 방식으로 표현

colorbar 없이 출력

x축이 Milk, y축이 Delicassen

산점도 방식으로 표현

colorbar 출력

30. 분류

1) KNN

1-1. 데이터 전처리

age 열에 나이 데이터가 없는 모든 행 삭제

embarked 열의 NaN 값을 승선도시 중에서 가장 많이 출현한 값 도출

가장 많이 출현한 값인 s로 NaN값을 대체

sex 열 데이터인 female와 male을 더미 변수 열로 만들기

열 이름에 접두어 'town'을 추가

embarked 열 데이터를 더미 변수 열로 만들기

원래 존재했던 sex와 embarked 열을 제거

모델 학습

모델 예측

모델 성능 평가 지표 확인 가능

age 열에 나이 데이터가 없는 모든 행 삭제

embarked 열의 NaN 값을 승선도시 중에서 가장 많이 출현한 값 도출

가장 많이 출현한 값인 s로 NaN값을 대체

sex 열 데이터인 female와 male을 더미 변수 열로 만들기

열 이름에 접두어 'town'을 추가

embarked 열 데이터를 더미 변수 열로 만들기

원래 존재했던 sex와 embarked 열을 제거

모델 학습

모델 예측

모형 성능 평가 지표 확인 가능 (confusion matrix)

모델 성능 평가 지표 확인 가능

age 열에 나이 데이터가 없는 모든 행 삭제

embarked 열의 NaN 값을 승선도시 중에서 가장 많이 출현한 값 도출

가장 많이 출현한 값인 s로 NaN값을 대체

sex 열 데이터인 female와 male을 더미 변수 열로 만들기

열 이름에 접두어 'town'을 추가

embarked 열 데이터를 더미 변수 열로 만들기

원래 존재했던 sex와 embarked 열을 제거

모델 학습

모델 예측

모델 성능 평가 지표 확인 가능

29. 회귀분석

1) 단순회귀분석

산점도 그리기

x축은 weight, y축은 mpg, 산점도의 색은 코랄색, 점의 크기는 10, 그래프 사이즈는 (10,5)

회귀선 표시

regplot() 함수를 이용하여 두 변수에 대한 산점도 그리기 가능

회귀선 미표시

산점도, 히스토그램

회귀선 없음

산점도, 히스토그램

회귀선 있음

ndf에 속한 4가지 변수에 대한 모든 경우의 수 그리기

독립변수 weight

종속변수 mpg

전체 데이터를 test : train = 3 : 7로 나누기

단순회귀분석 모형 객체 생성

train data를 가지고 모형 학습

X를 통해 예측한 값을 y_hat에 저장

실제 y값

예측한 y값

독립변수 weight

종속변수 mpg

전체 데이터를 test : train = 3 : 7로 나누기

2차항 적용

X_train 데이터를 2차항으로 변환

train 데이터 값을 o으로 표현

예측한 값을 + 로 표현