<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>hellobrown_11.log</title>
        <link>https://velog.io/</link>
        <description>냠냠 보드람 치킨</description>
        <lastBuildDate>Thu, 06 Apr 2023 01:23:17 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>hellobrown_11.log</title>
            <url>https://velog.velcdn.com/images/hellobrown_11/profile/f2b615d3-1e6c-44a6-b490-56125eef0475/social_profile.png</url>
            <link>https://velog.io/</link>
        </image>
        <copyright>Copyright (C) 2019. hellobrown_11.log. All rights reserved.</copyright>
        <atom:link href="https://v2.velog.io/rss/hellobrown_11" rel="self" type="application/rss+xml"/>
        <item>
            <title><![CDATA[python AI 기초지식 ]]></title>
            <link>https://velog.io/@hellobrown_11/python-AI-%EA%B8%B0%EC%B4%88%EC%A7%80%EC%8B%9D-%EC%9A%94%EC%95%BD-%EC%A0%95%EB%A6%AC</link>
            <guid>https://velog.io/@hellobrown_11/python-AI-%EA%B8%B0%EC%B4%88%EC%A7%80%EC%8B%9D-%EC%9A%94%EC%95%BD-%EC%A0%95%EB%A6%AC</guid>
            <pubDate>Thu, 06 Apr 2023 01:23:17 GMT</pubDate>
            <description><![CDATA[<h1 id="ai-python">AI Python</h1>
<blockquote>
<p>많은 부분 직접 타이핑하여, 오타가 있습니다.</p>
</blockquote>
<h2 id="데이터-불러오기--분석">데이터 불러오기 &amp; 분석</h2>
<h3 id="데이터-불러오기">데이터 불러오기</h3>
<ol>
<li>import</li>
</ol>
<pre><code class="language-python">import pandas as pd
import numpy as np
import matplotlib.pyplot as plt</code></pre>
<ol start="2">
<li>불러오기</li>
</ol>
<pre><code class="language-python">csv : pd.read_csv(&quot;파일이름. csv&quot;)
txt : pd.read_csv(&quot;파일이름. csv&quot;, sep=&quot;구분자&quot;)
xlsx : pd.read_excel(&#39;파일이름.xlsx&#39;)
      # ex) pd.read_excel(filename, engine=&#39;openpyxl&#39;) # pip install openpyxl
pickle : pd.read_pickle(&quot;파일이름.pkl&quot;)
# 저장하기 
csv : df.to_csv(&#39;파일이름.csv&#39;, index=False)</code></pre>
<ol start="3">
<li><p>데이터 병합 </p>
<ol>
<li><p>같은 column의 dataframe 이어붙이기 (row 개념)</p>
<p><code>pd.concat()</code></p>
<pre><code class="language-python">df = pd.read_csv(&quot;onenavi_train.csv&quot;,sep=&quot;|&quot;)
df_eval = pd.read_csv(&quot;onenavi_evaluation.csv&quot;,sep=&quot;|&quot;)
df_total=pd.concat([df,df_eval],ignore_index=True)</code></pre>
</li>
<li><p>데이터 병합</p>
<p><code>pd.merge()</code></p>
<pre><code class="language-python">df_total=pd.merge(df_total,df_signal , on=&quot;RID&quot;)</code></pre>
</li>
</ol>
</li>
</ol>
<h3 id="분석">분석</h3>
<h4 id="시각화-기본">시각화 기본</h4>
<ol>
<li><p>기본시각화 순서</p>
<ol>
<li><p>사이즈 설정</p>
<p><code>plt.figure()</code> </p>
</li>
<li><p>그래프 생성 </p>
<p><code>plt.plot()</code></p>
</li>
<li><p>보여주기</p>
<p><code>plt.show()</code></p>
</li>
</ol>
</li>
<li><p>한글 폰트 사용 </p>
<pre><code class="language-python">import matplotlib.font_manager as fm
fm.findSystemFonts(fontpaths=None, fontext=&#39;ttf&#39;)
#찾은 폰트를 기본 폰트로 설정하기. 여기서는 나눔고딕체 (NanumGothicCoding)
plt.rc(&#39;font&#39;, family=&#39;NanumGothicCoding&#39;)
plt.rc(&#39;axes&#39;, unicode_minus=False) # 폰트가 깨지지 않도록</code></pre>
</li>
</ol>
<h4 id="seaborn">Seaborn</h4>
<ol>
<li><p>import</p>
<pre><code class="language-python">!pip install seaborn

import seaborn as sns
import matplotlib.pyplot as plt</code></pre>
</li>
<li><p>참고 링크  </p>
<ul>
<li>Seaborn(<a href="https://seaborn.pydata.org/api.html">https://seaborn.pydata.org/api.html</a>)</li>
<li>Seaborn.CountChart(<a href="https://seaborn.pydata.org/generated/seaborn.countplot.html">https://seaborn.pydata.org/generated/seaborn.countplot.html</a>)</li>
<li>Seaborn.Distplot(<a href="https://seaborn.pydata.org/generated/seaborn.distplot.html?highlight=distplot#seaborn.distplot">https://seaborn.pydata.org/generated/seaborn.distplot.html?highlight=distplot#seaborn.distplot</a>) : 히스토그램 + 커널밀도</li>
<li>Seaborn.Boxplot(<a href="https://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot">https://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot</a>)</li>
<li>Seaborn.Heatmap(<a href="https://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap">https://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap</a>)</li>
<li>Seaborn.Pairplot(<a href="https://seaborn.pydata.org/generated/seaborn.pairplot.html?highlight=pairplot#seaborn.pairplot">https://seaborn.pydata.org/generated/seaborn.pairplot.html?highlight=pairplot#seaborn.pairplot</a>) : 조합별 히스토그램 + 산점도</li>
</ul>
</li>
<li><p>sns 사용 사전 준비 </p>
<pre><code class="language-python"># 설치된 폰트 리스트 출력
import matplotlib.font_manager as fm

fm.get_fontconfig_fonts()

font_list = [font.name for font in fm.fontManager.ttflist]
# font_list
sns.set(font=&quot;NanumGothicCoding&quot;, 
        rc={&quot;axes.unicode_minus&quot;:False}, # 마이너스 부호 깨짐 현상 해결
        style=&#39;darkgrid&#39;)</code></pre>
</li>
<li><p>차트 그리기 </p>
<ol>
<li>산점도 : <code>sns.scatterplot()</code></li>
<li>카테고리 분포 값 : <code>sns.catplot()</code></li>
<li>산점도의 회귀선을 넣기 : <code>sns.lmplot()</code></li>
<li>항목 별 갯수 <code>sns.countplot()</code><ol>
<li>색깔 바꾸기 가능 <code>palette=&#39;spring&#39;</code></li>
<li>예시 :<code>sns.countplot(data=df, x=&#39;MultipleLines&#39;,  hue=&#39;Churn&#39;)</code></li>
</ol>
</li>
<li>산점도와 countplot을 한꺼번에 상관관계 확인 <code>sns.jointplot()</code></li>
<li>상관관계 확인 데이터의 continuous 가 필수 -&gt; <code>sns.heatmap()</code><ol>
<li><code>plt.rc(&#39;axes&#39;, unicode_minus=False)</code></li>
</ol>
</li>
<li>수치적 자료 표현 그래프. 통계량(5가지요약) 을 표현 <code>sns.boxplot()</code></li>
<li>밀집도 확인 가능 <code>sns.violinplot()</code></li>
<li>히스토그램 확인 가능 : <code>sns.histplot()</code><ol>
<li>예시 : <code>sns.histplot(data=df, x=&#39;tenure&#39;, hue=&#39;Churn&#39;)</code></li>
<li>대체재 : <code>sns.kdeplot(data=df, x=&#39;tenure&#39;, hue=&#39;Churn&#39;)</code><ol>
<li>histplot은 앞 막대에 막혀 안보일 수 있다.</li>
</ol>
</li>
</ol>
</li>
<li>전체 분포도 확인  <code>sns.pairplot(df_total)</code></li>
</ol>
</li>
</ol>
<h4 id="matplotlib">matplotlib</h4>
<ol>
<li>이론 </li>
</ol>
<pre><code class="language-python">plt.plot(data)
    - 추세선
    - data는 x축y축의 개념이 있음 
plt.scatter(x,y)
    - 산점도
plt.hist(x)
    - 빈도, 빈도밀도, 확률 분포 그리기 좋음 
plt.boxplot(x)
    - 수치적 자료를 표현
    - 최소값, 제 1사분위값, 제 2사분위값, 제 3사분위값, 최대값 이렇게 5개를 요약가능
        - 주식차트가 생각남
plt.bar(x,height)
    - 범주형 데이터의 수치를 요약</code></pre>
<ol>
<li>예시</li>
</ol>
<pre><code class="language-python">plt.scatter(y=df[&quot;avg_bill&quot;),x=df[&quot;age&quot;])

plt.hist(df[&quot;A_bill&quot;], bins=20)

x = [5,3,7,10,9,5,3.5,8,6]
plt.boxplot(x=x)

df.boxplot(by=&quot;by_age&quot;, column=&quot;avg_bill&quot;, figsize=(16,7))

y=[5, 3, 7, 10, 9, 5, 3.5, 8]
x=list(range(len(y)))
plt.bar(x, y)

df2[[&#39;A_bill&#39;, &#39;B_bill&#39;]].plot(kind=&#39;bar&#39;, stacked=True)

df[&#39;Churn&#39;].value_counts().plot(kind=&#39;bar&#39;)</code></pre>
<h4 id="상관관계pandas-seaborn">상관관계(pandas, seaborn)</h4>
<ol>
<li><p>pandas를 이용한 상관계수 </p>
<p><code>df_total.corr()</code></p>
</li>
<li><p>seaborn을 이용한 heatmap 시각화 </p>
<pre><code class="language-python">sns.heatmap(df_total.corr(), annot=True, cmap=&quot;RdBu&quot;)
plt.show()</code></pre>
</li>
</ol>
<h3 id="추가-시각화-라이브러리">추가 시각화 라이브러리</h3>
<h4 id="folium">Folium</h4>
<ul>
<li>import<ul>
<li><code>import folium</code></li>
</ul>
</li>
<li>일반 지도 그리기<ul>
<li><code>map = folium.Map(location=[f_lat,f_lon], zoom_start=14)</code></li>
</ul>
</li>
<li>다른 형식의 지도 그리기<ul>
<li><code>map_ST = folium.Map(location=[f_lat,f_lon], zoom_start=14, tiles=&#39;Stamen Terrain&#39;)</code></li>
</ul>
</li>
</ul>
<h5 id="지도-위-heatmap">지도 위 heatmap</h5>
<ul>
<li><pre><code class="language-python">from folium.plugins import HeatMap
heat_data=np.array([ansan_map_1[&#39;lat&#39;],ansan_map_1[&#39;lon&#39;]])
heat_data=heat_data.transpose()

map = folium.Map(location=[f_lat,f_lon], zoom_start=13)
HeatMap(heat_data,min_opacity=0.2,max_val=1,max_zoom=25,radius=25).add_to(map)</code></pre>
</li>
</ul>
<h2 id="전처리">전처리</h2>
<h3 id="탐색">탐색</h3>
<ol>
<li><code>df.info()</code><ol>
<li>데이터 타입, Non-Null 개수 등 파악 가능</li>
</ol>
</li>
<li><code>df.describe()</code><ol>
<li>수학적 통계 확인 가능 </li>
</ol>
</li>
<li><code>df = df.rename(columns={딕셔너리구조 })</code><ol>
<li>명칭 정리</li>
</ol>
</li>
<li><code>df = df.astype({&#39;칼럼명&#39;:float})</code><ol>
<li>type 변경</li>
<li>Nan의 경우 int type 지원하지 않음. python에서는 float형태로 작업 수행</li>
</ol>
</li>
<li><code>df.select_dtypes(&#39;O&#39;)</code><ol>
<li>Dtype이 Object인 항목 추출(column을 기준으로 필터링되어 추출)</li>
<li><code>list(df_total.select_dtypes(&#39;O&#39;).columns)</code></li>
</ol>
</li>
<li><code>df[&#39;TotalCharges&#39;].str.contains(&quot;[^0-9.]&quot;)</code><ol>
<li>string 내 숫자만 존재하는지 확인할 수 있는 방법 </li>
</ol>
</li>
</ol>
<h3 id="이상치결측치">이상치/결측치</h3>
<h4 id="결측치">결측치</h4>
<ol>
<li>결측치 검색<ol>
<li><code>df_total.isnull().sum()</code></li>
</ol>
</li>
<li>결측치 삭제 <ol>
<li><code>df.dropna()</code></li>
<li><code>df = df.dropna(axis=0)</code></li>
</ol>
</li>
<li>결측치 채우기<ol>
<li><code>df.fillna(0)</code></li>
<li><code>df.fillna(method=&#39;pad/ffill&#39;)</code><ol>
<li><code>backfill/bfill</code>: 바로 뒤에 값으로 채우기</li>
<li><code>pad/ffill</code>: 바로 앞의 값으로 채우기</li>
</ol>
</li>
<li><code>df[&#39;age&#39;] = df[&#39;age&#39;].replace(np.nan, df[&#39;age&#39;].median())</code><ol>
<li>replace 활용<ol>
<li>위 예제 소스는 중간값 사용</li>
</ol>
</li>
</ol>
</li>
<li><code>df.interpolate()</code><ol>
<li>결측값 보간법. 그라데이션처럼 채워나가는 방식 </li>
<li><code>1-Nan-Nan-7</code> 일 경우, 보간법 적용 시, <code>1-3-5-7</code></li>
</ol>
</li>
</ol>
</li>
</ol>
<h4 id="이상치">이상치</h4>
<ul>
<li><p>이상치를 처리할 경우, 값의 상관관계를 따져 확인해보아야 한다.</p>
</li>
<li><p>상관관계 따져보기 </p>
<ul>
<li><p>고정된 방식이 없음. 아래를 기본적으로 추천</p>
</li>
<li><pre><code class="language-python">import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df_total)
plt.show()</code></pre>
<ul>
<li>소스 수행 후, 이상 값 삭제</li>
</ul>
</li>
<li><pre><code class="language-python"># 데이터 분포 확인하기
df_total.describe()</code></pre>
<ul>
<li>min/max 검토 후 이상 값 확인 </li>
</ul>
</li>
</ul>
</li>
<li><p>값의 의미를 고려했을 때 삭제하고 싶은 데이터 삭제</p>
<ul>
<li><p>예시    </p>
<ul>
<li><pre><code class="language-python"># 평균시속 변수 만들기 : 속도는 거리 나누기 시간
df_total[&#39;PerHour&#39;]=(df_total[&#39;A_DISTANCE&#39;]/1000)/(df_total[&#39;ET&#39;]/3600)</code></pre>
</li>
<li><p>시속 계산 후, 값 확인 시에 시속이 과하게 높은 값 제거 </p>
<ul>
<li><pre><code class="language-python"># Outlier 제거 후 데이터만 남기기
df_total=df_total[df_total[&#39;PerHour&#39;]&lt;=130]
df_total</code></pre>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<h4 id="중복-제거">중복 제거</h4>
<ul>
<li>중복 제거<ul>
<li><code>df.drop_duplicates()</code></li>
</ul>
</li>
</ul>
<h3 id="feature-engineering">Feature Engineering</h3>
<ul>
<li>초기데이터로부터 특징을 가공하여 입력 데이터를 생성하는 과정 </li>
</ul>
<h4 id="binning">Binning</h4>
<ol>
<li>cut<ol>
<li>구간별 범주화</li>
<li><code>pd.cut()</code></li>
</ol>
</li>
<li>qcut<ol>
<li>개수를 이용한 범주화 </li>
<li><code>pd.qcut</code></li>
</ol>
</li>
</ol>
<h4 id="scaling">Scaling</h4>
<h5 id="standardization">Standardization</h5>
<ul>
<li>정규 분포를 평균이 0이고 분산이 1인 표준 정규 분표로 변화하고자 할 경우, 수행</li>
<li><code>object.describe()</code> 를 통해 상황 확인 가능</li>
<li><code>Standardization_df = (cust_data_num - cust_data_num.mean())/cust_data_num.std()</code></li>
</ul>
<h5 id="normalizaiont">Normalizaiont</h5>
<ul>
<li><p><code>MinMaxScalar</code> 이용 </p>
</li>
<li><pre><code class="language-python"># 스케일링
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
feature = pd.DataFrame(scaler.fit_transform(train_data))
feature.columns=columnNames</code></pre>
</li>
</ul>
<h4 id="encoding">Encoding</h4>
<h5 id="one-hot-encodingdummies">One-Hot Encoding(dummies)</h5>
<ul>
<li><p>True/False 로 만들어주는 기법 </p>
</li>
<li><pre><code class="language-python">dummy_fields = [&#39;WEEKDAY&#39;,&#39;HOUR&#39;,&#39;level1_pnu&#39;,&#39;level2_pnu&#39;]

for dummy in dummy_fields:
    dummies = pd.get_dummies(df_total[dummy], prefix=dummy, drop_first=False)
    df_total = pd.concat([df_total, dummies], axis=1)

df_total = df_total.drop(dummy_fields,axis=1)</code></pre>
<ul>
<li><code>pd.dummies()</code> 을 이용해 값을 빼서, <code>pd.concat()</code>을 이용해 값 합한 후, 기존의 column값을 drop을 이용해 삭제</li>
</ul>
</li>
<li><p><code>from sklearn.preprocessing import OneHotEncoder</code></p>
</li>
</ul>
<h5 id="label-encoding">Label Encoding</h5>
<ul>
<li><p>차이점</p>
<ul>
<li>label 인코딩은 <code>과목</code>column의 값인 <code>국어, 영어, 수학, 과학, 사회</code>를<code>0,1,2,3,4</code>로 지정하는 것</li>
<li>label encoding 결과는 숫자이지만 평균값이나 중간값으로 계산되면 안되는 값이며, 이를 one-hot encoding으로 변환하여</li>
<li>과목이 0인 column, 과목이 1인 column, 과목이 2인 column 등으로 column을 변경해야함 </li>
</ul>
</li>
<li><p>사용방법</p>
<ul>
<li>label encoding만 하면 위 내용처럼 오해가 생길 수 있음</li>
<li>*<em>선형 알고리즘에서는 one-hot 적용 필수, tree계열의 알고리즘은 label만 해도 가능 *</em></li>
</ul>
</li>
<li><pre><code>from sklearn.preprocessing import LabelEncoder</code></pre></li>
</ul>
<h2 id="머신러닝">머신러닝</h2>
<h3 id="ml-종류별-예시---회귀">ML 종류별 예시 - 회귀</h3>
<h4 id="linear-regression">Linear Regression</h4>
<ul>
<li><pre><code class="language-python">from sklearn.linear_model import LinearRegression
model = LinearRegression()
# 아래 내용은 ML 에서 반복 사용
model.fit(X_train, y_train)
pred = model.predict(X_test)</code></pre>
</li>
</ul>
<h3 id="ml-종류별-예시---분류">ML 종류별 예시 - 분류</h3>
<h4 id="logistic-regression">Logistic Regression</h4>
<ul>
<li><p>이진 분류 규칙은 0과 1의 두 클래스를 갖는 것으로 선형회귀 모델을 이진 분류에 사용하기 어려움</p>
</li>
<li><p>로지스틱 함수를 사용하여 로지스틱 회귀 곡선으로 변환하여 이진 분류 가능 </p>
</li>
<li><pre><code class="language-python">from sklearn.linear_model import LogisticRegression 
model = LogisticRegression()</code></pre>
</li>
</ul>
<h3 id="ml-종류별-예시---회귀분류">ML 종류별 예시 - 회귀/분류</h3>
<h4 id="decision-tree">Decision Tree</h4>
<ul>
<li><pre><code class="language-python">from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=2)</code></pre>
</li>
<li><pre><code class="language-python">from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()</code></pre>
</li>
</ul>
<h4 id="k-nearest-neighbor">K-Nearest Neighbor</h4>
<ul>
<li><pre><code class="language-python">from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
pred=knn.predict(X_test)</code></pre>
</li>
<li><pre><code class="language-python">from sklearn.neighbors import KNeighborsRegressor
regressor = KNeighborsRegressor(n_neighbors = 3, weights = &quot;distance&quot;)</code></pre>
</li>
</ul>
<h3 id="ml-종류별-예시---앙상블">ML 종류별 예시 - 앙상블</h3>
<h4 id="random-forest">Random Forest</h4>
<ul>
<li><pre><code class="language-python">from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=50)</code></pre>
</li>
</ul>
<h4 id="xgboost">XGboost</h4>
<ul>
<li><p>설치 </p>
<ul>
<li><code>!pip install xgboost</code></li>
</ul>
</li>
<li><pre><code class="language-python">from xgboost import XGBClassifier
xgb = XGBClassifier(n_estimators=3, random_state=42)  # 10초 소요
xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)</code></pre>
</li>
</ul>
<h4 id="stacking">Stacking</h4>
<ul>
<li><p>개별 모델이 예측한 데이터를 기반으로 종합하여 output 계산 </p>
</li>
<li><pre><code class="language-python">from sklearn.ensemble import StackingRegressor, StackingClassifier
stack_models = [
    (&#39;LogisticRegression&#39;, lg), 
    (&#39;KNN&#39;, knn), 
    (&#39;DecisionTree&#39;, dt),
]
stacking = StackingClassifier(stack_models, final_estimator=rfc, n_jobs=-1)
stacking.fit(X_train, y_train)   # 1분 20초 소요
stacking_pred = stacking.predict(X_test)</code></pre>
</li>
</ul>
<h4 id="weighted-blending">Weighted Blending</h4>
<ul>
<li><p>예측값에 weight를 곱하여 최종 output 계산 </p>
</li>
<li><pre><code class="language-python"># 별개의 function이 있는 게 아니라, 여기 아래처럼 predict 결과 값을 모아서, output을 계산하는 방식 

final_outputs = {
    &#39;DecisionTree&#39;: dt_pred, 
    &#39;randomforest&#39;: rfc_pred, 
    &#39;xgb&#39;: xgb_pred, 
    &#39;lgbm&#39;: lgbm_pred,
    &#39;stacking&#39;: stacking_pred,
}

final_prediction=\
final_outputs[&#39;DecisionTree&#39;] * 0.1\
+final_outputs[&#39;randomforest&#39;] * 0.2\
+final_outputs[&#39;xgb&#39;] * 0.25\
+final_outputs[&#39;lgbm&#39;] * 0.15\
+final_outputs[&#39;stacking&#39;] * 0.3\

final_prediction = np.where(final_prediction &gt; 0.5, 1, 0)</code></pre>
</li>
</ul>
<h3 id="ml-수행-전후-추가-작업">ML 수행 전/후 추가 작업</h3>
<h4 id="학습훈련-데이터-나누기">학습/훈련 데이터 나누기</h4>
<pre><code class="language-python">from sklearn.model_selection import train_test_split
X = df1.drop(&#39;termination_Y&#39;, axis=1).values
y = df1[&#39;termination_Y&#39;].values
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    stratify=y,
                                                    random_state=42)</code></pre>
<h4 id="모델-분류기-성능-평가score">모델 분류기 성능 평가(score)</h4>
<ul>
<li><p>score </p>
<ul>
<li><code>model.score(X_test, y_test)</code></li>
</ul>
</li>
<li><p>오차행렬 </p>
<ul>
<li><pre><code class="language-python">pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix 
confusion_matrix(y_test, pred) </code></pre>
</li>
</ul>
</li>
<li><p>그 외 결과 값 </p>
<ul>
<li><pre><code class="language-python">from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy_score(y_test, pred)  
precision_score(y_test, pred) 
recall_score(y_test, pred)  
f1_score(y_test, pred) </code></pre>
</li>
<li><pre><code class="language-python">from sklearn.metrics import classification_report
print(classification_report(y_test, lg_pred))</code></pre>
</li>
</ul>
</li>
</ul>
<h4 id="모델-저장">모델 저장</h4>
<pre><code class="language-python">import pickle
import joblib
model.fit(train_x, train_y)
joblib.dump(model, &#39;{}_model.pkl&#39;.format(i))</code></pre>
<h2 id="딥러닝">딥러닝</h2>
<h3 id="기본-가이드">기본 가이드</h3>
<h4 id="팁">팁</h4>
<ol>
<li><p><code>activation</code>설정</p>
<ul>
<li><p>마지막 출력층에 Label의 열이 하나고 두 개의 값으로 이루어진 이진분류라면 <code>sigmoid</code></p>
</li>
<li><p>Label의 열이 두개 이상이라면 <code>softmax</code></p>
</li>
</ul>
</li>
<li><p><code>loss</code>설정</p>
<ul>
<li><p>출력층 activation이 <code>sigmoid</code> 인 경우: <code>binary_crossentropy</code></p>
</li>
<li><p>출력층 activation이 <code>softmax</code>인 경우:</p>
<ul>
<li>원핫인코딩(O): <code>categorical_crossentropy</code></li>
<li>원핫인코딩(X): <code>sparse_categorical_crossentropy</code></li>
</ul>
</li>
</ul>
</li>
<li><p><code>metrics</code>를 <code>acc</code> 혹은<code>accuracy</code>로 지정하면, 학습시 정확도를 모니터링 할 수 있습니다.</p>
</li>
</ol>
<h5 id="예시">예시</h5>
<ul>
<li>모델 컴파일 – 다중 분류 모델 (Y값을 One-Hot-Encoding 한경우) <br>
model.compile(optimizer=&#39;adam&#39;, loss=&#39;categorical_crossentropy&#39;, metrics=[&#39;accuracy&#39;])</li>
<li>모델 컴파일 – 다중 분류 모델  (Y값을 One-Hot-Encoding 하지 않은 경우) <br>
model.compile(optimizer=&#39;adam&#39;, loss=&#39;sparse_categorical_crossentropy&#39;, metrics=[&#39;accuracy&#39;])</li>
<li>모델 컴파일 – 예측 모델 <br>
model.compile(optimizer=&#39;adam&#39;, loss=&#39;mse&#39;)</li>
</ul>
<h3 id="딥러닝-종류별-소스">딥러닝 종류별 소스</h3>
<h4 id="dnn">DNN</h4>
<ul>
<li><pre><code class="language-python">import tensorflow as tf
from tensorflow.keras.models import Sequentioal
from tensorflow.keras.layers import Dense, Dropout </code></pre>
</li>
<li><pre><code class="language-python">model = Sequential()
model.add(Dense(4, input_shape=(3,), activation=&#39;relu&#39;))
# hiden layer
model.add(Dropout(0.2))
model.add(Dense(4, activation=&#39;relu&#39;))
model.add(Dense(1, activation=&#39;sigmoid&#39;))
# 결과 값이 1개일경우, 보통 sigmoid 
model.compile(loss=&#39;binary_crossentropy&#39;, optimizer=&#39;adam&#39;, metrics=[&#39;acc&#39;])
history = model.fit(X_train, validation_data=(X_test, y_test),....)</code></pre>
</li>
</ul>
<h4 id="cnn">CNN</h4>
<ul>
<li><pre><code class="language-python">model = Sequential()
model.add(Conv2D(12, kernel_size=(5,5), activation=&#39;relu&#39;, input_shape=(120, 60,1)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(12, kernel_size=(5,5), activation=&#39;relu&#39;))    
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(12, kernel_size=(4,4), activation=&#39;relu&#39;))    
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(128, activation=&#39;relu&#39;))
model.add(Dense(4, activation=&#39;softmax&#39;))</code></pre>
</li>
</ul>
<h4 id="lstm">LSTM</h4>
<ul>
<li><pre><code class="language-python">tf.keras.layers.LSTM(64)
tf.keras.layers.LSTM(64, return_sequence=True) </code></pre>
</li>
</ul>
<h3 id="활용-예시">활용 예시</h3>
<h4 id="callback">callback</h4>
<pre><code class="language-python">from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
# val_loss 모니터링해서 성능이 5번 지나도록 좋아지지 않으면 조기 종료
early_stop = EarlyStopping(monitor=&#39;val_loss&#39;, mode=&#39;min&#39;, 
                           verbose=1, patience=5)
# val_loss 가장 낮은 값을 가질때마다 모델저장
check_point = ModelCheckpoint(&#39;best_model.h5&#39;, verbose=1, 
                              monitor=&#39;val_loss&#39;, mode=&#39;min&#39;, save_best_only=True)       
history = model.fit(x=X_train_over, y=y_train_over, 
          epochs=50 , batch_size=32,
          validation_data=(X_test, y_test), verbose=1,
          callbacks=[early_stop, check_point])</code></pre>
<ul>
<li><p>val_loss 값이 더 떨어져야 하는데, 5번 지나도록 더 떨어지지 않으면 조기 종료 하겠다.</p>
</li>
<li><p>예시1</p>
<ul>
<li><code>Epoch 00002: val_loss improved from 0.43748 to 0.43266, saving model to best_model.h5</code><ul>
<li>2번째 epoch에서 val_loss가 떨어져서 저장되는 모습 </li>
</ul>
</li>
</ul>
</li>
<li><p>예시2</p>
<ul>
<li><pre><code>Epoch 00037: val_loss did not improve from 0.42811
Epoch 00037: early stopping</code></pre><ul>
<li>37번째 epoch에서 val_loss가 더이상 발전되지 않아 종료됨 </li>
</ul>
</li>
</ul>
</li>
</ul>
<h4 id="성능시각화">성능시각화</h4>
<pre><code class="language-python">plt.plot(history.history[&#39;accuracy&#39;])
plt.plot(history.history[&#39;val_accuracy&#39;])
plt.title(&#39;Accuracy&#39;)
plt.xlabel(&#39;Epochs&#39;)
plt.ylabel(&#39;Acc&#39;)
plt.legend([&#39;acc&#39;, &#39;val_acc&#39;])
plt.show()</code></pre>
<h4 id="feature-engineering-1">Feature Engineering</h4>
<ul>
<li><p>성능향상</p>
</li>
<li><p>불균형 Churn 데이터 균형</p>
<ul>
<li><p>OverSampling</p>
</li>
<li><p>UnderSampling</p>
<blockquote>
<p>샘플 갯수를 늘려서/줄여서, 학습시키기</p>
<p>데이터를 강제로 늘릴 경우, 과적합이 될 수도 있음</p>
</blockquote>
</li>
</ul>
</li>
<li><p>예시</p>
<ul>
<li><p><code>!pip install -U imbalanced-learn</code></p>
</li>
<li><pre><code class="language-python">from imblearn.over_sampling import SMOTE
# SMOTE 함수 정의 및 Oversampling 수행

smote = SMOTE(random_state=0)
X_train_over, y_train_over = smote.fit_resample(X_train, y_train)
print(&#39;SMOTE 적용 전 학습용 피처/레이블 데이터 세트: &#39;, X_train.shape, y_train.shape)
print(&#39;SMOTE 적용 후 학습용 피처/레이블 데이터 세트: &#39;, X_train_over.shape, y_train_over.shape)
# &gt; 수행 시 데이터 세트가 거의 1.5배 늘어남
# SMOTE 적용 후 레이블 값 분포 : 0과 1 갯수가 동일 
pd.Series(y_train_over).value_counts()</code></pre>
</li>
<li><pre><code class="language-python"># MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_over = scaler.transform(X_train_over)
X_test = scaler.transform(X_test)</code></pre>
<ul>
<li>재 scailing </li>
</ul>
</li>
<li><p>DNN 재수행 후, 정확도가 79%이상으로 향상됨 </p>
</li>
</ul>
</li>
</ul>
<hr>
<h2 id="기타-팁">기타 팁</h2>
<ul>
<li><p>object.drop</p>
<ul>
<li>data.drop(columns=[&#39;dev&#39;], axis=1)</li>
<li>data.drop(&#39;dev&#39;, axis=&#39;columns&#39;)<ul>
<li>위 소스 동일 </li>
</ul>
</li>
</ul>
</li>
<li><p>dataframe.loc</p>
<ul>
<li><pre><code>first_data_dropped =  data_dropped.loc[data_dropped.index &lt;= datetime.datetime(2019, 10,31)]
last_data_dropped = data_dropped.loc[data_dropped.index &gt; datetime.datetime(2019, 10,31)]</code></pre></li>
<li><pre><code>train = data_dropped.loc[:datetime.datetime(2019,10,31),:]
test = data_dropped.loc[datetime.datetime(2019,11,1):,:]</code></pre><ul>
<li>위 소스 결과 동일 </li>
</ul>
</li>
</ul>
</li>
</ul>
]]></description>
        </item>
    </channel>
</rss>