bits_by_seng.log

[Basic Stats] 01.Hypothesis Testing

Sun, 12 Nov 2023 03:12:34 GMT

This series will be followed a more advanced series in mathematical statistics which I'm still getting the hang of!

This post is meant to be a refresher on the topic! If this is your first time with these topics, please checkout more thorough materials.

Hypothesis Testing

Terminology

*Hypothesis : * A statement on something we are trying to investigate. It's normally a statement that is based on a belief about the population.

*Null Hypothesis : *The originial statement. The statement about the population we want to test.

*Alternative Hypothesis : * The opposite of the null hypothesis.

*Type I Error : * Thee probability of rejecting the null hypothesis although it is true. If we say tht the possibility of rejecting H0 is $a$ then type I error is $1 - a$.

*Type II Error : * The probability of accepting the H0 despite the fact that it is wrong.

*P-Value : * The probability that the H0 will be true. It is a number between 0 ~ 1.

Two-side Test v.s. One-side Test

	Two-tailed test	Left-tailed test	Right-tailed test
Sign in $H_{a}$ rejection region	$\neq$	<	>

One Sample Hypothesis Test

Assumptions about the population mean - population variance is KNOWN

If we know the population variance we can utilize the $z-score$ regardless of the size of the population.

$a = 0.05$, $z =\frac{\bar{x} - \mu}{\sigma/ n^{0.5}}$

a) $|z_{0}| \geq z_{a/2}$ Reject $H_{0}$ b) $z_{0} \geq z_{a}$ Reject $H_{0}$ c) $z_{0} \leq -z_{a}$ Reject $H_{0}$

Otherwise we FAIL to reject $H_{0}$

Assumptions about the population mean - the population variance is UNKNOWN

$a = 0.05$, $z =\frac{\bar{x} - \mu}{s/ n^{0.5}}$ ~ $t(n-1)$

$n>30$ The T-distribution becomes similar to a normal distribution as the degree of freedom increases

Assumptions about the population proportion

** Remember that the sampling distribution of $\hat{p}$ has a mean of $p$ and a standard deviation of $\sqrt{p(1-P)/n}$ **

$a = 0.05, z= \frac{\hat{p}-p}{\sqrt{p(1-P)/n}}$

[ML] Data Preprocessing📚

Sun, 05 Nov 2023 05:12:06 GMT

Assigning numbers to non-number values using Label Encoder

import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'c', 'a','b'],
                   'B': [1, 2, 3, 1, 0]})

df['le_A'] = le.fit_transform(df['A'])
df

We can go from string to int:

le.inverse_transform(df['le_A'])

Feature Scaling

Feature scaling enables gradient descent to run faster

Min-max Scaling

$x' = \frac{x-min(x)}{max(x) - min(x)}$

df = pd.DataFrame({
    'A': [10, 20, -10, 0, 25],
    'B': [1, 2, 3, 1, 0]
})

from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
mms.fit(df)

df_mms = mms.transform(df)

Standard Scaler (Z-score)

$x' = \frac{x - \mu}{\sigma}$

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(df)
df_ss = ss.transform(df)

Robust Scaler

$x' = \frac{x - Q2}{Q3 - Q1}$

from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

mm = MinMaxScaler()
ss = StandardScaler()
rs = RobustScaler() 

df_scaler = df.copy()
df_scaler['MinMax'] = mm.fit_transform(df)
df_scaler['Standard'] = ss.fit_transform(df)
df_scaler['Robust'] = rs.fit_transform(df)

df_scaler

In general there is very little difference in performance between the MinMaxScaler and the StandardScaler. The Robust scaler may be more 'robust' against outliers as the median will be 0. If I am using ReLU as the activation function, for instance, I would lean towards the Min-Max Scaler as this will yield target values between [0, 1].

Creating a Pipeline

import pandas as pd
red_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-red.csv"
white_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-white.csv"

red_wine = pd.read_csv(red_url, sep=";")
white_wine = pd.read_csv(white_url, sep=";")

red_wine['color'] = 1 
white_wine['color'] = 0 

wine = pd.concat([red_wine, white_wine])

X = wine.drop(['color'], axis = 1)
y = wine['color']

wine.head()

from sklearn.pipeline import Pipeline 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.preprocessing import StandardScaler 

estimators = [
    ('scaler', StandardScaler()), 
    ('clf', DecisionTreeClassifier())
]

pipe = Pipeline(estimators)

pipe.steps

[('scaler', StandardScaler()), ('clf', DecisionTreeClassifier())]

pipe.set_params(clf_max_depth=2)
pipe.set_params(clf__random_stae=13)

[EDA/Python] Playing with Pandas 📊🐼 2편

Sat, 21 Oct 2023 09:36:14 GMT

Applying Functions to Dataframes

사실 필자는 영어가 더 편하다. 필자의 학습을 위해 작성 중인 블로그인 만큼 한국말이 생각나지 않으면 그냥 영어로 적도록 하겠다. 🐼

apply()를 활용하여 df 혹은 Series에 함수를 적용시킬 수 있다.

"Dataframe C"

display(C)
G = C.copy() #Copy를 하지 않으면 the dataframe is modified 
G['year] = G['year'].apply(lambda x: "'{:02d}".format(x % 100)) 
display(G)

"Dataframe G"

요렇게 쓴다 이말이야. 근데 이게 상당히 복잡하고 유용해진다.

두 열에 대한 연산을 통해 새로운 열 생성하기

우선 axis = 0과 axis = 1의 방향을 잊지 말자.

G['prevalence'] = G['cases'] / G['popuation']

물론 위가 가장 간단한 방법이지만 apply함수를 활용하는 함수를 작성해보자.

def calc_prevalence(G):
    assert 'cases' in G.columns and 'population' in G.columns
    F = G.copy()
    F['prevalence'] = F.apply(lambda row : row['cases']/row['population'], axis=1)

    return F
display(calc_prevalence(G))

[EDA/Python] Playing with Pandas 📊🐼 1편

Sat, 21 Oct 2023 09:14:47 GMT

🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼🐼

Tidy Data

Tidy data는 데이터가 목적에 맍는 형식을 갖고 있음을 의미한다. R프로그래밍 장인이자 통계학자인 Hadley Wickham에 따르면 *Tidy data는 다음과 같은 조건을 만족하는 2-D 테이블이다: *

1. each column represents a variable; 2. each row represents an observation; 3. each entry of the table represents a single value, which may come from either categorical(discrete) or continuous spaces.

'tidy'한 테이블을 우리는 'tibble'이라고 부르기도 한다

import pandas as pd 
from io import StringIO
from IPython.display import display            #그래프나  df생성시 활용하면 편하다 

A_csv = """country,year,cases
Afghanistan,1999,745
Brazil,1999,37737
China,1999,212258
Afghanistan,2000,2666
Brazil,2000,80488
China,2000,213766"""

with StringIO(A_csv) as fp:
    A = pd.read_csv(fp)
print("=== A ===")
display(A)

A_csv = """country,year,cases
Afghanistan,1999,745
Brazil,1999,37737
China,1999,212258
Afghanistan,2000,2666
Brazil,2000,80488
China,2000,213766"""

with StringIO(A_csv) as fp:
    A = pd.read_csv(fp)
print("=== A ===")
display(A)

merge()함수를 이용하여 이 두 df를 쉽게 합칠 수 있다.

C = A.merge(B, on=['country', 'year'])
print("\n=== C = merge(A, B) ===")
display(C)

Joins

쉽게 말하자면... 다음과 같다:

Inner-join(A,B) (default): 둘 사이의 교집합만 살리고 나머지는 버림
Outer-join(A,B): 둘 사이의 합집합을 살리는데 non-match에 대해서는 NaN으로 채워버림
Left-join(A,B): A의 모든 행을 살리고 A와 맞는 B만 살림
Right-join(A,B): left-join 반대

with StringIO("""x,y,z
bug,1,d
rug,2,d
lug,3,d
mug,4,d""") as fp:
    D = pd.read_csv(fp)
print("=== D ===")
display(D)

with StringIO("""x,y,w
hug,-1,e
smug,-2,e
rug,-3,e
tug,-4,e
bug,1,e""") as fp:
    E = pd.read_csv(fp)
print("\n=== E ===")
display(E)

print("\n=== Outer-join (D, E) ===")
display(D.merge(E, on=['x', 'y'], how='outer'))

print("\n=== Left-join (D, E) ===")
display(D.merge(E, on=['x', 'y'], how='left'))

print("\n=== Right-join (D, E) ===")
display(D.merge(E, on=['x', 'y'], how='right'))


print("\n=== Inner-join (D, E) ===")
display(D.merge(E, on=['x', 'y']))

참 쉽죠~?

[EDA/Python] Row Major v.s. Column Major 📊

Sat, 21 Oct 2023 03:43:26 GMT

이것은 무엇인교?

2차원 이상의 배열을 사용할때 주의해야 하는 것이 바로 row-major와 column-major이다.

배열의 차원과 관계없이 저장 장치에 정장될 때에는 반드시 1차원으로 저장된다.

그럼 2차원 배열을 어떻게 1차원을 필 수 있을까?

Row-major

row-major는 row 단위로 저장하겠다는 것을 의미한다.

즉, 다음과 같이 저장된다.

[a11 a12 a13 a21 a22 a23 a31 a32 a33]

기존 index를 1차원 row-major 리스트 index로 반환하는 함수를 작성해보자!

n = 행의 수 m = 열의 수 i = 행 index j = 열 index

def linearize_rowmajor(i, j, m, n): # calculate `v`

    return i * n + j

참 쉽죠?

Column-major

같은 원리니 설명은 생략한다.

Col-major 함수는 다음과 같다.

def linearize_colmajor(i, j, m, n): # calculate `u`

    return i + (j*m)

[EDA/Python] Numpy! Numpy What and Why? 📊

Sat, 21 Oct 2023 02:25:09 GMT

Numpy란?

요즘 1년간 거의 매일 진행해온 수학 공부가 결실을 맺고 있는 것 같아 기분이 좋다. 머신러닝 공부를 최근에 본격적으로 시작하면서 수학 때문에 막힌 적은 크게 없는 것 같다 (IQ가 몇 점 부족하여 발생하는 문제는 빈번하다).

아무튼, Numpy란 'multidimensional arrays'에 대한 연산을 용이하게 해주는 라이브러리다. 그냥 기본 리스트 혹은 딕셔너리를 사용하는 것보다 훨씬 빠르다. 특히 'gradient descent'를 생각한다면 for loop을 돌려 parameter를 업데이트 해주는 것보다 np.dot 혹은 np.matmul 등의 기능을 활용하면 훨씬 빠르게 행렬 연산을 진행할 수 있다. 이런 얘기는 추후 machine learning 관련 포스팅에서 더 자세히 하도록 하겠다.

Why Numpy?

'vectorization'은 머신러닝에 알고리즘에 매우 중요하다. 아래 코드를 살펴보자.

import numpy
import time
 size = 1000000  

list1 = range(size)
list2 = range(size)

array1 = numpy.arange(size)  
array2 = numpy.arange(size)

initialTime = time.time()
resultantList = [(a * b) for a, b in zip(list1, list2)]

print("Time taken by Lists :", 
      (time.time() - initialTime),
      "seconds")

initialTime = time.time()
resultantArray = array1 * array2

print("Time taken by NumPy Arrays :",
      (time.time() - initialTime),
      "seconds")

> Time taken by Lists : 1.1984527111053467 seconds
  Time taken by NumPy Arrays : 0.13434123992919922 seconds

리스트를 'vectorize'하여 행렬처럼 대하면 훨씬 빠르게 결과를 산출할 수 있다.

기본문법

행렬 생성

B = np.array([[0, 1, 2, 3], 
            [4, 5, 6, 7], 
            [8, 9, 10, 11]])

B 모양 확인
```
print(B.shape)
```
```
> (3, 4)
```

3 X 4 '0' 행렬 생성, 3 x 3 Identity 행렬 생성

print(np.zeroes((3,4)))
print(np.eye(3))

[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


### Indexing and Slicing 
![](https://velog.velcdn.com/images/bits_by_seng/post/93717366-0f2c-4ba1-bbdb-0cd2405651fd/image.png)

```python
> Z= np.array([[0,1,2,3,4,5],
             [10,11,12,13,14,15],
             [20,21,22,23,24,25],
             [30,31,32,33,34,35],
             [40,41,42,43,44,45],
             [50,51,52,53,54,55]])

# Construct `Z_green`, `Z_red`, `Z_orange`, and `Z_cyan`:
Z_green = Z[(2,4), ::2]
Z_red = Z[:, 2]
Z_orange = Z[0, 3:5]
Z_cyan = Z[(4,5), 4:6]

크게 어려울 건 없다. 리스트 인덱싱과 비슷하다고 생각하면 된다. 메모리 공간을 고려했을때 Z_green 등은 그냥 'view'이다. Slicing을 하여 변수를 선언한다고 새로운 메모리 공간이 할당 되는 것은 아니다. 마찬가지로 새로운 객체를 생성하고 싶다면 Z[:, 2].copy() 를 선언하면 된다.

Indirect Addressing

'Boolean Mask' 또는 'Indices'로 구성된 array를 통해 indxing을 할 수도 있다.

from numpy.random import default_rng 
rng = default_rng(12345) 

x = rng.integers(0, 20, 15) 
print(x)
> [13 4 15 6 4 15 12 13 19 7 16 6 11 11 4]

inds = np.array([3, 7, 7, 12])
print(x[inds])
> [6 13 19 11]

mask_mult_3 = (x > 0) & (x % 3 ==0) 
print("x:", x)
print("mask_mult_3:", mask_mult_3)
print("==> x[mask_mult_3]:", x[mask_mult_3]) 
>x: [13 4 15 6 4 15 12 13 19 7 16 6 11 11 4]
>mask_mult_3: [False False  True  True False  True  True False False False False  True
 False False False]
>==> x[mask_mult_3]: [15 6 15 12 6]

응용

20까지의 소수를 모두 찾는 알고리즘을 작성해보자. 에라토스테네스의 체를 numpy를 활용하여 작성할 수 있다. 사실 불필요하며 코딩테스트에서는 그냥 리스트를 활용할 것 같다.

from math import sqrt
def sieve(n):

    is_prime = np.empty(n+1, dtype=bool) # the "sieve"

    # Initial values
    is_prime[0:2] = False # {0, 1} are _not_ considered prime
    is_prime[2:] = True # All other values might be prime

    m = int(sqrt(n)) + 1

    for i in range(2, m):
        if is_prime[i] == True:
            for j in range(i+i, n+1, i):
                is_prime[j] = False 

    return is_prime

# Prints your primes
print("==> Primes through 20:\n", np.nonzero(sieve(20))[0])
>==> Primes through 20:  
 [2 3 5 7 11 13 17 19]

[Algorithms/Python] 유용한 수학 알고리즘 정리 1편 📒

Mon, 16 Oct 2023 12:26:52 GMT

에라토스테네스의 체(Eratosthenes Sieve)로 소수 구하기

자연수 n이 소수인지를 판별하기 위해서는 2부터 n-1까지 for 반복문을 돌려 나누어 떨어지는 숫자가 있는지 확인하는 방법을 사용할 수 있다. 하지만 이러한 일반적인 방법을 사용할 경우 O(n)의 복잡도를 갖기 때문에 시간이 초과될 것이다.

따라서 우리는 에라토스테네스의 체라는 알고리즘을 사용할 것이다.

논리

각 수가 갖는 약수는 제곱근을 기준으로 대칭을 이루기 때문에 제곱근까지만 나누어 떨어지는 숫자가 있는지 확인하면 된다.

이는 제곱근 까지의 숫자의 배수를 모두 배제시키는 알고리즘을 구현하면 된다는 것을 의미한다!

파이썬으로 구현해보자

def prime_list(n):
    # Updating a list with numbers from 0-n (assume all are prime)
    sieve = [True] * n

    m = int(n ** 0.5)
    for i in range(2, m + 1):
        if sieve[i] == True:  # i가 소수인 경우
            for j in range(i + i, n, i):  # i이후 i의 배수들을 False 판정
                sieve[j] = False

    # 소수 목록 산출
    print(sieve)
    return [i for i in range(2, n) if sieve[i] == True]

시간복잡도가 대략 O(√n)으로 줄어든다! 일정 숫자까지의 소수를 구하는 알고리즘에서 효율이 매우 증가한다. 즐겁지 아니한가?!

에라이토레타의 체를 이용한 소인수분해

같은 논리를 적용하여 소인수분해를 할 수 있다. n을 √n 으로 나눴을 때 √n 보다 큰 수가 나올 수 없다. 따라서 n이 1이 될때까지 나눠서 소인수 분해 할 필요는 없다.

그냥 코드를 보면 안다

파이썬으로 구현구현~

N = int(sys.stdin.readline().strip())
d = 2
M = N ** (0.5)

while d <= M:
    if N % d != 0:
        d +=1
    else:
        print(d)
        N //= d
if N > 1:
    print(N)

유클리드 호제로 GCD & LCM 구하기

a, b가 있을 경우 min(a,b)를 i로 설정하고 a % i == 0 and b % i ==0가 될때까지 i -= 1을 하며 while룹을 돌려도 된다. 하지만 최악의 경우 i만큼 룹을 돌아야할 수 있기 때문에 시간이 초과될 가능성이 높다.

그래서 우리는 유클리드 알고리즘을 사용해야 한다!

논리

2 개의 자연수 a, b (a > b)에 대해서 a를 b로 나눈 나머지가 r일 때, a와 b의 최대공약수는 b와 r의 최대공약수와 같다. 재귀냄새가 물씬 나지 않는가?!

파이썬으로 구현해보자

def gcd_u(a,b):
    bigger = max(a,b)
    smaller = min(a,b)
    if smaller == 0:
        return bigger
    return gcd_u(smaller, bigger % smaller)

마찬가지로 시간이 훨씬 절약된다!

그럼 LCM은 어떻게 구하나요?

두 수 a와 b의 최소공배수는 a와 b의 곱을 a와 b의 최대공약수로 나눈 것과 같다!