denver_almighty.log

[Kafka] Python Kafka 프로그래밍

Sun, 12 Feb 2023 06:09:36 GMT

0. 실행 환경

AWS EC2 t2.xlarge OS : Red Hat 9.1 Kafka :3.3.1 Scala : 2.13

1. Python으로 Kafka producer, consumer 만들기

producer.py


from kafka import KafkaProducer

# create kafka producer instance
producer = KafkaProducer(bootstrap_servers = ['localhost:9092'])

# set topic name
producer.send('first-topic', b'hello world')
# reset buffer
producer.flush()

comsumer.py


from kafka import KafkaConsumer

# creae kafka consumer instance
consumer = KafkaConsumer('first-topic', bootstrap_servers=['localhost:9092'])

# print message
for msg in consumer:
    print(msg)

python consumer.py
python producer.py

consumer.py 실행해 놓고 다른 창에서 producer.py 실행

토픽 이름, offset(메세지 온 순서), timestamp, 내용 등이 전달된다.

[Kafka] Topic 만들기

Sun, 12 Feb 2023 05:44:28 GMT

0. 실행 환경

AWS EC2 t2.xlarge OS : Red Hat 9.1 Kafka :3.3.1 Scala : 2.13

1. Topic 만들기

토픽 생성

bin/kafka-topics.sh --create --topic  --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1

토픽 리스트 확인

producer 실행

./bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test-topic

여기 입력한 메세지가 토픽에 post됨

Consumer 실행

/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic

왼쪽 producer 창에서 메세지를 입력하면 오른쪽 conspumer 창에 메세지가 출력된다.

2. 컨슈머 그룹

consumer 그룹 지정하지 않으면 Unique한 컨슈머 그룹이 생성된다.

./bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

그룹 지정 컨슈머 생성

./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --group test-group

컨슈머 그룹 리스트

./bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

컨슈머 그룹 상세

./bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group test-group

3. Consumer 와 Partition

1) 파티션 1개짜리 토픽

1 producer 2 consumer

컨슈머 그룹을 지정하지 않으면 유니크한 컨슈머 그룹이 생성된다(2번 참고) 프로듀서가 메세지를 보내면 모든 컨슈머가 동일하게 메세지를 받는다.

1 producer 1 consumer group(2 consumer)

컨슈머 그룹으로 묶어주면 한 컨슈머만 메세지 받는다.

2 producer 1 consumer group (2 consumer)

2번째 프로듀서가 메세지를 보내도 한 컨슈머만 메세지를 받는다.

-> first-topic 토픽에 파티션이 1개 이기때문에 파티션 1개는 컨슈머 그룹 내 1컨슈머와 매핑된다 -> 리소스 효율화위해 파티션 여러개로 설정한다.

partition 2개짜리 topic

토픽 리스트 확인

./bin/kafka-topics.sh --list --bootstrap-server localhost:9092

partition 2개 짜리 토픽 생성

./bin/kafka-topics.sh --create --topic topic-multi-partition --bootstrap-server localhost:9092 --partitions 2 --replication-factor 1

2 producer 1 consumer group(2 consumer)

# producer
./bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic topic-multi-partition

# consumer
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic topic-multi-partition --group my-newgroup

메세지 분배되어서간다. 캡쳐에서는, 대부분 1프로듀서 -> 1 컨슈머에게, 2프로듀서 ->2컨슈머에게 가는 것만 나왔데 2프로듀서 메세지가 1컨슈머에게도 간다

[ERROR] Kafka Zookeeper 실행 오류 : Classpath is empty. Please build the project first e.g. by running './gradlew jar -PscalaVersion=2.13.8'

Sun, 15 Jan 2023 06:44:53 GMT

0. 실행 환경

AWS EC2 t2.xlarge OS : Red Hat 9.1 Kafka :3.3.1 Scala : 2.13

1. ERROR

zookeeper 실행 오류

./bin/zookeeper-server-start.sh --daemon  ./config/zookeeper.properties
=> 
Classpath is empty. Please build the project first e.g. by running './gradlew jar -PscalaVersion=2.13.8'

=> source 파일 말고 Binary 파일을 다운로드하기

[Snowflake] Badge 1 획득

Sun, 08 Jan 2023 07:28:51 GMT

Snowflake 웨비나, 핸즈온 랩1을 완료하고 뱃지를 받았다. 랩 1(데이터 웨어하우징)까지 했을 때는 스키마도 사전에 정해야하고 SnowflakeSQL 쿼리로 질의하는데 DB랑 뭐가 다른거지 싶었다. 랩 2는 Snowflake를 백엔드로 사용하는 애플리케이션 구축에 관한 내용인데 여기부터가 진짜인가보다

[AWS] EC2 인스턴스 자동 중지 설정

Sun, 08 Jan 2023 06:51:14 GMT

이미지 by 다락원

0. 배경

aws 결제 대시보드에 갔더니 이용 요금이 생각보다 많이 나와있었다. 테스트용 인스턴스 종료하는 것을 깜박했나보다.. 사용한만큼 과금되는거는 괜찮은데 이렇게 안썼는데 돈 나가는건 너무 아깝다ㅜㅜ 다음 소를 잃지않기 위해 외양간을 고쳐본다.

1. 설정하기

1) Policy 생성

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:StartInstances",
                "ec2:DescribeTags",
                "logs:*",
                "ec2:DescribeInstanceTypes",
                "ec2:StopInstances",
                "ec2:DescribeInstanceStatus"
            ],
            "Resource": "*"
        }
    ]
}

2) Role 생성

target은 Lambda, 1) 에서 생성한 policy로 Role을 생성한다

3) Lambda 생성

runtime : python 실행 역할 : 기존 역할 선택 -> 2) 에서 만든 Role 선택

import boto3

region = 'ap-northeast-2'
ec2 = boto3.resource('ec2', region_name=region)

def lambda_handler(event, context):
    # Get running instance list with tag AutoStop=True
    # instance-state-name : ( pending | running | shutting-down | terminated | stopping | stopped )
    instances = ec2.instances.filter(Filters=[
        {
            'Name': 'instance-state-name', 
            'Values': ['running']
        }
        ,{
            'Name': 'tag:AutoStop',
            'Values':['True']
        }
    ])

    # Stop instance
    for instance in instances:
        id=instance.id
        # ec2.instances.filter(InstanceIds=[id]).start()
        ec2.instances.filter(InstanceIds=[id]).stop()
        print('Instance ID is stopped :- '+instance.id)

    return 'success'

4) EventBridge 설정

규칙 세부 정보 정의

EventBridge 규칙 생성 -> 규칙일정 -> 일정 선택

일정 정의

cron 패턴 정의 표시 시간이 UTC인지 현지시간인지 확인

대상 선택

3) 에서 만든 Lambda 생선택

EventBrige 시간을 바꿔서 테스트해봤다. EC2 대시보드에서 확인하니 종료되었고 CloudWatch Log 확인해보면 test 로그와 동일하게 남겨져있다.

테스트했으니 시간 다시 01시로 맞춰놓기!

외양간 고치기 끝 🐮🐮🐮

참고 자료

Lambda를 사용하여 Amazon EC2 인스턴스를 정기적으로 중지하고 시작하려면 어떻게 해야 하나요?(권한 안맞음)

Dheeraj Choudhary's Blog - AWS Lambda & EventBridge | Schedule Start And Stop Of EC2 Instances Based On Tags...

[Python] Spotify API token 생성

Sun, 01 Jan 2023 10:09:43 GMT

0. 실행 환경

Python : 3.9

1. 토큰 생성

import requests
import base64
import json

client_id = ''
client_secret = ''
endpoint = 'https://accounts.spotify.com/api/token'


encoded = base64.b64encode(f'{client_id}:{client_secret}'.encode('utf-8')).decode('ascii')

headers = {'Authorization': f'Basic {encoded}'}
payload = {'grant_type': 'client_credentials'}

response = requests.post(endpoint, data=payload, headers=headers)
# print(json.loads(response.text))
access_token = json.loads(response.text)['access_token']
print(access_token)

[ERROR] (Not Solved) Airflow HttpSensor 400 Client Error: Bad Request for url

Sun, 01 Jan 2023 09:57:03 GMT

0. 실행 환경

AWS EC2 t2.xlarge OS : Red Hat 9.1 Python : 3.9 Airflow : 2.5.0

1. Code

with DAG(
    is_api_available = HttpSensor(
        task_id = 'is_api_available',
        http_conn_id = 'spotify_api',
        # method="GET",
        headers = {
            # 'Accept': 'application/json',
            # 'Content-Type': 'application/json',
            'Authorization': 
                'Bearer ',
        },

        request_params = {
            'q': 'BTS',
            'type': 'artist',
            'limit': '1',
        },
        method="GET",
        endpoint='v1/search'
    )

curl, request로 하면 결과가 나오는데 task 실행시키니까 400 에러가난다. HttpSensor, SimpleHttpOperator 로 해봐도 안된다. response_check=False를 넣어도 안된다. providers/http/hooks/http.py, requests/models.py를봐도 response code를 확인하는거 밖에 없는데 원인을 모르겠다. task정의에서 header를 제거해도 같은 오류가난다.

2. Connection

3. API Test

1) Curl 결과

2) Requests

import requests

headers = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': 'Bearer ',
}

params = {
    'q': 'BTS',
    'type': 'artist',
    'limit': '1',
}

response = requests.get('https://api.spotify.com/v1/search', params=params, headers=headers)
print(response)

4. Error

requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.spotify.com/v1/search?q=BTS&type=artist&limit=1

[2023-01-01 08:52:44,836] {http.py:122} INFO - Poking: v1/search
[2023-01-01 08:52:44,839] {base.py:73} INFO - Using connection ID 'spotify_api' for task execution.
[2023-01-01 08:52:44,840] {http.py:150} INFO - Sending 'GET' to url: https://api.spotify.com/v1/search
[2023-01-01 08:52:44,988] {http.py:163} ERROR - HTTP error: Bad Request
[2023-01-01 08:52:44,989] {http.py:164} ERROR - {
  "error": {
    "status": 400,
    "message": "Only valid bearer authentication supported"
  }
}
[2023-01-01 08:52:44,989] {taskinstance.py:1772} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/hooks/http.py", line 161, in check_response
    response.raise_for_status()
  File "/opt/anaconda/anaconda3/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.spotify.com/v1/search?q=BTS&type=artist&limit=1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/sensors/base.py", line 199, in execute
    poke_return = self.poke(context)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/sensors/http.py", line 137, in poke
    raise exc
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/sensors/http.py", line 124, in poke
    response = hook.run(
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/hooks/http.py", line 151, in run
    return self.run_and_check(session, prepped_request, extra_options)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/hooks/http.py", line 204, in run_and_check
    self.check_response(response)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/hooks/http.py", line 165, in check_response
    raise AirflowException(str(response.status_code) + ":" + response.reason)
airflow.exceptions.AirflowException: 400:Bad Request
[2023-01-01 08:52:44,990] {taskinstance.py:1322} INFO - Marking task as FAILED. dag_id=nft-pipeline, task_id=is_api_available, execution_date=20230101T085244, start_date=, end_date=20230101T085244
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/hooks/http.py", line 161, in check_response
    response.raise_for_status()
  File "/opt/anaconda/anaconda3/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.spotify.com/v1/search?q=BTS&type=artist&limit=1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/.local/bin/airflow", line 8, in 
    sys.exit(main())
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/__main__.py", line 39, in main
    args.func(args)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/cli/cli_parser.py", line 52, in command
    return func(*args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/utils/cli.py", line 108, in wrapper
    return f(*args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 576, in task_test
    ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 75, in wrapper
    return func(*args, session=session, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1673, in run
    self._run_raw_task(
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1378, in _run_raw_task
    self._execute_task_with_callbacks(context, test_mode)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1524, in _execute_task_with_callbacks
    result = self._execute_task(context, task_orig)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1585, in _execute_task
    result = execute_callable(context=context)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/sensors/base.py", line 199, in execute
    poke_return = self.poke(context)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/sensors/http.py", line 137, in poke
    raise exc
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/sensors/http.py", line 124, in poke
    response = hook.run(
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/hooks/http.py", line 151, in run
    return self.run_and_check(session, prepped_request, extra_options)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/hooks/http.py", line 204, in run_and_check
    self.check_response(response)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/airflow/providers/http/hooks/http.py", line 165, in check_response
    raise AirflowException(str(response.status_code) + ":" + response.reason)
airflow.exceptions.AirflowException: 400:Bad Request

참고 자료

Airflow : providers/http/hooks/http.py Python : requests/models.py Airflow : HttpSensor CurlConverter

[Airflow] Airflow 설치하기(pip)

Thu, 29 Dec 2022 15:08:45 GMT

0. 실행 환경

AWS EC2 t2.xlarge OS : Red Hat 9.1 Python : 3.9 Airflow : 2.5.0

1. 설치하기

# python3.6이상, anaconda3 경로에 pip 인지 확인
pip --version

# 설치
pip install apache-airflow

# home에 airflow 가 생성되었다
cd /home/ec2-user/airflow

# db초기화
airflow db init

# webserver 8080포트로 실행
airflow webserver -p 8080

# 새 세션에서 실행. ssh 포트포워딩
ssh -i "" -L 8080:localhost:8080 ec2-user@

# admin 계정 생성
airflow users create --role Admin --username  \
--password  --firstname  \
--lastname  --email

flask 로 web서버 구성

local이 아니라 aws에 설치한거라 ssh 포트포워딩이 필요하다

참고 자료

ssh 포트포워딩

[Spark] Spark Streaming

Sun, 18 Dec 2022 11:21:19 GMT

Spark Docs에 나오는 Spark Streaming 예제 localhost:9999에서 입력받은 글자 단어 세기

0. 실행 환경

AWS EC2 t2.xlarge OS : Red Hat 9.1 Python : 3.9 Spark : 3.3.1 Scala : 2.12.15 Java : OpenJDK 64-Bit Server VM, 1.8.0_352

1. Streaming Test

1-1. streaming.py 생성

vi streaming.py

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Create SparkSession
spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .getOrCreate()

# localhost:9999 streaming input -> Create DataFrame
lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()


# Split input by " " as word
words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

# Count words
wordCounts = words.groupBy("word").count()

# Print number of words
query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()



# DataFrame으로 실행
words_df = lines_df.select(expr("explode(split(value, ' ')) as word"))
counts_df = words_df.groupBy("word").count()
word_count_query = counts_df.writeStream.format("console")\
                            .outputMode("complete")\
                            .option("checkpointLocation", ".checkpoint")\
                            .start()
word_count_query.awaitTermination()

1-2. streaming 실행

spark-submit structured_network_wordcount.py localhost 9999

1-3. Netcat실행

# 추가 세션 실행 후 명령어 입력
nc -lk 9999
# -> 글자 입력

1-4. 결과

2. readSteam Options

# socket(테스트용 : UTF-8 읽어옴. fault-tolerant 보장 x) 
readStream("socket") \
.option("host", "localhost")\
.option("port", 9999)\

# rate source(테스트용 : 초당 지정된 수 만큼 데이터 생성)

# kafka source
readStream("kafka")\
.option("subscribe", "topic1") \
.load()

# file source
# 지원 파일 형식 : text, csv, json, orc, parquet 
serSchema = StructType().add("name", "string").add("age", "integer")
csvDF = spark \
    .readStream \
    .option("sep", ";") \
    .schema(userSchema) \
    .csv("/path/to/directory")

Q.

readstream("socket").option("host", HOST) HOST에 locahost말고 되는지?
kafka, 파일로 테스트해보기

checkpointLocation 오류 해결

word_count_query = df.writeStream.format("console")\
                         .outputMode("complete")\
                         .option("checkpointLocation", ".checkpoint")\
                         .start()

참고 자료

structured-streaming-programming-guide structured-streaming-programming-guide - KO

[Spark] SQL 연습하기

Sun, 18 Dec 2022 07:44:16 GMT

0. 실행 환경

AWS EC2 t2.xlarge OS : Red Hat 9.1 Python : 3.9 Spark : 3.3.1 Scala : 2.12.15 Java : OpenJDK 64-Bit Server VM, 1.8.0_352

1. SQL 연습

# create data list
stockSchema = ["name", "ticker", "country", "price", "currency"]
stocks = [
    ('Google', 'GOOGL', 'USA', 2984, 'USD'), 
    ('Netflix', 'NFLX', 'USA', 645, 'USD'),
    ('Amazon', 'AMZN', 'USA', 3518, 'USD'),
    ('Tesla', 'TSLA', 'USA', 1222, 'USD'),
    ('Tencent', '0700', 'Hong Kong', 483, 'HKD'),
    ('Toyota', '7203', 'Japan', 2006, 'JPY'),
    ('Samsung', '005930', 'Korea', 70600, 'KRW'),
    ('Kakao', '035720', 'Korea', 125000, 'KRW'),
]

# create DataFrame (list to dataframe)
df = spark.createDataFrame(data=stocks, schema=stockSchema)

# create DatFrame (read csv file)
filename = "/my/dir/filename.csv"
# 파일 여러개 인 경우
filename = "/my/dir/*.csv"
df = spark.read.csv(f"file:///{filename}", inferSchema=True, header=True)

# show data type
df.dtypes
"""
[('name', 'string'),
 ('ticker', 'string'),
 ('country', 'string'),
 ('price', 'bigint'),
 ('currency', 'string')]
"""

# describe() : 기본 통계 값 출력
df.describe().show()
df.select("total_amount").describe().show()
"""
+-------+------------------+
|summary|      total_amount|
+-------+------------------+
|  count|           9344926|
|   mean|18.217332152376397|
| stddev|184.27259172356767|
|    min|            -647.8|
|    max|          398469.2|
+-------+------------------+
"""

# print DataFrame
df.show()
"""                                                        
+-------+------+---------+------+--------+
|   name|ticker|  country| price|currency|
+-------+------+---------+------+--------+
| Google| GOOGL|      USA|  2984|     USD|
|Netflix|  NFLX|      USA|   645|     USD|
| Amazon|  AMZN|      USA|  3518|     USD|
|  Tesla|  TSLA|      USA|  1222|     USD|
|Tencent|  0700|Hong Kong|   483|     HKD|
| Toyota|  7203|    Japan|  2006|     JPY|
|Samsung|005930|    Korea| 70600|     KRW|
|  Kakao|035720|    Korea|125000|     KRW|
+-------+------+---------+------+--------+
"""

# "stocks"라는 Spark Temporary View 생성.
df.createOrReplaceTempView("stocks")

# SQL 사용
spark.sql("select name from stocks")
"""
DataFrame[name: string]
"""
spark.sql("select price from stocks")
"""
DataFrame[price: bigint]
"""

# spark.sql.("SQL").show() : show(n) n rows를 출력한다. default 20
spark.sql("select name from stocks").show()
"""
+-------+
|   name|
+-------+
| Google|
|Netflix|
| Amazon|
|  Tesla|
|Tencent|
| Toyota|
|Samsung|
|  Kakao|
+-------+
"""

spark.sql("select name, country from stocks where name like 'S%'").show()
"""
+-------+-------+
|   name|country|
+-------+-------+
|Samsung|  Korea|
+-------+-------+
"""

# JOIN
spark.sql("select A.name, (A.price/B.eps) from A join B on A.name = B.name ").show()

# explain(True)
spark.sql("select A.name, (A.price/B.eps) from A join B on A.name = B.name ").explain()

# Datetime Format
# EEE : 요일 3글자 (ex. Wed)
# EEEE : 요일 (ex. Wednesday)
query = """
SELECT 
    d.datetime,
    DATE_FORMAT(d.datetime, 'EEEE') AS day_of_week,
    COUNT(*) AS cnt
FROM
    df as d
GROUP BY
    d.datetime,
    day_of_week
"""

# DataFrame to pandas DataFrame
# pd_df 는 그냥 판다스 사용하는 것 처럼 seaborn, matplotlib 등등에 사용하면 된다.
pd_df = spark.sql(query).toPandas()

### biging df.dtypes 를 실행하면 price가 bigint 라는 타입이라고 출력된다. bigint는 8 바이트 크기의 SQL 서버에서 가장 큰 정수 데이터 타입이다. (-9,223,372,036,854,775,808 ~ 9,223,372,036,854,775,807)

SparkSession

pyspark.sql.SparkSession Dataset, DatFrame API 로 Spark 프로그래밍하기 위한 진입점 SparkSession은 DataFrame 생성, DataFrame을 table로 등록, parquet 파일 읽기에 사용된다. SparkSession을 생성하기위해서는 builder 패턴을 사용해야한다.

spark = SparkSession.builder \
    .master("local") \
    .appName("Word Count") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

createOrReplaceTempView

DATAFRAME.createOrReplaceTempView("VIEW_NAME") 데이터프레임(DATAFRAME)으로 로컬 임시 뷰(VIEW_NAME) 생성/대체. 임시 테이프블의 수명은 이 데이터프레임을 생성하는데 사용된 SparkSession에 달려잇다. 세션이 종료되면 View Table은 Drop된다.

Spark Function

date_trunc(date, fmt)

: date에서 fmt 다음 단위를 00 으로 자른 값 반환.

date_trunc(date, fmt) : fmt 모델 형식에 지정된 단위로 잘린 timestamp를 반환한다. fmt는 ["YEAR", "YYYY", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", "WEEK", "QUARTER"] 중에 하나여야 한다.

SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR');
# -> 2015-01-01T00:00:00
SELECT date_trunc('2015-03-05T09:32:05.359', 'MM');
# -> 2015-03-01T00:00:00
SELECT date_trunc('2015-03-05T09:32:05.359', 'DD');
# -> 2015-03-05T00:00:00
SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR');
# -> 2015-03-05T09:00:00

Q

파일 여러개인 경우, 아래와 같이 데이터프레임을 만든다.
```
filename = "*.csv"
df = spark.read.csv(f"file:///{filename}", inferSchema=True, header=True)
```
스키마가 동일하면 상관없는데 스키마가 다른 파일이(A,B라고 가정) 같이 있다면
```
df.printSchema()
# -> A의 스키마만 출력
```

trips_df.select("B_COLUMN").show()

-> Column 'B_COLUMN' does not exist.



- spark.sql("QUERY") VS df.select("").describe().show()







# 참고 자료
[bigint](https://www.dofactory.com/sql/bigint)
[Spark Docs : pyspark.sql.SparkSession ](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.SparkSession.html#pyspark.sql.SparkSession)
[Spark Docs : pyspark.sql.DataFrame.createOrReplaceTempView](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.createOrReplaceTempView.html)
[Spark Docs : Spark functions](https://spark.apache.org/docs/2.3.0/api/sql/index.html)
[Spark Docs : Datetime Pattern](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)
[PySparkn Datetime Format](https://dbmstutorials.com/pyspark/spark-dataframe-format-timestamp.html)

[Python] EC2에 Jupyter Notebook 실행하기(ssh 포트포워딩)

Sat, 10 Dec 2022 13:40:33 GMT

0. 실행 환경

AWS EC2 t2.xlarge OS : Red Hat 9.1 Python :3.9 Jupyter Notebook 6.4.12

1. Jupyter Notebook 설정

Jupyter Notebook 설치 후 비밀번호를 설정한다. 꼭 필요한 과정은 아니다. 다만 비밀번호를 생성하지 않으면 실행 시 마다 생성되는 token 값으로 접속해야한다.

1-1. Jupyter Notebook 비밀번호 생성

python
>>> from notebook.auth import passwd
>>> passwd()
Enter password: # 비밀번호 입력
Verify password: # 비밀번호 입력
''
# 비밀번호 2번 입력하면 비밀번호 해쉬값이 나오는데 !꼭! 복사해둔다
>>> exit()

1-2. 비밀번호 설정

# 설정 파일 생성
jupyter notebook --generate-config
# 설정 파일 편집
vi /home/ec2-user/.jupyter/jupyter_notebook_config.py
# 위에서 생성된 비밀번호 해쉬값 입력
conf.NotebookApp.password = u''

2. Jupyter Notebook 실행

jupyter notebook

1번을 생략했다면 접속 URL을 출력하는데 복사해둔다.

3. SSH 포트포워딩

# 서버 8888포트를 로컬  포트로 포트포워딩
# (jupyter notebook 기본 포트 8888)
ssh -i "" -L :localhost:8888 @

# 실행중인 프로세스 확인
ps

# LISTEN 포트 확인
sudo lsof -i -P -n | grep LISTEN
# 58971 프로세스에서 8888 포트 LISTEN

4. Jupyter Notebook 접속

localhost:8888

비밀번호 입력 창이 나오는데 1번에서 입력한 비밀번호를 입력한다.

1번을 생략했다면 2번에서 복사한 URL로 접속한다.

실패 기록

Jupyter Notebook 설치, 실행 후 http(s)://:8888 로 접속하면 연결이 안됐다. 인바운드는 내 ip에서는 모두 허용이었다. 인증 키를 만들고, 비밀번호를 생성하고, jupyter_notebook_config.py 에 설정하고, 접속하면 된다던데 안된다.

접속 시도하면 아래 로그가 발생한다. https말고 http로 접속하라는데 그래도 안된다.

handle: 
Traceback (most recent call last):
  File "/opt/anaconda/anaconda3/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/anaconda/anaconda3/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 189, in _handle_events
    handler_func(fileobj, events)
  File "/opt/anaconda/anaconda3/lib/python3.9/site-packages/tornado/netutil.py", line 276, in accept_handler
    callback(connection, address)
  File "/opt/anaconda/anaconda3/lib/python3.9/site-packages/tornado/tcpserver.py", line 288, in _handle_connection
    connection = ssl_wrap_socket(
  File "/opt/anaconda/anaconda3/lib/python3.9/site-packages/tornado/netutil.py", line 608, in ssl_wrap_socket
    context = ssl_options_to_context(ssl_options)
  File "/opt/anaconda/anaconda3/lib/python3.9/site-packages/tornado/netutil.py", line 576, in ssl_options_to_context
    context.load_cert_chain(
ssl.SSLError: [SSL] PEM lib (_ssl.c:4065)

Q.

ssh는 원격접속할 때나 썼는데 이렇게 포트포워딩은 처음 해봤다. SSH 포트포워딩 알아보자

참고 자료

Jupyter Notebook DOCS

[Kafka] Docker Compose로 Kafka 멀티 브로커 구성

Sat, 26 Nov 2022 11:51:53 GMT

0. 실행환경

AWS EC2 t2.xlarge OS : Ubuntu 22.04 Kafka : Docker Compose : v2.7.0

1. 실행

1) docker-compose.yml 생성

vi docker-compose.yml

version: '2'
services:
  zookeeper-1:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_SERVER_ID: 1
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
      ZOOKEEPER_INIT_LIMIT: 5
      ZOOKEEPER_SYNC_LIMIT: 2
    ports:
      - "22181:2181"

  zookeeper-2:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_SERVER_ID: 2
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
      ZOOKEEPER_INIT_LIMIT: 5
      ZOOKEEPER_SYNC_LIMIT: 2
    ports:
      - "32181:2181"

  zookeeper-3:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_SERVER_ID: 3
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
      ZOOKEEPER_INIT_LIMIT: 5
      ZOOKEEPER_SYNC_LIMIT: 2
    ports:
      - "42181:2181"



  kafka-1:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper-1
      - zookeeper-2
      - zookeeper-3
    ports:
      - 29092:29092
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-1:9092,PLAINTEXT_HOST://localhost:29092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1

  kafka-2:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper-1
      - zookeeper-2
      - zookeeper-3
    ports:
      - "39092:39092"
    environment:
      KAFKA_BROKER_ID: 2
      KAFKA_ZOOKEEPER_CONNECT: zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-2:9092,PLAINTEXT_HOST://localhost:39092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1

  kafka-3:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper-1
      - zookeeper-2
      - zookeeper-3
    ports:
      - "49092:49092"
    environment:
      KAFKA_BROKER_ID: 3
      KAFKA_ZOOKEEPER_CONNECT: zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-3:9092,PLAINTEXT_HOST://localhost:49092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1

# docker-compose 설정 변수
depends_on : 서비스들의 우선순위 지정. depends_on에 설정된 서비스가 실행되어야 해당 서비스가 올라간다.
environment: 환경변수 설정

# Kafka 설정 변수
KAFKA_BROKER_ID: 유일값이어야함. 
KAFKA_ZOOKEEPER_CONNECT: zookeeper 지정
KAFKA_ADVERTISED_LISTENERS: 외부에서 접속용 리스너 설정
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 보안을 위한 프로토콜 매핑. 여기 설정 값이 KAFKA_ADVERTISED_LISTENERS 과 함께 key-value로 매핑됨.
KAFKA_INTER_BROKER_LISTENER_NAME: 도커 내부에서 리스너 이름을 지정
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 트랜잭션 상태에서 복제 수
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 트랜잭션 최소 ISR(InSyncReplicas 설정) 수

2) docker compose 실행

docker-compose -f docker-compose.yml up -d

3) 토픽 생성

docker-compose exec kafka-1 kafka-topics --create --topic test-topic --bootstrap-server kafka-1:9092 --replication-factor 3 --partitions 2

=> Created topic test-topic.

--bootstrap-server  : 클라이언트가 접근하는 토픽 파티션의 메타데이터를 요청하기 위한 설정
--replication-factor : 토픽 복제 수
--partition: 토픽내에 파티션 수

4) 토픽 확인

docker-compose exec kafka-1 kafka-topics --describe --topic test-topic --bootstrap-server kafka-1:9092 

=>
Topic: test-topic    TopicId: zrU8TR3IQu2l24nQkYZ1jA    PartitionCount: 2    ReplicationFactor: 3    Configs:
    Topic: test-topic    Partition: 0    Leader: 3    Replicas: 3,1,2    Isr: 3,1,2
    Topic: test-topic    Partition: 1    Leader: 1    Replicas: 1,2,3    Isr: 1,2,3

Leader : 파티션의 리더 브로커
Replicas : 데이터 복제
Isr : In sync replica (동기화된 복제본)

5) 컨슈머 실행

docker-compose exec kafka-1 bash
[appuser@6e847b6b1748 ~]$ kafka-console-consumer --topic test-topic --bootstrap-server kafka-1:9092

6) producer 실행

$ docker-compose exec kafka-1 bash 
[appuser@6e847b6b1748 ~]$ kafka-console-producer --topic test-topic --broker-list kafka-1:9092

producer

consumer

3. Replication 수 변경

[MongoDB] DB, Data CRUD 명령어 모음

Sat, 26 Nov 2022 08:18:42 GMT

mongosh
use admin
show dbs
> 인증오류

DB 생성

mongosh admin -u "USERNAME" -p "PW"
show dbs

use test_db
show dbs
> test_db가 안보인다

데이터 추가

db.collection.insert()
db.collection.insertOne({})
db.collection.insertMany([{},{}.....])

db.collection.insert({})
db 
> DB 이름 출력
show dbs

데이터 입력(Update)

db.user.insert({})

데이터 읽기

db.collection.find()

# 데이터 삭제 ``` db.collection.deleteOne() db.collection.deleteMany() ``` ![](https://velog.velcdn.com/images/denver_almighty/post/a46078a0-591a-4974-b356-18524c2faa00/image.png)

DB 삭제(Delete)

db.dropDatabase()

[MongoDB] root(admin) 계정 생성하기

Sat, 26 Nov 2022 07:38:03 GMT

0. 실행 환경

AWS t2.xlarge OS : Redhat 8.6 MongoDB Version : 6.0.3

1. 계정 생성

# mongodb 실행
mongosh

# root권한가진 계정생성
db.createUser({user:"USERNAME", pwd:"PW", roles:["root"]})

#로그인
mongosh admin -u USERNAME -p PW

[MongoDB] Redhat8에 MongoDB 설치하기

Sat, 26 Nov 2022 07:29:30 GMT

0. 실행 환경

AWS t2.xlarge OS : Redhat 8.6 MongoDB Version : 6.0.3

1. 설치하기

1) 패키지 관리 시스템 (yum) 설정

vi /etc/yum.repos.d/mongodb-org-6.0.repo

# 아래 내용 입력
[mongodb-org-6.0]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/6.0/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-6.0.asc

2) MongoDB 패키지 설치

sudo yum install -y mongodb-org
sudo yum install -y mongodb-org-6.0.3 mongodb-org-database-6.0.3 mongodb-org-server-6.0.3 mongodb-mongosh-6.0.3 mongodb-org-mongos-6.0.3 mongodb-org-tools-6.0.3

# 의도치 않은 업그레이드 방지(yum 업그레이드 시 패키지 업그레이드 방지)
vi /etc/yum.conf
# 아래 내용 추가
exclude=mongodb-org,mongodb-org-database,mongodb-org-server,mongodb-mongosh,mongodb-org-mongos,mongodb-org-tools

3) 설정

3-1) ulimit 설정

MongoDB 4.4부터 ulimit열린 파일 수 값이 64000 미만 이면 시작 오류가 생성된다.
Redhat8에서 ulimit 명령어는 최대 프로세스 값을 구성하기 충분하기 때문에 nproc 값 설정이 필요없다.

3-2) 디렉토리 설정

기본 디렉토리 데이터 : /var/lib/mongo 로그 : /var/log/mongodb

# 새 디렉토리 생성 (/my/mongodb/dir/)
mkdir /my/mongodb/dir/
vi /etc/mongod.conf
storage.dbPath=/my/mongodb/dir
systemLog.path=/my/mongodb/dir/mongod.log
sudo chown -R mongod:mongod /my/mongodb/dir/

3-3) SELinux 구성

sudo yum install git make checkpolicy policycoreutils selinux-policy-devel
git clone https://github.com/mongodb/mongodb-selinux
cd mongodb-selinux
make
sudo make install

3-4) mongod.conf 수정

vi /etc/mongod.conf

아랫부분 net에 bindIp, security에 authorization 수정

bindIp: 연결 허용 IP

authorization : 연결 시 계정 확인

# mongod.conf
# for documentation of all options, see:
#   http://docs.mongodb.org/manual/reference/configuration-options/

where to write logging data.

systemLog: destination: file logAppend: true path: /var/log/mongodb/mongod.log

Where and how to store data.

storage: dbPath: /var/lib/mongo journal: enabled: true

engine:

wiredTiger:

how the process runs

processManagement: fork: true # fork and run in background pidFilePath: /var/run/mongodb/mongod.pid # location of pidfile timeZoneInfo: /usr/share/zoneinfo

network interfaces

net: port: 27017 bindIp : 0.0.0.0

security: authorization : enabled

#operationProfiling:

#replication:

#sharding:

Enterprise-Only Options

#auditLog:

#snmp:

## 4) 실행

sudo systemctl daemon-reload sudo systemctl start mongod

mongosh

![](https://velog.velcdn.com/images/denver_almighty/post/5d8cb263-3f4d-4346-96b1-f009a25b0936/image.png)


### 실행 안될 때
[실행 안될 때(status=14 / status=100)](https://velog.io/@denver_almighty/MongoDB-%EC%84%A4%EC%B9%98-%ED%9B%84-%EC%8B%A4%ED%96%89-%EC%95%88%EB%90%A8status14-status100)




# 참고
[Install MongoDB Community Edition on Red Hat or CentOS](https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-red-hat/)

[MongoDB] 설치 후 실행 안됨(status=14 , status=100)

Sat, 26 Nov 2022 07:24:12 GMT

0. 실행 환경

AWS t2.xlarge OS : Redhat 8.6 MongoDB Version : 6.0.3

1. Error 확인

# 서비스 상태 확인
systmectl status mongod
# 로그 확인
tail -50 /var/log/mongodb/mongod.log | grep error

1) status=14

systemstatus (code=exited, status=14)

mongod.log {"t":{"$date":"2022-11-26T06:41:23.568+00:00"},"s":"E", "c":"NETWORK", "id":23024, "ctx":"initandlisten","msg":"Failed to unlink socket file","attr":{"path":"/tmp/mongodb-27017.sock","error":"Operation not permitted"}}

해결 방법

rm /tmp/mongodb-27017.sock
systemctl start mongod

2) status=100

service status ExecStart=/usr/bin/mongod $OPTIONS (code=exited, status=100)

mongod.log {"t":{"$date":"2022-11-26T06:51:14.023+00:00"},"s":"E", "c":"CONTROL", "id":20557, "ctx":"initandlisten","msg":"DBException in initAndListen, terminating","attr":{"error":"IllegalOperation: Attempted to create a lock file on a read-only directory: /var/lib/mongo"}}

해결 방법

chown -R mongod:mongod /var/lib/mongo

[Kafka] Podman compose으로 Kafka 실행

Sun, 20 Nov 2022 12:22:13 GMT

0. 실행 환경

AWS t2.xlarge OS : Redhat 8.6 Kafka : 3.3 Zookeeper : 3.8

1. 실행하기

# kafka docker-compose.yml 다운로드
curl -sSL https://raw.githubusercontent.com/bitnami/containers/main/bitnami/kafka/docker-compose.yml > docker-compose.yml
# podman-compose로 이름 변경
mv docker-compose.yml podman-compse.yml

# podman compose 실행
podman-compose up

참고

Docker Hub - bitnami/kafka

[Podman] RHEL8에 Podman 설치하기 ( + Podman compose)

Sun, 20 Nov 2022 12:02:58 GMT

0. 실행 환경

AWS t2.xlarge OS : Redhat 8.6 Podman : 4.2.0

1. Redhat8에서 Docker

RedHat8 에 Docker를 설치하면 아래와 같은 오류가난다.

Errors during downloading metadata for repository 'docker-ce-stable'

Docker Docs에 RHEL 설치 페이지에 가면 "현재는 s390x(IBM Z)에서 Redhat에서만 Docker 설치가 지원된다." 라는 문구가나온다. Redhat Customer Portal에 보면 CentOS용으로 설치를 사용하거나 / 충돌 날 수 있는 Podman, Buildah 를 삭제하고설치하면 된다는데 CentOS 용으로 설치하면 설치는된다. Docker는 사용해봤으니 Podman으로 Kafka와 Airflow를 사용해볼까한다.

2. Podman 설치하기

# Podman 설치 (RHEL8)
sudo yum module enable -y container-tools:rhel8
sudo yum module install -y container-tools:rhel8

# podman-compose 설치
# Python3 설치되어있어야함(pip3)
pip3 install podman-compose --user

# 설치 확인
podman-compose --version

참고

Podman Installation Instructions

Github - Podman Compose

Podman 명령어

[OS] AWS EC2 root 비밀번호 생성

Sun, 20 Nov 2022 11:02:14 GMT

0. 실행 환경

AWS t2.xlarge OS : Redhat 8.6

1. 비밀번호 생성

mysqld를 실행하는데 ststemctl start/stop mysql 하면 root 비밀번호가 필요하다

ec2 인스턴스는 키로만 SSH 접속해왔어서 root 비밀번호를 몰랐는데 다른 계정처럼 아래 명령어로 만들면 된다.

sudo passwd root

[MySQL] Redhat8에 MySQL 설치하기

Sun, 20 Nov 2022 10:29:48 GMT

0. 실행 환경

AWS t2.xlarge OS : Redhat 8.6 MySQL Version : 8.0.31

1. 설치하기

1) MySQL 다운로드

# 다운로드 
wget https://repo.mysql.com//mysql80-community-release-el8-4.noarch.rpm

# YUM 레포지토리에 MySQL 추가
sudo yum install mysql80-community-release-el8-{version-number}.noarch.rpm

# MySQL 설치
sudo yum install mysql-community-server

MySQL 실행

# MySQL 실행 (root 아니고, sudo 없이 명령어 실행하면 OS root 비밀번호 입력해야함)
sudo systemctl start mysqld

# 임시 비밀번호 찾기
sudo grep 'temporary password' /var/log/mysqld.log

# MySQL 접속
mysql -uroot -p
-> 임시 비밀번호 입력

root 비밀번호 변경

# 비밀번호는 대문자, 소문자, 숫자 및 특수 문자가 하나 이상 포함되고 총 암호 길이가 8자 이상이어야 함
mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY 'MyNewPass4!';

참고

MySQL Download https://dev.mysql.com/downloads/repo/yum/

MySQL Docs - Installing MySQL on Linux Using the MySQL Yum Repository https://dev.mysql.com/doc/refman/8.0/en/linux-installation-yum-repo.html