data-traveler

2222

Sun, 17 Apr 2022 08:49:36 GMT

배치 추가

"keyword_matching_init.py"

if __name__ == "__main__":
    data_path = "/home/ubuntu/src/script/curation/data"
    ...
    ...
    file_list = os.listdir(data_path)

위 코드를 통해서 배치 작업을 진행함 "data_path" 아래 위치한 파일 명을 읽어서 이를 바탕으로 배치 작업을 진행

— batch 작업 중 오류가 발생할 경우 보통 data_path에 csv, xlsx 확장자가 아닌 파일이 있을 경우임 → 숨겨진 파일이 있을 수도 있으므로 직접 cmd 라인에서 확인 할 필요가 있음.

주관식 자동채점 적용 현황 리스트

과목 수 : 240, 문항 수 : 2820
정확률 : 99.03 % ( 총 채점 수 : 34161 / 올바르게 채점한 수 : 33832 )
2021-07-14 기준

imply.hunet.co.kr 에 들어가서 SQL 접속 후

subjective_question_verification

테이블을 통해 확인 할수 있음.

-- 올바르게 채점한 수
select count(*) from "subjective_question_verification"
where "__time" >= '2021-07-14'
and correct_answer_rate < 10
and correct_answer_rate > 90

-- 총 채점 수
select count(*) from "subjective_question_verification"
where "__time" >= '2021-07-14'

주의사항

초기 subjective_question_verification 테이블에 들어가던 값들에 대한 기준값이 바뀌었기 때문에 날짜 조건을 걸어서 기존에 책정해 둔것에 값을 더해가는 것이 정확함.

3. GPT-3 Pricing

처음 GPT-3를 신청하면 Trial(무료 체험 이용 기간) 상태가 되고, 3개월간 30만 토큰을 사용할 수 있게 해 줍니다. 3개월이 지나거나, 그 이전에 30만 토큰을 다 쓴다면 아래와 같은 가격정책에 따라 GPT-3 모델을 사용할 수 있습니다.

1 Token = 영어 4글자 정도로 생각하면 된다고 합니다. 짧은 단어 한개 정도를 의미한다고 보면 편할 것 같습니다.

4. GPT-3 Turorial

이제 기본적으로 GPT-3를 사용할 준비는 모두 마친 상태입니다. 그럼 파이썬에서 GPT-3를 사용하기 위해 라이브러리를 설치하고 사용하는 방법을 알아보도록 하겠습니다.

우선 가상환경에 openai 라이브러리를 설치합니다.

$ conda create --n gpt python=3.7
$ conda activate gpt
(gpt) $ pip install openai

설치까지 완료되었다면 간단한 코드를 통해 바로 테스트를 해볼 수 있습니다.

import openai

openapi.api_key = "SECRET_API_KEY"

prompt = "This is a test"

response = openai.Completion.create(engine="davinci", prompt=prompt, max_tokens=10)

print(response.choices[0].text)

------------------------------------------------------------------------------------
' of whether the programming works correctly.\n\nHere'

5. 예제를 통해 학습시키기

GPT-3는 few-shot learning을 통해 몇 가지 예제를 보여주면 성능이 더욱 좋아지는 모델입니다. 보다 효율적으로 활용하기 위해 GPT class와 Example class를 아래와 같이 생성해 줍니다.

(출처 : https://github.com/shreyashankar/gpt3-sandbox)

"""Creates the Example and GPT classes for a user to interface with the OpenAI
API."""

import openai
import uuid

def set_openai_key(key):
    """Sets OpenAI key."""
    openai.api_key = key

class Example:
    """Stores an input, output pair and formats it to prime the model."""
    def __init__(self, inp, out):
        self.input = inp
        self.output = out
        self.id = uuid.uuid4().hex

    def get_input(self):
        """Returns the input of the example."""
        return self.input

    def get_output(self):
        """Returns the intended output of the example."""
        return self.output

    def get_id(self):
        """Returns the unique ID of the example."""
        return self.id

    def as_dict(self):
        return {
            "input": self.get_input(),
            "output": self.get_output(),
            "id": self.get_id(),
        }

class GPT:
    """The main class for a user to interface with the OpenAI API.

    A user can add examples and set parameters of the API request.
    """
    def __init__(self,
                 engine='davinci',
                 temperature=0.5,
                 max_tokens=100,
                 input_prefix="input: ",
                 input_suffix="\n",
                 output_prefix="output: ",
                 output_suffix="\n\n",
                 append_output_prefix_to_query=False):
        self.examples = {}
        self.engine = engine
        self.temperature = temperature
        self.max_tokens = max_tokens
        self.input_prefix = input_prefix
        self.input_suffix = input_suffix
        self.output_prefix = output_prefix
        self.output_suffix = output_suffix
        self.append_output_prefix_to_query = append_output_prefix_to_query
        self.stop = (output_suffix + input_prefix).strip()

    def add_example(self, ex):
        """Adds an example to the object.

        Example must be an instance of the Example class.
        """
        assert isinstance(ex, Example), "Please create an Example object."
        self.examples[ex.get_id()] = ex

    def delete_example(self, id):
        """Delete example with the specific id."""
        if id in self.examples:
            del self.examples[id]

    def get_example(self, id):
        """Get a single example."""
        return self.examples.get(id, None)

    def get_all_examples(self):
        """Returns all examples as a list of dicts."""
        return {k: v.as_dict() for k, v in self.examples.items()}

    def get_prime_text(self):
        """Formats all examples to prime the model."""
        return "".join(
            [self.format_example(ex) for ex in self.examples.values()])

    def get_engine(self):
        """Returns the engine specified for the API."""
        return self.engine

    def get_temperature(self):
        """Returns the temperature specified for the API."""
        return self.temperature

    def get_max_tokens(self):
        """Returns the max tokens specified for the API."""
        return self.max_tokens

    def craft_query(self, prompt):
        """Creates the query for the API request."""
        q = self.get_prime_text(
        ) + self.input_prefix + prompt + self.input_suffix
        if self.append_output_prefix_to_query:
            q = q + self.output_prefix

        return q

    def submit_request(self, prompt):
        """Calls the OpenAI API with the specified parameters."""
        response = openai.Completion.create(engine=self.get_engine(),
                                            prompt=self.craft_query(prompt),
                                            max_tokens=self.get_max_tokens(),
                                            temperature=self.get_temperature(),
                                            top_p=1,
                                            n=1,
                                            stream=False,
                                            stop=self.stop)
        return response

    def get_top_reply(self, prompt):
        """Obtains the best result as returned by the API."""
        response = self.submit_request(prompt)
        return response['choices'][0]['text']

    def format_example(self, ex):
        """Formats the input, output pair."""
        return self.input_prefix + ex.get_input(
        ) + self.input_suffix + self.output_prefix + ex.get_output(
        ) + self.output_suffix

gpt.py 파일을 만들었다면 이제 예제를 통해 쿼리를 생성하는 gpt-3 모델을 학습해 줍니다.

gpt = GPT(engine="davinci",
                    temperature=0.5,
                    max_tokens=100)

gpt.add_example(Example('Fetch unique values of DEPARTMENT from Worker table.',
                        'Select distinct DEPARTMENT from Worker;'))

gpt.add_example(Example('Print the first three characters of FIRST_NAME from Worker table.',
                        'Select substring(FIRST_NAME,1,3) from Worker;'))

gpt.add_example(Example('Find the position of the alphabet ("a") in the first name column "Amitabh" from Worker table.',
                        'Select INSTR(FIRST_NAME, BINARY"a") from Worker where FIRST_NAME = "Amitabh";'))

gpt.add_example(Example('Print the FIRST_NAME from Worker table after replacing "a" with "A".',
                        'Select CONCAT(FIRST_NAME, " ", LAST_NAME) AS "COMPLETE_NAME" from Worker;'))

gpt.add_example(Example('Display the second highest salary from the Worker table.',
                        'Select max(Salary) from Worker where Salary not in (Select max(Salary) from Worker);'))

gpt.add_example(Example('Display the highest salary from the Worker table.',
                        'Select max(Salary) from Worker;'))

gpt.add_example(Example('Fetch the count of employees working in the department Admin.',
                        'SELECT COUNT(*) FROM worker WHERE DEPARTMENT = "Admin";'))

gpt.add_example(Example('Get all details of the workers whose SALARY lies between 100000 and 500000.',
                        'Select * from Worker where SALARY between 100000 and 500000;'))

gpt.add_example(Example('Get Salary details of the Workers',
                        'Select Salary from Worker'))

이렇게 몇 가지 질의 - 쿼리 예제를 통해 GPT-3를 학습시킨다음 학습되지 않은 질의를 모델에 날리게되면, 아래와 같이 쿼리를 생성해 줍니다.

test

Sun, 17 Apr 2022 08:48:26 GMT

test environment setting

테스트를 위한 환경 세팅은 아래와 같습니다.

pip install flask_restplus
pip install Werkzeug==0.16.1
pip install celery

# 버전이 낮은 버전이면 upgrade
pip install celery --upgrade

# python 안에서 redis 접속을 위하여 설치
pip install -y reids

celery --app celery_worker.celery worker --loglevel=info --pool=gevent --concurrency=10
celery worker -A celery_worker.celery --pool=gevent --concurrency=10

# pycurl 설치 시
yum install libcurl-devel
yum install gcc

Kafka란 무엇인가

Sun, 17 Apr 2022 06:10:21 GMT

1) Kafka 란 무엇인가

카프카는 Publish-Subscribe 모델을 구현한 분산 메시징 시스템이다.

데이터 파이프라인(Data Pipeline)을 구축할 때 가장 많이 고려되는 시스템 중 하나가 '카프카(Kafka)' 일 것이다.

2) Kafka 탄생 배경

LinkedIn에서 개발된 분산 메시징 시스템으로 2011년에 오픈소스로 공개되었다.

[기존 링크드인 시스템의 가진 문제점]

하나의 서비스가 너무 많은 시스템과 연결된다 그로 인해 유지 관리 부담은 더욱 더 늘어나게 되었고, 이로 인해 기능 개발 자체가 지연되었다

첫째 : 실시간 트랜잭션 처리와 비동기 처리가 동시에 이뤄지지만 통합된 전송 영역이 없으니 복잡도가 증가할 수밖에 없다. 둘째 : 통합 데이터 분석을 위해 서로 다른 데이터 시스템을 연결해야할 경우, 데이터의 포맷이나 처리하는 방법이 다르다면 통합하기가 어렵다. 또한 두 시스템 간의 데이터가 서로 달라져 신뢰도마저 낮아질 수 있다.

[과거]

과거에는 많은 서비스에서 생성되는 모든 이벤트의 부하를 견딜만한 버스 시스템이 없었다. 이전 세대에서는 회사 전체의 데이터가 파편화되어 총합적인 데이터 분석이 어려웠다.

[현재]

클라우드 시대가 본격적으로 열리면서 컴퓨텅 리소스는 더이상 영속적이지 않는다. 데이터가 증가함에 따라 스케일아웃이 가능한 시스템이다.

데이터 중앙화 : 카프카를 메시지 전달의 중앙 플랫폼으로 두고, 기업에서 필요한 모든 데이터 시스템(오라클, NOSQL, 하둡) 뿐만 아니라 마이크로 서비스, 사스 서비스 등과 연결된 파이프라인을 만드는 것을 목표로 두고 개발되었다.

[링크드인에서 카프카를 적용한 이후]

사내 서비스에서 발생하는 모든 이벤트/데이터의 흐름을 중앙에서 관리한다. 카프카가 제공하는 데이터를 이용해서 다양한 분석이 가능해졌다.

개발 입장에서도 여러 데이터 시스템에 의존하지 않고, 카프카에만 데이터를 전달하면 되기 때문에 본연의 업무에만 집중할 수 있게됨

[Elasticsearch] 자주 사용하는 명령어

Mon, 21 Mar 2022 10:38:46 GMT

[Elasticsearch] 자주 사용하는 명령어

Pending task 확인

일반적인 상황에서는 empty list 반환
pending 되어 있는 작업이 있는 경우 그 리스트 반환

GET _cluster/pending_tasks

hot threads 확인

GC가 비정상적이거나 CPU가 높거나 검색이 밀리는 등 대부분의 문제의 원인을 유추할 수 있음

GET _nodes/hot_threads?pretty

GET _nodes/node-01/hot_threads
GET _nodes/node-02/hot_threads
GET _nodes/node-03/hot_threads

Cluster & Node 상태 확인

GET _cluster/health
GET /_cat/nodes

# curl 명령어
# 인증서 미적용
curl -i http://x.x.x.x:9202
curl -XGET "x.x.x.x:9202/_cat/nodes?v"
curl http://x.x.x.x:9202/_cluster/health

# 인증서 적용시
curl -i -k -u elastic https://x.x.x.x:9202
curl -k -u elastic -XGET "https://x.x.x.x:9202/_cat/nodes?v"
curl -k -u elastic https://x.x.x.x:9202/_cluster/health?pretty

Index 정보 확인

# 모든 인덱스 확인
GET _cat/indices?v&s=index

# 특정 인덱스 확인
GET _cat/indices/hunet-app-b2b-2019-*?v&s=index

# 인덱스 health 확인
GET _cat/indices?health=yellow

# 인덱스 생성
PUT travel-log
{
  "settings": {
    "index": {
      "number_of_shards": "3",
      "number_of_replicas": "1"
    }
  },
  "mappings": {
    "properties": {
      "start_date": {
        "type": "keyword"
      },
      "end_date": {
        "type": "keyword"
      },
      "place": {
        "type": "keyword"
      },
      "word": {
        "type": "keyword"
      }
    }
  }
}

# 인덱스 내 데이터 검색
GET travel-log/_search
{
  "query": {
    "bool": {
      "should": [
        { "match_phrase": { "word": "여행" } },
        { "match_phrase": { "word": "제주도" } },   
        { "match_phrase": { "word": "바다" } }
      ]
    }
  }
}

# 인덱스 내 데이터 삭제
POST travel-log/_delete_by_query?wait_for_completion=true
{
  "query": {
    "bool": {
      "should": [
        { "match_phrase": { "word": "여행" } },
        { "match_phrase": { "word": "제주도" } },   
        { "match_phrase": { "word": "바다" } }
      ]
    }
  }
}

template 정보 확인

# 전체 template 확인
GET _cat/templates?v&s=name

# 특정 template 상세정보
GET _template/travel-log-template

# default template 확인
GET _template/default

# template 삭제
DELETE _template/travel-log-template
l
# template 생성
PUT _template/travel-log-template
{
  "order": 2,
  "index_patterns": [
    "travel-log-*"
  ],
  "settings": {
    "index": {
      "number_of_shards": "3",
      "number_of_replicas": "1"
    }
  },
  "mappings": {
    "dynamic_templates": [
      {
        "strings_as_keywords": {
          "mapping": {
            "type": "keyword"
          },
          "match_mapping_type": "string"
        }
      }
    ]
  },
  "aliases": {}
}

# ilm template 생성
PUT _template/daily-log-template
{
  "order": 2,
  "index_patterns": [
    "daily-log-*"
  ],
  "settings": {
    "index": {
      "number_of_shards": "3",
      "number_of_replicas": "1",
      "lifecycle": {
        "name": "daily-log-lim",
        "rollover_alias": "daily-log"
      }
    }
  },
  "mappings": {
    "dynamic_templates": [
      {
        "strings_as_keywords": {
          "mapping": {
            "type": "keyword"
          },
          "match_mapping_type": "string"
        }
      }
    ]
  }
}

샤드 할당 확인 및 강제 할당

GET /_cluster/allocation/explain

POST /_cluster/reroute?retry_failed=true

스냅샷

# 스냅샷 확인
GET /_snapshot
GET /_snapshot/*20210612
GET /_snapshot/_all

# 스냅샷 리스트 및 스냅샷 시작/종료시간 확인
GET _snapshot/all_backup/all_backup_20210513

# 스냅샷 진행 상태
GET _snapshot/all_backup/all_backup_20210513/_status

# 스냅샷 삭제(종료)
DELETE _snapshot/all_backup/all_backup_20210513

# 스냅샷 확인
GET _snapshot/travel-log-20210714/_all

# 복원
POST _snapshot/travel-log-20210714/travel-log-2021.07.28/_restore?wait_for_completion=false
{
  "indices": ["travel-2021.07.28"]
}

# 복원 확인
GET _cat/recovery/lms-app-logging-audit-2020.07.28?v

Task

# 실행중인 Task 확인
GET _tasks
GET _cat/tasks?v
GET _tasks?nodes=node-1, node-2
GET _tasks/vIYMDSJ3TGCGFtcu3Btp6w:521843726
GET _cat/tasks?detailed
GET _tasks?actions=*reindex
GET _tasks?actions=*reindex&wait_for_completion=true&timeout=10s

# Task 취소
POST _tasks/vIYMDSJ3TGCGFtcu3Btp6w:521843726/_cancel

Task Management API | Elasticsearch Reference [6.8] | Elastic

Index open & close

POST travel-log/_close
POST travel-log/_open

[Logstash] Elasticsearch 와 RDBMS 연동

Mon, 21 Mar 2022 10:31:25 GMT

[Logstash] Elasticsearch 와 RDBMS 연동

Kibana > Stack Management > Logstash Pipelines

input {
    jdbc {
        jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar"
        jdbc_driver_class => "com.mysql.jdbc.Driver"
        jdbc_connection_string => "jdbc:mysql://x.x.x.x:3011/Travel"
        jdbc_user => "id"
        jdbc_password => "pwd"
        statement_filepath => "/home/hunetdb/logstash/config/conf_sql/searchengine_travel.sql"
        schedule => "30 03,12 * * *"
        type => "search-engine-travel"
    }
}

filter {
    if [type] == "search-engine-travel" {
        mutate {
            remove_field => "message"
        }
    ruby {
            init => "require 'time'"
            code => "event.set('indexDate', Time.now.utc.getlocal.strftime('%Y.%m.%d'))"
        }
    }
}

output {
    if [type] == "search-engine-travel" {
        elasticsearch {
            hosts => ["x.x.x.10:9200", "x.x.x.11:9200", "x.x.x.12:9200"]
            index => "search-engine-travel-%{indexDate}"
            document_id => "%{travel_seq}"
            user => "elastic"
            password => "${elastic}"
            ssl => true
            ssl_certificate_verification => false
        }
    }
}

input : 어떤 데이터를 수집할지 설정하는 부분
- jdbc_driver_library : 사용하는 jdbc driver 위치 설정
- jdbc_driver_class : 사용하는 jdbc driver 명시
- jdbc_connection_string : 사용하는 db 종류, db가 위치한 host, db 이름 설정
  
  ex) jdbc:mysql://x.x.x.x:3011/Travel
  - db 종류 : mysql
  - db host : x.x.x.x:3011
  - db 이름 : Travel
- jdbc_user & jdbc_password : db 서버에서 권한을 부여 받은 id와 password
- statement_filepath : db에서 어떤 query 및 프로시저를 실행할지 작성한 파일의 경로
filter : 수집한 데이터를 가공
output : 수집한 데이터를 전송
- hosts : 데이터를 저장할 Elasticsearch의 Host
- index : Elasticsearch Host의 어떤 Index에 저장할지 설정 (없으면 생성)
- document_id : Document ID를 직접 지정
  - 미지정시 Elasticsearch가 임의값으로 생성
  - Document ID가 같으면 Update, 없으면 Insert
    
    Logstash와 JDBC를 사용해 RDBMS와 Elasticsearch의 동기화를 유지하는 방법

[Elasticsearch] Node Start and Stop

Mon, 21 Mar 2022 10:03:15 GMT

[Elasticsearch] Node Start and Stop

Elasticsearch Node를 재시작 할 때 아래와 같은 방법으로 작업을 진행하여야 샤드들이 재배치 되지 않고 빠르게 재시작 할 수 있음

Shard Allocation Stop

노드를 중단했을 때 샤드들이 재배치 되지 않게 다음 명령 실행

# Kibana Dev Tool 에서 실행
PUT _cluster/settings {
    "persistent": { "cluster.routing.allocation.enable": "none"
    } 
}

Sync Flus

Primary - Replica 샤드들 간의 세그먼트 저장 상태를 동기화 시켜줌

# Kibana Dev Tool 에서 실행
POST _flush/synced

Stop Node

노드가 중단되면 Elasticsearch 상태가 Yellow 로 변경됨
노드 시작과 종료를 쉘 스크립트 start.sh 파일과 stop.sh 파일로 생성해놓아서 해당 파일만 실행하면 간편하게 Node 시작/중지를 할 수 있음

# Elasticsearch 디렉토리 접속

# Elasticsearch 중지
$ sh stop.sh

Start Work

필요한 작업 진행

ex) JVM heap size 변경, 롤링 업그레이드 등

Start Node

# Elasticsearch 디렉토리 접속
$ cd /home/dohyung/elasticsearch

# Elasticsearch 시작
$ sh start.sh

Node 확인

# Kibana Dev Tool 에서 실행
GET _cat/nodes

Shard Allocation Start

unassigned 된 샤드들이 새 노드들에 다시 배치될 수 있게 다음 명령 실행

# Kibana Dev Tool 에서 실행
PUT _cluster/settings {
    "persistent": { "cluster.routing.allocation.enable": null
    } 
}

Cluster health chedk

Elasticsearch 클러스터가 Green 상태가 될 때까지 기다림

# Kibana Dev Tool 에서 실행
GET _cat/health

data-traveler

2222

배치 추가

3. GPT-3 Pricing

4. GPT-3 Turorial

5. 예제를 통해 학습시키기

test

test environment setting

Kafka란 무엇인가

1) Kafka 란 무엇인가

2) Kafka 탄생 배경

[Elasticsearch] 자주 사용하는 명령어

[Elasticsearch] 자주 사용하는 명령어

Pending task 확인

hot threads 확인

Cluster & Node 상태 확인

Index 정보 확인

template 정보 확인

샤드 할당 확인 및 강제 할당

스냅샷

Task

Index open & close

[Logstash] Elasticsearch 와 RDBMS 연동

[Logstash] Elasticsearch 와 RDBMS 연동

[Elasticsearch] Node Start and Stop

[Elasticsearch] Node Start and Stop

Shard Allocation Stop

Sync Flus

Stop Node

Start Work

Start Node

Node 확인

Shard Allocation Start

Cluster health chedk