diane_at_work.log

elasticsearch sockettimeout

Mon, 10 Feb 2025 07:12:15 GMT

sockettimetout 30초짜리 나고, node개수만큼 돌면서 계속 timeout 나는 현상

rest-cleint내에 node개수만큼 돌면서 쿼리를 날리고, 계속 sockettimeout발생하여 cpu증가하는 현상

=> 결론은 sockettimeout과 search.timeout을 적절히 설정해야 함. socketimeout > search.timeout이고, search.timeout이 끝났다고 바로 응답을 받는것은 아님. 바로 응답을 보내려면 strict하게 적용해서 탐색할 문서 개수를 설정해야함.

-----------아래는 chat cpt 대화-----------

elasticsearch-rest-client에서 exception발생시 node를 순회하면서 조회하잖아 ChatGPT의 말: 맞습니다! Elasticsearch Low-Level REST Client (RestClient) 는 기본적으로 요청 실패 시 등록된 노드들을 순회하며 재시도합니다.

🔹 어떤 경우에 다른 노드로 재시도하나?

네트워크 장애 (IOException) 발생 시 노드 다운 (ConnectionRefusedException) 발생 시 소켓 타임아웃 (SocketTimeoutException) 발생 시 하지만 쿼리 타임아웃(예: search.timeout)이 발생하면 재시도하지 않습니다.

1️⃣ 노드 순회 방식 (Round Robin) Elasticsearch Low-Level Client는 여러 노드를 등록하면, 요청이 실패할 경우 자동으로 다음 노드로 요청을 시도합니다. 즉, 라운드 로빈 방식으로 노드를 순회하면서 요청을 수행합니다.

RestClient restClient = RestClient.builder(
    new HttpHost("node1.example.com", 9200, "http"),
    new HttpHost("node2.example.com", 9200, "http"),
    new HttpHost("node3.example.com", 9200, "http")
).build();

위와 같이 여러 개의 노드를 등록하면, 한 노드에서 요청이 실패할 경우 다음 노드로 자동 재시도됩니다.

2️⃣ 재시도 동작을 커스텀하려면? 기본적으로 쿼리 타임아웃(서버가 응답을 안 주는 경우)에는 재시도하지 않으므로, 이를 변경하려면 수동으로 재시도를 구현해야 합니다.

✅ setFailureListener를 사용한 노드 실패 감지

RestClient restClient = RestClient.builder(
    new HttpHost("node1.example.com", 9200, "http"),
    new HttpHost("node2.example.com", 9200, "http")
).setFailureListener(new RestClient.FailureListener() {
    @Override
    public void onFailure(Node node) {
        System.err.println("노드 실패: " + node.getName());
    }
}).build();

✔ 특정 노드에서 실패할 경우 로깅할 수 있습니다.

3️⃣ High-Level REST Client에서도 동일한 동작을 하는가? 아니요!

RestHighLevelClient 자체는 노드 자동 순회를 지원하지 않습니다. 하지만 내부적으로 RestClient를 사용하므로, Low-Level REST Client의 재시도 전략을 활용할 수 있습니다.

RestHighLevelClient client = new RestHighLevelClient(
    RestClient.builder(
        new HttpHost("node1.example.com", 9200, "http"),
        new HttpHost("node2.example.com", 9200, "http")
    )
);

✔ Low-Level REST Client의 노드 순환 로직이 적용되므로, 네트워크 장애 발생 시 다른 노드로 재시도합니다.

4️⃣ 완전한 재시도 로직 구현 (쿼리 타임아웃 포함) 쿼리 타임아웃을 포함하여, 노드를 순회하며 재시도하려면 수동 재시도 로직을 추가해야 합니다.

public static SearchResponse executeWithRetry(RestHighLevelClient client, SearchRequest request, int maxRetries) {
    int attempt = 0;
    while (attempt < maxRetries) {
        try {
            return client.search(request, RequestOptions.DEFAULT);
        } catch (IOException e) {
            System.err.println("요청 실패 (시도 " + (attempt + 1) + "): " + e.getMessage());
            attempt++;
            if (attempt >= maxRetries) {
                throw new RuntimeException("모든 재시도 실패", e);
            }
        }
    }
    throw new RuntimeException("예기치 않은 오류");
}

✔ IOException 발생 시 최대 maxRetries번 노드를 순회하며 재시도합니다. ✔ 쿼리 타임아웃(search.timeout)이 발생한 경우에도 재시도하도록 설정할 수 있습니다.

✅ 정리 Elasticsearch Low-Level REST Client는 기본적으로 노드들을 순회하며 재시도합니다. 네트워크 장애(IOException)나 노드 다운 시에는 자동 재시도하지만, 쿼리 타임아웃 시에는 재시도하지 않습니다. 노드 장애 감지를 위해 setFailureListener를 사용할 수 있습니다. 쿼리 타임아웃도 포함한 완전한 재시도 로직이 필요하면, 수동으로 재시도를 구현해야 합니다. 🚀

나의 말: elasticsaerch에서 sockettimeout의 설정은 기본 몇이야? ChatGPT의 말: Elasticsearch Low-Level REST Client (RestClient) 의 기본 socketTimeout 값은 30초(30,000ms) 입니다.

🔹 기본 설정 값 설정 옵션 기본값 설명 connectTimeout 1000ms (1초) 서버에 연결을 시도할 때의 최대 대기 시간 socketTimeout 30000ms (30초) 요청을 보낸 후 응답을 기다리는 최대 시간 🔹 socketTimeout을 변경하는 방법

RestClient restClient = RestClient.builder(new HttpHost("localhost", 9200, "http"))
    .setRequestConfigCallback(requestConfigBuilder -> 
        requestConfigBuilder
            .setSocketTimeout(60000)  // 60초로 설정
            .setConnectTimeout(5000)   // 연결 타임아웃 5초로 설정
    )
    .build();
    ```
✔ setSocketTimeout(60000) → 60초 동안 응답을 기다림
✔ setConnectTimeout(5000) → 5초 내에 연결이 안 되면 실패

즉, 기본적으로 socketTimeout 값은 30초이며, 필요에 따라 조정할 수 있습니다. 🚀






나의 말:
sockettimeout과 쿼리타임아웃은 어떻게 달라?
ChatGPT의 말:
🔹 socketTimeout vs 쿼리 타임아웃(search timeout) 차이점
설정 옵션    의미
socketTimeout    네트워크 레벨의 타임아웃. 요청을 보낸 후 서버 응답을 기다리는 최대 시간.
쿼리 타임아웃 (search.timeout)    Elasticsearch 내부에서 실행되는 쿼리의 제한 시간. 지정된 시간 내에 완료되지 않으면 부분 결과를 반환하거나 실패.
🔹 1. socketTimeout (네트워크 레벨 타임아웃)
요청을 보낸 후 Elasticsearch 서버의 응답을 기다리는 최대 시간
네트워크 지연이나 서버 응답이 늦어질 경우 적용됨
기본값: 30초 (30,000ms)
설정 방법:
```java
RestClient restClient = RestClient.builder(new HttpHost("localhost", 9200, "http"))
    .setRequestConfigCallback(requestConfigBuilder -> 
        requestConfigBuilder
            .setSocketTimeout(60000)  // 60초로 설정
            .setConnectTimeout(5000)  // 연결 타임아웃 5초로 설정
    )
    .build();
    ```
✅ 만약 socketTimeout을 초과하면 SocketTimeoutException이 발생하며, 클라이언트는 응답을 받지 못함.

🔹 2. search.timeout (쿼리 실행 타임아웃)
Elasticsearch 내부에서 실행되는 쿼리의 제한 시간
지정된 시간 내에 실행되지 않으면 쿼리가 중단될 수 있음
기본적으로 설정되지 않음 (무한정 실행 가능)
설정 방법:
```java
SearchRequest searchRequest = new SearchRequest("index_name");
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder()
    .query(QueryBuilders.matchAllQuery())
    .timeout(TimeValue.timeValueSeconds(5)); // 5초 제한
searchRequest.source(searchSourceBuilder);

✅ 쿼리가 search.timeout을 초과하면 실패하거나 부분 결과를 반환할 수도 있음. ✅ 하지만 search.timeout이 초과되더라도 클라이언트는 여전히 socketTimeout 내에서 응답을 기다릴 수 있음.

🔹 차이점 요약 구분 socketTimeout search.timeout 설정 대상 클라이언트 ↔ 서버 간 네트워크 통신 Elasticsearch 내부에서 실행되는 쿼리 기본값 30초 설정하지 않으면 무제한 적용 시점 클라이언트가 응답을 기다리는 동안 쿼리가 실행될 때 내부적으로 적용 초과 시 결과 SocketTimeoutException 발생 (응답 없음) 부분 결과 반환 또는 실패 ✔ 결론:

socketTimeout은 클라이언트가 서버 응답을 기다리는 시간을 의미하고, search.timeout은 Elasticsearch가 쿼리를 실행하는 최대 시간을 의미함. 🚀 둘을 함께 조정하면 더 안정적인 Elasticsearch 쿼리 타임아웃 관리가 가능!

나의 말: search.timeout이 초과되었는데 왜 client는 여전히 응답을 기다려? socketTimeout시간 내에 search.timeout이 초과되면 에러 메세지를 받는거 아냐? ChatGPT의 말: 좋은 질문이에요! search.timeout이 초과되었는데도 클라이언트가 여전히 응답을 기다리는 이유는 Elasticsearch의 동작 방식 때문입니다.

🔹 search.timeout과 socketTimeout의 관계 search.timeout은 Elasticsearch 내부에서 실행되는 쿼리의 제한 시간입니다. 초과되면 쿼리를 중단할 수도 있고, 부분 결과를 반환할 수도 있으며, 경우에 따라 에러를 반환할 수도 있음 socketTimeout은 클라이언트가 서버 응답을 기다리는 최대 시간입니다. socketTimeout이 초과되면 클라이언트는 SocketTimeoutException을 발생시킴 즉, search.timeout이 초과되었다고 해서 바로 클라이언트가 응답을 받는 게 아니라, Elasticsearch가 응답을 어떻게 처리하느냐에 따라 달라집니다.

🔹 search.timeout 초과 시 클라이언트가 기다리는 이유 Elasticsearch는 search.timeout을 "강제 중단"이 아니라 "최대한 제한하는 옵션"으로 사용함

만약 검색이 search.timeout을 초과하면, Elasticsearch는 요청을 완전히 중단하지 않을 수도 있음 대신, 부분적인 결과를 반환하거나, 내부적으로 추가 연산을 수행한 후 에러를 반환할 수 있음 이 과정에서 클라이언트는 응답을 기다릴 수 있음 Elasticsearch가 search.timeout 이후에도 연산을 계속 수행하는 경우

일부 복잡한 쿼리(예: aggregations, scroll 쿼리)는 search.timeout이 초과되어도 부분 데이터를 수집하려고 할 수 있음 이때 응답이 늦어지면 클라이언트는 여전히 socketTimeout 내에서 기다리게 됨 Elasticsearch가 search.timeout 초과 시 에러 응답을 반환하는 경우

만약 search.timeout이 초과되었고, Elasticsearch가 바로 에러를 반환하면 클라이언트는 즉시 응답을 받음 이 경우, 클라이언트가 socketTimeout 내에 응답을 받으므로 문제 없음 응답 예시:

{
  "error": {
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "caused_by": {
      "type": "timeout_exception",
      "reason": "timed out after 5s"
    }
  },
  "status": 504
}

위처럼 에러가 반환되면 클라이언트는 socketTimeout 내에 응답을 받고 처리 가능 Elasticsearch가 search.timeout 초과 후에도 내부적으로 작업을 마무리하는 경우

일부 경우, 노드 간 데이터 이동이나 리덕션 작업이 남아 있으면, search.timeout이 초과되었더라도 서버가 최종 응답을 완전히 정리한 후 반환할 수 있음 이 과정에서 클라이언트는 socketTimeout 내에서 계속 기다리게 됨 🔹 search.timeout 초과 시 즉시 응답을 받으려면? 타임아웃을 단호하게 적용하려면 terminate_after 옵션 추가

SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder()
    .query(QueryBuilders.matchAllQuery())
    .timeout(TimeValue.timeValueSeconds(5))  // 5초 검색 타임아웃
    .terminateAfter(100); // 최대 100개 문서까지만 처리 후 종료

terminate_after는 설정된 문서 개수만큼 처리한 후 강제로 쿼리를 중단하도록 강제하는 옵션 search.timeout과 함께 사용하면 훨씬 강력한 타임아웃 제어 가능 socketTimeout을 search.timeout보다 짧게 설정하지 않도록 주의

예를 들어, search.timeout = 10초인데 socketTimeout = 5초면 클라이언트가 먼저 타임아웃 남 일반적으로 socketTimeout은 search.timeout보다 길게 설정하는 것이 좋음 🔹 결론 search.timeout이 초과되었더라도 Elasticsearch가 즉시 응답을 반환하는 것은 아님 socketTimeout이 초과되지 않으면 클라이언트는 응답을 계속 기다릴 수 있음 search.timeout 초과 시 즉시 응답을 받으려면 terminate_after 같은 강제 중단 옵션을 추가하는 것이 좋음 🚀

나의 말: 복잡한 시스템에서 socketTimeout을 몇초정도로 하려나? ChatGPT의 말: 복잡한 시스템에서 socketTimeout을 설정할 때 고려할 요소는 많지만, 일반적인 기준을 정리하면 다음과 같습니다.

🔹 1. 일반적인 socketTimeout 설정 기준 시스템 유형 socketTimeout 추천값 설명 일반적인 검색 요청 30~~60초 기본적인 쿼리 수행에 적절 대량 데이터 조회 (scroll, aggregation) 60~~120초 데이터가 많아 응답이 오래 걸릴 경우 ML 기반 분석, 복잡한 통계 쿼리 120~~300초 복잡한 연산 수행 시 안정적 백오피스 배치 시스템 (야간 배치 작업) 300~~600초 배치성 작업으로 여유롭게 설정 실시간 API (모바일, 웹 등) 1~5초 사용자 경험을 위해 빠른 응답 필요 🔹 2. socketTimeout 설정 시 고려해야 할 요소 ✅ (1) 평균 쿼리 응답 시간 측정 Elasticsearch의 slow log를 활용해서 쿼리 응답 시간이 얼마나 걸리는지 분석 예제 (log4j2.properties에서 설정):

index.search.slowlog.threshold.query.warn: 5s
index.search.slowlog.threshold.query.info: 3s
index.search.slowlog.threshold.query.debug: 1s

실제 평균 응답 시간이 2초 정도라면 socketTimeout을 5초 정도로 설정하는 것이 합리적 ✅ (2) 네트워크 상태 및 부하 고려 시스템이 다중 노드 환경이라면, 네트워크 레이턴시를 포함하여 설정 네트워크 상태가 좋지 않다면 여유 있게 설정하는 게 안전 ✅ (3) 쿼리 특성 분석 단순 검색 (match, term) → socketTimeout 5~~30초 복잡한 Aggregation → socketTimeout 60~~120초 Deep Pagination (scroll 사용) → socketTimeout 120~300초 ✅ (4) search.timeout과 조합하여 설정 search.timeout(쿼리 실행 제한 시간)과 socketTimeout(클라이언트가 기다리는 시간)은 맞춰야 함 예를 들어, search.timeout = 10초이면 socketTimeout = 15초 정도로 설정

// search.timeout = 10초
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder()
    .query(QueryBuilders.matchAllQuery())
    .timeout(TimeValue.timeValueSeconds(10));

// socketTimeout = 15초
RestClient restClient = RestClient.builder(new HttpHost("localhost", 9200, "http"))
    .setRequestConfigCallback(requestConfigBuilder -> 
        requestConfigBuilder.setSocketTimeout(15000) // 15초 설정
    )
    .build();
    ```
🔹 3. 현실적인 socketTimeout 설정 예시
🔹 A. 실시간 API (사용자 검색)
요청이 1~2초 내에 끝나야 함
search.timeout = 2초, socketTimeout = 3~5초
```java
searchSourceBuilder.timeout(TimeValue.timeValueSeconds(2));
requestConfigBuilder.setSocketTimeout(5000);

🔹 B. 대량 데이터 조회 (대시보드, 분석) Aggregation이나 scroll을 사용할 때 search.timeout = 60초, socketTimeout = 90~120초

searchSourceBuilder.timeout(TimeValue.timeValueSeconds(60));
requestConfigBuilder.setSocketTimeout(120000);

🔹 C. 배치 작업 (야간 분석, 백오피스) 대량 처리, 시간 여유 있음 search.timeout = 300초, socketTimeout = 600초

searchSourceBuilder.timeout(TimeValue.timeValueMinutes(5));
requestConfigBuilder.setSocketTimeout(600000);

🔹 4. 결론: 적절한 socketTimeout 설정 실시간 API → 1~~5초 (빠른 응답 필요) 일반적인 검색 요청 → 30~~60초 (기본값 유지) 대시보드 및 Aggregation → 60~~120초 배치성 분석 작업 → 300~~600초 (충분한 처리 시간 확보) 🚀 즉, 평균 응답 시간을 측정하고, search.timeout과 조합하여 최적의 socketTimeout을 설정하는 것이 중요!

Elasticsearch number_of_shard 기본값 변경 대응

Tue, 17 Jan 2023 07:17:35 GMT

기본값 변경 내역

기종 6점대 까지는, index 생성 시 num_of_shards를 지정하지 않으면

기본 5개 였는데, 7점대부터 기본 1개로 변경되었다.

template 생성

기존 자동 생성되도록 만들어진 index들이 그럼 앞으로 shard 1개로 만들어지니, 이를 방지하기 위해 기존 사용되던 5개의 값들을 사용하도록 변경해준다.

PUT _template/basic_shard_setting
{

  "index_patterns": ["name1-*","name2-*","name3-*"],    
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

template order

앞에서 생성한 basic_shard_setting은 조회해보면 order가 기본 0으로 되어있다.

그런데 이미 name1-, name2-, name3-* 들에 이미 order 0인 template이 있고, 그곳에 number_of_shards가 지정되어 있다면 어떤것이 우선순위로 수행될까?

기존 생성되어있는 template

PUT _template/name1_template
{
  "index_patterns": [
    "name1*"
  ],
  "settings": {
    "index": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    }
  }
}

인덱스 생성하여 확인

PUT name1-test
GET name1-test
{
  "name2-test" : {
    "aliases" : { },
    "mappings" : { },
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "5",
        "provided_name" : "name2-test",
        "creation_date" : "1673939210866",
        "number_of_replicas" : "1",
        "uuid" : "zXT3F8VbTSm5yGzcHU52dg",
        "version" : {
          "created" : "7170799"
        }
      }
    }
  }
}

basic_shard_setting이 적용된 것을 알 수 있다. 잠깐... 어떻게 알고 저걸 적용했지?

name1_template을 업데이트 후 재 확인

PUT _template/name1_template
{
  "index_patterns": [
    "name1*"
  ],
  "settings": {
    "index": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    }
  }
}

DELETE name1-test
PUT name1-test
GET name1-test


{
  "name1-test" : {
    "aliases" : { },
    "mappings" : { },
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "3",
        "provided_name" : "name1-test",
        "creation_date" : "1673939428414",
        "number_of_replicas" : "1",
        "uuid" : "6ZsbX6MaTvmNb_9RGc0FRw",
        "version" : {
          "created" : "7170799"
        }
      }
    }
  }
}

결과적으로 template order가 같다면, 나중 생성된template이 우선되는 거였다. 우리는 basic_shard_setting에만 number_of_shard를 적용해 놓을 거지만, 양쪽에 있을 때 어떤 이유에서든 template이 업데이트 되면 업데이트 된 것으로 적용되니 유의해야겠다.

KAFKA 데이터지우기

Thu, 31 Mar 2022 01:30:00 GMT

kafka 설정에 log.retention을 줄이고 한대씩 재기동

log.retention.hours=1 log.retention.ms=1000

Elasticsearch reindex 여러 인덱스 한번에 할때는 index이름을 변경해야한다.

Wed, 29 Sep 2021 05:44:12 GMT

reindex시 index명 바꾸며 그대로 이동

아래와 같이 쓰면, 모든 index들이 그대로 신규 clsuter로 이동한다.

log-test-1 -> log-test2-1 log-test-2 -> log-test2-2

POST _reindex?wait_for_completion=false
{
  "conflicts": "proceed",
  "source": {
    "remote": {
      "host": "http://ip:port",
      "username": "elastic",
      "password": "elastic"
    },
    "index": "log-test-*"
  },
  "dest": {
    "index": "log-test2",
    "op_type": "create"
  },
  "script": {
    "lang": "painless",
    "source": "ctx._index = 'log-test2' + (ctx._index.substring('log-test-'.length(), ctx._index.length()))"
  }
}

es공식문서에서 위와같이 가이드하는데, 도데체 ctx._index가 remote의 index인지 dest의 index인지 헷갈린다.

reindex시 index명을 그대로 두고는 이동 못함

그런데 아래와 같이, index명을 유지하고자 하는 경우에는 의도와 다르게 log-test2 하나의 인덱스로 전부 이동한다.

log-test-1 -> log-test2 log-test-2 -> log-test2

POST _reindex?wait_for_completion=false
{
  "conflicts": "proceed",
  "source": {
    "remote": {
      "host": "http://ip:port",
      "username": "elastic",
      "password": "elastic"
    },
    "index": "log-test-*"
  },
  "dest": {
    "index": "log-test",
    "op_type": "create"
  },
  "script": {
    "lang": "painless",
    "source": "ctx._index = 'log-test' + (ctx._index.substring('log-test-'.length(), ctx._index.length()))"
  }
}

https://discuss.elastic.co/t/reindex-multiple-indices-from-remote-with-the-same-name/171674/4

위와 관련해 ES소스 내부에서 index이름이 달라졌을 경우에만 script를 적용하고 있다. 그래서 같은 이름으로 reindex하려면 index별로 건건히 해줘야함... 아호...

Elasticsearch auto rollover 적용

Mon, 22 Feb 2021 12:56:15 GMT

elasticsearch에서 자동으로 index를 rollover하는 방식에 대해 알아보자.

주의할점은, rollover는 alias기준으로 수행되는것이지, index기준으로 수행되는것이 아니다.

index는 rollover시마다 이름이 바뀌기 때문에, rollover시에도 항상 유지할 alias에 대해 rollover rule을 규정하는것이다.

rolling policy 만들기

kibana의 > Stack Management > Index Management에서 policy를 생성한다.

아래와같이 이름은 rolling, 1GB / 2개문서 / 1days 로 생성했을시, 3개 중 하나라도 먼저 도달하는 것이 있으면 index가 rolling된다.

(키바나 ui가 없으면 아래 api를 날려도 됨)

PUT _ilm/policy/rolling
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_primary_shard_size": "1gb",
            "max_docs": 2
          },
          "set_priority": {
            "priority": 100
          }
        },
        "min_age": "0ms"
      }
    }
  }
}

index template생성

아래와같이 template을 설정하였다.

PUT _template/rolling
{
  "order": 1,
  "version": 1,
  "index_patterns": [
    "rolling*"
  ],
  "settings": {
    "index": {
      "lifecycle": {
        "name": "rolling",
        "rollover_alias": "rolling-write-index"
      },
      "number_of_shards" : "1",
      "number_of_replicas" : "0"
    }
  }
}

rolling이라는 이름으로 시작하는 모든 index에, rolling 라이프사이클을 적용하였고,

rollover대상의 alias는 rolling-write-index라는 alias이다.

index와 alias 생성

PUT rolling-test-000001
{
  "aliases": {
    "rolling-write-index": { "is_write_index": true } 
  }
}

alias를 처음에는 template에 넣어뒀다가, 아래와같은 error와 함께 rolling이 제대로 안되서 index시점에 따로 alias를 생성해주었다.

        "type" : "illegal_argument_exception",
        "reason" : "Rollover alias [rolling-write-index] can point to multiple indices, found duplicated alias [[rolling-write-index]] in index template [rolling]",

마지막에 더 얘기하겠지만 alias는 template에 넣어서는 안되고 따로 지정해야한다. template에 넣었을시에는 "is_write_index": true 가 먹지 않는다.

해당 index에 3개의 doc을 생성함

POST rolling-write-index/_doc
{"id":1}
POST rolling-write-index/_doc
{"id":2}
POST rolling-write-index/_doc
{"id":3}

ilm상태 확인

GET rolling-write-index/_ilm/explain

{
  "indices" : {
    "rolling-test-000001" : {
      "index" : "rolling-test-000001",
      "managed" : true,
      "policy" : "rolloing",
      "lifecycle_date_millis" : 1613977180668,
      "age" : "16.72s",
      "phase" : "hot",
      "phase_time_millis" : 1613977180704,
      "action" : "unfollow",
      "action_time_millis" : 1613977180755,
      "step" : "wait-for-follow-shard-tasks",
      "step_time_millis" : 1613977180766,
      "phase_execution" : {
        "policy" : "rolloing",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "1gb",
              "max_age" : "1d",
              "max_docs" : 2
            },
            "set_priority" : {
              "priority" : 100
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1613976026352
      }
    }
  }
}

당장 수행되지않고 status가 "wait-for-follow-shard-tasks" 이다. ( status가 "check-rollover-ready"의 경우 아직 rollover 기준에 도달하지 못했다는 것이다.)

아래와 같이, 기본 ILM poll 주기가 10분이라고한다.

By default the poll interval for ILM is 10 minutes, if you want to test your policy (with a 1 minute rollover), you can lower that by changing the dynamic indices.lifecycle.poll_interval setting.

아래처럼 10초로 바꿀수있다.

PUT /_cluster/settings
{
  "transient" : {
    "indices.lifecycle.poll_interval" : "10s"
  }
}

10분후 확인해보면 000002로 rolling되었음을 확인할수있다.

GET rolling-write-index/_ilm/explain

{
  "indices" : {
    "rolling-test-000002" : {
      "index" : "rolling-test-000002",
      "managed" : true,
      "policy" : "rolloing",
      "lifecycle_date_millis" : 1613983439108,
      "age" : "1.75m",
      "phase" : "hot",
      "phase_time_millis" : 1613983439190,
      "action" : "unfollow",
      "action_time_millis" : 1613983439236,
      "step" : "wait-for-follow-shard-tasks",
      "step_time_millis" : 1613983439265,
      "phase_execution" : {
        "policy" : "rolloing",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "1gb",
              "max_age" : "1d",
              "max_docs" : 2
            },
            "set_priority" : {
              "priority" : 100
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1613976026352
      }
    }
  }
}

index가 rolling, 굴러가는 모양이다.

롤링하는 순간 rolling-test-000001이

"rolling-write-index": {
      "is_write_index": false
    }

가 되고

rolling-test-000002는

"rolling-write-index": {
      "is_write_index": true
    }

가 된다.

즉 rolling-write-index로 검색하면 두 index모두에서 검색가능하지만, indexing하면 000002에만 인덱싱된다. 000001 인덱스의 크기는 더이상 늘지 않는 것이다.

주의할점은, template에 rollover대상인 alias를 넣어선 안된다는 것.

처음에는 template에 index생성시 alias를 생성하도록 아래와 같이 넣어뒀었다.

PUT _template/rolling
{
  "order": 1,
  "version": 1,
  "index_patterns": [
    "rolling*"
  ],
#  "aliases": {
#    "rolling-write-index": {
#      "is_write_index": true
#    }
#  },
  "settings": {
    "index": {
      "lifecycle": {
        "name": "rolloing",
        "rollover_alias": "rolling-write-index"
      },
      "number_of_shards" : "1",
      "number_of_replicas" : "0"
    }
  }
}

그런데, 이런 경우 rollover시 아래와 같은 에러가 나며 rolling이 제대로 수행되지 않는다.

     "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "Rollover alias [rolling-write-index] can point to multiple indices, found duplicated alias [[rolling-write-index]] in index template [rolling]",

policy에 적용되어있는 rolling alias template에 넣지 않고 생성시 한번 따로 생성해야한다.

추가: 수동 rollover시

매핑이 계속 수정되면서 rollover로 신규 템플릿을 적용해왔다.

PUT rolling-write-index/_rollover

근데 이런 경우 실제 이미 수동 rollover된 index가 ILM에 의해서 또 rollover시도하면서 아래와 같은 error가 난다.

index [rolling-test-000002] is not the write index for alias [rolling-write-index]

index를 수동으로 rollover하는 경우 아래와같이 수동으로 complete처리를 해줘야한다.

PUT rolling-test-000002/_settings
{ index.lifecycle.indexing_complete : true }

수동 complete처리 후 해당 index에 ilm이 제대로 동작하는지 아래와같이 재시도하여 확인할수있다.

POST rolling-test-000002/_ilm/retry

정상의 경우 ack : true로 리턴된다.

Elasticsearch 기본 score 계산 방법

Fri, 11 Dec 2020 10:58:32 GMT

Elasticsearch 기본 score는 BM25 알고리즘으로 계산된다.

GET file/_search?explain=true
{
  "query": {
    "match": {
      "file.name": "test"
    }
  },
  "_source": "file.name", 
  "sort": [
    "_score"
  ]
}

검색시 explain=true값을 인자로 넘기면, score가 어떻게 계산되었는지 자세히 설명해준다.

아래 _explanation부분을 보자.

{
          "value" : 1.6962869,
          "description" : "weight(file.objtNm:test in 347841) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 1.6962869,
              "description" : "score(doc=347841,freq=30.0 = termFreq=30.0\n), product of:",
              "details" : [
                {
                  "value" : 0.94338167,
                  "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 18014.0,
                      "description" : "docFreq",
                      "details" : [ ]
                    },
                    {
                      "value" : 46272.0,
                      "description" : "docCount",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 1.7980919,
                  "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details" : [
                    {
                      "value" : 30.0,
                      "description" : "termFreq=30.0",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "parameter k1",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "parameter b",
                      "details" : [ ]
                    },
                    {
                      "value" : 102.28585,
                      "description" : "avgFieldLength",
                      "details" : [ ]
                    },
                    {
                      "value" : 728.0,
                      "description" : "fieldLength",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }

score값인 1.6962869는 details array에 있는 두개의 value의 곱이다.
```
details.value(score) = details.details[0].value * details.details[1].value
```
details.details[0].value 는 IDF이다.
```
IDF = log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
```
IDF( (Inverse Document Frequency)는 위의 수식으로 결정되는데,

결국 docFreq가 낮을수록 IDF가 크다는걸 알 수 있다.

즉, 문서에 해당 키워드가 등장하는 빈도가 작을수록, IDF가 커지면서 score도 올라간다.

details.details[1].value 는 tfNorm이다.
```
tfNorm = (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength))
```
tfNorm(Term Frequency)는 위의 수식으로 결정되는데, 여기서 k1과 b는 가중치를 매기는 정해진 상수이니 무시하고,

freq(문서에 매칭된 키워드 수)이 높을수록

fieldLength(평균 필드의 길이)대비 avgFieldLength(검색된 문서의 필드 길이)가 클수록

TF의 크기가 커지는것을 알수있다.

즉, 검색된 문서에 매칭된 키워드수가 자주 반복될수록, 또 평균 필드 길이보다 검색된 문서의 필드가 길수록 score가 올라간다.

종합해보자면 BM25 알고리즘에서 score는 아래와 같이 계산되고

score = TF * tfNorm

전체 문서에서 해당 키워드가 등장하는 빈도가 작을수록, 즉 해당 문서에만 검색되는 독특한 단어일수록, 검색된 문서에 해당 키워드가 자주 반복될수록 score는 올라간다.

기존 TF-IDF보다 BM25가 더 정교한 score계산을 하기 때문에 lucene도 BM25를 적용하였고 따라서 elasticsearch도 6.3버전부터 score계산에 BM25알고리즘을 적용하고 있다고 한다.

Elasticsearch 'mapper_parsing_exception'

Tue, 01 Dec 2020 08:27:59 GMT

Join field에 null이 들어가면 fail to parse exception이 남.

{
  "error" : {
    "root_cause" : [
      {
        "type" : "mapper_parsing_exception",
        "reason" : "failed to parse"
      }
    ],
    "type" : "mapper_parsing_exception",
    "reason" : "failed to parse",
    "caused_by" : {
      "type" : "illegal_state_exception",
      "reason" : "[null] expected START_OBJECT or VALUE_STRING but was: VALUE_NULL"
    }
  },
  "status" : 400
}

consumer startOffset 안먹는 현상

Thu, 03 Sep 2020 12:50:22 GMT

Kafka cloud stream 설정 중 startOffset: latest 안먹음

    combine:
      destination: {action명}
      binder: kafka
      group: {group}
      consumer:
        batch-mode: true
        ackEachRecord: false
        startOffset: latest

이렇게 해놔도 latest가 아닌 earlist부터 가져옴. 원인은.. group명을 지정하면 startOffset이 earlist로 세팅됨. 그래서 그간 쌓인 몇백만개를 다 읽어오려고함.. @.@

https://cloud.spring.io/spring-cloud-stream-binder-kafka/spring-cloud-stream-binder-kafka.html

https://github.com/spring-cloud/spring-cloud-stream-binder-kafka/issues/67

Our suggested workaround is to manage offsets using the Kafka consumer group management CLI tooling (http://kafka.apache.org/0102/documentation.html#basic_ops_consumer_group) between application restarts. (Also set the startOffset to earliest or latest so that the application can restart at the proper offset).

=> anomynous그룹을 쓰거나 컨슈머 그룹 매니지먼트를 써서 offset을 관리해라.

아래와같이 offset을 latest로 바꿈..


./kafka-consumer-groups.sh \

--bootstrap-server {ip}:9092 \

--group {group} 

--topic {topic}\

--reset-offsets --to-latest --execute

MySQL binlog설정

Thu, 27 Aug 2020 10:17:36 GMT

mysql binlog설정위치

server-id={서버아이디} 마스터와 슬레이브의 서버는 달라야함
log-bin={빈로그 쌓이는 곳}
binlog_format = ROW {유형} //STATEMENT, ROW, MIXED
max_binlog_size = 1G {롤링할 크기}
expire_logs_days = 2 {보관 기간}
relay_log={슬레이브의 로그가 쌓이는 곳}

binlog_format

이 옵션을 사용해서 디폴트 포맷을 지정하면, 모든 리플리케이션 슬레이브는 이벤트를 지정 포맷으로 읽게 된다.
ROW — 열 기반 리플리케이션을 디폴트로 설정한다.
STATEMENT — 명령문 기반 리플리케이션을 디폴트로 설정한다.
MIXED — 혼합 기반 리플리케이션을 디폴트로 설정한다.

binlog 읽기 위한 권한
```
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON . TO  {ID}@'{IP}'  IDENTIFIED BY '{PWD}';
```
GRANT REPLICATION SLAVE : Enable the user to ask where master or slave servers are REPLICATION CLIENT : Enable replication slaves to read binary log events from the master

binlog 파일 읽기

파일 원문으로 떨구기

 mysqlbinlog /{경로}/mysql-bin.000004 > binlog.sql

binlog 특정 날짜 원문으로 보기

mysqlbinlog --verbose --database=DB이름 --start-datetime="{시작시간}" --stop-datetime="{종료시간}" "{binlog경로}" > binlog.sql

ex)./mysqlbinlog --verbose --database=iofficemetadb --start-datetime="2020-08-20 18:00:00" --stop-datetime="2020-08-20 20:00:00" "/mysqldata/3406/log/blog/mysql-bin.002185" >/home/ioffice/temp/binlog.sql

/home/ioffice/temp

특정 포지션을 원문으로 보기

mysqlbinlog --verbose --database=testdb --start-position=1 --stop-position=100000 {binlog경로} > binlog.sql

로컬에 mysql설치 후 빈로그 파일

#window local test
mysqlbinlog.exe -v --base64-output=DECODE-ROWS --stop-never --read-from-remote-server --user=root --password=root binlog.000003

mysqlbinlog: Error writing file '' (Errcode: 2 - No such file or directory)

#보니깐 해당 로그를 파일로 써야되나보다 그냥 나는 cmd창에서 계속 보고싶던건데...
mysqlbinlog.exe -v --base64-output=DECODE-ROWS --stop-never --read-from-remote-server --user=root --password=root binlog.000005 > ./binlog000005.txt

/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
SET @@SESSION.GTID_NEXT= '273990e1-e37e-11ea-9449-988389a1841d:13'/*!*/;
# at 1718
#200826 17:52:36 server id 1  end_log_pos 1803 CRC32 0x55fae6e5     Query    thread_id=3    exec_time=0    error_code=0
SET TIMESTAMP=1598431956/*!*/;
BEGIN
/*!*/;
# at 5308
#200827 13:02:44 server id 1  end_log_pos 5545 CRC32 0x8ab098af     Table_map: `{db명}`.`{테이블명}` mapped to number 113
# at 5545
#200827 13:02:44 server id 1  end_log_pos 6614 CRC32 0x1b7d9757     Update_rows: table id 113 flags: STMT_END_F
### UPDATE `iofficeefss0001`.`tb_file_f`
### WHERE
###   @1=133900000002300000
###   @2='181128_1339-배치테스트용_변경금지 - 01-(물리없음)-133900000002300000'
###   @3='TXT'
###   @4='대용량 테스트용 파일'
###   @5='111'
###   @6='001'
###   @7=1540660923
###   @8=147999346801600097
###   @9=NULL
###   @10=NULL
###   @11=1545941252
###   @12=NULL
###   @13=1111
###   @14=10
###   @15=147999346801600097
###   @16=1535390523
###   @17=1
###   @18=133900000002300110
###   @19=NULL
###   @20=NULL
###   @21='181128_1339-배치테스트용_변경금지 - 01-(물리없음)-133900000002300000'
###   @22=133900000000100000
###   @23=154337031460200427
###   @24='015'
###   @25=147927667597500027
###   @26='1132000000'
###   @27='Y'
###   @28='050'
###   @29=1556385723
###   @30='N'
###   @31='N'
###   @32='TAG,INFO,VAL'
###   @33=NULL
###   @34='EDTDEL'
###   @35=NULL
###   @36=NULL
###   @37=154337033663200430
###   @38='015'
###   @39=133900000000100000
###   @40=154597309410300287
###   @41=123456789012345670
###   @42=147927667597500027
###   @43=NULL
###   @44=NULL
###   @45=NULL
###   @46=NULL
###   @47='1132000000'
###   @48='050'
###   @49=1556385723
###   @50=NULL
###   @51=NULL
###   @52='W0'
###   @53='9'
###   @54='Y'
###   @55='0123456789'
###   @56='N'
###   @57='DOCT_NO'
###   @58='N'
###   @59='REG_DSTCT'
###   @60='TAG,INFO,VAL'
###   @61=NULL
###   @62=NULL
###   @63=NULL
###   @64='EDTDEL'
###   @65='PRSRV_RSN_CONTENT'
###   @66='IF_TGT_SYS'
###   @67=NULL
###   @68=NULL
###   @69='EFDSC10'
###   @70='Y'
###   @71=1543339323
###   @72=147999346801600097
###   @73=1545941252
###   @74=147999356147300023
### SET
###   @1=133900000002300000
###   @2='ioffice'
###   @3='TXT'
###   @4='대용량 테스트용 파일'
###   @5='111'
###   @6='001'
###   @7=1540660923
###   @8=147999346801600097
###   @9=NULL
###   @10=NULL
###   @11=1598500964
###   @12=NULL
###   @13=1111
###   @14=10
###   @15=147999346801600097
###   @16=1535390523
###   @17=1
###   @18=133900000002300110
###   @19=NULL
###   @20=NULL
###   @21='181128_1339-배치테스트용_변경금지 - 01-(물리없음)-133900000002300000'
###   @22=133900000000100000
###   @23=154337031460200427
###   @24='015'
###   @25=147927667597500027
###   @26='1132000000'
###   @27='Y'
###   @28='050'
###   @29=1556385723
###   @30='N'
###   @31='N'
###   @32='TAG,INFO,VAL'
###   @33=NULL
###   @34='EDTDEL'
###   @35=NULL
###   @36=NULL
###   @37=154337033663200430
###   @38='015'
###   @39=133900000000100000
###   @40=154597309410300287
###   @41=123456789012345670
###   @42=147927667597500027
###   @43=NULL
###   @44=NULL
###   @45=NULL
###   @46=NULL
###   @47='1132000000'
###   @48='050'
###   @49=1556385723
###   @50=NULL
###   @51=NULL
###   @52='W0'
###   @53='9'
###   @54='Y'
###   @55='0123456789'
###   @56='N'
###   @57='DOCT_NO'
###   @58='N'
###   @59='REG_DSTCT'
###   @60='TAG,INFO,VAL'
###   @61=NULL
###   @62=NULL
###   @63=NULL
###   @64='EDTDEL'
###   @65='PRSRV_RSN_CONTENT'
###   @66='IF_TGT_SYS'
###   @67=NULL
###   @68=NULL
###   @69='EFDSC10'
###   @70='Y'
###   @71=1543339323
###   @72=147999346801600097
###   @73=1598500964
###   @74=147999356147300023
# at 6614
#200827 13:02:44 server id 1  end_log_pos 6645 CRC32 0x4b5d9a7a     Xid = 833
COMMIT/*!*/;

SET @@SESSION.GTID_NEXT= '273990e1-e37e-11ea-9449-988389a1841d:13'/*!*/;

gtid_mode의 설정이 'ON'일때만 위 GTID가 찍힘

아래 여러 옵션들이 있다.
- 빈로그 옵션들

MASTER의 빈로그 파일과 최종 Position

SHOW MASTER STATUS;

  {
          "File": "binlog.000009",
          "Position": 5800,
          "Binlog_Do_DB": "",
          "Binlog_Ignore_DB": "",
          "Executed_Gtid_Set": "273990e1-e37e-11ea-9449-988389a1841d:1-43"
      }

이벤트 쿼리로 보기
```
SHOW BINLOG EVENTS 
IN 'binlog.000009' 
FROM 4873 LIMIT 10;
```
FROM값이 정확히 BINLOG EVENTS의 Pos와 맞지 않으면 조회되지 않는다.

아래와같이 SHOW MASTER STATUS에서 조회한 Position의 값이 최종 End_log_pos인 것을 알 수 있다.

  {
              "Log_name": "binlog.000009", #빈로그 파일명
              "Pos": 4873, # 시작 위치
              "Event_type": "Anonymous_Gtid",#이벤트 타입
              "Server_id": 1, 
              "End_log_pos": 4938, # 끝나는 위치
              "Info": "SET @@SESSION.GTID_NEXT= 'ANONYMOUS'" # 빈로그에 at {Pos} 다음에 오는 end위치 다음 정보 한단어

          },

  # at 4873
#200831 18:30:30 server id 1  end_log_pos 4938 CRC32 0xc0d6daa3     Anonymous_GTID    last_committed=5    sequence_number=6    rbr_only=yes

  {
      "table": "LIMIT",
      "rows":
      [
          {
              "Log_name": "binlog.000009",
              "Pos": 4873,
              "Event_type": "Anonymous_Gtid",
              "Server_id": 1,
              "End_log_pos": 4938,
              "Info": "SET @@SESSION.GTID_NEXT= 'ANONYMOUS'"
          },
          {
              "Log_name": "binlog.000009",
              "Pos": 4938,
              "Event_type": "Query",
              "Server_id": 1,
              "End_log_pos": 5023,
              "Info": "BEGIN"
          },
          {
              "Log_name": "binlog.000009",
              "Pos": 5023,
              "Event_type": "Table_map",
              "Server_id": 1,
              "End_log_pos": 5260,
              "Info": "table_id: 108 (iofficeefss0002.tb_file_f)"
          },
          {
              "Log_name": "binlog.000009",
              "Pos": 5260,
              "Event_type": "Update_rows",
              "Server_id": 1,
              "End_log_pos": 5769,
              "Info": "table_id: 108 flags: STMT_END_F"
          },
          {
              "Log_name": "binlog.000009",
              "Pos": 5769,
              "Event_type": "Xid",
              "Server_id": 1,
              "End_log_pos": 5800,
              "Info": "COMMIT /* xid=302 */"
          }
      ]
  }

LEVEL-DB

Tue, 21 Jul 2020 09:26:23 GMT

LEVEL-DB란?

A fast and lightweight key-value DB library made by Google

key-value : NO-SQL

lightweight library : Embeded DB

LEVEL-DB특징

웹 애플리케이션용 DB를 위해 만들었다.

데이터를 로컬에 저장하기 때문에 네트워크 유실등의 상황에 유리함.

크롬의 indexedDB를 구현하기 위해 만듬

IndexedDB는 사용자의 브라우저에 데이터를 영구적으로 저장할 수 있는 방법 중 하나입니다. IndexedDB를 사용하여 네트워크 상태에 상관없이 풍부한 쿼리 기능을 이용할 수 있는 웹 어플리케이션을 만들 수 있기 때문에, 여러분의 웹 어플리케이션은 온라인과 오프라인 환경에서 모두 동작할 수 있습니다. https://developer.mozilla.org/ko/docs/Web/API/IndexedDB_API/Using_IndexedDB

Indexed Database API 또는 IndexedDB(과거 이름: WebSimpleDB)는 색인이 포함된 JSON 객체가 모여있는 트랜잭셔널 로컬 데이터베이스를 위해 W3C가 권고한[1] 웹 브라우저 표준 인터페이스의 하나이다. 웹사이트는 데이터베이스에서 영속적인 데이터를 모아서 저장할 수 있다. W3C는 2015년 1월 8일 IndexedDB 인터페이스를 위한 최종 권고안을 발행하였다.[2]
data is stored by key

ordered map이기 때문에 iteration 비용이 저렴,

forward / backward iteration is supported
basic operations are put / get / key
Users can create snapshot
Data is automatically compressed using the Snappy compression library

snappy는 압축률은 높지 않지만 속도가 매우 빠름

초당 250M압축, 초당 500M압축 해제
Key와 Value는 Byte Arrays 타입
전체적으로 뛰어난 read/write
이지만, value사이즈가 크면 write시 성능이 하락함

구현상 key와 value를 적어도 두번 복사하기 때문. (?)

역시나 update비용은 비쌈. delete and write이기때문
B-tree의 변형인 LSM이다.

설치

C++, node, python3로 지원하는데 node에 설치해보자. (https://www.npmjs.com/package/level)

npm install level
//level = levelup(node) + leveldown(C++)

간단히 파일에 put해보기

var level = require('level')
const db = level("./mydb");

const put = (key, value) => {
    db.put(key, value, function (err) {
        if(err) {
            console.log(err);
        }
        db.get(key, function (err, value) {
            if (err) {
                console.log(err);
            }
            console.log(value);
        });
    });
};

put("111", "ttt");

memory에 쓰려면 memdown 이라는 np를 install 하면 됨.

여러 확장 모듈들은 여기서참고

BATCH로 넣어보기

level-db의 장점은 빠른 batch수행임.

db.batch(array[, options][, callback]) (array form)

batch() can be used for very fast bulk-write operations (both put and delete). The array argument should contain a list of operations to be executed sequentially, although as a whole they are performed as an atomic operation inside the underlying store.

batch예시에 한국인이름이 있네? 싱기방기

const batch = () =>{
    var ops = [
        { type: 'del', key: 'father' },
        { type: 'put', key: 'name', value: 'Yuri Irsenovich Kim' },
        { type: 'put', key: 'dob', value: '16 February 1941' },
        { type: 'put', key: 'spouse', value: 'Kim Young-sook' },
        { type: 'put', key: 'occupation', value: 'Clown' }
    ]

    db.batch(ops, function (err) {
        if (err) return console.log('Ooops!', err)
        console.log('Great success dear leader!')
    })

    db.get('spouse', (err, value) =>{
        console.log("wow~", value);
    })
}

batch();

Great success dear leader!
wow~ Kim Young-sook

.... Yuri Irsenovich Kim => 김정일 이란다ㅋㅋㅋㅋ

BATCH를 chaining하기

const batchChain = () => {
    db.batch()
        .del('father')
        .put('binna', 'what?')
        .put('태황', 'what??')
        .put('지원', {"type":"mannerism", "age":32})
        .put('다연', '퇴근하구싶어요')
        .write(function () {
            console.log('Done!')
        })
}

batch는 위와 같이 chaining 할 수 있음.

그래서 어디에 쓰임?

LevelDB is used as the backend database for Google Chrome's IndexedDB and is one of the supported backends for Riak.[8] Additionally, Bitcoin Core and go-ethereum stores the blockchain metadata using a LevelDB database.[9] Minecraft Bedrock Edition uses a modified version for chunk and entity data storage.[10] Autodesk AutoCAD 2016 also uses LevelDB.

이더리움의 metadata를 저장. 마인크레프트의 entitiydata저장. AutoCad에서 사용함.

2. java filter plugin 개발

Fri, 12 Jun 2020 08:58:51 GMT

로그스태시 소스 빌드하기

로컬에 logstash 소스가 있어야만 filter plugin 빌드가 가능하다.

logstash download
```
https://github.com/elastic/logstash.git
```
빌드하기
```
gradlew.bat assemble
```

해당 위치에 jar가 생성되었는지 확인

{플젝위치}/build/libs에 logstash-core-*.*.*.jar생성되었는지 확인

환경변수 설정

set LS_SRC_HOME = {LOGSTASHCORE플젝위치}

filter plugin받아 작성하기

filter - example download

 https://github.com/logstash-plugins/logstash-filter-java_filter_example

C:\Users\SDS.gradle\gradle.properties에 아래 추가

```
LOGSTASH_CORE_PATH={LOGSTASHCORE플젝위치}/logstash-core/

#LOGSTASH_CORE_PATH=$LS_SRC_HOME/logstash-core/
#->아래꺼로 해도 된다고 하는데 뭔가 잘 안맞아서 그냥 절대경로로 세팅함..
#프록시가 있다면 그것도 여기에 넣어줘야함
systemProp.proxySet=true

systemProp.http.proxyHost={IP}
systemProp.http.proxyPort={PORT}
systemProp.http.nonProxyHosts={exclude할주소}|localhost

systemProp.https.proxyHost={IP}
systemProp.https.proxyPort={PORT}
systemProp.https.nonProxyHosts={exclude할주소}|localhost
```

build.gradle수정 (dependency가져오기, 이름바꾸기)

나는 기존 maven repository에 배포된 것들을 가져와야해서 다음과 같이 gradle 설정을 바꿨다.

아래와같이 maven url 모두 추가 (settings.xml에 있는그대로)
```
repositories{
    maven {
        credentials {
            username '{id}'
            password '{password}'
        }
        url "{url}"
    }
}
```
아래와 같이 dependencies 추가 (pom.xml에 있는 그대로)
```
compile '{groupID}:{artifactId}:{version(RELEASE일 경우 +)}'
```
이름 바꿈
```
pluginInfo.pluginClass     = "{yourNewFilterName}"
pluginInfo.pluginName =  "{your_new_filter_name}"
```

java plugin작성..

dependency로 가져온 것들 사용해가며 filter 작성..

위에서 지정한 {pluginInfo.pluginClass}로 class생성하여 만듬

gem build

gradlew.bat gem

아래와같이 파일 생김

logstash-filter-your_new_name-1.00.gem

plugin install

여기서 삽질을 많이했는데, window에서는 경로의 backslash를 forwardslash로 바꿔야한다.

For Windows platforms: Substitute backslashes for forward slashes as appropriate in the command.

또 local plugin을 설치하는데 자꾸 https://rubygems.org 에 연결할수 없다고 나오는지 모르겠지만.. 환경변수에 아래와 같이 프록시를 설정했다.
```
HTTPS_PROXY=https://{IP}:{PORT}
HTTP_PROXY=http://{IP}:{PORT}
```
```
bin\logstash-plugin.bat install --no-verify --local {경로}\logstash-filter-your_new_name-1.00.gem
```
```
Installation successful
```

test

input {
  kafka {
    bootstrap_servers => "{IP:PORT}"
    topics => "{ACTION_NAME}"
    group_id => "log-to-es-group"
    consumer_threads => 1
  }
}

filter {
    your_new_filter_name {}
}

output { 
  elasticsearch {
    hosts => ["{ES_HOST}"]
    index => "log-insert"
  }
  stdout { codec => rubydebug }
}

문제는 kafka input에서 byte[]의 kafkaMessage를 filter로 줄 때 String으로 변환하여 주고있다. 직렬화된 byte[]를 특정 Class로 역직렬화 하고 싶은데, String으로 변환해서 들어오니 Class -> byte[] -> String -> byte[] -> Class의 비효율이 생긴다.

또 String -> byte[] -> Class로 변환하려고 할때 아래와 같은 exception이 발생함.. byte변환 시 getBytes("8859_1")로 charsetName을 주면 누락이 없다고 해서 kafka input 설정의 encoding도 아래와 같이 놓고 시도해보았지만 변환은 계속 안된다.

    codec => plain {
                   charset => "ISO-8859-1"
            }

invalid stream header: EFBFBDEF

몇일째 삽질중이다. 로그스태시 input에서 어떤 일이 발생하는지 디버깅하기가 쉽지 않다. 다만 그냥 어떤 stream을 그대로 es에 손쉽게 때려박기는 용이한 서비스같다. 일단 flume을 작성해보고 생각하자.

1. logstash설치 및 실행

Fri, 12 Jun 2020 08:46:32 GMT

kafka message -> logstash -> ES에 넣어만 보기

input {
  kafka {
    bootstrap_servers => "182.193.17.35:9092"
    topics => "EfssAction"
    group_id => "log-to-es-group"
    consumer_threads => 1
  }
}

filter {

}

output { 
  elasticsearch {
    hosts => ["182.193.17.33:9220"]
    index => "log-insert"
  }
  stdout { codec => rubydebug }
}

kafka-client 버전확인.

kafka-input-plugin과 output-plugin이 integration으로 합쳐졌다고한다.

아래 들어가면 kafka-client버전을 확인 할 수 있음.. 호환되나 확인해봄
```
D:\ELK\logstash-7.7.0\vendor\bundle\jruby\2.5.0\gems\logstash-integration-kafka-10.0.1-java\vendor\jar-dependencies\org\apache\kafka\kafka-clients
```
logstash 실행

{logstash설치경로}로 이동 후 실행
```
bin\logstash.bat -f ./config/logstash.conf
```

검색

  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "log-insert",
        "_type" : "doc",
        "_id" : "EKtJmHIB5PmJ0m_JzGP1",
        "_score" : 1.0,
        "_source" : {
          "message" : "com.***.***.kafka.model.KafkaMessage...

메세지가 잘 indexing되지만 보내는 쪽에서 특정 java class로 wrapping해서 보내고 있음.

이거를 적절히 filter하고 가공하려면 java filter plugin을 개발해야 함

0. LOG TO ES 전환

Fri, 12 Jun 2020 08:45:33 GMT

기존 audit_log를 kafka message로 쏘고 cassandra로 넣고있는데, cassandra가 조회가 이런저런 요건에 맡게 조회 하기가 힘들고, 활용도가 낮아(카산드라도 물론 설계를 잘 하면 되겠지만 현재는 그렇지 않다.) ES로 전환하기로 했다.
기존 레거시를 유지하며, 추가적으로 es로 넣는 방법은?
1. db변경사항을 추적하여 logstash로 넣음
  
  -> 문제는 사용자 각 action을 알수있을까? 데이터변경으로 각 action을 특정할 수 있어야함.
  
  -> PAD편집, 히스토리 검색 등은.... 알수가없다. 단말구분, 생성시 copy인지 new인지 등 구분이 안됨.

logstash 사용.
flume sink 추가하여 es에 넣기

일단 2번은 POC해보기로.

logstash vs flume?

logstash ->

es에 손쉽게 넣을수있다. but 다만 우리처럼 db에서 추가정보를 가져와서 가공하고 업무요건이 복잡하게 많은 서비스에는 활용이 어렵다.
뭐가좋은건지 정말 모르겠네

데이터 활용은?

데이터 활용 방안을 생각해보자.
- 이상action감지
- 사용량 예측

maven import PKIX error

Fri, 05 Jun 2020 00:55:06 GMT

1. cmd열기

C:\Program Files\Java\jdk1.8.0_251\bin 으로 이동 shift + 우클릭 > 여기서 cmd창 열기 (관리자모드로 실행)

2. 인증서 import

keytool -import -file "D:\{인증서}" -keystore "C:\Program Files\Java\jdk1.8.0_251\jre\lib\security\cacerts" -storepass "changeit"

keytool -import -file "D:\{인증서}" -keystore cacerts -storepass "changeit"

3. 확인

확인하기

keytool -list - v

인증서가 없다면?

브라우저 설정 > 인증서 > 내보내기 해서 받자

Elasticsearch nested query : nested에 should와 must를 함께 쓰는 경우 inner_hit 안맞는 이슈

Tue, 19 May 2020 05:31:19 GMT

이번에 개발하다가 이상한 점을 발견했다.

기존 should로만 구성된 nested query에 필수 조건이 붙어 must를 추가로 넣었는데, 쿼리는 잘 되는 듯 보였으나, inner_hit이 제대로 나오지 않는 현상이다. 쿼리의 결과는 제대로 나오고 있어서 제대로 되고있다고 착각했다.

아래 예시 데이터로 설명하겠다.

mapping

blogs라는 인덱스에 post doc이 있고, 그 안에는 nested로 해당 blog가 누구에게 공유되었는지(sharedTo)와 언제까지 공유가 유효한지(expiration)을 가지고 있다. authority는 해당 post에대한 권한을 의미한다.

PUT blogs
{
  "mappings": {
    "_doc": {
      "properties": {
        "post": {
          "properties": {
            "postId": {
              "type": "keyword"
            },
            "contents": {
              "type": "text"
            }
          }
        },
        "sharedPeople": {
            "type" : "nested",
          "properties": {
            "expiration": {
              "type": "date"
            },
            "authority": {
              "type": "keyword"
            },
            "sharedTo":{
              "type":"keyword"
            }
          }
        }
      }
    }
  }
}

doc 2개 추가

1번 doc은 A,B,C에게 공유되었고

2번 doc은 B,C에게 공유되었다.

PUT blogs/_doc/1
{
  "post": {
    "postId": 1,
    "contents": "hello"
  },
  "sharedPeople": [
    {
      "exiration": "2021-05-20",
      "authority": "1",
      "sharedTo": "A"
    },
        {
      "exiration": "2021-05-20",
      "authority": "2",
      "sharedTo": "B"
    }
    ,
    {
      "exiration": "2021-05-20",
      "authority": "3",
      "sharedTo": "C"
    }
  ]
}

PUT blogs/_doc/2
{
  "post": {
    "postId": 2,
    "contents": "world"
  },
  "sharedPeople": [
        {
      "exiration": "2021-05-20",
      "authority": "2",
      "sharedTo": "B"
    }
    ,
    {
      "exiration": "2021-05-20",
      "authority": "3",
      "sharedTo": "C"
    }
  ]
}

A가 공유받은 문서를 가져오는 쿼리

여러번 공유되었을 수도 있기 때문에, 쿼리에서는 공유 권한(authority)순으로 정렬해서 가장 큰 값을 inner_hit으로 받아온다.

GET blogs/_search
{
  "query": {
    "bool": {
      "filter": {
        "nested": {
          "path": "sharedPeople",
          "query": {
            "bool": {
              "should": [
                {
                  "bool": {
                    "must": [
                      {
                        "terms": {
                          "sharedPeople.sharedTo": [
                            "A"
                          ]
                        }
                      }
                    ]
                  }
                }
              ]
            }
          },
          "inner_hits": {
            "name": "sharedInfo",
            "from": 0,
            "size": 1,
            "sort": [
              {
                "sharedPeople.authority": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "blogs",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.0,
        "_source" : {
          "post" : {
            "postId" : 1,
            "contents" : "hello"
          },
          "sharedPeople" : [
            {
              "exiration" : "2021-05-20",
              "authority" : "1",
              "sharedTo" : "A"
            },
            {
              "exiration" : "2021-05-20",
              "authority" : "2",
              "sharedTo" : "B"
            },
            {
              "exiration" : "2021-05-20",
              "authority" : "3",
              "sharedTo" : "C"
            }
          ]
        },
        "inner_hits" : {
          "sharedInfo" : {
            "hits" : {
              "total" : 1,
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "blogs",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_nested" : {
                    "field" : "sharedPeople",
                    "offset" : 0
                  },
                  "_score" : null,
                  "_source" : {
                    "exiration" : "2021-05-20",
                    "authority" : "1",
                    "sharedTo" : "A"
                  },
                  "sort" : [
                    "1"
                  ]
                }
              ]
            }
          }
        }
      }
    ]
  }

여기에 공유기간을 추가조건으로 must로 넣었을때

GET blogs/_search
{
  "query": {
    "bool": {
      "filter": {
        "nested": {
          "path": "sharedPeople",
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "sharedPeople.exiration": {
                      "from": "2020-05-20",
                      "to": null,
                      "include_lower": false,
                      "include_upper": true
                    }
                  }
                }
              ],
              "should": [
                {
                  "bool": {
                    "must": [
                      {
                        "terms": {
                          "sharedPeople.sharedTo": [
                            "A"
                          ]
                        }
                      }
                    ]
                  }
                }
              ]
            }
          },
          "inner_hits": {
            "name": "sharedInfo",
            "from": 0,
            "size": 1,
            "sort": [
              {
                "sharedPeople.authority": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "blogs",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.0,
        "_source" : {
          "post" : {
            "postId" : 1,
            "contents" : "hello"
          },
          "sharedPeople" : [
            {
              "exiration" : "2021-05-20",
              "authority" : "1",
              "sharedTo" : "A"
            },
            {
              "exiration" : "2021-05-20",
              "authority" : "2",
              "sharedTo" : "B"
            },
            {
              "exiration" : "2021-05-20",
              "authority" : "3",
              "sharedTo" : "C"
            }
          ]
        },
        "inner_hits" : {
          "sharedInfo" : {
            "hits" : {
              "total" : 3,
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "blogs",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_nested" : {
                    "field" : "sharedPeople",
                    "offset" : 2
                  },
                  "_score" : null,
                  "_source" : {
                    "exiration" : "2021-05-20",
                    "authority" : "3",
                    "sharedTo" : "C"
                  },
                  "sort" : [
                    "3"
                  ]
                }
              ]
            }
          }
        }
      }
    ]
  }

위와같이 결과중에 inner_hit에 C의 권한이 나온다.

inner_hit에서 쿼리 중 아래의 should권한이 무시된 것이다.

              "should": [
                {
                  "bool": {
                    "must": [
                      {
                        "terms": {
                          "sharedPeople.sharedTo": [
                            "A"
                          ]
                        }
                      }
                    ]
                  }
                }
              ]

문제는 무시되었다면 쿼리의 결과에 2번문서, 즉 A에게 공유되지 않은 문서도 나왔어야하는데, 실제 쿼리의 결과에서는 1번만 나오지만 inner_hit에서 A의 권한이 아닌 C의권한이 나오고있다.

ES의 filter context에서 should안의 minimum_should_match는 기본으로 1인데, (version 7.0이하에서) 이것이 inner_hit에서는 적용되지 않고 있었다. 실제로 6.8에서 위 쿼리를 날리면, 앞으로 없어진다는 아래와같은 경고가 나온다

#! Deprecation: Should clauses in the filter context will no longer automatically set the minimum should match to 1 in the next major version. You should group them in a [filter] clause or explicitly set [minimum_should_match] to 1 to restore this behavior in the next major version.

아래와같이 minimun_should_match를 명시해보자

GET blogs/_search
{
  "query": {
    "bool": {
      "filter": {
        "nested": {
          "path": "sharedPeople",
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "sharedPeople.exiration": {
                      "from": "2020-05-20",
                      "to": null,
                      "include_lower": false,
                      "include_upper": true
                    }
                  }
                }
              ],
              "should": [
                {
                  "bool": {
                    "must": [
                      {
                        "terms": {
                          "sharedPeople.sharedTo": [
                            "A"
                          ]
                        }
                      }
                    ]
                  }
                }

              ],
              "minimum_should_match" : 1
            }
          },
          "inner_hits": {
            "name": "sharedInfo",
            "from": 0,
            "size": 1,
            "sort": [
              {
                "sharedPeople.authority": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "blogs",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.0,
        "_source" : {
          "post" : {
            "postId" : 1,
            "contents" : "hello"
          },
          "sharedPeople" : [
            {
              "exiration" : "2021-05-20",
              "authority" : "1",
              "sharedTo" : "A"
            },
            {
              "exiration" : "2021-05-20",
              "authority" : "2",
              "sharedTo" : "B"
            },
            {
              "exiration" : "2021-05-20",
              "authority" : "3",
              "sharedTo" : "C"
            }
          ]
        },
        "inner_hits" : {
          "sharedInfo" : {
            "hits" : {
              "total" : 1,
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "blogs",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_nested" : {
                    "field" : "sharedPeople",
                    "offset" : 0
                  },
                  "_score" : null,
                  "_source" : {
                    "exiration" : "2021-05-20",
                    "authority" : "1",
                    "sharedTo" : "A"
                  },
                  "sort" : [
                    "1"
                  ]
                }
              ]
            }
          }
        }
      }
    ]
  }

제대로 A의 권한이 나오고 있다.

우리는 6.8을 쓰고있는데, 여기서도 inner_hit에서의 minimum_should_match를 없애버리는 실수를 한걸까? 앞으로 7.0대부터 minimum_should_match가 없어지니 항상 명시하는것을 기본으로 해야겠다.

Cassandra 스키마 보기

Mon, 18 May 2020 09:40:19 GMT

cassandra 최신버전은 현재 3.11이다.

the latest Apache Cassandra 3.11 release: 3.11.6 (pgp, sha256 and sha512), released on 2020-02-14.

지금 개발할때 쓰고있는 버전은 2.2.8이다.

버전 보기

select release_version 
from system.local;

모든 KEYSPACE보기

select * 
from system.schema_keyspaces;

특정 keyspace의 table보기

select columnfamily_name 
from system.schema_columnfamilies 
where keyspace_name = 'audit';

특정 table의 column보기

select column_name, type, validator 
from system.schema_columns 
where keyspace_name = 'audit' 
and columnfamily_name = 'user_action';

moment().utcOffset()

Thu, 07 May 2020 09:31:26 GMT

이런 코드를 쓴 게 있었다..

moment().utcOffset(UserTimeOffset).getTime()

알고보니 아래 코드는 모두 같음

moment().utcOffset(360).getTime()
moment().utcOffset(240).getTime()
new Date().getTime()

utcOffset은 시간대를 어디로 표시할지에 대한 것이지 시간 값 자체를 바꿔주는 것이 아니다.

KAFKA 맨날까먹는것

Thu, 19 Mar 2020 08:27:39 GMT

0. 카프카 실행

./kafka-server-start.sh ../../config/server.properties

1. 카프카 토픽 생성

./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic {topicName}

2. 토픽 확인

./kafka-topcis.sh --list --zookeeper localhost:2181

3. 그룹확인

./kafka-consumer-groups.sh --list --bootstrap-server {IP}:9092

4. 컨슈머 콘솔에서 토픽 읽기

./kafka-console-consumer.sh --bootstrap-server {IP}:9092 --topic {topicName} --from-beginning

5. 특정 그룹 offset확인

./kafka-consumer-groups.sh --bootstrap-server {IP}:9092 --describe --group abc

6. 특정 그룹의 특정 토픽 offset 설정

./kafka-consumer-groups.sh \
--bootstrap-server {IP}:9092 \
--group {group} 
--topic {topic} \
--reset-offsets --to-latest --execute

결과

--to-offset n
--to-earlist
--shift-by n

등으로 offset 설정 가능

7. 특정 그룹의 특정 토픽 특정 시간대로 offset 설정

./kafka-consumer-groups.sh \
--bootstrap-server 182.193.17.35:9092 \
--group {group} \
--reset-offsets \
--topic {topic} \
--to-datetime 2020-12-02T00:00:00.000+09:00

kafka client가 낮으면 kafka에 offset을 저장하는게 아니라 zookeeper에 저장함. 그럴때는 zookeeper에서 해당 작업을 해줘야함

flume source NPE

Wed, 18 Mar 2020 09:32:43 GMT

1. 로그 바꾸니 에러 발생

flume.root.logger=DEBUG

root.logger를 debug로 바꾸니 아래와 같은 에러가 발생했다.

18 Mar 2020 08:58:16,193 DEBUG KafkaSource  - Event #: 0
18 Mar 2020 08:58:16,193 ERROR KafkaSource  - KafkaSource EXCEPTION, {}
java.lang.NullPointerException
        at org.apache.flume.instrumentation.MonitoredCounterGroup.increment(MonitoredCounterGroup.java:261)
        at org.apache.flume.instrumentation.kafka.KafkaSourceCounter.incrementKafkaEmptyCount(KafkaSourceCounter.java:49)
        at org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:146)
        at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:139)
        at java.lang.Thread.run(Thread.java:748)
18 Mar 2020 08:58:16,193 ERROR KafkaSource  - KafkaSource EXCEPTION, {}
java.lang.NullPointerException
        at org.apache.flume.instrumentation.MonitoredCounterGroup.increment(MonitoredCounterGroup.java:261)
        at org.apache.flume.instrumentation.kafka.KafkaSourceCounter.incrementKafkaEmptyCount(KafkaSourceCounter.java:49)
        at org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:146)
        at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:139)
        at java.lang.Thread.run(Thread.java:748)
18 Mar 2020 08:58:16,196 DEBUG KafkaSource  - Waited: 1002
18 Mar 2020 08:58:16,196 DEBUG KafkaSource  - Event #: 0
18 Mar 2020 08:58:16,196 ERROR KafkaSource  - KafkaSource EXCEPTION, {}
java.lang.NullPointerException
        at org.apache.flume.instrumentation.MonitoredCounterGroup.increment(MonitoredCounterGroup.java:261)
        at org.apache.flume.instrumentation.kafka.KafkaSourceCounter.incrementKafkaEmptyCount(KafkaSourceCounter.java:49)
        at org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:146)
        at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:139)
        at java.lang.Thread.run(Thread.java:748)
18 Mar 2020 08:58:16,196 DEBUG KafkaSource  - Waited: 1002
18 Mar 2020 08:58:16,196 DEBUG KafkaSource  - Event #: 0
18 Mar 2020 08:58:16,196 ERROR KafkaSource  - KafkaSource EXCEPTION, {}
java.lang.NullPointerException
        at org.apache.flume.instrumentation.MonitoredCounterGroup.increment(MonitoredCounterGroup.java:261)
        at org.apache.flume.instrumentation.kafka.KafkaSourceCounter.incrementKafkaEmptyCount(KafkaSourceCounter.java:49)
        at org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:146)
        at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:139)
        at java.lang.Thread.run(Thread.java:748)

2. kafkSorce.class

에러를 따라가다보니 kafkaSource.class에서 log level이 debug일때만 실행되는 소스가 있는데

 if (!iterStatus) {
                if (log.isDebugEnabled()) {
                    this.counter.incrementKafkaEmptyCount();
                    log.debug("Returning with backoff. No more data to read");
                }

                return Status.BACKOFF;

incrementKafkaEmptyCount를 할때

 return increment(COUNTER_KAFKA_EMPTY)

를 하는데, increment시에

  protected long increment(String counter) {
    return counterMap.get(counter).incrementAndGet();
  }

counterMap.get(counter) 이 부분이 null이어서 nullPointerException 발생.

수정된 버그(https://issues.apache.org/jira/browse/FLUME-2578)라고 하는데 우리는 kafka-source를 낮은버전을 써서 발생하는 듯.

        
            org.apache.flume.flume-ng-sources
            flume-kafka-source
            1.6.0

...

log4j.logger.org.apache.flume.source = INFO

일단 해당 부분을 info로 해두었다.

Elasticsearch join 관계에서 조건에 맞는 children count를 세는 composite aggregation

Wed, 18 Mar 2020 03:58:21 GMT

join 관계의 doc이 있다

매핑은 아래와 같다 post의 자식으로 comment가있고, post에는 언급된 사람들이 배열로 있다. (7버전 아래에서는 mappings 하위에 _doc으로 감싸줘야함.)

PUT blog
{
  "mappings": {
    "properties": {
      "post": {
        "properties": {
          "postId": {
            "type": "keyword"
          },
          "mentionedPeople": {
            "type": "keyword"
          }
        }
      },
      "comment": {
        "properties": {
          "commentId": {
            "type": "keyword"
          },
          "content": {
            "type": "text"
          }
        }
      },
      "join": {
        "type": "join",
        "relations": {
          "post": "comment"
        }
      }
    }
  }
}

데이터는 아래와 같이 들어가있다.

데이터 구분을 위해 모든 댓글의 내용(comment.content)는 "hello"이고, 26번만 " goodbye"로 지정했다. 아래와같이 데이터를 넣었다.

### 부모 포스트 2개 넣음

PUT blog/_doc/1
{
  "post":{
    "postId" :1,
    "mentionedPeople":["a","b","c"]
  },
  "join":"post"
}
PUT blog/_doc/2
{
  "post":{
    "postId" :2,
    "mentionedPeople":["b","c","d","e"]
  },
  "join":"post"
}

### 아래와 같이 11개의 댓글을 넣었다.
PUT blog/_doc/11?routing=1
{
  "comment": {
    "commentId": 11,
    "content": "hello"
  },
  "join": {
    "name": "comment",
    "parent": 1
  }
}
## 2의 6번째 댓글만 content를 bye로 넣었다.
PUT blog/_doc/26?routing=2
{
  "comment": {
    "commentId": 26,
    "content": "goodbye"
  },
  "join": {
    "name": "comment",
    "parent": 2
  }
}

comment가 hello인 것의 개수를 '언급된 사람별'로 count해보자.

하고싶은것은, post에는 여러 언급된 사람들이있고, 또 post에는 여러 comment가 있는데, 언급된 사람 > post > 특정 조건의 comment를, 사람별로 집계하고 싶은것이다.

즉 b는 1번과 2번 포스트에 언급되어 11개의 모든 댓글을 가지지만, 그중 내용이 hello인 10개를 가진다는 것을 알고싶고, a는 1번 포스트에 언급되어 내용이 hello인 5개의 포스트를 가진다는 것을 집계하고 싶은것이다.

1. mentionedPeople로 aggregation하고 -> 코멘트로 aggregation

comment는 자식이기때문에 aggregation시 children aggr를 써줘야한다.

GET blog/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "inner_hits": {
        "_source": false,
        "size": 0
      },
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "comment.content": "hello"
              }
            }
          ]
        }
      }
    }
  },
  "_source": false,
  "aggs": {
    "byworkspaceId": {
      "terms": {
        "field": "post.mentionedPeople"
      },
      "aggs": {
        "commentCount": {
          "children": {
            "type": "comment"
          },
          "aggs": {
            "commentHowMany": {
              "value_count": {
                "field": "comment.commentId"
              }
            }
          }
        }
      }
    }
  }
}

위 결과는 아래와 같다.

inner_hit에 보면, hello인 것의 개수가 a포스트 5개, b포스트 5개이지만 bucket에보면 a사람에 11개, b사람에 11개로 goodbye인 것이 제외되지 않고 나온다.

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 4,
    "successful": 4,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "blog",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "inner_hits": {
          "comment": {
            "hits": {
              "total": 5,
              "max_score": 0,
              "hits": []
            }
          }
        }
      },
      {
        "_index": "blog",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "inner_hits": {
          "comment": {
            "hits": {
              "total": 5,
              "max_score": 0,
              "hits": []
            }
          }
        }
      }
    ]
  },
  "aggregations": {
    "byworkspaceId": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "b",
          "doc_count": 2,
          "commentCount": {
            "doc_count": 11,
            "commentHowMany": {
              "value": 11
            }
          }
        },
        {
          "key": "c",
          "doc_count": 2,
          "commentCount": {
            "doc_count": 11,
            "commentHowMany": {
              "value": 11
            }
          }
        },
        {
          "key": "a",
          "doc_count": 1,
          "commentCount": {
            "doc_count": 5,
            "commentHowMany": {
              "value": 5
            }
          }
        },
        {
          "key": "d",
          "doc_count": 1,
          "commentCount": {
            "doc_count": 6,
            "commentHowMany": {
              "value": 6
            }
          }
        },
        {
          "key": "e",
          "doc_count": 1,
          "commentCount": {
            "doc_count": 6,
            "commentHowMany": {
              "value": 6
            }
          }
        }
      ]
    }
  }
}

2. aggregation에도 필터를 걸어준다.

GET blog/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "inner_hits": {
        "_source": false,
        "size": 0
      },
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "comment.content": "hello"
              }
            }
          ]
        }
      }
    }
  },
  "_source": false,
  "aggs": {
    "byworkspaceId": {
      "terms": {
        "field": "post.mentionedPeople",
        "size" : 5
      },
      "aggs": {
        "childCount": {
          "children": {
            "type": "comment"
          },
          "aggs": {
            "inner_filter": {
              "filter": {
                "bool": {
                  "should": [
                    {
                      "match": {
                        "comment.content": "hello"
                      }
                    }
                  ]
                }
              },
              "aggs": {
                "commentHowMany": {
                  "value_count": {
                    "field": "comment.commentId"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

내부 aggr에 filter aggr를 걸었더니, goodbye인것이 제외되고 잘 집계된다. 결과는 아래와같다.

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 4,
    "successful": 4,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "blog",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "inner_hits": {
          "comment": {
            "hits": {
              "total": 5,
              "max_score": 0,
              "hits": []
            }
          }
        }
      },
      {
        "_index": "blog",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "inner_hits": {
          "comment": {
            "hits": {
              "total": 5,
              "max_score": 0,
              "hits": []
            }
          }
        }
      }
    ]
  },
  "aggregations": {
    "byworkspaceId": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "b",
          "doc_count": 2,
          "childCount": {
            "doc_count": 11,
            "inner_filter": {
              "doc_count": 10,
              "commentHowMany": {
                "value": 10
              }
            }
          }
        },
        {
          "key": "c",
          "doc_count": 2,
          "childCount": {
            "doc_count": 11,
            "inner_filter": {
              "doc_count": 10,
              "commentHowMany": {
                "value": 10
              }
            }
          }
        },
        {
          "key": "a",
          "doc_count": 1,
          "childCount": {
            "doc_count": 5,
            "inner_filter": {
              "doc_count": 5,
              "commentHowMany": {
                "value": 5
              }
            }
          }
        },
        {
          "key": "d",
          "doc_count": 1,
          "childCount": {
            "doc_count": 6,
            "inner_filter": {
              "doc_count": 5,
              "commentHowMany": {
                "value": 5
              }
            }
          }
        },
        {
          "key": "e",
          "doc_count": 1,
          "childCount": {
            "doc_count": 6,
            "inner_filter": {
              "doc_count": 5,
              "commentHowMany": {
                "value": 5
              }
            }
          }
        }
      ]
    }
  }
}

3. 한번에 다가져오나? 페이징처리는?

문제는, 위 쿼리에 보면 아래와같이 size를 5개까지만 명시했는데, 명시하지 않으면 10개가 default이다.

  "aggs": {
    "byworkspaceId": {
      "terms": {
        "field": "post.mentionedPeople",
        "size" : 5
      },

문제는 mentionedPeople이 만명, 십만명 이라면?

bucket이 만개 십만개 생길것이고, 이를 한번의 response로 내리면 메모리 문제가 생길 수 있다.

4. composite aggr 추가

composite aggr은 조합 집계를 할때쓰이지만, 페이징처리를 통해 모든 집계를 가져오기 위해서도 쓰인다. 아래와같이 composite 을 추가하면 paging처리되어 aggr를 가져오는 것이 가능하다.

GET blog/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "inner_hits": {
        "_source": false,
        "size": 0
      },
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "comment.content": "hello"
              }
            }
          ]
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "my_compostie": {
      "composite": {
        "sources": [
          {
            "byMentionedPeople": {
              "terms": {
                "field": "post.mentionedPeople"
              }
            }
          }
        ],
        "size": 2
      },
      "aggs": {
        "childCount": {
          "children": {
            "type": "comment"
          },
          "aggs": {
            "inner_filter": {
              "filter": {
                "bool": {
                  "should": [
                    {
                      "match": {
                        "comment.content": "hello"
                      }
                    }
                  ]
                }
              },
              "aggs": {
                "commentHowMAny": {
                  "value_count": {
                    "field": "comment.commentId"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

위 쿼리의 결과에서는 a,b가 나왔다.

결과에서 아래 "after_key"를 주목하자.


 "aggregations": {
    "my_compostie": {
      "after_key": {
        "byMentionedPeople": "b"
      },

다음 쿼리시, composite의 after에 위 맵을 그대로 넘겨주면, 그 이후로 조회 가능하다.

"aggs": {
    "my_compostie": {
      "composite": {
        "sources": [
          {
            "byMentionedPeople": {
              "terms": {
                "field": "post.mentionedPeople"
              }
            }
          }
        ],
        "size": 2,
        "after": {
          "byMentionedPeople": "b"
        }
      },

위와같이 b 이후 부터 조회하겠다라고 하면 c와 d를 보여준다.

5. java ligh-level client로 작성하면

    RestHighLevelClient metaEs = ESClient.getClient("localhost", 9200);

    BoolQueryBuilder shouldMatchHello = QueryBuilders.boolQuery().should(QueryBuilders.matchQuery("comment.content","hello"));
        HasChildQueryBuilder hasChildQueryBuilder = JoinQueryBuilders.hasChildQuery("comment", shouldMatchHello, ScoreMode.None);
        hasChildQueryBuilder.innerHit();

        List> sources = new ArrayList<>();
        sources.add(new TermsValuesSourceBuilder("byMentionedPoeple").field("post.mentionedPeople"));
        CompositeAggregationBuilder compositeAggregation = new CompositeAggregationBuilder("my_composite", sources).size(2);

        compositeAggregation.subAggregation(
                new ChildrenAggregationBuilder("childCount", "comment")
                        .subAggregation(AggregationBuilders.filter("inner_filter", shouldMatchHello))
        );


        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        sourceBuilder.query(hasChildQueryBuilder);
        sourceBuilder.aggregation(compositeAggregation);
        sourceBuilder.size(0);


        SearchRequest searchRequest = new SearchRequest();
        searchRequest.indices("blog");
        searchRequest.source(sourceBuilder);


        try {
            SearchResponse response = metaEs.search(searchRequest, RequestOptions.DEFAULT);
            System.out.println(searchRequest);
            System.out.println(response);

        } catch (IOException e) {
            System.out.println(e);
        }

6.사실

사실 inner_filter를 추가할때부터 그 내부의 value_count aggs는 의미가 없어졌다.

이미 filter의 결과에 doc_count가 있기 때문이다. 어쨋든.. 복잡한 요건이 해결되었다.