2024-12-19 TIL 문서 복사

rlaisqls · Dec 19, 2024 · a2cb5a8 · a2cb5a8
1 parent b2ec68b
commit a2cb5a8
Show file tree

Hide file tree

Showing 3 changed files with 78 additions and 3 deletions.
diff --git a/src/content/docs/TIL/DevOps/AWS/Analytics/Athena.md b/src/content/docs/TIL/DevOps/AWS/Analytics/Athena.md
@@ -42,6 +42,58 @@ lastUpdated: 2024-03-13T15:17:56
 
 - AWS에서 지원하는 더 다양한 serde 목록 및 사용법은 [공식문서](https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html)에서 확인할 수 있다.
 
+### 파티션 프로젝션 설정
+
+파티션은 말 그대로, 데이터를 특정 단위로 분류하여 필요한 범주의 파일을 더 빠르게 찾을 수 있도록 돕는 기능이다.
+
+만약 파일 저장 경로가 `/{yyyy}/{MM}/{dd}` 와 같이 날짜별로 나눠져있을 때 파티션을 설정한 후 2024년 12월의 데이터를 쿼리하면, Athena는 다른 폴더를 무시하고 /2024/12 경로만 탐색한다. 따라서 검색 속도가 빨라진다. 데이터의 양이 많아지면 쿼리 성능을 위해 파티션 지정은 필수적이다.
+
+이 파티션 정보는 데이터 분석 특화 엔진인 Glue나 외부에 수동으로 지정할 수도 있고, Athena에서 쿼리를 실행할 때 실시간으로 계산하도록 할 수도 있는데 파티션이 자주 변하는 경우 Athena에서 계산하는 것이 더 성능 효율적이다.
+
+이 기능을 파티션 프로젝션이라고 한다.
+
+파티션 프로젝션을 적용하기 위해선 `PARTITIONED BY`, `TBLPROPERTIES` 두 부분에 설정을 추가해야한다.
+
+- `PARTITIONED BY`
+
+  - 파티션 항목의 이름을 정의한다
+
+- `TBLPROPERTIES`
+
+  - `'projection.enabled'='true'`: projection을 활성화한다.
+
+  - `'projection.{field-name}.type'={type}`: 파티션 할 항목(필드)의 타입을 지정한다.
+
+    - 총 4가지 타입(enum, integer, date, injected)이 있다.
+
+    - 타입별로 허용할 파티션 값의 범위 (int는 range 등)를 각각 지정해줘야한다. 자세한 내용은 [공식문서](https://docs.aws.amazon.com/ko_kr/athena/latest/ug/partition-projection-supported-types.html) 참고
+
+  - `'storage.location.template'='s3://...'`: 파티션 항목이 폴더 경로에 어떻게 포함되는지 표현한다.
+
+- 아래는 연월일, 시간을 파티션 필드로 지정한 예시이다.
+
+```
+PARTITIONED BY ( 
+  `year` string, 
+  `month` string, 
+  `day` string, 
+  `hour` string)
+ROW FORMAT SERDE ...
+LOCATION ...
+TBLPROPERTIES (
+  'projection.enabled'='true', 
+  'projection.day.type'='integer',
+  'projection.day.range'='1, 31', 
+  'projection.hour.type'='integer', 
+  'projection.hour.range'='0, 23', 
+  'projection.month.type'='integer', 
+  'projection.month.range'='1, 12', 
+  'projection.year.type'='integer',
+  'projection.year.range'='2024, 2124', 
+  'storage.location.template'='s3://test-log-bucket/${year}/${month}/${day}/${hour}/'
+);
+```
+
 ### 테이블 생성
 
 테이블 생성 후 파티션 메타데이터를 등록하기 위해 REPAIR 명령어를 사용한다.
@@ -55,4 +107,3 @@ reference
 
 - <https://aws.amazon.com/athena/faqs/?nc=sn&loc=6>
 - <https://docs.aws.amazon.com/athena/latest/ug/serde-reference.html>
-
diff --git a/...g/datadog/Anomaly detection Algorithms.md → ...adog/DatadogAnomalydetectionAlgorithms.md b/...g/datadog/Anomaly detection Algorithms.md → ...adog/DatadogAnomalydetectionAlgorithms.md
@@ -1,6 +1,6 @@
 ---
-title: 'Anomaly detection Algorithms'
-lastUpdated: 2024-12-19T02:24:25
+title: 'DatadogAnomalydetectionAlgorithms'
+lastUpdated: 2024-12-18T09:46:08
 ---
 Datadog은 Anormaly detection 기준 설정을 위해 최대 6주간의 데이터를 학습하고, 아래 세 알고리즘 중 하나에 따라 계산한다.
 

diff --git a/src/content/docs/TIL/개발/압축 알고리즘.md b/src/content/docs/TIL/개발/압축 알고리즘.md
@@ -0,0 +1,24 @@
+---
+title: '압축 알고리즘'
+lastUpdated: 2024-12-19T20:01:28
+---
+- BZIP2: Burrows-Wheeler 알고리즘을 사용하는 형식이다.  
+
+- DEFLATE: [LZSS](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Storer%E2%80%93Szymanski) 및 [Huffman 코딩](https://en.wikipedia.org/wiki/Huffman_coding)을 기반으로 한 압축 알고리즘이다. [Deflate](https://en.wikipedia.org/wiki/Deflate)는 Avro 파일 형식에만 해당된다.  
+
+- GZIP: Deflate를 기반으로 한 압축 알고리즘이다. Athena 엔진 버전 2 및 3의 Hive 테이블과 Athena 엔진 버전 2의 Iceberg 테이블의 경우 GZIP은 Parquet 및 텍스트 파일 스토리지 형식의 파일에 대한 기본 쓰기 압축 형식이다. `tar.gz` 형식의 파일은 지원되지 않는다.  
+
+- LZ4: Lempel-Ziv 77(LZ7) 패밀리의 알고리즘도 최대 데이터 압축이 아닌 압축 및 압축 해제 속도에 중점을 둔다. LZ4에는 다음과 같은 프레이밍 형식이 있다.  
+    • LZ4 Raw/Unframed: LZ4 블록 압축 형식의 프레이밍되지 않은 표준 구현이다. [( LZ4 블록 형식 설명 )](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md)  
+  - LZ4 framed: LZ4의 일반적인 프레이밍 구현이다. [( LZ4 프레임 형식 설명 )](https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md)
+  - LZ4 hadoop-compatible: LZ4의 Apache Hadoop 구현이다. 이 구현은 LZ4 압축을 [BlockCompressorStream.java](https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/BlockCompressorStream.java) 클래스로 래핑한다.  
+
+- LZO: 최대 데이터 압축이 아닌 높은 압축 및 압축 해제 속도에 중점을 둔 Lempel–Ziv–Oberhumer 알고리즘을 사용하는 형식이다. LZO에는 두 가지 구현이 있다.  
+    • [Standard LZO](http://www.oberhumer.com/opensource/lzo/#abstract)
+  - LZO hadoop-compatible - LZO 알고리즘을 [BlockCompressorStream.java](https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/BlockCompressorStream.java) 클래스로 래핑한 버전이다.  
+
+- SNAPPY: Lempel-Ziv 77 (LZ7) 패밀리의 일부인 압축 알고리즘이다. Snappy는 데이터를 최대한 압축하지 않고, 빠른 압축 및 압축 해제 속도에 중점을 둔다.  
+
+- ZLIB: Deflate를 기반으로 한 ZLIB는 ORC 데이터 스토리지 형식의 파일에 대한 기본 쓰기 압축 형식이다. 자세한 내용은 GitHub에서 [zlib](https://github.com/madler/zlib) 페이지를 참조하라.  
+
+- ZST: [Zstandard 실시간 데이터 압축 알고리즘](http://facebook.github.io/zstd/)은 높은 압축비를 제공하는 빠른 압축 알고리즘이다. Zstandard(ZSTD) 라이브러리는 BSD 라이선스를 사용하는 오픈 소스 소프트웨어로 제공된다. ZSTD는 Iceberg 테이블의 기본 압축이다. ZSTD 압축 데이터를 작성할 때 Athena는 기본적으로 ZSTD 압축 수준 3을 사용한다.