Generation 정량 평가 Metric

참고 reference: A Comprehensive Assessment of Dialog Evaluation Metrics, https://arxiv.org/abs/2106.03706

A Comprehensive Assessment of Dialog Evaluation Metrics

Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that c

arxiv.org

Metric 이름	적용 방법
ADEM	RNN 모델 생성 response에 활용되는 metric으로 MSE 기반 측정 https://github.com/Yoctol/ADEM
RUBER	Cosine 유사도와 dialog history에 적절한지 ranking loss를 계산하여 RNN 기반 예측 https://github.com/gmftbyGMFTBY/RUBER-and-Bert-RUBER
BERT-RUBER	RUBER의 RNN을 BERT로 교체 https://github.com/gmftbyGMFTBY/RUBER-and-Bert-RUBER
PONE	Negative example에 대한 응답 측정 https://github.com/gmftbyGMFTBY/PONE
MAUDE	Noise Contrastive Estimation (NCE)를 학습해서 negative response에 대한 모델 응답 측정
DEB	관련되거나 관련되지 않은 응답을 BERT 기반 측정
GRADE	Dialog history에 대해 RUBBER 기반 graph를 만들어 측정(turn-level) https://github.com/li3cmz/GRADE
DynaEval	Dialog-level에서 그래프 구조로 모델 성능 측정 https://github.com/e0397123/DynaEval
USR	여러 모델을 학습시켜 각각의 항목을 평가 -. LM) fluency 평가 -. Retrieval Model) 답변 연관성 평가 -. Fact-to-response Model) 지식적절성 평가
USR-H	-. VUP(Valid Utterance prediction): 문법성 평가 -. NSP: sensibleness 평가 -. MLM: 적절성 평가
DialogRPT	여러 GPT2 모델을 앙상블해서 평가 https://github.com/golsun/DialogRPT
Deep AM-FM	-. AM(Adequacy Metric): BERT 기반 semantic similarity 측정 -. FM(Fluency metric): 확률값의 similarity 측정 https://github.com/e0397123/deep-amfm
HolisticEval	GPT2를 활용해 Context coherence, language fluency, response diversity, logical self consistency 학습 후 측정
FED	DialoGPT를 활용해 utterance의 likelihood를 학습 및 측정
FlowScore	DialoFlow(CFM, SIM, RGM으로 학습된 모델) 기반 dialog가 history에 기반하고 있는지 품질 평가 https://github.com/ictnlp/DialoFlow/tree/main/FlowScore
FBD	FT 없이 RoBERTa를 활용해 Distribution-wise difference를 평가
BERTScore	Token embedding으로 F1 score 계산 https://github.com/lovit/KoBERTScore
BLUERT	Pre-trained BERT를 MSE loss로 FT해서 synthetic data 생성 https://github.com/google-research/bleurt
QuestEval	QG 기반으로 QA의 사실성 평가 https://github.com/ThomasScialom/QuestEval

BERTScore, BLUERT, QuestEval의 경우, 대화보다는 번역, 요약 등 일반적인 generation 평가에 활용되는 방식

* STS(Semantic Textual Similarity) 지표에 활용되는 Pearson 상관계수

STS를 계산하기 위해, Reference-generation의 결과를 비교한 human 및 model의 결과를 비교하게 됌
이 때, 이 두 결과의 일치성을 활용하기 위해 주로 MSE 또는 Pearson 상관계수를 사용하여, 값이 얼마나 일정하게 나오는지를 확인함
그러나 의미적인 유사성을 고려하지 못하며, outlier에 예민하여 결과값이 왜곡되는 경향이 있음

저작자표시 비영리 변경금지 (새창열림)

'Papers > Metric' 카테고리의 다른 글

[Review] FACTSCORE: Fine-grained Atomic Evaluation ofFactual Precision in Long Form Text Generation (0)	2024.03.21
[Review] GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue System (0)	2023.01.27

NLP AI Research Review

Generation 정량 평가 Metric

'Papers > Metric' 카테고리의 다른 글

티스토리툴바

Generation 정량 평가 Metric

'Papers > Metric' 카테고리의 다른 글

'Papers/Metric' Related Articles

티스토리툴바