PhD Themes 2024: Measuring Output Quality of Large Language Models
Supervising team: Jakub Šimko (supervisor, KInIT), Dominik Macko (KInIT)
Keywords: generative AI, large language models, dataset creation, dataset augmentation, machine generated text detection, metrics and evaluation, machine learning
The advent of large language models (LLMs) is raising research questions about how to measure quality and properties of their outputs. Such measures are needed for benchmarking, model improvements or prompt engineering. Some evaluation techniques pertain to specific domains and scenarios of use (e.g., how accurate are the answers to factual questions in such and such domain? how well can we use the generated answers to train a model for a specific task?), others are more general (e.g., what is the diversity of paraphrases generated by an LLM? how easy to detect it is that the content is generated?).
Through replication studies, benchmarking experiments, metric design, prompt engineering and other approaches, the candidate will advance the methods and experimental methodologies of LLM output quality measurement. Of particular interest are two general scenarios:
- Dataset generation and/or augmentation, where LLMs are prompted with (comparatively small) sets of seeds to create much larger datasets. Such an approach can be very useful, when dealing with a domain/task with limited availability of original (labelled) training data (such as disinformation detection).
- Detection of generated content, where stylometric-based, deep learning-based, statistics-based, or hybrid methods are used to estimate whether a piece of content was generated or modified by a machine. The detection ability is crucial for many real-world scenarios (e.g., detection of disinformation or frauds), but feeds back also to research methodologies (e.g., detecting the presence of generated content in published datasets or in crowdsourced data).
The candidate will select (but will not be limited to) one of the two general scenarios, identify, and refine specific research questions and experimentally answer them.
- Cegin, J., Simko, J. and Brusilovsky, P., 2023. ChatGPT to Replace Crowdsourcing of Paraphrases for Intent Classification: Higher Diversity and Comparable Model Robustness. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://arxiv.org/pdf/2305.12947.pdf
- Macko, D., Moro, R., Uchendu, A., Lucas, J.S., Yamashita, M., Pikuliak, M., Srba, I., Le, T., Lee, D., Simko, J. and Bielikova, M., 2023. MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://arxiv.org/pdf/2310.13606.pdf
The research will be performed at the Kempelen Institute of Intelligent Technologies (KInIT, https://kinit.sk) in Bratislava in cooperation with industrial partners or researchers from highly respected research units. A combined (external) form of study and full employment at KInIT is expected.
Jakub Šimko is an expert researcher at KInIT, where he also leads the Web and User Data Processing team. Jakub focuses on the intersection of human computation, machine learning and user modeling. He has recently been working on social media algorithm auditing, misinformation modeling and promotes interdisciplinary approaches to computer science research. He graduated from Slovak University of Technology in Bratislava, where, after receiving his PhD, he worked for 7 years as a researcher and teacher. He co-authored more than 30 internationally recognized publications, together receiving more than 350 citations.
Dominik focuses on energy efficiency and security in the Internet of Things environment, from a communication point of view as well as from a device point of view. He focuses on reduction of unnecessary control overhead to create a secure channel and transmit data by strictly power-managed sensor nodes. Also, he deals with anomaly and intrusion detection in IP networks based on communication statistics. Recently, he focuses on robust detection of multilingual machine-generated text.