PhD Themes 2024: Human-AI Collaboration in Dataset Creation

Supervising team: Jakub Šimko (supervisor, KInIT), Peter Brusilovsky (University of Pittsburgh), Jana Kosecka (George Mason University), Peter Dolog (Aalborg University)
Keywords: generative AI, large language models, machine learning, human in the loop, crowdsourcing, human computation, active learning.

The models created in machine learning can only be as good as the data on which they are trained. Researchers and practitioners thus strive to provide their training processes with the best data possible. It is not uncommon to spend much human effort in achieving upfront good general data quality (e.g. through annotation). Yet sometimes, upfront dataset preparation cannot be done properly, sufficiently or at all. 

In such cases the solutions, colloquially denoted as human-in-the-loop solutions, employ the human effort in improving the machine learned models through actions taken during the training process and/or during the deployment of the models (e.g. user feedback on automated translations). They are particularly useful for surgical improvements of training data through identification and resolving of border cases. 

Human-in-the-loop approaches draw from a wide palette of techniques, including active and interactive learning, human computation, and crowdsourcing (also with motivation schemes of gamification and serious games). With recent emergence of large language models (LLM), the original human-in-the-loop techniques can be further boosted to create extensive synthetic training sets with comparatively small human effort. 

The domains of application of human-in-the-loop are predominantly those with a lot of heterogeneity and volatility of data. Such domains include online false information detection, online information spreading (including spreading of narratives or memes), auditing of social media algorithms and their tendencies for disinformation spreading, support of manual/automated fact-checking and more.

Relevant publications:

  • Cegin, J., Simko, J. and Brusilovsky, P., 2023. ChatGPT to Replace Crowdsourcing of Paraphrases for Intent Classification: Higher Diversity and Comparable Model Robustness. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 
  • J. Šimko and M. Bieliková. Semantic Acquisition Games: Harnessing Manpower for Creating Semantics. 1st Edition. Springer Int. Publ. Switzerland. 150 p. 

The research will be performed at the Kempelen Institute of Intelligent Technologies (KInIT, in Bratislava in cooperation with industrial partners or researchers from highly respected research units. A combined (external) form of study and full employment at KInIT is expected.

Supervising team

Jakub Šimko Lead researcher, KInIT More info
Close Jakub Šimko Lead researcher, KInIT

Jakub Šimko is an expert researcher at KInIT, where he also leads the Web and User Data Processing team. Jakub focuses on the intersection of human computation, machine learning and user modeling. He has recently been working on social media algorithm auditing, misinformation modeling and promotes interdisciplinary approaches to computer science research. He graduated from Slovak University of Technology in Bratislava, where, after receiving his PhD, he worked for 7 years as a researcher and teacher. He co-authored more than 30 internationally recognized publications, together receiving more than 350 citations.

Peter Brusilovsky Professor, University of Pittsburgh, USA More info
Close Peter Brusilovsky Professor, University of Pittsburgh, USA

Peter Brusilovsky is a Professor at the School of Computing and Information, University of Pittsburgh, where he directs the Personalized Adaptive Web Systems (PAWS) lab. His research is focused on user-centered intelligent systems in the areas of adaptive learning, recommender systems, and personalized health. He is a recipient of Alexander von Humboldt Fellowship, NSF CAREER Award, and Fulbright-Nokia Distinguished Chair. Peter served as the Editor-in-Chief of IEEE  Trans. on Learning Technologies, and a program chair for several conferences including RecSys.

Jana Kosecka Professor, George Mason University, USA More info
Close Jana Kosecka Professor, George Mason University, USA

Jana Kosecka is a Professor at the George Mason University. She is interested in computational models of vision systems, acquisition of static and dynamic models of environments by means of visual sensing, high-level semantic scene understanding and human-computer interaction. She held visiting positions at UC Berkeley, Stanford University, Google and Nokia Research, and served as Program chair, Area chair or senior member of editorial board for  leading conferences in the field CVPR, ICCV, ICRA.

Jana is currently mentor of our PhD student: Ivana Beňová

Peter Dolog Associate Professor, Aalborg University, Denmark More info
Close Peter Dolog Associate Professor, Aalborg University, Denmark

Peter Dolog is an Associate Professor at the Department of Computer Science, Aalborg University, Denmark. His current research interests include machine learning and data mining in the areas of user behavior analysis and prediction, recommender systems, preference learning, and personalization. Peter is a senior member of ACM, served as a senior program commitee member of AI related conferences as well as a general chair of UMAP, HT and Web Engineering conferences.