Thesis Projects
all projects updated 06/2024
This is a list of projects that can serve as a basis for student theses and similar coursework. If you are a student at the IT University of Copenhagen and are interested in working on one of these topics for the degree you’re pursuing, please feel free to get in touch with me.
Speech processing and sensitive data
-
Transcription and anonymisation of sensitive conversations for LLM training
This project is in collaboration with the VIRTU research group at Region Hovedstadens Psykiatri. The goal is to create a dataset of therapist-patient interactions for training large language models, which will be used to analyse and support psychotherapy. Available data consists of a large corpus of audio recordings of psychotherapy sessions. The main challenges of this project will be to adapt automatic speech recogition models to work for optimal results on this data and to anonymise the transcriptions by automatically recognising and editing out personally identifiable information.
Uncertainty quantification and communication
-
Sensitivity of uncertainty measurements to corpus information
The idea of this project is to study how sensitive the uncertainty metrics we can derive from language model output are to variations in the degree of certainty expressed by the finetuning data. This involves creating finetuning datasets with a controlled level of confidence, using a meta-analysis dataset as input and relating the measured confidence to the confidence in the finetuning dataset.
-
Communicating uncertainty effectively
This project is about aligning the uncertainty a large language model expresses in words with the uncertainty measured numerically with machine learning methods. We want to train a large language model so that the expressed uncertainty, as understood by the user of the language model, matches the uncertainty the model itself “experiences”.
-
Uncertainty of meaning vs. uncertainty of form
When we measure the uncertainty of a large language model at a token level, this metric conflates the uncertainty of the LLM about the facts it expresses with its uncertainty about how to phrase things. Which of the two is more interesting, depends on the use case (say, grammar checking vs. question answering). Metrics have been proposed to separate the two aspects (e.g. semantic entropy, Kuhn et al., ICLR 2022). This project serves to explore how well the two aspects can be separated and how we can obtain reliable estimates of one type of uncertainty over the other.
Toxicity and bias
-
Explainable modelling of specific types of toxic speech
Select a specific type of toxic speech and analyse its characteristic properties, then use these insights to create models to recognise this particular type of speech and automatically identify the precise factors making it toxic. I have previously supervised similar student projects on dehumanising language and on threats, and these could potentially be built upon, or you could study an entirely different type of speech.
-
Target-specific modelling of toxic language
A lot of toxic language targets specific groups of victims, e.g., specific demographic groups. The idea of this project is to select a specific group of people that is commonly targeted by toxic behaviour and develop methods for the automatic identification of toxic language targeting that group, recognising and labelling the ways in which toxicity is group-specific.
Referring expressions
-
Modelling ambiguity in referring expressions
Referring expressions are frequently ambiguous. For instance, in the passage “The bomb exploded violently. It created a huge crater.”, the pronoun “It” could refer either to the bomb itself or to the fact that it exploded violently. The idea of this project is to model and evaluate coreference resolution in a way that preserves and respects this ambiguity and doesn’t force an artificial decision on the process.
-
Cross-lingual realisation of referring expressions
Different languages have different preferences for how to select referring expressions in language generation. This becomes evident when you compare translations of the same text across languages. In this project, you develop automatic methods, based on neural machine translation or large language models, to match up translations of referring expressions across languages even when they are not literal.