Description
A narrator tells a story slightly different each time. Imagine an invisible conductor controlling the stage lighting, projections, music, sounds in real-time and in accordance to the narrator to amplify the experience for the audience.
Tasks
Create a pipeline starting with Whisper for speech recognition of the narrator. Feed the text into a fine-tuned LLM which gives commands to the modal generators controlling the scene appearance. Each student will handle a single modality of choice. The main focus is on authoring (i.e. programming) a restricted generative environment controllable by a set of parameters, fine-tuning the LLM for it and crafting queries for the real-time LLM employment. Support tasks involve connecting the components in a unified environment, preferably .NET, and creating a framework for running the whole pipeline for experiments.
Requirements
- Knowledge of English language (source code comments and final report should be in English)
- Experience with C# and Python
- Basic knowledge of large language models (LLMs)
- Docker and Linux knowledge advantageous
Environment
The project should be implemented as a standalone .NET application wrapped in a Docker container. It may use helper libraries like for example:
- https://github.com/sandrohanea/whisper.net
- https://github.com/SciSharp/LLamaSharp
- https://opentk.net
- https://github.com/sinshu/meltysynth