ABSTRACT
Existing violin performance datasets fail to completely capture the intricate techniques of musical expression. To address this gap, we built an augmented violin system with a range of integrated sensors, yielding a multimodal dataset intended to support research in music information retrieval. Our system captures audio, video, bow pressure, bow tilt, bow and fingerboard position, violin/bow orientation, and room impulse response simultaneously from ten advanced violinists performing studio-grade repertoire.
We further analyzed hardware-captured features to correct the state-of-the-art software pitch transcription algorithm, resulting in highly accurate MIDI data that captures legato bowing, bow speed, and contact point. Our dataset achieves a 62.8% improvement in note error rate over MUSC and a 14.6% relative reduction in perceived audio distance (Zimtohrli metric).
AUGMENTED VIOLIN SYSTEM
The augmented violin integrates several distinct sensor modalities directly onto a violin and bow, with all data routed through a Raspberry Pi Pico 2W MCU mounted on a custom protoboard.
Place photo here: img/violin-top.jpg
Place photo here: img/violin-bottom.jpg
Place photo here: img/violin-bow.jpg
Hardware System Summary
| Sensor | Model | Purpose |
|---|---|---|
| Touch sensors (×4) | TSP-L | Per-string finger position |
| Ultrasonic sensor | SR-04 | Lateral bow position from bridge |
| IMU — violin body | ICM-20948 | Violin orientation & movement |
| IMU — bow | ICM-20948 | Bow orientation & movement |
| Proximity sensors (×4) | VCNL 4010 | Bow pressure & contact location |
| MCU | Raspberry Pi Pico 2W | Sensor interfacing & data transmission |
DATASET
The dataset comprises 300 minutes of studio-grade solo classical violin recordings from 10 advanced violinists, annotated with synchronized sensor streams, audio, video, and bow position labels.
Comparison with Existing Datasets
| Dataset | Performers | Length | Modalities |
|---|---|---|---|
| Ours | 10 | 300 min | Audio, MIDI Transcriptions, Video, Positions, Fingering, etc. |
| Violin Etudes | 21* | 1668 min | Audio, MIDI Transcriptions |
| URMP | 2 | 80 min | Audio, MIDI Transcriptions, Video |
| Bach10 | 1 | 5 min | Audio, MIDI Transcriptions |
* Violin Etudes' data is heavily skewed towards performers 1 and 2.
RESULTS
Note Error Rate
Perceived Audio Distance (Zimtohrli)
ACKNOWLEDGEMENTS
We greatly thank Professor Olga Vechtomova of the University of Waterloo NLP Lab for her support as our faculty mentor. This study has been reviewed and received ethics clearance through a University of Waterloo Research Ethics Board (REB #47874).