🗜️ CLaMP 3 - Multimodal & Multilingual Semantic Music Search
CLaMP 3 is a multimodal and multilingual music information retrieval (MIR) framework, supporting sheet music, audio, and performance signals in 100 languages. Using contrastive learning, it aligns these modalities in a shared space for cross-modal retrieval.
🔍 How This Demo Works
- You can retrieve music using any text input (in any language) or an image (
.png
,.jpg
). - When using an image, BLIP generates a caption, which is then used for retrieval.
- Since CLaMP 3's training data includes rich visual descriptions of musical scenes, it can match images to semantically relevant music.
- For simplicity, this demo retrieves music based on metadata (text descriptions) rather than directly searching sheet music, MIDI, or audio files.
⚠️ Limitations
- This demo retrieves music only from the WikiMT-X benchmark (1,000 pieces).
- These pieces are mainly from the U.S. and Western Europe (especially the U.S.) and mostly from the 20th century.
- Thus, retrieval results are mostly limited to Western 20th-century music, so you won’t find music from other regions or historical periods.
🔧 Need retrieval for a different music collection? Deploy CLaMP 3 on your own dataset.
Generally, the larger and more diverse the reference music dataset, the better the retrieval quality, increasing the likelihood of finding relevant and accurately matched music.
Note: This project is for research use only.
YouTube Video
Metadata
Examples
Select Search Mode | Or upload an image (PNG, JPG) |
---|
Pages: