🗜️ CLaMP 3 - Multimodal & Multilingual Semantic Music Search

CLaMP 3 is a multimodal and multilingual music information retrieval (MIR) framework, supporting sheet music, audio, and performance signals in 100 languages. Using contrastive learning, it aligns these modalities in a shared space for cross-modal retrieval.

🔍 How This Demo Works

  • You can retrieve music using any text input (in any language) or an image (.png, .jpg).
  • When using an image, BLIP generates a caption, which is then used for retrieval.
  • Since CLaMP 3's training data includes rich visual descriptions of musical scenes, it can match images to semantically relevant music.
  • For simplicity, this demo retrieves music based on metadata (text descriptions) rather than directly searching sheet music, MIDI, or audio files.

⚠️ Limitations

  • This demo retrieves music only from the WikiMT-X benchmark (1,000 pieces).
  • These pieces are mainly from the U.S. and Western Europe (especially the U.S.) and mostly from the 20th century.
  • Thus, retrieval results are mostly limited to Western 20th-century music, so you won’t find music from other regions or historical periods.

🔧 Need retrieval for a different music collection? Deploy CLaMP 3 on your own dataset.
Generally, the larger and more diverse the reference music dataset, the better the retrieval quality, increasing the likelihood of finding relevant and accurately matched music.

Note: This project is for research use only.

Select Search Mode
Select Retrieval Result

YouTube Video

Metadata

Examples
Select Search Mode Or upload an image (PNG, JPG)
Pages: