BRIAN WHITMAN
The Echo Nest Corporation
Scientists and engineers around the world have been attempting to do the impossible—and yet, no one can question their motives. When spelled out, “understanding music” by a computational process just feels offensive. How can music, something so personal, something rooted in context, culture, and emotion, ever be labeled by an autonomous process? Even an ethnographical approach—surveys, interviews, manual annotation—undermines the raw effort of musical artists, who will never understand, or even, perhaps, take advantage of what might be learned or created through this research. Music by its very nature resists analysis.
In the past 10 years, I’ve led two lives—one as a “very long-tail” musician and artist and the other as a scientist turned entrepreneur who currently sells “music intelligence” data and software to almost every major music-streaming service, social network, and record label. How we got from one to the other is less interesting than what it might mean for the future of expression and what I believe machine perception can actually accomplish.
In 1999, I moved to New York City to begin graduate studies at Columbia working on a large “digital government” grant parsing decades of military documents to extract the meaning of acronyms and domain-specific words. At night I would swap the laptops in my bag and head downtown to perform electronic music at various bars and clubs.
As hard as I tried to keep my two lives separate, the walls between them quickly came down when I began to ask my fellow performers and audience members how they learn about music. They responded, “We read websites,” “I’m on a discussion board,” “A friend e-mailed me some songs,” and so on. Obviously, simultaneously with the concurrent media frenzy on peer-to-peer networks
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 43
Very Large Scale Music Understanding
BRiAn WhitmAn
The Echo Nest Corporation
Scientists and engineers around the world have been attempting to do the
impossible—and yet, no one can question their motives. When spelled out,
“understanding music” by a computational process just feels offensive. How
can music, something so personal, something rooted in context, culture, and
emotion, ever be labeled by an autonomous process? Even an ethnographical
approach—surveys, interviews, manual annotation—undermines the raw effort
of musical artists, who will never understand, or even, perhaps, take advantage of
what might be learned or created through this research. Music by its very nature
resists analysis.
In the past 10 years, I’ve led two lives—one as a “very long-tail” musician
and artist and the other as a scientist turned entrepreneur who currently sells
“music intelligence” data and software to almost every major music-streaming
service, social network, and record label. How we got from one to the other is
less interesting than what it might mean for the future of expression and what I
believe machine perception can actually accomplish.
In 1999, I moved to New york City to begin graduate studies at Columbia
working on a large “digital government” grant parsing decades of military docu -
ments to extract the meaning of acronyms and domain-specific words. At night
I would swap the laptops in my bag and head downtown to perform electronic
music at various bars and clubs.
As hard as I tried to keep my two lives separate, the walls between them
quickly came down when I began to ask my fellow performers and audience
members how they learn about music. They responded, “We read websites,” “I’m
on a discussion board,” “A friend e-mailed me some songs,” and so on. Obvi-
ously, simultaneously with the concurrent media frenzy on peer-to-peer networks
OCR for page 43
FRONTIERS OF ENGINEERING
(Napster was just ramping up), a real movement in music discovery was underway.
Technology had been helping us acquire and make music, but all of a sudden it
was being used to communicate and learn about it as well.
With the power to communicate with millions and the seemingly limitless
potential of bandwidth and attention, even someone like me could be noticed. So,
suitably armed with a background in information retrieval and an almost criminal
naiveté about machine learning and signal processing, I quit my degree program
and began to concentrate full time on the practice of what is now known as “music
information retrieval.”
MUSIC INFORMATION RETRIEvAL
The fundamentals of music information retrieval derive from text retrieval.
In both cases, you are faced with a corpus of unstructured data. For music, these
include time-domain samples from audio files and score data from the composi -
tions. Tasks normally involve extracting readable features from the input and then
developing a model from the features. In fact, music data are so unstructured that
most music-retrieval tasks began as blind “roulette wheel” predictions: “Is this
audio file rock or classical?” (Tzanetakis and Cook, 2002) or “Does this song
sound like this one?” (Foote, 1997).
The seductive notion that a black box of some complex nature (usually with
hopeful success stories embedded in their names [e.g., neural networks, Bayesian
belief networks, support vector machines]) might untangle a mess of audio stimuli
in a way that approaches our nervous and perceptual systems’ response is intimi -
dating enough. That problem is so complex and so hard to evaluate that it distracts
researchers from the much more serious elephantine presence of the emotional
connection that underlies the data.
The science of music retrieval is rocked by a massive advance in signal pro-
cessing or machine learning that solves the problem of label prediction. We can
now predict the genre of a song with 100 percent accuracy. The question is what
that does for the musician and what it does for the listener. If I knew a song I hadn’t
heard yet was predicted to be “jazz” by a computer, this might save me the effort
of looking up the artist’s information, who probably spent years defining his/her
expression in terms of or despite these categories. But the jazz label doesn’t tell
me anything about the music, about what I’ll feel when I hear it, about how I’ll
respond or how it will resonate with me individually or in the global community.
In short, we had built a black box that could neatly delineate other black boxes
but was of no benefit to the very human world of music.
The way out of this feedback loop is to somehow automatically understand
reaction and context the same way we do when we actually perceive music. The ulti-
mate contextual-understanding system would be able to gauge my personal reaction
and mindset to a piece of music. It would not only know my history and my influ-
ences, but would also understand the larger culture around the musical content.
OCR for page 43
VERY LARGE SCALE MUSIC UNDERSTANDING
We are all familiar with the earliest approaches to contextual understand -
ing of music—collaborative filtering, a.k.a. “people who buy this also buy this”
(Shardanand and Maes, 1995)—and we are just as familiar with its pitfalls. Sales-
or activity-based recommenders only know about you in relation to others—their
meaning of your music is not what you like but what you’ve shared with an
anonymous hive. The weakness of these filtering approaches becomes apparent
when you talk to engaged listeners: “I always see the same bands,” “There’s never
any new stuff,” or “This thing doesn’t know me.”
My reaction to the senselessness of filtering approaches was to return to school
and begin applying my language-processing background to music—reading about
music and not just trying to listen to it. The idea was that, if we could somehow
approximate even 1 percent of the data that communities generate about music
on the Internet—they review it, they argue about it on forums, they post about
shows on their blogs, they trade songs on peer-to-peer networks—we could begin
to model large-scale cultural reactions (Whitman, 2005). Thus, in a world of
music activity, we would be able to autonomously and anonymously find a new
band, for example, that collaborative filtering would never touch (because there
are not enough sales data yet) and acoustic filtering would never “get” (because
what makes them special is their background or their fan base or something else
impossible to calculate from the signal).
THE ECHO NEST
With my co-founder, whose expertise is in musical approaches to signal
analysis (Jehan, 2005), I left the academic world to start a private enterprise, “The
Echo Nest.” We now have 30 people, a few hundred computers, one and a half
million artists, and more than ten million songs. Our biggest challenge has been
the very large scale of the data. Each artist has an Internet footprint, on average
thousands of blog posts, reviews, and forum discussions, all in different languages.
Each song is comprised of thousands of “indexable” events, and the song itself
might be duplicated thousands of times in different encodings. Most of our engi -
neering work involves dealing with this huge amount of data. Although we are not
an infrastructure company, we have built many unique data storage and indexing
technologies as a byproduct of our work.
We began the company with the stated goal of indexing everything about
music. And over the past five years we have built a series of products and tech-
nologies that take the best and most practical aspects of our music-retrieval dis -
sertations and package them cleanly for our customers. The data we collect are
necessarily unique. Instead of storing data on relationships between musicians and
listeners, or only on popular music, we compute and aggregate a sort of Internet-
scale cache of all possible points of information about a song, artist, release, lis -
tener, or event. We sell a music-similarity system that compares two songs based
on their acoustic and cultural properties. We provide data (automatically gener-
OCR for page 43
FRONTIERS OF ENGINEERING
ated) on tempo, key, and timbre to mobile applications and streaming services. We
track artists’ “buzz” on the Internet and sell reports to labels and managers.
The heart of The Echo Nest remains true to our original idea. We strongly
believe in the power of data to enable new music experiences. Because we crawl
and index everything, we can level the playing field for all types of musicians
by taking advantage of the information provided to us by any community on the
Internet. Work in music retrieval and understanding requires a sort of wide-eyed
passion combined with a large dose of reality. The computer is never going to
fully understand what music is about, but if we sample from the right sources
often enough and on a large enough scale, the only thing standing in our way is
a leap of faith by listeners.
REFERENCES
Foote, J.T. 1997. Content Based Retrieval of Music and Audio. Pp. 138–147 in Multimedia Storage
and Archiving Systems II, edited by C.-C.J. kuo, S-F. Chang, and V. Gudivada. Proceedings of
SPIE, Vol. 3229. New york: IEEE.
Jehan, T. 2005. Creating Music by Listening. Dissertation, School of Architecture and Planning,
Program in Media Arts and Sciences, Massachusetts Institute of Technology.
Shardanand, U., and P. Maes. 1995. Social Information Filtering: Algorithms for Automating ‘Word
of Mouth.’ Pp. 210-217 in Proceedings of ACM (CHI)’95 Conference on Human Factors in
Computing Systems. Vol. 1. New york: ACM Press.
Tzanetakis, G., and P. Cook. 2002. Musical genre classification of audio signals. IEEE Transactions
on Speech and Audio Processing 10(5): 293–302.
Whitman, B. 2005. Learning the Meaning of Music. Dissertation, School of Architecture and Planning,
Program in Media Arts and Sciences, Massachusetts Institute of Technology.