< Back

Building an App That Gives Music Advice Through a Conversation

What would a search interface without a search box look like? After a few weeks of exploration, I created an app that gives music advice through a conversation, and called it “Musicologist”. This is the story of how I built it, hoping to give you an idea of what kinds of new search interfaces are coming to the fore and how you could start building them yourself.

Introduction: a dying search box

If you were a search box, you would worry about the recent press. From The Verge reporting that it will disappear in the next decade, to eBay saying it will at best “become redundant,” it’s a rough time for our good old search box. But is the end of the search box the end of search interfaces?

If you go beyond the headline, it’s a somewhat different story. The Verge’s article is referring to research by Susan Dumais, who said that “[search box] will be replaced by search functionality that is more ubiquitous, embedded and contextually sensitive”.

So the search box might disappear, but only to be replaced by a new generation of search interfaces: accessible everywhere, integrated in your activities, and aware of the context in which you are.

Conversational search interfaces are a first step in this direction. They are ubiquitous, as you can deploy them on any platform where your users reside. You can design them to be integrated in your user’s journey, providing value while interrupting them as little as possible. Finally, they let you leverage your user’s context by remembering previous conversations to adapt to them.

But what would a search interface without a search box look like? I spent a few weeks to explore what has been done in this field, and to propose an example of what kind of UX a new type of interface could provide: the “Musicologist”. This application gives you music advice through a conversation, and its code is now released open-source on Github. Here is a demo to show you the Musicologist in action:

In this demonstration, you see a search interface built on three components: a mobile frontend, a conversational agent, and a backend powered by Algolia. Let’s see what each component does and how you can build it.

The agent: understanding your users

We built the conversational agent with Dialogflow (formerly api.ai), a conversational API that helps you build conversational experiences in natural language.

At its core, the agent is a middleware capable of turning an input sentence into an intent that can contain several parameters. It can then forward those to a web service, and turn its answer into a reply sentence. Intents describe what the user can say: each one is a different intention that the user can express in many ways.

For the Musicologist there are two main intents: search for songs and choose a result. Each one is defined by the sentences that a user could use to express it; for example, the search for songs intent:

                                Two expressions that would show an intent to search for songs

As you can see in the screenshot, some words are highlighted. These are parameters of the intent, corresponding to different entities. Each entity describes a category of objects you could recognize. You can use the many system entities provided (phone numbers, city names, months of the year, etc), or you can define your own if none of the former fits your use case.

Typo-tolerance and other limitations

There are, however, some limitations in the way entities are recognized. For example, there is almost no typo-tolerance in the entity recognition system. If you use the system in its default configuration, you can only recognize exact entities. Let’s define a project entity called “DocSearch”, and use it in an Intent. The agent recognizes it:

But if you do a typo, the agent is confused and does not recognize the entity anymore:

The agent didn’t recognize the DocSearch project

There is an option called Allow automated expansion, but unless you give a lot of examples to your agent, it has good chances of inventing non-existent entities:

With automated expansion, the agent believes “Docsarch” is a valid project

This demonstrates the limitation of an entity recognition system that you cannot customize further. Like with handling typos, there are several techniques you would want to use when detecting entities, like removing superfluous terms or normalizing them.

Thankfully, you don’t have to implement these on your own. You can augment your agent by leveraging an API to give it more powers:

  • Handling typos: you might think that with conversational interfaces, users won’t do typing mistakes as they are speaking to their devices. But being ubiquitous, your conversational interface can be deployed on various platforms, through which your users can talk (Amazon Alexa, Google Assistant, …) as well as write (Slack, Messenger, …). As your users will expect the same quality of results in both cases, customizable typo-tolerance is key to a great user experience across channels.
  • Ignoring variants: be it via voice or text, your users might not speak exactly how you expect it. From plurals to declinations or missing diacritics, it is important to understand your user across all variants.
  • Removing optional words: especially via voice, your users might tell you more than what you expect, and phrase a query that’s too precise. But when you don’t find anything relevant to it, you will provide a better experience to your users by giving them results for a part of the query: being able to configure optional words lets you fine tune this experience.
  • Defining advanced synonyms: you can define two-way synonyms in your agent, but this only helps if the two terms are strictly equivalent. This is not always the case: for example, you might want to show songs of the genre Rockabilly when queried about Rock, since the former is a subgenre of the latter. In this case, you don’t want a query with “Rockabilly Song” to return any Rock song: you can solve this with one-way synonyms.

This is only a glimpse of what you can do: there are several relevance features you can use to improve how your conversational interface understands its users.

The backend: coming up with meaningful answers

The agent now gets what the user means (let’s say searching for songs) and the eventual entities in the intent (for example Led Zeppelin). But how can it provide a relevant answer to its users?

For the Musicologist, the backend is the agent’s intelligence engine: it will use Algolia to connect the intent with the relevant content taken from an index of songs. This gives some memory to the agent, giving it access to a great amount of data — as well as relevance, as the agent can now recognize an artist even if it is mistyped:

The agent now recognizes the correct entity, and can provide relevant results.

The backend’s code, built with Node.js, is actually quite simple: it takes the request from Dialogflow, extracts the search parameters and queries Algolia with these, then transforms the search results into the format expected by the agent.

Once your backend is deployed, you can start talking to your search interface! You can deploy it on various platforms in a few clicks:

Deploying your agent to Slack, Messenger, or Alexa is only a few clicks away

For example, here’s what your agent could look like in Slack:

Of course, you can customize each integration further. For example, you can use Messenger’s Quick Replies pattern to propose predefined answers to your agent’s responses. Still, this is quite powerful: your agent can be deployed on pretty much any platform without more work.

You now have an conversational search interface that can be accessed anywhere, and can already satisfy your users through voice-first interfaces or text-based conversational channels like Slack. This approach is optimal for low-friction interactions where you want to help your users find what they are looking for without interrupting their current flow.

At other times however, your users will use your search interface to discover the content you can provide. It’s thus important to adapt your search interface accordingly, by helping them navigate and understand your results. In this case, a purely voice interface is not optimal: being able to ask your questions in natural language is great, but getting many search results as a spoken sentence make them hard to understand. If this is the only way to delve into your data, your users will likely miss the rich search results UI they have gotten used to with consumer-grade search on Google, Amazon, etc. It also makes it harder to interact further with the results: imagine filtering results on several criteria by talking! It’s a lot more difficult to think about such a sentence than to tap on a few filters.

It seems that for exploring your content, conversational interfaces and graphical search interfaces both have some advantages over the other. But why couldn’t your users have both?

The frontend to navigate your results

To provide our music aficionados with an interface more suited for discovery, we built an Android frontend to the Musicologist. This is the actual search interface where users will interact with the agent.

There are several advantages to having a mobile frontend for our Musicologist. First of all, it helps us leverage existing speech-to-text and text-to-speech systems. The application will use any speech recognition service installed on the device to turn the user’s voice into a written sentence, forward it to the agent, and then voice the response using the installed text to speech engine.

But having a mobile frontend brings more than voice-related components. It lets you adapt the user experience for content discovery by providing new ways to grasp the search results and interact with them.

The first area where this is helpful is understanding your search results. A voice interface puts a much higher cognitive load on the user than a graphical interface, because they will need to pay attention to the voice interface as long as it speaks to avoid losing information. When you display the results, the user will read the information at the time they sees fit, rather than being forced to listen.

Moreover, you can apply all the usual techniques that help the user get the most out of your results. You can display minor attributes that could be useful, highlight the results, snippet the relevant text, etc.

This can sound like a lot of work on top of the agent and backend, but it is very straightforward to build with the InstantSearch library – Algolia’s open source search patterns library providing widgets and helpers for building an instant-search experience on Android. It is built on top of Algolia’s Android API Client to provide you a high-level solution that abstracts the hard work of building such an interface.

Search results, visually highlighted and snippeted when relevant

As you can see, with a graphical interface it will be easier for your users to get the information they want. But that’s not all: you will also make it easier to interact with this information. A lot of the patterns your users learned to expect on great search interfaces still make sense in a voice-enabled application.

Take infinite scrolling as an example: sure, you can ask the agent to show more songs, but now that you have a list of results, it feels quite natural to simply scroll to see more results.

Likewise, you can tell the agent to play a given song, but sometimes a tap on the search result will be more adequate to play that song in your favorite music app. With hands full, it will be natural to voice a query, then ask the agent to select a result, while in another context you’ll find it quicker to tap on a search result.

The power of hybrid search interfaces

Such voice and graphical interfaces bring the best of both worlds: accessibility and ease of use via voice input/output, but also rich information display and interaction possibilities.

On the one hand, you have the possibility to use the search interface hands-free, and you get the results you are looking for without having to touch your phone. You can even get your search results without looking at it!

On the other hand, you can interact with the results without using your voice again, for example if you already started listening to some music and just want to hear the next song. All the familiar interactions with search results can be leveraged in such a hybrid search interface.

The Musicologist is a proof of concept of such a hybrid interface, but many search interfaces could benefit from this approach. There are many use cases for which a hybrid interface would bring value to your users: cooking, driving, troubleshooting your car’s engine… whenever they already use their hands but would enjoy a screen. Those are just a few examples of situations where users would benefit both from hands-free control and a rich graphical user interface.

With the Musicologist, I wanted to give you an idea of what kinds of new search interfaces you could build. We hope this article will help you get started with building great voice search interfaces. If you build something with Algolia, please do let us know!