Want to Build Your Own Chatbot? Here's How We Got Started

Chatbots are popping up all over the place. Like many other developers, we’ve long dreamed about talking to our computers and wanted to build our own Artificial Intelligence. What initially seemed like an incredibly daunting task turned out to be surprisingly simple - even without any special training!

We have now built a few different prototypes, one of which is a voice-controlled banking assistant that uses Dialogflow, Google Home and the Root Banking API. To spare you the fears and doubts we had to work through to get here, we decided to share our five key lessons from building chatbots.

Simon_image1_header

Right up front, let’s do away with some common misconceptions about chatbots:

Chatbots are for the elite - Anybody who knows how to use an API can build a chatbot. We've done it, so you can too!

You have to know machine learning to build a chatbot - Google’s web services to the rescue! We'll talk about this in more detail.

You can plug in a voice interface with no changes to the bot - We learned this the hard way: Bots can (and will) read text differently than intended, and you will have to account for this.

Conversations follow a single path - Boy, were we wrong! If only! When you draw up a conversation, it looks similar to a decision tree, not a linear line. That means you need to account for different branches in the conversation in your bot logic!

Lesson 1 - Arrie: Chatbots are easy to build ...if you use an API

Artificial Intelligence (AI) is a world of self-driving cars, facial recognition, cancer diagnostics and all things futuristic - something I felt was inaccessible to me. I had a burning desire to know more and to actually build the insanely cool things we often only see in movies and comics, but I had no idea how to get there.

As a self-taught developer, this feeling isn’t all too new to me: Everything I know, I’ve learnt through online courses, videos, tutorials, documentation, and plain old trial and error. A friend eventually talked me into doing a Nanodegree on Deep Learning through Udacity. I really enjoyed this course, but after four months of online learning, I still felt like I couldn’t quite apply myself.

My newly acquired AI-knowledge actually made the prospect of building my own chatbot even more daunting because I finally understood how complex AI can be. Back then, I thought you had to build it all - a scary list of moving parts:

The text interface to the bot,
The voice interface, including recording the audio from the user, converting the audio into text and synthesizing the voice response,
The Natural Language Processing (NLP) models to extract the required information, i.e. neural networks and rules engines and
The API to easily access the NLP models.

Building chatbots with DialogFlow

Eventually, however, I discovered what I consider to be pure gold: A Google-service called DialogFlow (API.ai at that stage). It’s an API that you can easily integrate:

- It handles the NLP: You no longer have to worry about building the neural networks or rules engines to extract the information from a user’s input. Instead, you simply send a request to an API and create a webhook for it to send information to you.
- It also provides a simple text interface: It offers integrations into various platforms such as Slack and Telegram messenger. This way, you can use the plug-in of your choice.

There are obviously other services that are similar to DialogFlow, such as Luis.ai or Wit.ai. Personally, we’ve found DialogFlow’s interface extremely easy to use and their documentation to be excellent. It also handles the context (subject) of a conversation really well, which is far more important than you might realize. (Some of the other services don’t keep track of context at all!) That's why we'll reference DialogFlow a lot in this post.

Lesson 2 - Simon: Chatbots are just high level functions ...with some magic

One of the biggest conceptual leaps I made is to think about what a chatbot actually does as a high level function.

I have a formal background in some AI techniques but chatbots still felt like a foreign technology to me when I first encountered them. What they are doing seemed almost magical because of how complex I knew it was to work with natural language.

That's when I decided to start thinking about chatbots differently. I remembered that, mathematically speaking, machine learning models are just high level input-output-functions. With this simplifying metaphor in place, I could make sense of how a chatbot actually worked in practice:

The function represents a mapping between the spoken text by the human and the appropriate response from the bot. This can be illustrated with the following snippet of JavaScript:

const chatbot = (spokenText, currentContext) => {
  // magic
  return responseText;
}

The high-level function receives text input, "magically" understands what the human wants, and responds appropriately with more text, the output.

Now, obviously, there is quite a lot that goes into this “magic”. Practically speaking, APIs like DialogFlow accomplish this functional mapping with two important mechanisms: a rules engine and a machine learning model.

DialogFlow’s machine learning

The machine learning models that approximate the mapping between text and the user’s intent are incredibly complex and super difficult to get right. (That’s why it’s so handy when an API takes care of this.) The models are trained on example text for each mapping you want the chatbot to understand. After you have provided some training examples - usually various ways of asking or saying the same thing - you sit back and let the neural network models do their thing.

DialogFlow’s rules engine

DialogFlow’s rule engine allows you to implement something like a decision tree to design the flow of your conversation. This is useful for branches in the conversation, like a decision in a flow diagram. The images on this DialogFlow documentation page will give you a good visual picture of what I mean.

The rules engine does this by mapping the user input to the correct intent based on the context of the conversation. This is similar to conditionally calling a function that destructures the input and returns the results. You can think of these rules as something similar to pattern matching in Elixir or destructuring in JavaScript - just that now, it’s for the flow of the conversation.

Let’s look at an example:

If I ask Google: “Who is the president of the United States?”, and subsequently ask: “What is his wife’s name?”, the chatbot must first extract that I intend to know who the president is. Once I know who the president is, the second question is quite specific when asked directly after the first - I obviously want to know the president’s wife’s name but for that, the chatbot needs to be set up to take context into account. Otherwise, the result will look like this:

Simon_image2_google_chat

Lesson 3 - Arrie: It’s super important to understand the three main chatbot concepts

Now, I know we just told you that, thanks to all the magic, chatbots are quite easy to build. That however does not mean that we didn't come across our fair share of challenges. When we built our first prototype bot together, Simon and I were working on a Robo-Advisor platform. We wanted to introduce a new way of interacting with the application and have the bot respond to the user’s input with dynamic data. For this, we decided on a simple text-interface.

Then it became more complex: Protype two was a registration bot that allows users to easily register on the Robo-Advisor platform and prototype three a banking assistant bot that allows a user to voice-command their Root bank account.

For these to work we had to understand three main concepts: intent, entities and context. They are not only useful but turned out to be absolutely essential for the conversation flow and architecture of our chatbots.

Note: These three concepts are pretty universal across any API platform you might use, but could have slightly different names.

Intent

For our very first prototype, we basically just wanted to map what the user was saying to a function call in our app. This lead to the discovery and understanding of intents.

Basically, an intent is the desire of the user based on the input provided. The chatbot needs to understand what the user actually wants it to do or how it should respond.

Let's use a simple example: A bot that allows you to order a cup of coffee.

Simon_image3_chat1_1of2

At this point the chatbot determines that the user intends to know the price of a coffee. It’ll then reply with:

Simon_image4_chat1_2of2

The chatbot matches the intent based on the training examples you've set up earlier. You would, for example, define an intent in DialogFlow as “get_coffee_price” and provide it various examples of how the user would ask for the price of a coffee. The easiest way to think about it is by asking yourself: “What does the user want?” or “What do they want to do?”

Entities

Our second prototype, the registration bot, went a little bit further: We wanted to gather information from the user to register them on the platform. For this, we needed to not only understand the user’s intent, but also to extract additional information from the sentence. In DialogFlow, they work like the parameters in a function input.

Entities are essentially specific pieces of actionable information in the user input.

Coming back to our coffee bot example, the bot would need to know what type of coffee the person wants to be able to give them the correct price.

Simon_image3_chat2_1of3

At this point, the chatbot determines that the user intends to know the price of a coffee, but it needs to know which coffee (entity the user is interested in:

Simon_image6_chat2_2of3

Now the chatbot has to determine that the user is enquiring about the entity “flat white” specifically, and it can reply with:

Simon_image7_chat2_3of3

Entities are like markers in example sentences. They indicate to your API what information you would want to extract from a sentence: You’d set up the word “flat white” as the entity type “coffee”. When the chatbot then receives this or a similar sentence, it would give you a little structure back that looks like this:

{
   entities: {
      coffee: “flat white”
   }
}

Context

When we started building our third prototype, a voice-controlled banking assistant bot that allows you to transact with your Root bank account, we realised that we had to understand the user input in context of their previous inputs.

When making a payment to someone, there are a lot of moving parts. That means you have to allow the user to do many different things. They need to:

Be able to specify who they’d like to pay,
Say how much they need to pay and
Be able to cancel the payment if they change their mind.

We had to keep track of all of this information in order to direct them along the right conversation flow. The bot needed to understand where it was in the conversation, what information it had and what information it still needed to get.

Our coffee bot scenario would then change like this:

Simon_image8_chat3_1of2

At this point, the chatbot needs to remember that the coffee drinker is talking about the Americano, and it needs to reply with something like:

Simon_image8_chat3_2of2

If it didn't have that context, it could easily think that the “cold milk” is a separate order and end up giving the coffee drinker a glass of cold milk. That’s why context is important.

In DialogFlow, you’d set up the contexts in intents, since contexts are used to help filter to the correct intent: You might have one intent that listens for new orders and another that listens for an add-on order within the context of the original order. In the above example, DialogFlow will hit the “new order” intent when the user first asks for the Americano and it’ll match on the “add-on order” intent when he asks for cold milk.

Lesson 4 - Simon: Voice interfaces are not perfect

When Arrie and I initially started with our banking bot’s voice interface, I honestly thought it would just be another “plug-and-play” story. (I had clearly won in confidence by that time.) In reality, however, a voice interface has nuances beyond just converting the text to audio and back. Google Assistant’s voice interface is state of the art, and it still gets this wrong at times.

Turns out that - despite the easy setup through the DialogFlow API and understanding the main concepts quite well by now - modern technology is still quite far from allowing us to be Knight Rider. We realised this both through experiments with the actual bot and by observing ourselves:

We speak differently than we write

One thing I noticed immediately is this: The way we write a text for someone to read is different to the way we write a text that will be said aloud. This includes the way words are pronounced.

My favourite embarrassing story comes from our live demo at Rubyfuza 2018: Arrie asked the bot to transfer R50 into my bank account and then asked it what the last transaction was. The bot responded: “Your last transaction was 50 rand to Simon with description: eff’d.” Now, obviously, this was supposed to be pronounced character by character, as E.F.T. A human would know that EFT without the periods stands for electronic funds transfer, but the bot read it as if it was a word, pronouncing it as "eff’d" which, luckily, got us quite a few laughs from the audience.

Chatbots don't have all the context

Around the same time, Arrie and I also noticed that the way we talked to the chatbot was really different from the way we talked to each other. You might have observed this yourselves: When you are talking to Google Assistant, Alexa or Siri, the bot often misunderstands what you want to do. We realised that we’ve been conditioned to talk differently, more directly, more clearly - essentially, more like a robot when using chatbots.

Let's take the scenario of me getting ready for our trip to Cape Town. Instead of saying: “Will I be cold if I don’t take a jacket with me?”, I asked Google Assistant:

“What is the weather on Friday in Cape Town?”

A human would easily infer that we wanted to know what to pack for our trip, but bots are obviously not quite that smart. That’s why we preemptively give them more direct information.

To be completely honest with you, it made us a little bit sad when we began to understand that, even with the current state of the art NLP, we still couldn’t achieve a completely natural conversational interface with our bot.

You see, we dream about having completely natural conversations with our bots, but this just isn’t possible if the bot is not aware of the greater context in which a conversation might happen. For example, if the chatbot knew I was standing in front of my wardrobe, it could possibly infer that I was contemplating taking a jacket and wanted to know about the weather.

Note: The reason chatbots are not (yet) perfect is that the way we communicate involves a whole lot more than just our voices: we use non-verbal cues, cultural norms and deeper contexts.

Lesson 5 - Simon: Intents run based on context

Whilst trying to make our banking bot feel more natural, we realised that we had to deepen our understanding of the context-concept: We wanted the user to be able to confirm or cancel a transfer before we moved any money. In order to do this, we needed to know the user had provided all required information for the transfer before asking them if they wanted to continue, or cancel. This is referred to as a confirmation step, and you would get there from a conversation something like this:

Simon_image9_table-1

In a confirmation step, you might decide to continue with the current action, change a detail, or cancel it completely - it all depends on your current context.

We really struggled to get this right because we didn’t fully understand how the concept of contexts was meant to be used in DialogFlow’s interface. It wasn’t until we realised that the confirmation step looks the same for many intents, that we really grokked this:

The chatbot runs intents based on the context it’s in when the confirmation step is reached.

-this was an “aha-moment” of note. Once we understood this, we added a payment confirmation intent that would only be possible if the user was trying to make a payment and the bot had collected all the required entities: the contact and the amount to pay.

In the background of this, DialogFlow’s rule engine is at work. It filters down all possible intents to the ones that can only be run in the context of us being ready to make a payment. DialogFlow decides where to go in the conversation (what intent should be run) using a combination of rules based on your conversation’s current context, and what the user responds with. These two together allow you to design context-aware conversations that also add to that feeling of a natural flow.

Takeaways

When it comes to voice interfaces, don't try to reinvent the wheel. There are times when as a developer you should roll-your-own, but this is not one of those. Rather rely on the same technology that powers the conversion from text to voice in Google Assistant.

Once the moving parts are taken care of, you really get to focus on making the conversation aspects useful for the users - and it’s incredibly fun to watch them interact with your bots in the end! Now we feel like slapping a chatbot onto pretty much anything! Personally, we’d want to be able to control our entire life via a voice interface like in the movie Her (maybe without the weird personality).

Resources

Strip-3

Simon is a technical product lead at Platform45, interested in everything software from design & development to machine learning. On his journey, Simon loves sharing his learnings, and has given a couple talks in South Africa and the USA. Catch him on Github, Twitter or at simonvandyk.co.za.

Arrie is a product developer at Platform45, passionate about building great products that matter. He believes that life is too short for bad products and that hard problems can be solved at the intersection of design and innovative technologies, particularly AI. Catch him on Github or Twitter.