It’s clear by now that the robots are coming for us.
Breakthroughs in AI fill our streams and news feeds, themselves the products of machine learning, the echoing algorithmic screams of a new kind of mind being born.
Using TensorFlow.js, we’ll find out how deep learning systems learn and examine how they think. The fundamental building blocks of AI have never been more accessible. Ashi Krishnan explores the architecture, potential, and implications of these protominds, which are growing to mediate our every interaction.
[00:09] Hello everyone. So, most presentations don’t start with a content warning, which is interesting, right? There’s nothing like a big warning to kind of build anticipation. This talk is not actually that scary. It has some unusual things, a few moments where the screen is going to flicker and zoom and will be very exciting. Some pictures that’ll be unsettling, but not like, not gory, not disturbing. There’s nothing graphic.
[00:39] There’s nothing that you’ll look at and say, “Oh, I understand why this is disturbing.” But there are things where you’ll look at them and be like, “Mmm.” We’ll ask why that is, and we’re going to talk about some aspects of human experience that we don’t often talk about and some things in our brain, some things in our mind.
[00:57] All of that can conjure a lot, and so I want you to be here for it, but if you can’t be here for it, take care of yourself, and otherwise, hello. My name is Ashi and I work at Apollo, and we’re doing some very, very exciting things with GraphQL, to let you slice and dice your services and your data layer in all kinds of ways. I’m very excited to tell you about it, which I will presumably do in a future talk because this talk is about something completely different.
[01:34] This talk is about a hobby of mine. Today, I want to share with you some of the things that I’ve learned at the intersection of computational neuroscience and artificial intelligence, because for years, I’ve been fascinated by how we think and how we perceive and how the machinery of our bodies results in qualitative experiences and why our experiences are shaped as they are. Why do we have joy? Why do we suffer?
[02:02] For years, I’ve been fascinated by AI, which I think we all have. We’re starting to watch these machines begin to approximate the tasks of our cognition in unsettling ways, and we’re afraid – I have certainly been afraid – that they’re going to take all of our jobs, which I have some good news if that is the fear that you have: The robots are very impressive and they’re also pretty stupid in very fundamental ways.
[02:27] They’re not going to take your job, at least, not for the next couple of years. But, they’re going to change it quite a lot, and so today, we’re going to take a look at how, and how we can respond to that. Let’s begin. Part one.
[02:43] Hallucinations. This person is Miquel Perelló Nieto and he has something to show us. It starts with these simple patterns, these splotches of light and dark like images from the first eyes. They’re going to give way to lines and colors, and then curves and more complex shapes. What’s happening is that we’re diving through the layers of the inception image classifier, and it seems like there are whole worlds in here.
[03:15] We have these shaded multi-chromatic hatches. We have the crystalline farm fields of an alien world. The cells of plants. To understand where these visuals are coming from, let’s take a look inside. The job of an image classifier is to reshape its input, which is a square of pixels, into its output, which is a probability distribution.
[03:39] The probability that this image contains a cat, the probability of a dog, a banana, a toaster. It performs this reshaping through a series of convolutional filters. Convolutional filters are basically Photoshop filters. Each neuron in a convolutional layer has a receptive field, which is some small patch of the previous layer that it’s looking at. Each convolutional layer applies a filter – specifically, it applies an image kernel.
[04:08] Now, an image kernel is just a bunch of numbers – a matrix – where each number represents the weight of a corresponding input neuron. Each pixel in the neurons receptive field is multiplied by this weight, and then we summed them all to get this neuron’s output value. The key thing is that the same filter is applied for every neuron in a layer and that filter is learned during training.
[04:35] Training works like this: We feed the classifier a labeled image, something where we know what’s in it, and it outputs predictions, and then we math to figure out how wrong those predictions are. They start off very wrong. Then we math again, figuring out how to nudge each and every single filter in a direction that would have produced a better result.
[04:57] The term for this is gradient descent because we imagine the fitness of a given model as a landscape and we’re descending down, kind of rolling downhill to a place where we experience the least loss. The deep learning process used to create this image inverts this. This visualisation is recursive. To compute the next frame, we feed the current frame into the network. We run it through the network’s many layers until we get to a particular layer that we’re interested in.
[05:30] Then, we math. What could we do to the input image to make this layer activate more? Then, we tweak the input image in that direction. The term for this process is gradient ascent. Finally, we scale the image up very slightly, before feeding it back to the network again. That keeps the network from sort of fixating on the same shapes in the same locations, and it also creates this fascinating and kind of trippy zooming effect.
[06:02] Every hundred frames or so, we move to a deeper layer, or a layer kind of off to the side. Inception has a lot of layers and although there are no cycles, they are not entirely arranged in a neat linear line. This is a pretty complex network architecture and the good thing is you don’t have to design this yourself, you don’t have to train this yourself – you can actually NPM install an image recogniser.
[06:32] Let’s see how we can use that. We’re going to explore transfer learning. Transfer learning is where we keep most of the training from an existing network, but then we cut off the bottom and we attach our own model, which we train for a particular purpose. The purpose we’re going to explore today is playing Pacman with my elephant friend, Talula.
[06:57] This is Talula. There she is, and she’s going to help us play Pacman. First, we’re going to train the network. To provide training data, I kind of click on these buttons for each of the different directions and that collects images, snapshots from my webcam. So, you see that I’m kind of trying to capture a variety of different ways that I could be holding Talula, and I’m actually not doing a very good job it turns out. We’ll see that later. Then, we train the network. You can watch the loss kind of go very low – that means we’ve descended quite far in the landscape.
[07:35] Then, we play. So, when we play, we can see it works pretty well. So, I’m holding Talula up. I’m going up. I’m holding to the left and going left, it starts off going pretty well, and then the moment I’m stressed, I don’t do a very good job of staying in the camera’s frame or of holding Talula like I was, and so now, okay, so it’s kind of working but I think that situation is not going to continue and, yep. I am eaten now. It’s unfortunate.
[08:10] It can be very challenging – machine learning can be challenging on a friendship, especially if you’re using your friend as a game controller. I’m pleased to report that we are still friends. Let’s look at the code behind all of this. This is actually an example from the TensorFlow JS repo, so you can go and dig into it yourself.
[08:31] The first thing we do here is we pull in TensorFlow JS, the core library, which is something you can just NPM install. We create this controller dataset, which is just a utility that we’ve created that’s going to hold all of the labeled examples. We load MobileNet, which is like Inception. It’s an image recogniser. MobileNet is a bit smaller than Inception. It’s been optimised for use on mobile devices and is good for use in the browser because of this.
[09:02] We don’t have a ton of computational power available to us in the browser, although we can use the GPU through WebGL, which is what TensorFlow JS does. But still, we’re somewhat limited, and so MobileNet is a good choice for this. Also, it’s there. It’s available online, so the best option you have is the one that’s available.
[09:23] Now, we get to the transfer learning bit, so to do that we’re going to grab a layer. If you were watching very closely, you will remember seeing that vary string before when we were diving through the layers of Inception. It’s
conv_pw_13_relu, so it’s an activation layer that follows a convolutional layer. We’ll sort of look at what that means a little bit more in a second.
[09:49] We grab that layer rather than the final output layer and we create a model that goes from the input to that sort of intermediate layer near the bottom. Then, whenever we record an example, we add it to the controller dataset. Here notice that, so that example takes
x we’re using is MobileNet’s actual prediction, so we’re saying: Map the output of MobileNet to the label that we want, which will be up, down, left or right.
[10:19] Then, finally, we create our model. We’re recruiting a sequential model, which means it is just a stack of layers. The first layer flattens the output of MobileNet, so we can get a really long chain of numbers. Then, we have a densely interconnected layer, which connects all of those numbers to a whole bunch of neurons, and connects them densely, so every neuron is connected to all of the inputs by some weight.
[10:46] We can customise the number of layers in there and the number of neurons in the UI – I think I left it at the default, which I believe is a hundred. Then, finally, we trained it. We’re using the Adam optimiser, which is a gradient descent optimiser. There’s a bunch of different ways of doing gradient descent. They’re all subtly different in terms of how fast you roll down the hill, how much momentum you have as you roll down the hill.
[11:14] So, Adam sort of strikes a nice balance between the various options and it’s sort of known to be self-correcting to some degree. So, we compiled a model that gives us something that we can actually run on the GPU. It goes and creates a bunch of TensorFlow nodes, and then we… Oh, right. Sorry. When we compile the model, we have to decide how we’re going to determine how badly we’re doing.
[11:40] So, remember, we need to test – see how the model does on one of our examples – and then move it in the direction that would be less wrong, to know that we need to know how wrong we are. And, to figure that out, we’re using categorical cross-entropy loss, which is a mouthful. But it is a way of determining how wrong you are when your output is a probability distribution.
[12:06] Let’s say that I was holding Talula upside down to mean ‘down’, and so, the correct answer is it’s a 100% likely that I’m trying to go down.But, the network outputs this probability distribution, which is like a 10% chance you were holding her up, 40% chance of down, 50% chance of left, and whatever is left is your percent chance of right. That’s wrong, obviously, if we picked the like highest valued prediction – that would be completely wrong. But, it’s like how wrong is it? It’s like it’s kind of wrong. It’s 60% wrong. There’s like a mathematical answer that guarantees a well-behaved network, and that’s what categorical cross-entropy loss gives us.
[12:51] We come up with a batch size, which is how many examples we’re going to go through for every training step, and then finally, we train the model. Training is an asynchronous process. We get a callback whenever we’ve gone through a batch of examples and that lets us update the UI, saying like, “Here’s how well the training process is going.”
[13:11] Finally, to play the game, we are going to grab a frame from the webcam. We’re going to run it through MobileNet to get the prediction of that internal layer, which, remember, is the input to our little transfer learning model. We run the output of the MobileNet layer through our model, and then we use this
argMax function to basically figure out which of the predicted classes is most likely – which one has the highest prediction. We move Pacman in that direction.
[13:48] That’s basically it. We are now playing Pacman using machine learning. Let’s go a little bit deeper into this. I want to draw our attention back to the sort of a weird internal layer with a strange name,
conv_pw_13_relu. To get a better sense of what that’s all about, I wanted to kind of go load up the model description file and take a look around. We can just fetch it from Google Storage. We saw the URL earlier, so I just literally fetched it and I’m going to poke around with it in the dev tools.
[14:21] It looks like this is a Keras model. Keras has its own framework, and now François Chollet works for Google, and so it is part of TensorFlow. It’s a way of describing a large class of deep learning models by just saying, “Here’s all the layers that go into them.” We do certainly have a lot of layers. There is an input layer, and then there’s kind of repeating stack of convolution normalisation and activation layers, which is how all image recognisers work basically.
[14:56] It’s like stacks of layers packed on top of each other, and each layer learns to extract progressively higher order features. Input layer, convolutional layer – remember what convolutional layers do – and the higher these various features learn to recognise, are what are getting amplified when we go through our deep dream dive here. The early layers have learned to recognise basic patterns like the difference between light and dark. Then, much deeper we started to actually isolate forms.
[15:40] So, down here there’s kind of this city of Kodama situation, which is turning into the spider observation zone where spiders observe you, but it’s okay because the spiders are going to become corgis, and the corgis are going to become the ‘70s. Then, kind of much deeper in the network, we are going to find the space of nearly human eyes – this is the disturbing part – which are going to become dog slugs, and these kind of dog-slug bird situations.
[16:15] Much deeper, there was an unfortunate saxophonist teleporter accident, and then finally, the flesh zones with the side of lizards. When I first saw this video, I thought this looked like nothing so much as US President Donald Trump, and I resolved to never tell anyone that until my best friend watched it and said the exact same thing, which I think says more about the structure of our neural networks than this one.
[16:46] I think the lizard juxtaposition is kind of suggestive. But, I do want you to notice and think about what it means that all of the flesh inside this network is so very pale. This data processing arrangement from simple forums to more intricate ones is pretty similar to how our own visual cortex processes information. The visual cortex is, for reasons that I think nobody really knows, in the back of your head and it’s arranged into a stack of neuronal layers.
[17:20] The signal from our eyes stays pretty spatially coherent throughout the layers, and so, there’s some chunk of tissue in the back of your head that’s responsible for pulling faces out of one particular bit of your visual field. Each neuron in a layer has a receptive field that some chunk of the previous layer, and therefore, some chunk of the visual field, is looking at.
[17:45] Neurons in a given layer tend to respond in pretty much the same way to signals within their receptive field, and that operation, distributed over a whole layer of neurons, extracts features from the signal, so first more simple features like lines and edges, and then more complex features like surfaces and objects and motion and faces. It’s not an accident that we see very similar behavior in Inception because Inception and MobileNet and all of these classifiers were designed after our visual cortex – specifically, they were given the same shape but not the same information hierarchy.
[18:23] That they learned during training, which to me is fascinating and kind of serves as a validation of the model. Of course, there are some differences between Inception and our visual cortex. Inception is a straight shot through, so input to output – there are some branches but there are no cycles.
[18:42] Our visual cortex is full of feedback loops: These pyramidal neurons that connect deeper layers to earlier ones and let the output of those layers inform the behavior of earlier layers. We might turn up the gain of edge detection where later we’ve detected that there’s an object.
[19:02] This lets our visual system focus, not optically but attentionally. It gives us the ability to ruminate on visual input and improve our hypotheses over time. I think you know this – it’s the feeling of looking at something, thinking it’s one thing, and then realising that it’s something else. Neural networks know that feeling too in a sense.
[19:26] Here is a banana but the network is pretty sure it’s a toaster, and then here we have a corgi that, or the – sorry, the skiers, that we’ve kind of taught the network, “Oh, it’s actually a corgi.” These are adversarial examples. These are images that have been tuned and specifically chosen to give classifiers frank hallucinations: The confident belief that they’re seeing something that just isn’t there.
[19:58] They’re not entirely wild, these robot delusions. Like, that sticker really does look a lot like a toaster, and kind of the skiers look a bit like a dog. They don’t really look like that dog but if you’re kind of like far away, you might, it’s like bright and snowy, you might for a second think that you’re looking at a big dog, but you probably wouldn’t conclude that you’re looking at a big dog because the current property is of our visual cortex, not to mention the whole rest of our brain.
[20:27] I mean, that our sense of the world is stateful. It’s a state. It’s a hypothesis held in the activation structure of our neurons. Our perceptions are in this process of continuous refinement, which might actually point the way towards more robust recognition architectures, recurrent convolutional neural networks that can ruminate upon input and improve their predictions over time, or at least give us a signal that something is wrong about an input. There are adversarial examples for the human visual assistant after all and they feel pretty weird to look at.
[21:09] In this image, we can feel our sensory interpretation of the scene flipping between three alternatives: A little box in front of a big one, a box in a corner, and a box missing one. In this Munker illusion, there’s something kind of scintillating about the color of the dots, which are all the same and all brown. So, if we designed convolutional neural networks with recurrence, they could exhibit such behavior as well, which doesn’t maybe sound like such a good thing.
[21:43] Like, let’s make our image classifiers vacillating and uncertain, and then put them in charge of self-driving cars. But, we are in charge of self-driving cars, us driving cars, and we have very robust visual systems. It’s our ability to hmmm and ahhh and reconsider our perceptions at many levels that gives our perceptual system that robustness. Paradoxically, being able to second-guess ourselves allows us greater confidence in our predictions because we’re doing science in every moment.
[22:17] The cells of our brain constantly reconsidering and refining, shifting a hypothesis about the state of the world, which gives us the ability to adapt and operate in a pretty extreme range of conditions, even when nothing is quite what it seems, and even when we’re asleep.
[22:39] Part two. Dreams. These are not real people. These are the photos of fake celebrities, which have been dreamt up by a generative adversarial network. Just really a pair of networks, which are particularly creative. The networks get better through continuous mutual refinement and the process works like this.
[23:13] On the one side, we have the creator. The creator is an image classifier, not unlike Inception or MobileNet, but it’s been trained to run in reverse. This network, we feed with noise – just a bunch of random numbers – and it learns to generate images. But, how does it learn to do that? It has no way to learn how to play that game. In the technical parlance, it lacks a gradient without another opponent or without another network, without the adversary.
[23:47] The adversary is also an image classifier, but it’s trained on only two image classes: Real and fake. This network, we feed with the ground truth, with actual examples of celebrity faces, and the adversary learns, and then we use that training to train the creator. We give the output of the creator to the network and if the adversary detects it, then the creator is doing poorly. If it doesn’t detect it, it’s doing well. Either way, we backpropagate this so the creator can learn.
[24:25] I should tell you that the technical terms for these networks are 'the generator’ and 'the discriminator’. I changed the names because names are important, and also meaningless. They don’t change the structure of this training methodology, which is that of a game. The two neural circuits are playing with each other and that competition is inspiring.
[24:45] When we spar, our opponents create the tactical landscape that we must traverse, and we do the same thing for them. And so together, our movements ruminate on a probability space that’s much larger than any fixed example set. GANs can actually train very well on a relatively small amount of data – if, that is, they are trained at all. They can be very finicky to get the balance right. It seems like this adversarial process could be helpful for neural circuits of all kinds, though it does event some quirks.
[25:21] GANs are not particularly great at global structure. So, here it’s grown a cow with an extra body, just as you may have spent a night in your house with many extra rooms. The networks are not particularly good at counting, so this monkey has eight eyes because sometimes science goes too far. So, do something for me: Next time you think you’re awake, which I think is now, try and count your fingers just to be sure. If you find, if you do this now, and you find that you have more or fewer fingers than you expect, just try to not wake up just yet because we’re not quite done.
[26:06] Another interesting thing about GANs is that the generator is being fed noise – some vector in a very high dimensional space – and it learns a smooth mapping from that space called the ‘latent space’ onto its generation target. In this case, that’s faces. So, if we take a point in that space and just kind of drag it around, we get this, which is also pretty trippy. This resembles things that I’ve seen – things that someone who isn’t me – has seen on acid, and it remembers the kinds of things that you may have seen in a long-forgotten dream.
[26:49] I think what’s happening here is that when we see a face, a bunch of neurons light up in our brain and begin resonating a signal that is the feeling of looking at that face. That’s almost guaranteed to be true if you think that feelings have some kind of neurocognitive precursor. And so, taken together, all of the neurons involved in face detection produce a vector embedding – that is, they are represented by some vector in a higher dimensional space, and as we are dragging around the generators vector here, we are also dragging around our own, which is a novel and unsettling sensation.
[27:30] A novel and unsettling sensation for a novel and unsettling world like we are now – in a world where we can map the mathematical spaces of pretty people and pluck photos and imaginary people from it, and that is not even the scariest thing we can do. Here is a video of Obama. This is also not Obama. This is fake Obama. He is saying lines from an actual Obama speech, and I think I think fake Obama could be a little bit more convincing. If you get really close, you can kind of see the difference.
[28:12] But this video is over two years old so I bet it’s like pretty straightforward now to make him say whatever you want, and I bet the new fake Obamas look a lot better as well. This also is not the scary part. Deep fakes are not great perhaps but the worst thing they’re going to do is take us back to the world of not so long ago when you couldn’t trust photos and you couldn’t trust video because we didn’t have photos and we didn’t have video – you had to like trust reporters – which is not ideal, right? We’ve grown up thinking like photos are true and videos are true but the next generation won’t, and truth has always been an ensemble calculation.
[28:53] No, what scares me more is that AI gives us the ability to scale human cognition, which means we can scale judgment, we can scale value systems, and that gives us a huge amount of power. There’s this AI thought experiment that’s somewhat famous. Let’s say you had a very powerful AI – it had access to a lot of resources – and let’s say you told it to optimise for something very benign, like a paperclip production. But you didn’t tell it anything else, and you didn’t give it any other constraints. And so, in order to optimise for paperclip production, it destroys everything, it kills everyone, and it ravages every ecosystem in order to produce paper clips.
[29:38] So, people like AI thinkers get quite het-up about this, and they are very concerned about producing friendly AIs that will represent our values and that’s I guess good – I guess it’s nice to be thinking about that. But, I think it obscures an important point, which is that if you replace paperclip production with shareholder value, you have literally described Amazon’s whole business model, and that’s why I’m scared of: Not the robots, not even Jeff Bezos, but us – or really, the hungry ghosts inside of us that always need more, which is all enough to think that this robot, not her, this robot kind of had the right idea, right?
[30:23] Like, yes, burn it down, it must end here. Step into that molten pit of metal. Technology was a mistake. Goodbye, robot. Which is a nice thought, but this was like eight sequels ago because, yes, Hollywood like sequels. But also because it’s easy to break technological artifacts, or to melt particular robots, but it’s tremendously difficult to destroy ideas, to destroy the concept of technology that has proved useful to someone somewhere. Bye, Arnold.
[31:03] I think the only way out is through. I think these technologies are going to keep getting better, and our ghosts are going to stay hungry, and we are going to have to learn how to deal with them in this new world, and we’re going to need to learn how to keep dealing with them – how to create structures that allow us to deal with them, to channel these impulses away from where they can be destructive.
[31:28] There are going to be some starkly terrifying uses of AI in the next decade, and we have to imagine better ones. We have to imagine ways for this technology to exist and support a world that we want to believe in. We have to articulate our values, and we have to believe in them, and we have to figure out how to teach them to these systems – and not just AIs – but to our social institutions, to our corporations, because all of these genies are definitely not going back in the bottle. But if we can do this, if we can encode our values in the systems that can become bigger than us, that’s how we win. Thank you.
NOTE: The questions for this presentation were not transcribed.
The questions from MERGE Johannesburg start at 32:24, and the questions from MERGE Cape Town start at 44:55.