Principle #1 – The input and output systemAll AI models are based on the common principle of inputs and outputs. Every single form of Artificial Intelligence,including Machine Learning models, ChatBots, recommender systems, robots, and of course Reinforcement Learning models, will take something as input and will return another thing as output. In Reinforcement Learning, these inputs and outputs have a specific name: the input is called the state, or input state. The output is the action performed by the AI. And in the middle, we have nothing other than a function that takes a state as input and returns an action as output. That function is called a policy. As an example, consider a self-driving electric vehicle (AV). Try to imagine what the input and output would be in that case.The input would be what the embedded computer vision system sees, and the output would be the next move of the car: accelerate, slow down, turn left, turn right, or brake. Note that the output at any time (t) could very well be several actions performed at the same time. For instance, the autonomous vehicle can accelerate while at the same time turning left. In the same way, the input at each time (t) can be composed of several elements: mainly the image observed by the computer vision system, but also some parameters of the AV such as the current speed, the amount of batter remaining, and so on. That's the very first important principle in Artificial Intelligence: it is an intelligent system (a policy) that takes some elements as input, does its magic in the middle, and returns some actions to perform as output. Remember that the inputs are also called the states.
Principle #2 – The rewardEvery AI has its performance measured by a reward system which is simply a metric tells the AI how well it does over time. The simplest example is a binary reward: 0 or 1. Imagine an AI that has to guess an outcome. If the guess is right, the reward will be 1, and if the guess is wrong, the reward will be 0. A reward doesn't have to be binary, however. It can be continuous. Imagine an AI playing this game. Try to work out what the reward would be in that case. It could simply be the score; more precisely, the score would be the accumulated reward over time in one game, and the rewards could be defined as the derivative of that score. This is one of the many ways we could define a reward system for that game. Different AIs will have different reward structures. With that in mind, remember this as well: the ultimate goal of the AI will always be to maximize the accumulated reward over time. Those are the first two basic, but fundamental, principles of Artificial Intelligence as it exists today; the input and output system, and the reward.
Principle #3 – The AI environmentThe third principle is what we call an "AI environment." It is a very simple framework where we define three things at each time (t):1. The input (the state) 2. The output (the action) 3. The reward (the performance metric) For each and every single AI based on Reinforcement Learning that is built today, we always define an environment composed of the preceding elements. It is, however, important to understand that there can be more than three elements in any given AI environment. For example, if we are building an AI to beat a car racing game, the environment will also contain the map and the gameplay of that game. Or, in the example of an autonomous vehicle, the environment will also contain all the roads along which the AI is driving and the objects that surround those roads. But what we will always find in common when building any AI, are the three elements of state, action, and reward. The next principle, the Markov decision process, covers how they work in practice.
Principle #4 – The Markov decision processThe Markov decision process, or MDP, is simply a process that models how the AI interacts with the environment over time. The process starts at t = 0, and then, at each next iteration, meaning at t = 1, t = 2, … t = n units of time (where the unit can be anything, for example, 1 second), the AI follows the same formatof transition: 1. The AI observes the current state, 2. The AI performs the action, 3. The AI receives the reward, 4. The AI enters the following state, The goal of the AI is always the same in Reinforcement Learning: it is to maximize the accumulated rewards over time, that is, the sum of all the received at each transition.
Principle #5 – Training and inferenceThe final principle to understand is the difference between training and inference. When building an AI, there is a time for the training mode, and a separate time for inference mode.
Training modeNow we understand, from the three first principles, that the very first stage of building an AI is to build an environment in which the input states, the output actions, and a system of rewards are clearly defined. From the fourth principle, we also understand that inside this environment we will build an AI to interact with it and maximize the total reward accumulated over time. To put it simply, there will be a preliminary (and long) period of time during which the AI will be trained, called the training. During that time, the AI tries to accomplish a certain goal over and over again until the AI succeeds. After each attempt, the parameters of the AI model are modified in order to do better at the next attempt. For example, let's say we're building an autonomous vehicle and we want it to go from point A to point B. Let's also imagine that there are some obstacles that we want our self-driving car to avoid. Here is how the training process happens: 1. We choose an AI model, which can be Thompson Sampling, Q-learning, deep Q-learning or even deep convolutional Q-learning. 2. We initialize the parameters of the model. 3. Our AI tries to go from A to B (by observing the states and performing its actions). During this first attempt, the closer it gets to B, the higher reward we give to the AI. If it fails reaching B or hits an obstacle, we give the AI a very bad reward. If it manages to reach B without hitting any obstacle, we give the AI an extremely good reward. It's just like we would train a dog to sit: we give the dog a treat or say "good girl" (positive reward) if the dog sits. And we give the dog whatever small punishment we need to if she disobeys (negative reward). That process is training, and it works the same way in Reinforcement Learning. 4. At the end of the attempt (also called an episode), we modify the parameters of the model in order to do better next time. The parameters are modified intelligently, either iteratively through equations (Q-Learning), or by using Machine Learning and Deep Learning techniques such as stochastic gradient descent or backpropagation. 5. We repeat stages 3 and 4 again, and again, until we reach the desired performance; that is, until we have our fully autonomous vehicle!
Inference modeInference mode simply comes after our AI is fully trained and ready to perform well. It will simply consist of interacting with the environment by performing the actions to accomplish the goal the AI was trained to achieve before in training mode. In inference mode, no parameters are modified at the end of each episode.One of my clients asked me to build an AI to optimize the flows in a smart grid. First, we entered an R&D phase during which I trained their AI to optimize flows (training mode). Once the AI reached a good level of performance by observing the current states of the grid and performing the actions it has been trained to do (inference mode), their AI was ready for implementation. Sometimes, the environment is subject to change, in which case we have to alternate fast between training and inference modes so that our AI can adapt tothe new changes in the environment. An even better solution is to train our AI model every day, and go into inference mode with the most recently trained model. ||| First introduced by Ioffe and Szgedy in their 2015 paper, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, batch normalization layers (or BN for short), are used to normalize the activations of a given input volume beforepassing it into the next layer in the network. If we consider x to be our mini-batch of activations, then we can compute the normalized xˆ via the above equation. We set ε equal to a small positive value such as 1e-7 to avoid taking the square root of zero. Applying this equation implies that the activations leaving a batch normalization layer will have approximately zero mean and unit variance (i.e., zero-centered). At testing time, we replace the mini-batch µβ and σβ with running averages of µβ and σβ computed during the training process. This ensures that we can pass images through our neural network and still obtain accurate predictions without being biased by the µβ and σβ from the final mini-batch passed through the neural network at training time.Batch normalization has been shown to be extremely effective at reducing the number of epochs it takes to train a neural network. Batch normalization also has the added benefit of helping “stabilize” training, allowing for a larger variety of learning rates and regularization strengths. Using batch normalization doesn’t alleviate the need to tune these parameters of course, but it will make the neural network accurate by making learning rate and regularization less volatile and more straightforward to tune. We’ll also tend to notice lower final loss and a more stable loss curve when using batch normalization in neural networks. The biggest drawback of batch normalization is that it can actually slow down the clock time it takes to train a neural network (even though we’ll need fewer epochs to obtain reasonable accuracy) by 2-3x due to the computation of per-batch statistics and normalization. That said, using batch normalization in nearly every situation results in a significant difference. ||| Gradient descent is a first-order optimization algorithm that can be used to learn a set of classifier weights for parameterized neural network learning. However, the “vanilla” implementation of gradient descent can be prohibitively slow to run on large datasets – in fact, it can even be considered computational wasteful. Instead, we should apply Stochastic Gradient Descent (SGD), a simple modification to the standard gradient descent algorithm that computes the gradient and updates the weight matrix W on small batches of training data, rather than the entire training set. While this modification leads to “more noisy” updates, it also allows us to take more action along the gradient (one action per each batch versus one action per epoch), ultimately leading to faster convergence and no negative affects to loss and classification accuracy. SGD is arguably the most important algorithm when it comes to training deep neural networks. Even though the original incarnation of SGD was introduced over 57 years ago, it is still the engine that enables us to train large neural networks to learn patterns from data points. Reviewing the vanilla gradient descent algorithm, it should be (somewhat) obvious that the method will run very slowly on large datasets. The reason for this slowness is because each iteration of gradient descent requires us to compute a prediction for each training point in our training data before we are allowed to update our weight matrix. For image datasets such as ImageNet where wehave over 1.2 million training images, this computation can take a long time. It also turns out that computing predictions for every training point before taking an action along a weight matrix is computationally wasteful and does little to help model coverage. Instead, what we should do is batch our updates. We can update the pseudocode to transform vanilla gradient descent to become SGD by adding an extra function call (pictured above).The only difference between vanilla gradient descent and SGD is the addition of the next_training_batch function. Instead of computing our gradient over the entire dataset, we instead sample our data, yielding a batch. We evaluate the gradient on the batch, and update our weight matrix W. From an implementation perspective, we also randomize training samples before applying SGD since the algorithm is sensitive to batches.After looking at the pseudocode for SGD, immediately notice an introduction of a new parameter: the batch size. In a “purist” implementation of SGD, mini-batch size would be 1, implying that we would randomly sample one data point from the training set, compute the gradient, and update the parameters. However, we often use mini-batches that are > 1. Typical batch sizes include 32, 64, 128, and 256. So, why bother using batch sizes > 1? To start, batch sizes > 1 help reduce variance in the parameter update, leading to a more stable convergence. Secondly, powers of two are often desirable for batch sizes as they allow internal linear algebra optimization librariesto be more efficient. In general, the mini-batch size is not a hyperparameter to worry too much about. If using a GPU to train a neural network, we determine how many training examples will fit into the GPU and then use the nearest power of two as the batch size such that the batch will fit on the GPU. For CPU training, we typically use one of the batch sizes listed above to ensure we reap the benefits of linear algebra optimization libraries.
~ Intelligent Website Design, Data Analysis and Creative Business Strategies powered by AI ~