Activation Functions

4. Activation Functions#

So far, we have looked at examples where the output is a weighted sum of the inputs. Generally, you’d use classical regression software in that case rather than torch, since the classical software provies greater speed and interpretability for linear regression. The benefit of deep learning is the ability to estimate nonlinear functions, which are generally beyond the scope of linear regression.

Nonlinearity is introduced to deep learning by using activation functions, which are simple, nonlinear functions that introduce complexity into the model. In this chapter, we will look at three of the most common activation functions: rectified linear units (ReLU), Sigmoid, and tanh. Let’s use the ReLU activation function as a specific example to understand how they work in a deep learning model.

4.1. ReLU Activation Function#

Activation functions get their name from the Restricted Linear Unit (ReLU) activation function. It defines a threshold for its input and sets the output to zero if the input is below the threshold. Otherwise, the input is the same as the output. So, you could say that the output is active above the threshold and inactive below the threshold. Here is what the ReLU function looks like as it maps its input to its output:

ReLU function

The power of the ReLU activation function is that it introduces conditional statements like “if… then…” into the model. This allows the model to have different behavior depending on its inputs, rather than always giving the same weight to the same column of inputs. In order to make use of the complex conditionality, ReLU activation functions are typically put between linear layers. Here is an example that makes the linear regression model for diabetes progression into a nonlinear dep learning model:

# define the torch model
class NonlinearRegressionModel(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim=1):
        super(NonlinearRegressionModel, self).__init__()
        self.linear_1 = torch.nn.Linear(input_dim, hidden_dim)
        self.linear_2 = torch.nn.Linear(hidden_dim, output_dim)
        self.relu = torch.nn.ReLU(hidden_dim)


    def forward(self, x):
        result = self.linear_1(x)
        result = self.relu(result)
        result = self.linear_2(result)
        return result

Note that the model now has multipl linear layers with that operate on different dimensions and with a ReLU layer between them. We can now create an instance of the nonlinear model and train it:

# instantiate the model
my_nonlin = NonlinearRegressionModel(input_dim=2, hidden_dim=5)

# define the loss function and the optimizer
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(my_nonlin.parameters(), lr=1)

# convert the data to tensors
x_data = torch.tensor(df[['bmi', 's6']].values, dtype=torch.float32)
y_data = torch.tensor(df['target'].values, dtype=torch.float32).unsqueeze(-1)

# train the model by repeatedly making small steps toward the solution
for epoch in range(40000):
    # Forward pass
    y_pred = my_nonlin(x_data)
    
    # Compute and print loss
    loss = criterion(y_pred, y_data)
    if epoch % 1000 == 0:
        print(f'Epoch: {epoch} | Loss: {loss.item()}')
    
    # Zero gradients, perform a backward pass, and update the parameters.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Epoch: 0 | Loss: 29134.857421875

Epoch: 1000 | Loss: 3723.62890625

Epoch: 2000 | Loss: 3723.62890625

Epoch: 3000 | Loss: 3723.62890625

Epoch: 4000 | Loss: 3723.62841796875

Epoch: 5000 | Loss: 3723.62890625

Epoch: 6000 | Loss: 3723.628662109375

Epoch: 7000 | Loss: 3723.62890625

Epoch: 8000 | Loss: 3629.808349609375

Epoch: 9000 | Loss: 3631.943115234375

Epoch: 10000 | Loss: 3630.031982421875

Epoch: 11000 | Loss: 3629.775634765625

Epoch: 12000 | Loss: 3625.42822265625

Epoch: 13000 | Loss: 3625.53466796875

Epoch: 14000 | Loss: 3623.890625

Epoch: 15000 | Loss: 3624.219970703125

Epoch: 16000 | Loss: 3623.57568359375

Epoch: 17000 | Loss: 3625.212158203125

Epoch: 18000 | Loss: 3623.34521484375

Epoch: 19000 | Loss: 3623.49951171875

Epoch: 20000 | Loss: 3623.993896484375

Epoch: 21000 | Loss: 3623.3232421875

Epoch: 22000 | Loss: 3625.195068359375

Epoch: 23000 | Loss: 3622.60888671875

Epoch: 24000 | Loss: 3620.525390625

Epoch: 25000 | Loss: 3620.5341796875

Epoch: 26000 | Loss: 3620.423828125

Epoch: 27000 | Loss: 3620.857666015625

Epoch: 28000 | Loss: 3620.439697265625

Epoch: 29000 | Loss: 3622.558349609375

Epoch: 30000 | Loss: 3620.7119140625

Epoch: 31000 | Loss: 3620.645751953125

Epoch: 32000 | Loss: 3620.683349609375

Epoch: 33000 | Loss: 3620.650146484375

Epoch: 34000 | Loss: 3621.42822265625

Epoch: 35000 | Loss: 3620.89599609375

Epoch: 36000 | Loss: 3620.390380859375

Epoch: 37000 | Loss: 3621.003662109375

Epoch: 38000 | Loss: 3621.7138671875

Epoch: 39000 | Loss: 3620.70703125

The training loss seems stable after about 30000 iterations.

# print out the model parameters:
for param in my_nonlin.parameters():
    print(param)

Parameter containing:
tensor([[ -0.2555,  -0.5684],
        [ 80.6649,  32.7858],
        [ -5.8280,  -6.3134],
        [ 30.3064, -49.6007],
        [ 55.9338, 107.0210]], requires_grad=True)
Parameter containing:
tensor([-0.4508,  6.2032, -5.5858,  4.6875,  0.1389], requires_grad=True)
Parameter containing:
tensor([[-0.1531,  7.7079,  5.5726,  3.4063,  3.9982]], requires_grad=True)
Parameter containing:
tensor([76.4415], requires_grad=True)

fit_nonlin = pd.DataFrame({'fitted':y_pred.squeeze(1).detach().numpy()})
fit_nonlin['target'] = df['target']

# plot model fits
fit_plot = sns.regplot(data=fit_nonlin, x="fitted", y="target")

../_images/27a49b5ee95a362bc3f1f38cf53dd3a95383bac9a8d49ee36cefed4424b2f316.png

Activation Functions

Contents

4. Activation Functions#

4.1. ReLU Activation Function#