var_x = torch.autograd.Variable(x, requires_grad=True)  
  for l in range(n):  
    mx = layers[l](var_x)
    gradients = torch.autograd.grad(outputs=mx, inputs=var_x,
                                      grad_outputs=torch.ones(mx.size()).to(device), 
                            create_graph=True, retain_graph=True, only_inputs=True)[0]
    norm_gradient_velocity[l] = torch.max( torch.norm( gradients.view(gradients.shape[0],-1), p=np.inf, dim=1 ) )
Because it’s long during training to use such loop, I was wondering if it could be done using hooks.
              Ok, so not linear at all. So you wont be able to do that 
In that case, I don’t think there is any other way to do this.
Hooks would only allow you to call a function during the backward pass. But you still have to do all these autograd.grad.
              sorry to bother you. have you achieved to get gradients of the intermediate layers w.r.t the input data with hook in pytorch?
Now I also met this issue and i want calculate the F-norm of Jacobi Matrix w.r.t  the every layers and input data in a network .
Thanks for your reply!
              I have found a solution.
My problem is to start the gradient process from a specific filter in a specific layer with respect to the input image.
First, I have assigned a forward hook to all conv. layers in order to keep output of each filter (activation map). Second, I have assigned a backward hook to all layers to keep their gradient in the backward pass.
Third, I fed the model with an input image to obtain all activation maps. Then, I called the backward function on the norm of one of the activation maps (remind that each activation map of each filter is similar to the output of loss). Finally, I have all gradients in a list variable where the last element of this list is the gradient of input image which should have the size of (1,3,224,224) for instance.
              @shubham_vats  You can either use .retain_grad() on the input value but a more consistent way of getting this value is to use hooks. You can use a full backward hook (not a backward hook as that’s deprecated and gives the wrong result). Also, check there’s no use of in-place operations (for example with ReLU) nor use of .detach() which breaks the gradient of your computational graph.
@Avani_Gupta For an explanation of what grad_input and grad_output mean there’s an explanation on the forums here. The grad_output term is the gradient of the loss with respect to the output of that nn.Module.
              @AlphaBetaGamma96 thank you. I’ll look into this. I’m fairly new to pytorch and hooks didn’t just come naturally to me. I was hoping to use create_feature_extractor to get those intermediate outputs and do something with them but apparently, the grad attribute is what I need for which retain_grad should be helpful. I also wanted to avoid hooks because as per my understanding I would have to retrain the model to use them? I am using load_state_dict to load a pretrained model for now.
              @shubham_vats You won’t have to retrain the model, you can just attach the hooks to the model perform a forward pass then call loss.backward(). That will trigger the full backward hooks and you’ll get your derivatives (although it will be for all samples so you’ll need to take a mean over that to get the correct shape)