This is part of the CNN Architectures series by Dimitris Katsios. Find all CNN Architectures online:
- Notebooks: MLT GitHub
- Video tutorials: YouTube
- Support MLT on Patreon
ShuffleNet
We will use the tensorflow.keras Functional API to build ShuffleNet from the original paper: “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices” by Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun.
ShuffleNet: Video tutorial
In the paper we can read:
[i] “The first building block in each stage is applied with stride = 2. Other hyper-parameters within a stage stay the same, and for the next stage the output channels are doubled”.
[ii] “Similar to [9], we set the number of bottleneck channels to 1/4 of the output channels for each ShuffleNet unit”
[iii] “we add a Batch Normalization layer [15] after each of the convolutions to make end-to-end training easier.”
[iv] “Note that for Stage 2, we do not apply group convolution on the first pointwise layer because the number of input channels is relatively small.”
We will also make use of the following Table [v]:

as well the following Diagrams [vi]

Figure 2. ShuffleNet Units. a) bottleneck unit [9] with depthwise convolution (DWConv) [3, 12]; b) ShuffleNet unit with pointwise group convolution (GConv) and channel shuffle; c) ShuffleNet unit with stride = 2.
Figure 1. Channel shuffle with two stacked group convolutions. GConv stands for group convolution. a) two stacked convolution layers with the same number of groups. Each output channel only relates to the input channels within the group. No cross talk; b) input and output channels are fully related when GConv2 takes data from different groups after GConv1; c) an equivalent implementation to b) using channel shuffle.
Network architecture
Based on [v] the model starts with a stem of Convolution-Max Pool and continues with a number of Stages before the final Global Pool-Fully Connected layers.
Each Stage consists of two parts:
- One Shufflenet block with strides 2 [vi.b]
- a number of repeated Shufflenet blocks with strides 1 [vi.c]
Each one of the right most columns of [v] corresponds to a model architecture with different number of internal groups (g). In our case we are going to implement the “g = 8” model, however the code will be general enough to support any other combination of number of:
- groups
- stages
- repetitions per stage
Shufflenet block
The Shufflenet block is the building block of this network. Similar to the ResNet block there are two variations of the block based on whether the spatial dimensions of the input tensor change (strides = 2) or not (strides = 1).
In the first case we apply a 3×3 Average Pool with strides 2 at the shortcut connection as depicted at [vi.c].
The main branch of the block consists of:
- 1×1 Group Convolution with 1/4 filters (GConv) followed by Batch Normalization and ReLU ([ii])
- Channel Shuffle operation
- 3×3 DepthWise Convolution (with or w/o strides=2) followed by Batch Normalizaion
- 1×1 Group Convolution followed by Batch Normalizaion
The tensors of the main branch and the shortcut connection are then concatenated and a ReLU activation is applied to the output.
Group Convolution
The idea of Group Convolution is to separate the input tensor to g sub-tensors each one with $1/g$ distinct channels of the initial tesnsor. Then we apply a 1×1 Convolution to each sub-tensor and finally we concatenate all the sub-tensors together ([vii]).
Channel Shuffle
Channel shuffle is an operation of shuffling the channels of the input tensor as shown at [vii.b,c]. In order to shuffle the channels we
- reshape the input tensor:
from:
width x height x channels
to:
width x height x groups x (channels/groups)
- prermute the last two dimensions
- reshape the tensor to the original shape
A simple example of the results of this operation can be seen at the following application of the operation on a 6-element array
- reshape to groups x (n / groups) (groups=2)
2. prermute the dimensions
3. reshape to the original shape
Workflow
We will:
- import the neccesary layers
- write a helper function for the Stage
- write a helper function for the Shufflenet block
- write a helper function for the Group Convolution
- write a helper function for the Channel Shuffle
- write the stem of the model
- use the helper function to write the main part of the model
- write the last part of the model and build it
1. Imports
Code:
from tensorflow.keras.layers import Input, Conv2D, DepthwiseConv2D, \ Dense, Concatenate, Add, ReLU, BatchNormalization, AvgPool2D, \ MaxPool2D, GlobalAvgPool2D, Reshape, Permute, Lambda
2. Stage
The Stage function will:
- take as inputs:
- a tensor (
x
) - the number of channels (also called filters) (
channels
) - the number of repetitions of the second part of the stage (
repetitions
) - the number of groups for the Group Convolution blocks (
groups
)
- a tensor (
- run:
- apply a Shufflenet block with strides=2
- apply
repetitions
times a Shufflenet block with strides=1
- return the tensor
Code:
def stage(x, channels, repetitions, groups): x = shufflenet_block(x, channels=channels, strides=2, groups=groups) for i in range(repetitions): x = shufflenet_block(x, channels=channels, strides=1, groups=groups) return x
3. Shufflenet block
The Shufflenet block will:
- take as inputs:
- a tensor (
tensor
) - the number of channels (
channels
) - the strides (
strides
) - the number of groups for the Group Convolution blocks (
groups
)
- a tensor (
- run:
- apply a Group Convolution block with 1/4
channels
channels followed by Batch Normalizaion-ReLU - apply
Channel Shuffle
to this tensor - apply a Depthwise Convolution layer followed by Batch Normalizaion
- if
strides
is 2:- subtract from
channels
the number of channels oftensor
so that after the concatenation the output tensor will havechannels
channels
- subtract from
- apply a Group Convolution block with
channels
channels followed by Batch Normalizaion - if
strides
is 1:- add this tensor with the input
tensor
- add this tensor with the input
- else:
- apply a 3×3 Average Pool with strides 2 ([vi]) to the input
tensor
and concatenate it with this tensor
- apply a 3×3 Average Pool with strides 2 ([vi]) to the input
- apply ReLU activation to the tensor
- apply a Group Convolution block with 1/4
- return the tensor
Note that according to [iv] we should not apply Group Convolution to the first inupt (24 channels) and apply only the Convolution operation instead which we can code with a simple if-else
statement. However, for the sake of clarity of the code we ommit it.
Code:
def shufflenet_block(tensor, channels, strides, groups): x = gconv(tensor, channels=channels // 4, groups=groups) x = BatchNormalization()(x) x = ReLU()(x) x = channel_shuffle(x, groups) x = DepthwiseConv2D(kernel_size=3, strides=strides, padding='same')(x) x = BatchNormalization()(x) if strides == 2: channels = channels - tensor.get_shape().as_list()[-1] x = gconv(x, channels=channels, groups=groups) x = BatchNormalization()(x) if strides == 1: x = Add()([tensor, x]) else: avg = AvgPool2D(pool_size=3, strides=2, padding='same')(tensor) x = Concatenate()([avg, x]) output = ReLU()(x) return output
4. Group Convolution
The Group Convolution function will:
- take as inputs:
- a tensor (
tensor
) - the number of channels of the output tensor (
channels
) - the number of groups (
groups
)
- a tensor (
- run:
- get the number of channels (
input_ch
) of the input tensor using the get_shape() method - calculate the number of channels per group (
group_ch
) by dividinginput_ch
bygroups
- calculate how many channels will have each group after the Convolution layer. It will be equal to
channels
divided bygroups
- for every group:
- get the
group_tensor
which will be a sub-tensor oftensor
with specific channels - apply a 1×1 Convolution layer with
output_ch
channels - add the tensor to a list (
groups_list
)
- get the
- Concatenate all the tensors of
groups_list
to one tensor
- get the number of channels (
- return the tensor
Note that there is a commented line in the code bellow. One can get a slice of a tensor by using the simple slicing notation a[:, b:c, d:e]
but the code takes too long to run (as it is in the case of tensorflow.slice()). By using a Lambda layer and applying it on the tensor we have the same result but much faster.
Code:
def gconv(tensor, channels, groups): input_ch = tensor.get_shape().as_list()[-1] group_ch = input_ch // groups output_ch = channels // groups groups_list = [] for i in range(groups): group_tensor = tensor[:, :, :, i * group_ch: (i+1) * group_ch] # group_tensor = Lambda(lambda x: x[:, :, :, i * group_ch: (i+1) * group_ch])(tensor) group_tensor = Conv2D(output_ch, 1)(group_tensor) groups_list.append(group_tensor) output = Concatenate()(groups_list) return output
5. Channel Shuffle
The Channel Shuffle function will:
- take as inputs:
- a tensor (
x
) - the number of groups (
groups
)
- a tensor (
- run:
- get the dimensions (
width, height, channels
) of the input tensor. Note that the first number ofx.get_shape().as_list()
will be the batch size. - calculate the number of channels per group (
group_ch
) - reshape
x
towidth
xheight
xgroup_ch
xgroups
- permute the last two dimensions of the tensor (
group_ch
xgroups
->groups
xgroup_ch
) - reshape
x
to its original shape (width
xheight
xchannels
)
- get the dimensions (
- return the tensor
Code:
def channel_shuffle(x, groups): _, width, height, channels = x.get_shape().as_list() group_ch = channels // groups x = Reshape([width, height, group_ch, groups])(x) x = Permute([1, 2, 4, 3])(x) x = Reshape([width, height, channels])(x) return x
6. Stem of the model
Now we can start coding the model. We will start with the model’s stem. According to [v] the first layer of the model is a 3×3 Convolution layer with 24 filters followed by ([iii]) a BatchNormalization and a ReLU activation.
The next layer is a 3×3 Max Pool with strides 2.
Code:
input = Input([224, 224, 3]) x = Conv2D(filters=24, kernel_size=3, strides=2, padding='same')(input) x = BatchNormalization()(x) x = ReLU()(x) x = MaxPool2D(pool_size=3, strides=2, padding='same')(x)
7. Main part of the model
The main part of the model consists of Stage
blocks. We first define the hyperparameters repetitions
, initial_channels
acoording to [v] and groups
. Then for each number of repetitions we calculate the number of channels according to [i] and apply the stage()
function on the tensor.
Code:
repetitions = 3, 7, 3 initial_channels = 384 groups = 8 for i, reps in enumerate(repetitions): channels = initial_channels * (2**i) x = stage(x, channels, reps, groups)
8. Rest of the model
The model closes with a Global Pool layer and a Fully Connected one with 1000 classes ([v]).
Code:
x = GlobalAvgPool2D()(x) output = Dense(1000, activation='softmax')(x) from tensorflow.keras import Model model = Model(input, output)
Final code
Code:
from tensorflow.keras.layers import Input, Conv2D, DepthwiseConv2D, \ Dense, Concatenate, Add, ReLU, BatchNormalization, AvgPool2D, \ MaxPool2D, GlobalAvgPool2D, Reshape, Permute, Lambda def stage(x, channels, repetitions, groups): x = shufflenet_block(x, channels=channels, strides=2, groups=groups) for i in range(repetitions): x = shufflenet_block(x, channels=channels, strides=1, groups=groups) return x def shufflenet_block(tensor, channels, strides, groups): x = gconv(tensor, channels=channels // 4, groups=groups) x = BatchNormalization()(x) x = ReLU()(x) x = channel_shuffle(x, groups) x = DepthwiseConv2D(kernel_size=3, strides=strides, padding='same')(x) x = BatchNormalization()(x) if strides == 2: channels = channels - tensor.get_shape().as_list()[-1] x = gconv(x, channels=channels, groups=groups) x = BatchNormalization()(x) if strides == 1: x = Add()([tensor, x]) else: avg = AvgPool2D(pool_size=3, strides=2, padding='same')(tensor) x = Concatenate()([avg, x]) output = ReLU()(x) return output def gconv(tensor, channels, groups): input_ch = tensor.get_shape().as_list()[-1] group_ch = input_ch // groups output_ch = channels // groups groups_list = [] for i in range(groups): # group_tensor = tensor[:, :, :, i * group_ch: (i+1) * group_ch] group_tensor = Lambda(lambda x: x[:, :, :, i * group_ch: (i+1) * group_ch])(tensor) group_tensor = Conv2D(output_ch, 1)(group_tensor) groups_list.append(group_tensor) output = Concatenate()(groups_list) return output def channel_shuffle(x, groups): _, width, height, channels = x.get_shape().as_list() group_ch = channels // groups x = Reshape([width, height, group_ch, groups])(x) x = Permute([1, 2, 4, 3])(x) x = Reshape([width, height, channels])(x) return x input = Input([224, 224, 3]) x = Conv2D(filters=24, kernel_size=3, strides=2, padding='same')(input) x = BatchNormalization()(x) x = ReLU()(x) x = MaxPool2D(pool_size=3, strides=2, padding='same')(x) repetitions = 3, 7, 3 initial_channels = 384 groups = 8 for i, reps in enumerate(repetitions): channels = initial_channels * (2**i) x = stage(x, channels, reps, groups) x = GlobalAvgPool2D()(x) output = Dense(1000, activation='softmax')(x) from tensorflow.keras import Model model = Model(input, output)