This is part of the CNN Architectures series by Dimitris Katsios. Find all CNN Architectures online:

- Notebooks: MLT GitHub
- Video tutorials: YouTube
- Support MLT on Patreon

## ShuffleNet

We will use the tensorflow.keras Functional API to build ShuffleNet from the original paper: “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices” by Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun.

ShuffleNet: Video tutorial

In the paper we can read:

[i]“The first building block in each stage is applied with stride = 2. Other hyper-parameters within a stage stay the same, and for the next stage the output channels are doubled”.

[ii]“Similar to [9], we set the number of bottleneck channels to 1/4 of the output channels for each ShuffleNet unit”

[iii]“we add a Batch Normalization layer [15] after each of the convolutions to make end-to-end training easier.”

[iv]“Note that for Stage 2, we do not apply group convolution on the first pointwise layer because the number of input channels is relatively small.”

We will also make use of the following Table **[v]**:

as well the following Diagrams **[vi]**

_{Figure 2. ShuffleNet Units. a) bottleneck unit [9] with depthwise convolution (DWConv) [3, 12]; b) ShuffleNet unit with pointwise group convolution (GConv) and channel shuffle; c) ShuffleNet unit with stride = 2.}

_{Figure 1. Channel shuffle with two stacked group convolutions. GConv stands for group convolution. a) two stacked convolution layers with the same number of groups. Each output channel only relates to the input channels within the group. No cross talk; b) input and output channels are fully related when GConv2 takes data from different groups after GConv1; c) an equivalent implementation to b) using channel shuffle.}

## Network architecture

Based on **[v]** the model starts with a stem of Convolution-Max Pool and continues with a number of **Stages** before the final Global Pool-Fully Connected layers.

Each **Stage** consists of two parts:

- One
**Shufflenet block**with strides 2**[vi.b]** - a number of repeated
**Shufflenet blocks**with strides 1**[vi.c]**

Each one of the right most columns of **[v]** corresponds to a model architecture with different number of internal groups (g). In our case we are going to implement the “*g = 8*” model, however the code will be general enough to support any other combination of number of:

- groups
- stages
- repetitions per stage

### Shufflenet block

The Shufflenet block is the building block of this network. Similar to the ResNet block there are two variations of the block based on whether the spatial dimensions of the input tensor change (strides = 2) or not (strides = 1).

In the first case we apply a 3×3 Average Pool with strides 2 at the shortcut connection as depicted at **[vi.c]**.

The main branch of the block consists of:

- 1×1
**Group Convolution**with 1/4 filters (GConv) followed by Batch Normalization and ReLU (**[ii]**) **Channel Shuffle**operation- 3×3 DepthWise Convolution (with or w/o strides=2) followed by Batch Normalizaion
- 1×1
**Group Convolution**followed by Batch Normalizaion

The tensors of the main branch and the shortcut connection are then concatenated and a ReLU activation is applied to the output.

### Group Convolution

The idea of *Group Convolution* is to separate the input tensor to g sub-tensors each one with $1/g$ distinct channels of the initial tesnsor. Then we apply a 1×1 Convolution to each sub-tensor and finally we concatenate all the sub-tensors together (**[vii]**).

### Channel Shuffle

Channel shuffle is an operation of shuffling the channels of the input tensor as shown at **[vii.b,c]**. In order to shuffle the channels we

- reshape the input tensor:

from:

`width x height x channels`

to:

`width x height x groups x (channels/groups)`

- prermute the last two dimensions
- reshape the tensor to the original shape

A simple example of the results of this operation can be seen at the following application of the operation on a 6-element array

- reshape to groups x (n / groups) (groups=2)

2. prermute the dimensions

3. reshape to the original shape

## Workflow

We will:

- import the neccesary layers
- write a helper function for the
**Stage** - write a helper function for the
**Shufflenet block** - write a helper function for the
**Group Convolution** - write a helper function for the
**Channel Shuffle** - write the stem of the model
- use the helper function to write the main part of the model
- write the last part of the model and build it

### 1. Imports

**Code:**

from tensorflow.keras.layers import Input, Conv2D, DepthwiseConv2D, \ Dense, Concatenate, Add, ReLU, BatchNormalization, AvgPool2D, \ MaxPool2D, GlobalAvgPool2D, Reshape, Permute, Lambda

### 2. Stage

The Stage function will:

- take as inputs:
- a tensor (
)`x`

- the number of channels (also called filters) (
)`channels`

- the number of repetitions of the second part of the stage (
)`repetitions`

- the number of groups for the Group Convolution blocks (
)`groups`

- a tensor (
- run:
- apply a Shufflenet block with strides=2
- apply
times a Shufflenet block with strides=1`repetitions`

- return the tensor

**Code:**

def stage(x, channels, repetitions, groups): x = shufflenet_block(x, channels=channels, strides=2, groups=groups) for i in range(repetitions): x = shufflenet_block(x, channels=channels, strides=1, groups=groups) return x

### 3. Shufflenet block

The Shufflenet block will:

- take as inputs:
- a tensor (
)`tensor`

- the number of channels (
)`channels`

- the strides (
)`strides`

- the number of groups for the Group Convolution blocks (
)`groups`

- a tensor (
- run:
- apply a Group Convolution block with 1/4
channels followed by`channels`

*Batch Normalizaion-ReLU* - apply
to this tensor`Channel Shuffle`

- apply a
*Depthwise Convolution*layer followed by*Batch Normalizaion* - if
is 2:`strides`

- subtract from
the number of channels of`channels`

so that after the concatenation the output tensor will have`tensor`

channels`channels`

- subtract from
- apply a Group Convolution block with
channels followed by`channels`

*Batch Normalizaion* - if
is 1:`strides`

*add*this tensor with the input`tensor`

- else:
- apply a 3×3
*Average Pool*with strides 2 (**[vi]**) to the inputand`tensor`

*concatenate*it with this tensor

- apply a 3×3
- apply
*ReLU*activation to the tensor

- apply a Group Convolution block with 1/4
- return the tensor

Note that according to **[iv]** we should not apply Group Convolution to the first inupt (24 channels) and apply only the Convolution operation instead which we can code with a simple `if-else`

statement. However, for the sake of clarity of the code we ommit it.

**Code:**

def shufflenet_block(tensor, channels, strides, groups): x = gconv(tensor, channels=channels // 4, groups=groups) x = BatchNormalization()(x) x = ReLU()(x) x = channel_shuffle(x, groups) x = DepthwiseConv2D(kernel_size=3, strides=strides, padding='same')(x) x = BatchNormalization()(x) if strides == 2: channels = channels - tensor.get_shape().as_list()[-1] x = gconv(x, channels=channels, groups=groups) x = BatchNormalization()(x) if strides == 1: x = Add()([tensor, x]) else: avg = AvgPool2D(pool_size=3, strides=2, padding='same')(tensor) x = Concatenate()([avg, x]) output = ReLU()(x) return output

### 4. Group Convolution

The Group Convolution function will:

- take as inputs:
- a tensor (
)`tensor`

- the number of channels of the output tensor (
)`channels`

- the number of groups (
)`groups`

- a tensor (
- run:
- get the number of channels (
) of the input tensor using the get_shape() method`input_ch`

- calculate the number of channels per group (
) by dividing`group_ch`

by`input_ch`

`groups`

- calculate how many channels will have each group after the Convolution layer. It will be equal to
divided by`channels`

`groups`

- for every group:
- get the
which will be a sub-tensor of`group_tensor`

with specific channels`tensor`

- apply a 1×1 Convolution layer with
channels`output_ch`

- add the tensor to a list (
)`groups_list`

- get the
*Concatenate*all the tensors ofto one tensor`groups_list`

- get the number of channels (
- return the tensor

Note that there is a commented line in the code bellow. One can get a slice of a tensor by using the simple slicing notation `a[:, b:c, d:e]`

but the code takes too long to run (as it is in the case of tensorflow.slice()). By using a Lambda layer and applying it on the tensor we have the same result but much faster.

**Code:**

def gconv(tensor, channels, groups): input_ch = tensor.get_shape().as_list()[-1] group_ch = input_ch // groups output_ch = channels // groups groups_list = [] for i in range(groups): group_tensor = tensor[:, :, :, i * group_ch: (i+1) * group_ch] # group_tensor = Lambda(lambda x: x[:, :, :, i * group_ch: (i+1) * group_ch])(tensor) group_tensor = Conv2D(output_ch, 1)(group_tensor) groups_list.append(group_tensor) output = Concatenate()(groups_list) return output

### 5. Channel Shuffle

The Channel Shuffle function will:

- take as inputs:
- a tensor (
)`x`

- the number of groups (
)`groups`

- a tensor (
- run:
- get the dimensions (
) of the input tensor. Note that the first number of`width, height, channels`

`x.get_shape().as_list()`

will be the batch size. - calculate the number of channels per group (
)`group_ch`

- reshape
to`x`

x`width`

x`height`

x`group_ch`

`groups`

- permute the last two dimensions of the tensor (
x`group_ch`

->`groups`

x`groups`

)`group_ch`

- reshape
to its original shape (`x`

x`width`

x`height`

)`channels`

- get the dimensions (
- return the tensor

**Code:**

def channel_shuffle(x, groups): _, width, height, channels = x.get_shape().as_list() group_ch = channels // groups x = Reshape([width, height, group_ch, groups])(x) x = Permute([1, 2, 4, 3])(x) x = Reshape([width, height, channels])(x) return x

### 6. Stem of the model

Now we can start coding the model. We will start with the model’s stem. According to **[v]** the first layer of the model is a 3×3 Convolution layer with 24 filters followed by (**[iii]**) a BatchNormalization and a ReLU activation.

The next layer is a 3×3 Max Pool with strides 2.

**Code:**

input = Input([224, 224, 3]) x = Conv2D(filters=24, kernel_size=3, strides=2, padding='same')(input) x = BatchNormalization()(x) x = ReLU()(x) x = MaxPool2D(pool_size=3, strides=2, padding='same')(x)

### 7. Main part of the model

The main part of the model consists of ** Stage** blocks. We first define the hyperparameters

**,**

`repetitions`

**acoording to**

`initial_channels`

**[v]**and

**. Then for each number of repetitions we calculate the number of channels according to**

`groups`

**[i]**and apply the

`stage()`

function on the tensor.**Code:**

repetitions = 3, 7, 3 initial_channels = 384 groups = 8 for i, reps in enumerate(repetitions): channels = initial_channels * (2**i) x = stage(x, channels, reps, groups)

### 8. Rest of the model

The model closes with a Global Pool layer and a Fully Connected one with 1000 classes (**[v]**).

**Code:**

x = GlobalAvgPool2D()(x) output = Dense(1000, activation='softmax')(x) from tensorflow.keras import Model model = Model(input, output)

## Final code

**Code:**

from tensorflow.keras.layers import Input, Conv2D, DepthwiseConv2D, \ Dense, Concatenate, Add, ReLU, BatchNormalization, AvgPool2D, \ MaxPool2D, GlobalAvgPool2D, Reshape, Permute, Lambda def stage(x, channels, repetitions, groups): x = shufflenet_block(x, channels=channels, strides=2, groups=groups) for i in range(repetitions): x = shufflenet_block(x, channels=channels, strides=1, groups=groups) return x def shufflenet_block(tensor, channels, strides, groups): x = gconv(tensor, channels=channels // 4, groups=groups) x = BatchNormalization()(x) x = ReLU()(x) x = channel_shuffle(x, groups) x = DepthwiseConv2D(kernel_size=3, strides=strides, padding='same')(x) x = BatchNormalization()(x) if strides == 2: channels = channels - tensor.get_shape().as_list()[-1] x = gconv(x, channels=channels, groups=groups) x = BatchNormalization()(x) if strides == 1: x = Add()([tensor, x]) else: avg = AvgPool2D(pool_size=3, strides=2, padding='same')(tensor) x = Concatenate()([avg, x]) output = ReLU()(x) return output def gconv(tensor, channels, groups): input_ch = tensor.get_shape().as_list()[-1] group_ch = input_ch // groups output_ch = channels // groups groups_list = [] for i in range(groups): # group_tensor = tensor[:, :, :, i * group_ch: (i+1) * group_ch] group_tensor = Lambda(lambda x: x[:, :, :, i * group_ch: (i+1) * group_ch])(tensor) group_tensor = Conv2D(output_ch, 1)(group_tensor) groups_list.append(group_tensor) output = Concatenate()(groups_list) return output def channel_shuffle(x, groups): _, width, height, channels = x.get_shape().as_list() group_ch = channels // groups x = Reshape([width, height, group_ch, groups])(x) x = Permute([1, 2, 4, 3])(x) x = Reshape([width, height, channels])(x) return x input = Input([224, 224, 3]) x = Conv2D(filters=24, kernel_size=3, strides=2, padding='same')(input) x = BatchNormalization()(x) x = ReLU()(x) x = MaxPool2D(pool_size=3, strides=2, padding='same')(x) repetitions = 3, 7, 3 initial_channels = 384 groups = 8 for i, reps in enumerate(repetitions): channels = initial_channels * (2**i) x = stage(x, channels, reps, groups) x = GlobalAvgPool2D()(x) output = Dense(1000, activation='softmax')(x) from tensorflow.keras import Model model = Model(input, output)