Running handwritten digit recognition on K1 [1]

This article is the first article of the K1 TensorFlow-Lite Getting Started Tutorial. It mainly introduces how to use the Mnist dataset to build the simplest network from scratch and use tensorflow to train handwritten digit recognition.

MNIST Dataset

Introduction

MNIST is an entry-level computer vision dataset that contains various handwritten digits. Its status in machine learning is equivalent to the printing of Hello World in Python. The official website is THE MNIST DATABASE of handwritten digits

The dataset contains the following four parts:

train-images-idx3-ubyte.gz: training set – images, 6w
train-labels-idx1-ubyte.gz: training set – labels, 6w
t10k-images-idx3-ubyte.gz: test set – images, 1w
t10k-labels-idx1-ubyte.gz: test set – labels, 1w

Images and labels

Each image in the mnist dataset is 2828 pixels in size, and a 2828 array can be used to represent an image.

The label is represented by an array of size 10, and this encoding is called One hot encoding.

One-hot encoding

One-hot encoding uses N bits to represent N states, and only one of them is valid at any time.

Example of using one-hot encoding:

Gender:
[0, 1] represents female, [1, 0] represents male
Digits 0 - 9:
[0,0,0,0,0,0,0,0,1] represents 9, [0,1,0,0,0,0,0,0,0,0] represents 1

The advantages of one-hot encoding are:

Ability to handle non-continuous numerical features
It also expands the features to a certain extent. For example, gender itself is a feature, but after encoding, it becomes two features: male or female.

In neural networks, one-hot encoding is actually very fault-tolerant. For example, the output of a neural network is [0,0.1,0.2,0.7,0,0,0,0,0,0, 0], which, after being converted into one-hot encoding, represents the number 3. That is, the place with the largest value becomes 1, and the rest are 0. [0,0.1,0.4,0.5,0,0,0,0,0,0, 0] can also represent the number 3.

There is a function in numpy, numpy.argmax(), which can get the subscript of the maximum value.

Training the model

Environment Installation

Install TensorFlow 2.* The latest version is 2.17.0

pip install tensorflow

Note: The source code can be downloaded from bit-brick’github.

Model definition (train.py)

The first half of the model definition mainly uses the Conv2D (convolution) and MaxPooling2D (pooling) functions provided by Keras.layers.

The input of CNN is a tensor with dimensions (image_height, image_width, color_channels). The mnist dataset is black and white, so there is only one color_channel. Generally, color images have 3 (R, G, B). Students familiar with the Web front-end may know that some images have 4 channels (R, G, B, A), where A represents transparency. For the mnist dataset, the input tensor dimension is (28, 28, 1), which is passed to the first layer of the network through the parameter input_shape.

import os
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

class CNN:
    def __init__(self):
        self.model = models.Sequential([
            # The first layer of convolution, the convolution kernel size is 3*3, 32, 28*28 is the size of the image to be trained
            layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
            # The second layer is the maximum pooling layer, using a 2x2 pooling window
            layers.MaxPooling2D((2, 2)),
            # The third layer is another convolutional layer, using 64 3x3 convolution kernels
            layers.Conv2D(64, (3, 3), activation='relu'),
            # The fourth layer is another max pooling layer
            layers.MaxPooling2D((2, 2)),
            # The fifth layer is another convolutional layer, using 64 3x3 convolution kernels
            layers.Conv2D(64, (3, 3), activation='relu'),
            # The sixth layer is the flattening layer, which expands the feature map into a one-dimensional vector
            layers.Flatten(),
            layers.Dense(64, activation='relu'),
            layers.Dense(10, activation='softmax')
        ])
        self.model.summary()

model.summary()用来打印我们定义的模型的结构。

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten (Flatten)            (None, 576)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                36928     
_________________________________________________________________
dense_1 (Dense)              (None, 10)                650       
=================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0
_________________________________________________________________

We can see that the output of each Conv2D and MaxPooling2D layer is a three-dimensional tensor (height, width, channels). The height and width will gradually decrease. The number of output channels is controlled by the first parameter (for example, 32 or 64). As the height and width decrease, the channels can become larger (from the perspective of computing power).

The second half of the model is to define the output tensor. layers.Flatten will convert the three-dimensional tensor into a one-dimensional vector. The dimension of the tensor before expansion is (3, 3, 64). After converting it to a one-dimensional vector of (576), the layers.Dense layer is used to construct two fully connected layers, gradually changing the number of bits of the one-dimensional vector from 576 to 64, and then to 10.

The second half is equivalent to building an ordinary neural network with 64 hidden layers, 576 input layers, and 10 output layers. The activation function of the last layer is softmax, and 10 bits can just express the ten numbers 0-9.

The subscript of the maximum value represents the corresponding number, which can be easily calculated using numpy:

import numpy as np

y1 = [0, 0.8, 0.1, 0.1, 0, 0, 0, 0, 0, 0]
y2 = [0, 0.1, 0.1, 0.1, 0.5, 0, 0.2, 0, 0, 0]
np.argmax(y1) # 1
np.argmax(y2) # 4

MNIST dataset preprocessing (train.py)

class DataSource:
    def __init__(self):
        (train_images, train_labels), (test_images, test_labels) = datasets.mnist.load_data()
        # 60,000 training images, 10,000 test images, pixel values are mapped to between 0 and 1
        train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
        test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255

        self.train_images, self.train_labels = train_images, train_labels
        self.test_images, self.test_labels = test_images, test_labels

Train and save training results (train.py)

class Train:
    def __init__(self):
        self.cnn = CNN()
        self.data = DataSource()

    def train(self):
        self.cnn.model.compile(optimizer='adam',
                               loss='sparse_categorical_crossentropy',
                               metrics=['accuracy'])
        self.cnn.model.fit(self.data.train_images, self.data.train_labels,
                           epochs=5)

        test_loss, test_acc = self.cnn.model.evaluate(self.data.test_images, self.data.test_labels)
        print(f"Accuracy: {test_acc:.4f}, a total of {len(self.data.test_labels)} pictures were tested")
        
        # Save the entire model
        self.cnn.model.save('mnist_model.h5')
        
        # Call the conversion function
        

if __name__ == "__main__":
    app = Train()
    app.train()
    conver_to_tflite('mnist_model.h5', 'mnist_model.tflite')

After execution, a model file named mnist_model.h5 will be generated in the current path.

Image prediction (predict_h5.py)

The code for predicting images is as follows: used to verify whether the training results are correct.

import tensorflow as tf
from PIL import Image
import numpy as np
import os
from tensorflow.keras.models import load_model

from train import CNN




class Predict(object):
    def __init__(self):
        print("Current working directory:", os.getcwd())
        checkpoint_dir = './'
        # Since the weight file is in HDF5 format, specify the file path directly
        latest = os.path.join(checkpoint_dir, 'mnist_model.h5')  # Use the latest weights file

        self.cnn = CNN()
        # Directly use the specified file path as the parameter of load_weights
        self.cnn.model= load_model('mnist_model.h5')

    def predict(self, image_path):
        # Read the image in black and white
        img = Image.open(image_path).convert('L')
        img = np.reshape(img, (28, 28, 1)) / 255.
        x = np.array([1 - img])

        # API refer: https://keras.io/models/model/
        y = self.cnn.model.predict(x)

        # Because x only passed in one picture, just take y[0]
        # np.argmax() gets the subscript of the maximum value, which is the number it represents
        print(image_path)
        print(y[0])
        print('        -> Predict digit', np.argmax(y[0]))


if __name__ == "__main__":
    app = Predict()
    app.predict('./test_images/0.png')
    app.predict('./test_images/1.png')
    app.predict('./test_images/4.png')

The execution results are as follows

../test_images/0.png
[9.9999809e-01 1.5495613e-10 3.3248398e-08 2.2874749e-10 7.2154744e-09
 2.9732897e-10 9.0776956e-07 8.8862202e-11 1.1108587e-07 7.9468083e-07]
        -> Predict digit 0
../test_images/1.png
[3.2026957e-08 9.9998009e-01 4.3477922e-08 3.4642572e-10 1.4215015e-05
 6.9246203e-10 1.2963224e-07 4.5330389e-06 8.9890926e-07 7.5559392e-08]
        -> Predict digit 1
../test_images/4.png
[1.46609270e-11 2.91387710e-06 2.11647162e-07 5.38430411e-09
 9.99984741e-01 2.79038481e-09 1.04211018e-09 1.61079342e-07
 1.04318104e-07 1.17996497e-05]
        -> Predict digit 4

So far, we have used the mnist dataset to complete the training model and saved the model file. In the next article, we will introduce how to convert the trained model to tflite format for easy use in K1.

MNIST Dataset​

Introduction​

Images and labels

One-hot encoding

Training the model​

Environment Installation

Model definition (train.py)​