COMPUTER VISION FOR INVENTORY MANAGEMENT

Опубликовано в журнале: Научный журнал «Интернаука» № 19(242)

Автор(ы): Аминов Султан Булатович

Рубрика журнала: 3. Информационные технологии

DOI статьи: 10.32743/26870142.2022.19.242.339476

Библиографическое описание

Аминов С.Б. COMPUTER VISION FOR INVENTORY MANAGEMENT // Интернаука: электрон. научн. журн. 2022. № 19(242). URL: https://internauka.org/journal/science/internauka/242 (дата обращения: 04.07.2025). DOI:10.32743/26870142.2022.19.242.339476

Авторы

Аминов Султан Булатович

COMPUTER VISION FOR INVENTORY MANAGEMENT

Aminov Sultan Bulatovich

student, Kazakh-British Technical University,

Kazakhstan, Almaty

КОМПЬЮТЕРНОЕ ЗРЕНИЕ ДЛЯ УПРАВЛЕНИЯ ЗАПАСАМИ

Аминов Султан Булатович

студент, Казахстанско-Британский Технический Университет,

Казахстан, г. Алматы

ABSTRACT

Computer vision has been transforming inventory management for retailers in the recent years. Nonetheless, the current technologies deployed in retail stores have a great potential for improvement, since most of the processes such as fruits classification, products detection, and on-shelf availability (OSA) have not been automated fully yet. There are growing appeals for an effective and customizable solutions in the industry. In most cases, paying for fruits or vegetables in a shop requires them to be manually identified. This research provides a lightweight Convolutional Neural Networks (CNN)-based image classification algorithm with the purpose of speeding up the checkout process in businesses. A novel image dataset is introduced that includes multiple types of fruits. Different input features are added to the CNN architecture in order to improve classification accuracy. A single RGB color, the RGB histogram, and the RGB centroid obtained by K-means clustering are examples of such inputs. The results suggest that fruits without a plastic bag have a classification accuracy of 95%, whereas fruits with a plastic bag have a classification accuracy of 93%.

АННОТАЦИЯ

В последние годы компьютерное зрение изменило управление запасами для розничных продавцов. Тем не менее, современные технологии, используемые в розничных магазинах, имеют большой потенциал для улучшения, поскольку большинство процессов, таких как классификация фруктов, обнаружение продуктов и наличие на полке, еще не полностью автоматизированы. В отрасли растет спрос на эффективные и настраиваемые решения. В большинстве случаев для оплаты фруктов или овощей в магазине требуется их идентификация вручную. Это исследование предоставляет облегченный алгоритм классификации изображений на основе сверточных нейронных сетей (CNN) с целью ускорения процесса оформления заказа на предприятиях. Представлен новый набор данных изображений, включающий несколько видов фруктов. В архитектуру CNN добавлены различные входные функции для повышения точности классификации. Одиночный цвет RGB, гистограмма RGB и центроид RGB, полученные с помощью кластеризации K-средних, являются примерами таких входных данных. Результаты показывают, что фрукты без пластикового пакета имеют точность классификации 95%, тогда как фрукты с пластиковым пакетом имеют точность классификации 93%.

Keywords: fruits classification, convolutional neural networks, mobilenetv2, rgb histogram, k-means.

Ключевые слова: классификация фруктов, сверточные нейронные сети, mobilenetv2, гистограмма rgb, k-ближайших соседей.

1. Introduction

Customers' purchases are processed by cashiers or self-service checkout devices in retail markets. The checkout time has already been reduced because most products have scannable barcodes. Fruits and vegetables, on the other hand, are frequently treated differently. The cashier or the customer must manually identify the type of product being purchased and search the system for it. With that in mind, my goal is to propose a preliminary approach to fruit classification in order to assess its suitability for such a use case.

Fruit classification is a difficult task because to the numerous variations that might occur. In general, there are two types of classification problems: i) classification of different types of fruits (e.g., to distinguish between oranges and apples) [2], and ii) classification of different variants of the same fruit (e.g., to differentiate among apple varieties such as, red, pink lady, granny smith, golden, etc.) [4]. Even focusing on the first type of issue, perfect classification is difficult to obtain owing to changes in shape, color, ripening phases, and other factors. Another issue, which is directly related to fruit purchases in retail stores, is that fruits may be packaged in plastic bags. This study focuses on the first form of classification (i.e., classification of different types of fruits), in which the fruits can be packaged inside or outside of a plastic bag.

Convolutional Neural Networks (CNN) [9] have recently advanced to the point where they are suitable for this application. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC)[7] was a once-a-year competition that ran from 2010 to 2017. The purpose was to recognize the various objects inside an image shown to you. Because the winner of the ILSVRC utilized a model based on deep CNN trained on raw RGB pixel values, known as AlexNet, CNN has shown remarkable performance in the task of image classification since 2012. Aside from AlexNet, a number of CNN designs have been established over time, including LeNet, ZFNet, GoogleNet, VGG, ResNet, YOLO, and MobileNetV2 [9]. ResNet won the ILSVRC in 2015, surpassing the human-level accuracy (5% error) [7] for the first time with a 3% testing error.

This study presents a modified CNN architecture for classifying fruits based on MobileNetV2 [8], which includes the addition of different input features (apart from the input photos) to improve accuracy. The hue of the fruits is related to such extra input parameters. As a result, tests with a single RGB color, the RGB histogram, and the RGB centroid produced via K-means clustering are presented. I also built a new fruit dataset for three different types of fruits: apples, oranges, and bananas, as well as fruits in transparent plastic bags. The results demonstrate that fruits without a plastic bag have a classification accuracy of 95%, whereas fruits inside a plastic bag have a classification accuracy of 93%.

Here is the breakdown of the paper's outline. The second section reviews relevant fruit classification papers. The offered classification method is represented in Section 3. Section 4 reviews the outcomes of investigations and the classification accuracy. In the end, section 5 concludes with an analysis of the findings and future research.

2. Related Work

There are multiple fruit recognition and categorization research projects with various targets and applications. Agriculture and fruit harvesting are two of these applications. DeepFruits is a CNN that is based on faster regions (known as R-CNN). Their model uses ImageNet for transfer learning and two types of input images: color (RGB) and near-infrared (NIR) (NIR). The images correspond to seven fruits that are still attached to their respective tree/plant, hence this software is geared toward agricultural robots picking fruit and vegetables. Some of the photographs were taken by the writers, while others were found through Google Image searches. Deep Count is another Deep Neural Network (DNN)-based application for robotic agriculture, with the authors proposing a modified Inception-ResNet architecture. Their study is limited to Google Photographs tomato images [6]. Deep Fruit Detection for robotic orchard harvesting is another related application. Faster R-CNN is used in this study, and its performance is compared to that of other architectures like VGG and ZFNet. They also investigate the amount of training images, data augmentation, and transfer learning. They use RGB photographs created by themselves to analyze three fruits: apple, mango, and almond [1]. Finally, MangoYOLO is a CNN model that forecasts mango harvest. Other CNN designs such as Faster R-CNN with VGG and ZFNet, SDD, and YOLO are compared to MangoYOLO. With PASCAL VOC, COCO, and ImageNet, they investigate the amount of training photos and transfer learning. They created their own RGB visuals at night using an unique LED system installed on a farm vehicle in order to achieve constant lighting conditions [5]. Other aspects that must be considered in the above study projects include working outdoors, varying lighting conditions, and the fact that fruits and vegetables are still attached to the trees.

Aside from agriculture, the classification of fruits and vegetables can substantially assist retail applications. Two CNN designs, a light model with six CNN layers, and a VGG-16 fine-tuned model, were proposed by Hossain et al. [3]. They also collected photographs from the Internet to generate their own dataset. A double-track technique based on two nine-layer CNNs was proposed in another study [4]. The first network uses photos with backgrounds as input, and the second network uses a single fruit selected from a region of interest as input. Rather than classifying fruit categories, they classified six apple types. Finally, Femling et al. [2] offer a hardware system that can classify ten different varieties of fruits in retail shops. They employ a dataset made up of photographs obtained with their system's camera and taken from ImageNet. They used CNN architecture based on MobileNet, just like I did. It's worth noting that the previous works don't assume the fruits to be in plastic bags, whereas this paper does.

3. Methodology

The method I propose for tackling the fruit classification problem is presented in this section. The chosen CNN architectures and training methods are explained first, followed by a new dataset.

3.1 Data

Datasets are a crucial component of Deep Learning. As a result, it's critical to choose the right input data for my objectives. There is a dataset called Fruits-360 that contains 28736 training photos and 9673 testing images for fruit classification. It has 60 different fruit classes, some of which pertain to various fruit kinds (e.g., for apples they have six varieties). One of the dataset's primary flaws is that the photos are small (100x100 pixels), making it impossible to distinguish between various fruits. Furthermore, because the images have no background, they do not scale well in real-world applications.

As a result, I choose to construct my own dataset. The fruits were laid over a stainless steel sheet and the images were taken from the top, as my major purpose was to simulate a retail setting. Apples, oranges, and bananas were the fruits I picked to work with. I add variety to the dataset by photographing the fruits in various locations and rotations (see Fig. 1). I also took images of the fruits in a bag because they are often inside a transparent plastic bag during the checkout process. The photographs were captured with an iPhone 6's front camera. There are 443 images of apples in the dataset (297 for training and 146 for testing), 363 images of oranges in the dataset (242 for training and 121 for testing), and 231 images of bananas in the dataset (156 for training and 75 for testing). A total of 1067 photos were collected, with 725 being used for training and 342 being used for testing.

3.2 Selecting a CNN Architecture

The CNN architecture chosen is determined by the task to be solved. There is no one-size-fits-all solution to all situations. As a result, picking the appropriate one becomes a difficulty in and of itself. The Ensamble C built by the WMW team is now the best performing architecture, having won the ILSVRC competition in 2017 [7], although it has the disadvantage of being computationally expensive to train and forecast. MobileNets, on the other hand, are CNN designs that solve the problem of computational complexity by developing a more efficient and lightweight architecture that can run on mobile or embedded devices and provide high-level performance. I chose the MobileNetV2 [8] architecture because it is lightweight and durable, which is ideal for working in retail outlets.

3.3 Transfer Learning

Transfer learning is a deep learning technique in which a model created for one task is used as the foundation for a model for another. When the available dataset is insufficient, this strategy works well, and the model converges quickly. As a result, I used weights from a model trained on the ImageNet dataset to train MobileNetV2 with transfer learning.

Transfer learning can be used to train a model in a variety of ways. I began by loading the pre-trained model and discarding the last layer. This is a 1000-neuron thick layer that acts as a classifier for the prior feature map. I set the rest of the layers to not trainable after they were discarded (this prevents the weights in each layer from being updated). Then, towards the network's end, I added another dense layer, this time with the amount of fruit classifications I wish to forecast. This allows to maintain all the ImageNet model's features and apply them to the fruit classification task. The model was then trained for 20 epochs at a base learning rate of 0.0001. I set the layers from the 100th to the last (155) to trainable after the first 20 epochs and trained the network for another 20 epochs. However, in order to fine-tune the model, I reduced the learning rate to 0.1 of the base learning rate. As a result, the weights must be adjusted from generic feature maps to features particular to this dataset.

Figure 1. Examples of training images of the dataset created

Table 1.

Number of images per category

Fruit	Training images	Test images	Total images
Apple	297	146	443
Orange	242	121	363
Banana	156	75	231

TensorFlow is used to train the models in this study, with Keras providing the MobileNetV2 implementation. Both decay and momentum are set to 0.9 in the standard RMSPropOptimizer. After each layer, I use batch normalization, with the standard weight decay set to 0.00004, as described in [8]. A base learning rate of 0.0001 is used, with a batch size of 50. A Macbook Pro with a 1.4 GHz Intel Core i5 processor and 8 GB of 2133 MHz LPDDR3 RAM was used to train the models. Using my dataset, Figure 2 demonstrates a preliminary comparison of model performance when trained with transfer learning vs weights initialized randomly. The results clearly show that a model trained with random weights is unable to learn, whereas a model trained with transfer learning achieves an accuracy of around 0.80.

Figure 2. Comparing the test accuracy using our dataset and transfer learning

3.4 Improving MobileNetv2

Looking at the convolutional layers activations to see what information is being preserved by the layers is one technique to visualize what a CNN model is learning. Figure 3 depicts the activations of two different fruits in the first convolutional layer. Images in the top row depict orange activations, while those in the bottom row depict apple activations. Both fruits are similar for the model, as can be seen. It mostly keeps the form and texture of the fruit. Due to their similar shapes, this information may not be sufficient to distinguish between the two fruits. The color of the fruits is one lacking information that would be significant for such discrimination in this scenario. As a result, if additional input features (related to the color of the fruit) are fed into the model, the model's accuracy can be enhanced. This paper provides three different input features and the model modifications that go along with them.

Figure 3. Similar activations of the first convolutional layer of an orange (top image) compared to an apple (bottom image)

Single RGB fruit color. Aside from the image, the model can also be given a vector containing the RGB color values of the fruit to be categorized. This color should be the one that best reflects the fruit in general. Bananas, for example, are represented by the yellow color, so the model is given a vector with RGB color values of [1.0, 1.0, 0.0]; for an orange, the vector is [1.0, 0.64, 0.0]. The model is fed this vector with three RGB values.

Histogram in RGB color space. An image histogram is a graph that shows how many pixels are present at various scale levels in a particular image. The histogram of each RGB channel was produced for this project, yielding a vector of 765 input values that were then fed into the model. Figure 4 depicts an RGB histogram of a picture. One downside at the moment is that the majority of the values would correlate to the background colors.

Centroid RGB Using the K-Means algorithm. Finally, a hybrid machine learning (ML) technique is used. The objective is to mix various machine learning methods to compliment one another. K-Means is a clustering technique that attempts to divide data into K groups (subgroups). When applied to a picture, it may be able to identify color groupings that represent the image. The number of groups for this project was set to three, as indicated in Fig. 5. The model is fed the three RGB colors found (9 values).

Figure 4. Example of an image RGB histogram

Figure 5. Example of the RGB Centroid using K-Means

A multi-input model is used to implement the RGB color and RGB histogram. As seen in Figs. 6b and 6c, the model takes as input an image and a vector with color data. The color data is sent into a thick layer, while the picture is fed into the CNN (i.e., the MobileNetV2 architecture). The results of both networks are then combined, and softmax activation is used to get the final forecast. The model is classified as a hybrid model because it employs K-Means. The K-Means technique was implemented as a Keras layer in TensorFlow, which allows the model to internally synthesize the K colors, which are then concatenated at the end of the process (as shown in Fig. 6d). I used three centroids (k=3) for my studies, yielding nine RGB values. The models' performance is compared in the following section using the methods previously described.

Figure 6. Different architectures of the proposed methods

4. Expected Results

My models are based on the MobileNetV2 architecture and were trained on two versions of the dataset I created: I photos with only fruits (no bags), and ii) images with fruits without and within plastic bags, as mentioned in the preceding section. The baseline model (MobileNetV2), the multi-input models (both the single RGB color and the RGB histogram), and the hybrid model (MobileNetV2 + K-Means) are all compared in Table 1. When no plastic bag is used, the accuracy is always higher. This is to be expected, as plastic bags distort the appearance of the fruits. The baseline model (MobileNetV2), which has no additional color information, has the highest training set accuracy but the lowest testing set accuracy. In the meantime, all three models that used additional color information had greater overall accuracy. For both versions of the dataset, the model utilizing a single RGB color had the highest accuracy, with 0.95 and 0.93, respectively. The hybrid model's reduced accuracy could be explained by the fact that only one of the three colors obtained is related to the color of the fruit. The other two colors are associated with the background and should not be considered. As a result, one future project might be to strive to delete as much background information as feasible. Using the entire dataset, Figure 7 compares the testing data accuracy of the different proposed approaches in this research with time.

Table 2.

Accuracy of the trained models

	With plastic bag		No plastic bag
Model	Train accuracy	Test accuracy	Train accuracy	Test accuracy
MobilenetV2	0.98	0.78	0.99	0.82
MobilenetV2+Single Color	0.98	0.93	0.99	0.95
MobilenetV2+Histogram	0.99	0.82	0.99	0.92
MobilenetV2+ K-Means	0.98	0.86	0.99	0.90

Figure 7. Accuracy of the models trained with images containing fruits in bags

5. Conclusion and Future Work

This study suggested an improved CNN architecture based on the MobileNetV2 lightweight CNN design, which takes into account extra input attributes in addition to the input images. By including information about the color of the fruits, such input features improved the model's accuracy. The single RGB fruit color, the RGB histogram, and the RGB centroid using K-Means are the input characteristics. Overall, the single RGB color had the best categorization accuracy: 95 percent for fruits without a plastic bag and 93 percent for fruits in a plastic bag. Due to a lack of data, a new dataset of 725 photos for training and 342 images for testing was created. It considers three types of fruits (apples, oranges and bananas). Fruits in plastic bags are also taken into account in the dataset. As part of my ongoing study, I’m looking at the smallest number of training images that may be used to obtain the maximum level of accuracy. Furthermore, data augmentation has not been investigated using my proposed dataset. I'd also wish to determine the sensitivity to light. Alternately, I intend to compare the proposed lightweight CNN architecture's accuracy to that of other state-of-the-art CNN networks running on GPU hardware and my dataset.

References:

Bargotti, S., Underwood, J.: “Deep fruit image classification.” In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3626–3633 (2017)
Femliing, F., Olson, A., Alonso Fernandez, F.: “Fruit and vegetable classification using deep learning for retail stores.” In: 14th International Conference on Signal-Image Technology (SITIS), pp. 9–15. IEEE (2018)
Hosain, M.S., Al-Hamadi, M., Muhamad, G.: “Automatic fruit detection using deep learning.” IEEE Trans. Ind. Inf. 15(2), 1027–1034 (2018)
Kataryna, R., Pavel, M.: “A vision-based method using deep CNN for fruit classification in different conditions of retail sales.” Appl. Sci. 9(19), 3971 (2019)
Koyrala, A., Wallash, K.B., Wung, Z., McCarthy, C.: “Deep learning for real-time fruit classification and fruit load estimate: benchmark of ‘MangoYOLO’.” Precis. Agric. 20(6), 1107–1135 (2019)
Rahnemonfar, M., Shepard, C.: “DeepCount: fruit count by deep learning.” Sensors 17(4), 905 (2017)
Rusakovskiy, O., Deng, J., et al.: “ImageNet large-scale computer visual recognition.” Int. J. Computer Vision 115(3), 211–252 (2015)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: “MobileNetV2: converted residuals and nonlinear bottlenecks.” In: Proceedings of the IEEE Conference on Pattern Recognition, pp. 4510–4520 (2018)
Szhe, V., Chenn, Y.H., Yung, T.J., Emeer, J.S.: “Efficient implementation of neural networks: a survey and tutorial.” Proc. IEEE 105(12), 2295–2329 (2017)

COMPUTER VISION FOR INVENTORY MANAGEMENT

Авторы

Подписка на рассылку