Face Recognition based on Convoluted Neural Networks: Technical Review

Human beings recognize and classify objects with biological senses and brain that processes the input into meaningful information. Other than that humans have come to recognize each other in multiple ways one of which is visual recognition of faces. As a biological trait human faces are certainly a biometric such they are universal, distinctive, mostly permanent and collectable. With that a computerized face recognition system can constructed relying on visual information present on each face uniquely. Generally a face recognition system consists of two main phases, face detection phase where presence of a human face is verified on visual input and face recognition phase where detected face is processed for identification. One of the most sought after methods in field image processing for face recognition is CNN (Convoluted Neural Networks). CNNs have proved its effectiveness and accuracy in many CNN based face detection and face recognition systems. As such in this paper the architecture of CNN is presented. Then different techniques for face detection and face recognition based on CNNs are reviewed. In reviewed papers CNNs have repeatedly demonstrated effectiveness and accuracy on multiple benchmarks for face recognition application.


Introduction
For human beings, recognizing and classifying objects (animated or not) is done by capturing the object through multiple available biological senses and then the information is passed to the brain that recognizes (or learn of) the object and classifies it instantly based on traits captured from that object.Furthermore, objects' traits could also be measured using measurement tools which provide distinctive data that can be translated into information to be used to describe or uniquely identify that object (Alblushi A., 2021;Hassin & Abbood, 2021).With that certain biological traits could be measured and used to uniquely identify an individual among human beings.Such biological traits are known as biometrics.According to (Jain et al., 2004) in order for a biological trait to be eligible as biometric it must be universal (common among humans), distinctive (measured uniquely between different humans to sufficient extend), permanent (largely unchangeable over time) and collectable (measurable quantitatively).One of the biological traits that are eligible as biometric is human face.Human faces satisfy all the requirements of biometric; they are certainly universal, highly distinctive in large scale, largely permanent over long periods of time and collectable.As such it's possible to construct a biometric system based on human face biometric.A computerized biometric system based on human faces is essentially a face recognition system that relies on visual information present in each face uniquely.Image enhancement is the process of altering a digital image to be more appropriate for identifying and classifying the correct objects (Al-Hatmi & Yousif, 2017;Hasson et al., 2011)).
According to (Li et al., 2020) face recognition is a visual pattern recognition problem where visual inputs presented as matrixes in computer needs to be distinguished in terms of whether data contains a face then identify who the face belongs to.(Oloyede et al., 2020) explains that a face recognition system structure is similar in essence to structure of biometric system it involves face detection, face image preprocessing, facial feature extraction and feature classification which is a common step in biometric systems as stated by (Oloyede & Hancke, 2016).(Oloyede et al., 2020) further explains the stages involved in face recognition system: -Face detection is verification of presence of human face in visual input data.
-Face image preprocessing is preparing the image so that it contains important facial visual data only.
Approaches include normalization (face images are transformed to same scale), face alignment (defined by (Jin & Tan, 2017) as locating fiducial points on face image) and enhancement of image (stated by (Karamizadeh et al., 2016) as processing the face image into an enhanced version which has the potential to enhance face recognition system performance).
-Facial feature extraction is extraction of most relevant facial visual data that identify face uniquely -while minimizing noise and unrelated information -into sufficient description vector.
-Feature classification is recognition stage of facial images where facial images are compared for verification or identification of facial images from database.As mentioned by (Oloyede & Hancke, 2016) this is a common stage in biometric systems and it involves verification and identification.Verification is achieved through a oneto-one search between an input and a target as for identification is one-to-many search between input and entire database of targets (Coventry et al., 2003) (Ganorkar & Ghatol, 2007) (P Tripathi, 2011) (Muhtahir et al., 2013) (Ahmad et al., 2012).Facial recognition systems are deployed in wide range of applications.Some of the applications include control of attendance access (S.Manjula & S. Santhosh Baboo, 2012), security (Lander et al., 2018), finance, education, smartphones, retail, transportation and network information security (Hu et al., 2010).

Problem Statement:
As mentioned face recognition systems are deployed in various applications, making it a critically needed computer vision technology that attracted interest for further development and enhancement.There are multiple techniques used in main subsystems (face detection and feature classification) involved in overall structure of face recognition system.All of subsystems collectively have techniques that use deep learning (DL) convolutional neural networks (CNN) method to fulfill their purposes.As such the purpose of this study is to introduce the CNN method and present some of the CNN based techniques for face recognition subsystems.

Convolutional Neural Networks:
Neural networks are powerful mathematical models that aim to mimic the human brain in solving complex problems in multidimensional space and convert them to a lower dimension (Yousif J., 2015;Yousif & Kazem, 2021;Alattar et al., 2019).Convolutional neural networks are type of artificial neural networks (Lecun et al., 1998) that are specifically applied in applications that involve processing of visual information.Some of CNN applications include face recognition (Taigman et al., 2014), detection of objects (Ren et al., 2017), image segmentation and classification (Farabet et al., 2013).Visual data in images is typically contained in form of an array or multiple of which.CNNs translate visual data into meaningful visual information using sequential layers of convolution filters to detect edges, detect portion of objects and finally detect the whole object shape (LeCun et al., 2015).Convolution filters are classified in terms of their function in CNN to convolution layer filters, pooling layer filters and fully connected layers filters (Bezdan & Bačanin Džakula, 2019).

Layers of CNN:
As mentioned mainly CNN consists of three layers which are convolution layer, pooling layer and fully connected layer (Bezdan & Bačanin Džakula, 2019).Ultimately processing visual data through CNN layers is done by extracting feature maps from input 2D image using kernels (filters) (Salomon et al., 2017).

Convolution Layer:
Convolution layer as its name implies relies on convolution operation between image pixels and set of learning kernels.Kernels typically have small size of  ×  and depth  equal to input image channels, if image is grayscale  = 1 and  = 3 if image is RGB color and so on.As input visual data is passed to convolution layer, frame pixels at defined positions are convoluted with kernel filter yielding a convoluted frame; and this process is repeated for each kernel (Bezdan & Bačanin Džakula, 2019).Convoluted frames are then processed by activation function to generate feature maps.Some of activation functions include sigmoid logistic function, hyperbolic tangent Gaussian function and Rectified Linear unit (ReLU).Similar to activation functions in neural networks (NN) a bias value can be introduced to shift activation function input for generation of feature maps, therefore for feature map ,  = (. + ) (Salomon et al., 2017).
According to (Bezdan & Bačanin Džakula, 2019) size of generated feature maps depend on three convolutionrelated parameters which are stride, depth and padding.Stride is position shift parameter that defines next position of frame pixels to be convoluted with kernel i.e., for pixel at position  the next pixel to be convoluted is at position  +  where  is stride value.Depth refers to number of unique kernel filters applied to input frame.Padding is adding zeros to boards of input image such that required pixels are convoluted and information is preserved.With that output feature map size can be calculated as ( + 2 − )  + 1 ⁄ where  is filters number,  is padding layers number,  is kernel size and  is stride.

Pooling Layer:
Features maps are processed in pooling layer for reduction of maps' dimensions by down-sampling them (Bezdan & Bačanin Džakula, 2019) and reducing variance among feature maps pixels (Salomon et al., 2017).In down-sampling process feature maps are divided into smaller regions of equal dimensions  ×  then in each region either the average or maximum of pixels values is taken as representative of the region (Salomon et al., 2017).
Pooling process also depends on stride and size of pooling region.Overlapping between to-be-pooled regions can be controlled using stride value and to prevent occurrence of any overlapping between regions stride value can be set as  where  is feature map dimension (Salomon et al., 2017).

Fully Connected Layer:
Fully connected layer is last layer of CNN.Here processed features maps are converted into vectors that are fed to artificial neural network neurons as input (Bezdan & Bačanin Džakula, 2019) for classification.Deep learning methods can discover many complex relations between training data and outputs due to non-linearity of its intermediate hidden layers.However in case of limited training data DL network may formulate relationships that might be valid in context of training data only and not on real testing data.This is known as overfitting (Srivastava et al., 2014).One of techniques that can be applied to prevent overfitting on CNN is dropout method proposed by (Srivastava et al., 2014).In dropout method neural network nodes are dropped randomly from network temporally along with its incoming and outgoing connections.

Face Recognition Subsystems Methods:
4.1.Face Detection Methods (Triantafyllidou & Tefas, 2016) proposed a light model for face detection based on CNN using 113,864 parameters only.Despite lesser complexity of model its result showed that it can be deployed for real world applications using standard processing power.The model consists of two CNNs that were combined in a single architecture.The first CNN was trained to detect major facial features such as mouth, eyes, nose and so on.The extraction for detecting faces from various orientations while simplifying its architecture to reduce computational complexity.DDFD consists of five convolutional layers followed by three fully connected ones.Fully connected layers are converted into convolutional layers by reshaping parameters on layers (Felzenszwalb et al., 2010) which allowed CNN to process images of any dimensions effectively and generate heat map.From heat maps regions of highest probability of containing face are detected and then processed with non maximal suppression to localize faces accurately.DDFD was tested on three libraries PASCAL, AFW and FDDB datasets and non-maximal suppression model was implemented on maximum and average.Firstly implementing DDFD based on average nonmaximal suppression (NMS-avg) had higher average precision than maximum non-maximal suppression (NMSmax).From which DDFD NMS-avg was tested and compared with different detectors using mentioned datasets.
DDFD had average precision of 91.79% on PASCAL dataset scoring highest among face detectors and average precision of 96.26% on AFW dataset coming in third.As for FDDB dataset DDFD had recall rate of 84%.
(H. Li et al., 2015) builds a cascade CNN face detector that rejects false detections during early stages where input is processed in low resolution and verifies truthiness of detection at high resolution stages.proposed model takes color images of size 448 × 448 as input and it consists of seven convolution layers for features extraction each is followed by pooling layer that performs max pooling using 2 × 2 down-sampling kernels.
Following that are three fully connected layers and output layer where NMS (Non-Maximal Suppression) is used for classifying detection according to extracted features and bounding box position.FDDB was used for training and testing model where 70% of selected samples were used for training and remaining for testing.On training phase gradient decent optimizer algorithm was used, model was run for 25 epochs and different learning rates values were tested.It was found that accuracy remained constant after 20 epochs and optimal learning rate was 0.0001.Running model on testing set resulted in achieving 92.2% accuracy which is higher than accuracies achieved by other face detection algorithms tested by authors which are 89.6% on R-CNN and 83.8% on Haar Cascade.(Liu et al., 2021) proposed a lightweight CNNs architecture for face recognition.The reasoning behind this architecture is despite current CNN based face recognition systems being highly accurate they are complex and require extensive computation resources which make them unsuitable for computationally limited devices (e.g.mobile devices).Also there have been previous attempts to build lightweight CNN face recognition system however despite systems showed efficiency their results were not accurate enough.As such the authors build compressed face recognition CNN model while maintaining accuracy for computationally limited devices.Improvements were made in design of network structure, training methodology and loss function.In terms of network design structure, three structures based on channel attention mechanism are proposed which are depthwise squeeze and excitation model, depthwise separable squeeze and excitation model and linear squeeze and excitation model.Squeeze and excitation approach reduces computational costs for processing feature maps and improves CNNs based architecture performance (J.Hu et al., 2018).Those structures were applied on light CNN with small set of parameters and tested In a study conducted by (Khalajzadeh et al., 2013) a hybrid face recognition system consisting of CNN and LRC (Logistic Regression Classifier) was presented.CNN component was trained for detection and recognition of face images.Features extracted by CNN are then passed to LRC component for classification of output.CNN structure consists of two convolution layers each followed by pooling layer then a fully connected layer.Images of size 64 × 64 are passed to convolution layer where 7 × 7 × 6 kernel is applied resulting in six 58 × 58 feature maps.

Face Recognition Methods:
Following pooling layer applies 2 × 2 × 6 sub-sampling kernel resulting in six 29 × 29 feature maps.The second convolution layer applies 8 × 8 × 16 kernel generating sixteen 22 × 22 feature maps that are passed to pooling layer where 2 × 2 × 16 sub-sampling kernel is applied, down-sampling sixteen feature maps to 11 × 11.On fully connected layer feature maps are downsized to fifteen 1 × 1 using 11 × 11 kernels.For CNN training, five hundred 204 epochs were applied due to complexity and time for computation constrains.Learning rate for CNN was set dynamically decreasing as a function of number of epochs.To address issues of illumination and varying pose orientations of face images for recognition input images were normalized using pixels mean and standard deviation.
Other techniques applied to CNN are back propagation algorithm and dynamic update of weights during feature presentation (to keep number of parameters within data range) rather than after passing training set (Batch update).
For evaluation of network performance Yale dataset was used for training and testing of CNN structure and several classifiers were applied for final recognition.Out of tested classifiers the model had highest accuracy and least time when using SimpleLogistic classifier on Yale dataset with 86.06% accuracy and 1.22 seconds recognition time.(Ramaiah et al., 2015) presented a facial recognition system based on CNN that contributes to tackling face recognition systems performance degrading issue of illumination variations in input face images.Authors take advantage of feature extraction capabilities of CNN for processing correct recognition of face images and further enhance CNN performance by considering symmetrical face information present in horizontal reflection of facial image.Architecture of CNN consists of five layers, two convolution layers followed by pooling layer for each and finally a fully connected layer.Input face images are rescaled to size of 28 × 32 and passed as input CNN.On first convolution layer kernel of size 5 × 5 × 6 is applied on input face image generating six 24 × 28 feature maps that are down-sampled on pooling layer using 2 × 2 × 6 down-sampling kernel to size 12 × 14.Then on second convolution layer kernel of size 5 × 5 × 12 is convoluted with feature maps generating new twelve 8 × 10 sized feature maps.New feature maps are down-sampled to size 4 × 5 on last pooling layer using 2 × 2 × 12 downsampling feature maps.Generated feature maps are converted and combined into 240 × 1 column vector using row major order.Column vector is input to fully connected layer where classifications of facial image to one of thirty output classes occur.CNN classifier was trained using back-propagation algorithm with batch mode.Experiments on CNN were conducted using extended Yale Face Database B. From dataset thirty subjects (classes) were selected and for each subject sixty two face images with different illuminations were taken.Face images were then organized into five different sets according to lighting degree.Training CNN was implemented using back-propagation method, batch size 2 and 500 epochs.Five-fold cross validation was used for training and testing.Running five sets on constructed CNN face recognition system resulted in average accuracy of 89.05%.To boost CNN performance, images on sets were enhanced by adding horizontal reflection to face images which provide classifier with 205 additional information relating to shadows on face image.This enhancement resulted in increasing CNN average accuracy of classification inputs on five sets to 94.01%.(Nakada et al., 2017) constructed an active face recognition system named AcFR.AcFR is a viewpointdependent system means that the system outputs certain behavior depending on recognition result similar to human behavior when attempting to recognize face of another individual.AcFR implements its proposed tasks through two similar to views at extremes (-90 and 90 degrees) which showed AcFR ability to distinguish strangers from recognized individuals.As for control system component AcFR behavior was dependent on input image characteristics (illumination, expression and mainly view) and computed Euclidean distance.To minimize impact of change in image characteristics illumination was set as constant on different subject's images.Results showed that when setting higher first threshold (t1) AcFR would greet more often and when setting lower second threshold (t2) AcFR would ignore more often.As such first and second thresholds were set to 250 and 325 respectively.(Schroff et al., 2015) present a face recognition system that overcomes scaling and efficiently requirements in such systems.System named FaceNet is based on principle of calculating Euclidean space from face images.From distances in Euclidean space a face similarity measure can be computed.Euclidean spaces are features vectors generated from FaceNet as such; FaceNet can be combined with other techniques to implement face recognition, verification or clustering system as well.FaceNet uses a trained deep CNN that directly optimizes how features are extracted rather from classical bottleneck approach used in other CNN based features extractors.CNN was trained using multiple three-similar-sets of approximately aligned matching and non-matching face patches.Those sets were mined using an online triplets mining tool.This training approach resulted in achieving high performance with much greater efficiency using 128 bytes face images.FaceNet was tested on LFW and YouTube Faces DB.On LFW FaceNet achieved 99.63% accuracy and 95.12% accuracy on YouTube Faces DB.FaceNet also highly reduces error rate by 30% in comparison to results achieved by (Sun et al., 2015) on same datasets.(Sanchez-Moreno et al., 2021) presents a face recognition system mainly composed of FaceNet (FaceNet implements features extraction using deep CNNs (William et al., 2019)) and known classifiers such as SVM, K Nearest Neighbor and Random Forest.The reasoning behind building the system is address the need for having low cost and efficient face recognition system that can operate in unconstrained environment.As face recognition systems involve two main stages face detection and face recognition, authors implement real-time high speed face detector YOLO-Face (one of most popular CNN face detectors in recent years (Garg et al., 2018)) based on YOLOv3.On face recognition stage FaceNet along with supervised classifiers are used as mentioned previously.
Experiments on model were carried for face detector and face recognition stages.On face detection stage YOLO-Face based detector was able to reach 89.6% accuracy on Honda/USCD dataset which is mainly composed of images taken in unconstrained environment.It's worth noting that experiments carried on YOLO-Face for face detection showed that the model can detect small faces and had better performance when detecting partly blocked or differently pose oriented faces.Tests on face recognition stage were conducted using LFW dataset on FaceNet and different combinations of FaceNet and other classifiers.Highest accuracy was 99.7% achieved using combination of FaceNet + SVM.Accuracies of 99.5%, 85.1% and 99.6% were achieved using FaceNet + K Nearest Neighbor, FaceNet + Random Forest and FaceNet respectively.Overall face recognition system (composed of two stages) had 99.1% recognition rate and runtime of 49 milliseconds.(Khan et al., 2019) presents a face recognition system framework for smart glasses using CNN.The method adds flexibility and portability attributes to such system and good capturing capabilities on frontal view.The overall face recognition system is presented in two stages face detection stage and face recognition stage.Face detection uses Haar classifier which is composed of series of weak classifiers that form one strong classifier.Here face detector was able to achieve 98% accuracy using 3099 features samples.On face recognition stage authors used AlexNet CNN that includes five convolution layers, three fully connected layer and ReLU as activation function.Transfer learning ability of AlexNet was used for facial recognition on smart glasses.The system was able to reach 98.5% accuracy after training it with 2500 various images per class.Face detection uses Haar classifier and face recognition stage uses AlexNet The system was able to reach 98.5% accuracy after training it with 2500 various images per class.

Discussion
In this paper a total of fifteen papers were reviewed on implementation of CNN on face recognition applications.
Six of reviewed papers focused on various implementations of CNN on face detection whereas the rest focused on face classification/recognition aspect.The overall trends on papers were a focus on improving accuracy on various databases or compression of CNN required resources to run on computationally limited devices.Figures 3 and 4 show highest accuracies achieved for face detection and face recognition systems.
Figure 3 showed that significant improvements had been made on CNN detectors over the years.Highest recall rates were achieved on later years which show the ongoing improvement process on CNN face detectors.The same trend can be observed on CNN face recognition subsystems.Despite the differences on testing datasets face recognition CNNs have had higher accuracies with passing of years, reaching near to or 100% accuracies on conducted tests.

Conclusion
In conclusion, face recognition systems are of great importance as they are deployed on various applications including attendance control, security, finance, education, smartphones, retail, transportation and network information security.Overall face recognition system consists of two main stages face detection and face recognition subsystems.Both of those systems have methods that utilize CNNs on carrying their respective 78.00%  (Nimbarte & Bhoyar, 2018) (Tang et al., 2020) (Khalajzadeh et al., 2013) (Ramaiah et al., 2015) (Nakada et al., 2017) (Schroff et al., 2015) (Sanchez-Moreno et al., 2021) (Khan et al., 2019) purposes.CNN are a special type of ANN for processing on visual data.CNNs use convolutional layers and pooling layers for extraction of required features for fully connected layer which is a neural network classifier.On face detection CNN feature extractions focuses on extracting features that are unique for a human being then classifier decides on result being a face or not.On face recognition CNN features extraction focuses on extracting features that are unique to a person then classifier decides identity of result.Deployment on CNNs for face detection and face recognition showed continuous improvements over the years, reaching higher than 90% and near 100% on some cases respectively.

Figure 1 :
Figure 1: General Steps Involved in Facial Recognition System

Figure 2 :
Figure 2: Example of CNN architecture(Ignjatić et al., 2018) second CNN was trained for full detection of face.Face detection CNN contains seven convolution layers and used images of dimensions 32 × 32 × 3 for training.Face parts CNN contains three convolution layers and used 16 × 16 × 3 images for training.First CNN was evaluated in terms of successful full face detection whereas the second CNN was evaluated in terms of detection of relevant face parts successfully.CNNs were combined by parallel processing of first three layers of first CNN and second CNN fully then results were stacked as inputs for convolution four to seven on first CNN.The performance of CNN was tested on FDDB (Face Detection Data Set and Benchmark) dataset.The detector achieved a recall rate of 88.9% outperforming most of recent face detection methods.(Farfade et al., 2015) presented a method based on CNN named Deep Dense Face Detector (DDFD) for multiple faces detection in various poses.DDFD model has lesser complexity as it doesn't require bounding-box regression (for reduction of localization errors (Girshick et al., 2014)), semantic segmentation, or support vector machines classifiers.DDFD was constructed based on principle of maximizing CNN capacity for classification and feature Face detector was designed to feature fast detection of faces, accelerated cascade CNN, localization with high quality and multiresolution architecture for detection verification.Cascade is composed of six CNNs three of which are for face classification and others are calibrating bounding boxes for faces.CNNs are based on Alexnet architecture and use ReLU activation function.As input image is passed to CNN cascade detector 12-net CNN scans image on differentscales and reject more than 90% of detected windows.12-calibration-net CNN processes remaining windows as 12 × 12 images adjusting their location and size to approach potential face.NMS is then applied for elimination of highly overlapped detection windows, and then remaining windows are resized into 24 × 24 images.Generated images becomes input for 24-net CNN and subsequently to 24-calibration-net CNN and processes that occurred on first two CNNs are repeated outputting 48 × 48 windows images.48-net CNN receives new windows and evaluates detection and NMS eliminates overlapped windows.Lastly 48-calibration-net CNN calibrates bounding boxes for detected output faces.Cascade CNN detector was tested on AFW and FDDB datasets.On AFW the detector achieved average precision of 96.72% and had recall rate of 85.1% on FDDB dataset.(Yang et al., 2018) created face detector utilizing capabilities of supervised CNN by capturing facial features based on common attributes of face rather than standard bounding box.Authors show that this approach has more robustness in detecting faces under server oscillations or pose variations.Face detector was based on three principles.The first principle is uniqueness of human face parts structure where CNN can be trained to detect and classify different face parts without explicit supervision.The second principle is evaluation of detect parts based on their spatial arrangements on faces through a score to find likelihood of detection actually being a face or not.The third principle is refining output of bounding boxes detection of potential faces by CNN that recognizes true faces and estimates face locations more precisely.Based on those principles face detector named Faceness-Net was constructed and it consists of two stages; the first stage is detection of facial parts to generate face proposals that are ranked according to faceness score and second stage is enhancement of face proposals for detecting faces.On first stage attribute-aware CNNs are used to generate facial parts maps from inputs images.Those maps show locations of hair, eyes, nose, mouth and bread face components then maps are combined on face label map.Generated face proposals are ranked based on their faceness scores which are determined from face parts maps own faceness scores that are determined from spatial configuration of detected face part.NMS is applied on face parts to reduce number of detected windows then average faceness score of parts is taken as faceness score of face proposal.Another NMS is applied to reduce face proposals based on faceness score to eliminate false positive detections.On second stage CNNs for optimizing face classification and bounding boxes regression are used to enhance generated face proposals.Authors have implemented three more variations of Faceness-Net which are Faceness-Net-SR, Faceness-Net-TP and Faceness-Net-SR-TP.
on datasets.In terms of training methodology authors implement teacher-student training method that is based on additive angular margin loss function (loss function for distinguishing faces (Deng et al., 2019)) and knowledge distillation for transferring knowledge between CNNs.Deep CNN that is superior in feature extraction and fitting capabilities called teacher is used to guide and train a light CNN called student.Using knowledge distillation superior performance and capabilities of teacher can be transferred to student.With that lightweight CNN model can be improved while maintaining model compression.Different models were constructed with mentioned SE (Squeeze and Excitation) structures and teacher-student training method.Models were trained and tested on several datasets and achieved highest accuracy of 99.67% using a combined model of depthwise SE and linear SE structure on LFW dataset with 5.36 MB storage space and 1.35 million parameters.(Nimbarte & Bhoyar, 2018) presents age invariant face recognition model based on CNNs.The main goal is for network to recognize matching face for input from gallery of face images despite the changes occurring in face features due to age difference.AIFR (Age Invariant Face Recognition)-CNN architecture has seven layers and it accepts images of size 32 × 32 to reduce computational costs.Architecture of AIFR-CNN consists of three stages: image preprocessing, feature extraction and classification.On image preprocessing stage Viola Jones face detection algorithm is applied to crop image into face-focused image, then image is transformed to grayscale and resized to 32 × 32.As for feature extraction stage, here image is passed to AIFR-CNN seven layers.Layers are arranged as 203 two convolution layers followed by pooling layers for each then a convolution layer followed by two fully connected layers.Kernels used on architecture are size 5 × 5 and pooling filters are size 2 × 2. On last stage of classification output of last fully connected layer is passed to SVM classifier for face identification.AIFR-CNN was trained and tested on FGNET and MORPH (album-II) datasets.On FGNET 980 images were used, 852 of which for training and remaining 128 for testing.Testing on FGNET resulted on network having a recognition percentage of 76.6%.As for MORPH (album-II) dataset total of 1005 images were used, 750 for training and 255 images for testing.Testing on MORPH (album-II) resulted on network having recognition rate of 92.5%.(Tang et al., 2020) proposes face recognition system architecture based on local binary pattern (LBP) and parallel ensemble learning of CNNs.The reasoning behind this architecture is to address issues that degrades face recognition systems success rate such as face expression, pose orientation, illumination and occlusion.Those issues raise mainly due to single CNN low generalization abilities.On architecture face features are extracted firstly using LBP on input image.Following that ten CNNs based on five different structures extract features further for training and improvement of parameters (weights and biases) values.Those CNNs also obtain classification for input after fully connected layer using Softmax function.To obtain final face recognition result parallel ensemble learning is used to get the result with majority voting.Method was tested on ORL and Yale-B face datasets and achieved recognition rates of 100% and 97.51% respectively.Experiments on model showed its tolerance to mentioned face recognition issues in addition to elevation of face recognition accuracy and generalization performances.More to that a detection hybrid model consisting of proposed face recognition model and pedestrian detection model was tested for improvement of detection rate.It achieved 11.2% increase in detection rate performance.
components.The first component is a face recognition model consisting of VGG-Face CNN coupled with nearestneighbor identity recognition criterion.First component evaluates recognition (identifies subject) and provides information required for second component which is a control model to take decisions.Decisions made by control model determine output behavior of AcFR which belong to set {(), ℎ(), ()} where x is individual extracted information.For face recognition component on AcFR it follows conventional architecture of face recognition system steps.On first step preprocessing (detection and alignment) authors follow (Mathias et al., 2014) face detection algorithm.On feature extraction step VGG-Face CNN was implemented which has sixteen layers and was trained with two million images.On classification stage authors experiment with different classifiers such as SVM, Linear and Regression and Nearest Neighbor classifier.The first two had low accuracy below 20% whereas Nearest Neighbor classifier achieved 90% accuracy.Nearest Neighbor classifier uses extracted features from feature maps, stored feature maps and Euclidean distance to compute classification.Euclidean distance is also used in control model to output behavior.As mentioned control model makes decisions according to information provided from first model.Control model is given two initial threshold distances (t1 and t2) that are compared with euclidean distance (d) to output certain behavior.If distance is lesser than or equal to first threshold value output is greet, if it's higher than or equal to second threshold distance output is ignore and if it falls in between view is changed.When changing view features are extracted for same subject however input image is taken from different orientation.Experiments on AcFR was conducted on PIE dataset and for each individual nine different pictures from nine different view angles in range of -90 and 90 degrees were used.On face recognition component views closer to frontal views (0 degrees) had highest accuracy (can reach to 100%) and least Euclidean distance which showed the robustness of VGG-Face CNN and AcFR being view dependent.The hypothesis of AcFR being view dependent was tested further by changing feature vectors in gallery from frontal view to -45 degrees.Similarly highest recognition accuracy and least Euclidean distance were achieved for views nearing -45 degrees and 45 degrees as well due to symmetrical nature of human face.To test AcFR behavior when subject is a stranger, authors removed ten subjects' 206 features from gallery and reevaluated system response.AcFR computed Euclidean distance in range of 286 to 350

Figure 3 :
Figure 3: Comparison between CNN face detector recall rates on FDDB dataset

Figure 4 :
Figure 4: Comparison between CNN face recognition highest accuracies on various dataset SR means that variant uses single attribute-aware CNN not five and TP means varies uses template technique was used to generate candidate windows not external generic object.Faceness-Net and its variants were tested on AFW, PASCAL and FDDB datasets.On AFM Faceness-Net-SR-TP, Faceness-Net-SR, Faceness-Net-TP and Faceness-Net had average precision of 98.05%, 97.38%, 97.25% and 97.2% respectively.
(Qin et al., 2016)average precisions were 92.11% for Faceness-Net, 91.79% for Faceness-Net-SR-TP, 91.65% for Faceness-Net-SR and 91.23% for Faceness-Net-TP.As for FDDB recall rates were 92.84% for Faceness-Net-SR-TP, 91.72% for Faceness-Net-TP, 91.31% for Faceness-Net-SR and 90.98% for Faceness-Net.(Qinetal., 2016)made modifications on cascade CNNs approach to obtain better performance from network by jointly training CNNs.Authors showed that back propagation algorithm can be used in training cascaded CNN and joint training approach can be implemented on more complex cascade CNNs architectures.On joint training architecture named FaceCraft image of size 48 × 48 is input for three branch networks x12, x24 and x48 and image is resized according to branch name.Activation function ReLU is used on non-linear layers and dropout is implemented before regression or classification layer.Output of network is one joint loss of three branches and its optimized using back propagation.Joint network also use control threshold layers to determine how loss is contributed from proposals coming from up branches to down branches.FaceCraft was tested on AFW and FDDB datasets.FaceCarft scored an average precision of 98% on AFW dataset and had recall rate of 88.YOLO as being comparatively faster in face detection in real time, maintains accurate detection performance regardless of input image size and capable of extracting features from arbitrary image sizes.Architecture of 201

Table 1 :
Review of CNN based face detectors

Table 2 :
Review of CNN based face recognition systems