Automated Face Analysis

نویسندگان

  • Jeffrey F. Cohn
  • Adena J. Zlochower
  • James Lien
چکیده

The face is a rich source of information about human behavior. Available methods for coding facial displays, however, are human-observer dependent, labor intensive, and difficult to standardize. To enable rigorous and efficient quantitative measurement of facial displays, we have developed an automated method of facial display analysis. In this report we compare the results with those of manual FACS (Facial Action Coding System, Ekman & Friesen, 1978a) coding. One hundred university students were videotaped while performing a series of facial displays. The image sequences were coded from videotape by certified FACS coders. Fifteen action units and action unit combinations that occurred a minimum of 25 times were selected for automated analysis. Facial features were automatically tracked in digitized image sequences using a hierarchical algorithm for estimating optical flow. The measurements were normalized for variation in position, orientation, and scale. The image sequences were randomly divided into a training set and a cross-validation set, and discriminant function analyses were conducted on the feature point measurements. In the training set, average agreement with manual FACS coding was 92% or higher for action units in the brow, eye, and mouth regions. In the crossvalidation set, average agreement was 91%, 88%, and 81% for action units in the brow, eye, and mouth regions, respectively. Automated Face Analysis by feature point tracking demonstrated high concurrent validity with manual FACS coding. Automated Face Analysis 2 Automated Face Analysis by Feature Point Tracking Has High Concurrent Validity with Manual FACS Coding The face is a rich source of information about human behavior. Facial displays indicate emotion (Ekman, 1993; Russell, 1994) and pain (Craig, Hyde, & Patrick, 1991), regulate social behavior (Cohn & Elmore, 1988; DePaulo, 1992; Fridlund, 1994), reveal brain function (Ekman, Davidson, & Friesen, 1990; Fox & Davidson, 1988) and pathology (Katsikitis & Pilowsky, 1988; Rinn, 1984), and signal developmental transitions in infants (Campos, Bertenthal, & Kermoian, 1992; Emde, Gaensbauer, & Harmon, 1976). To make use of the information afforded by facial displays, reliable, valid, and efficient methods of measurement are critical. Current methods, which require human observers, vary in their specificity, comprehensiveness, degree of objectivity, and ease of use. The anatomically based Facial Action Coding System (FACS: Ekman & Friesen, 1978a; Baby FACS: Oster & Rosenstein, 1993) is the most comprehensive method of coding facial displays. Using FACS and viewing videotaped facial behavior in slow motion, coders can manually code all possible facial displays, which are referred to as “action units” (AUs). More than 7,000 combinations have been observed (Ekman, 1982). Ekman and Friesen (1978b) proposed that specific combinations of FACS action units represent prototypic expressions of emotion (i.e. joy, sadness, anger, disgust, fear, and surprise). Emotion expressions, however, are not part of FACS; they are coded in a separate system known as EMFACS (Friesen & Ekman, 1984) or the more recent FACS Interpretive Dictionary (Friesen & Ekman, undated, cited in Oster Hegley, & Nagel, 1992). FACS itself is purely descriptive and uses no emotion or other inferential labels. Another anatomically based objective system, which also requires slow motion viewing of videotape, is the Maximally Discriminative Facial Movement Coding System (MAX: Izard, 1983). Compared with FACS, MAX is less comprehensive and was intended to include only facial displays (referred to as “movements” in MAX) related to emotion. MAX does not discriminate among some anatomically distinct displays (e.g., innerand outer-brow raises) and considers as autonomous some displays that are not anatomically distinct (Oster et al., 1992). Malatesta (Malatesta, Culver, Tesman, & Shephard, 1989) added some displays in an effort to make MAX more comprehensive. Unlike FACS, MAX makes explicit claims that specific combinations of displays are expressions of emotion, and the goal of MAX coding is to identify these MAXspecified emotion expressions. Whereas FACS and MAX use objective, physically measurable criteria, other videotape viewing systems are based on subjective criteria for facial expressions of emotions (AFFEX: Izard, Dougherty, & Hembree, 1983) and other expressive modalities (e.g., Monadic Phases: Cohn & Tronick, 1988; Tronick, Als, & Brazelton, 1980). The expression codes in these systems are given emotion labels (e.g., “joy”) based on the problematic assumption that facial expression and emotion have an exact correspondence (Camras, 1992; Fridlund, 1994; Russell, 1994). Like FACS and MAX, these systems also require slow motion viewing of videotaped facial behavior. As used below, “emotion expression” refers to facial displays that have been given emotion labels. Note that emotion expressions with the same label do not necessarily refer to the same facial displays. Systems such as MAX, AFFEX, and EMFACS can and do use the same terms when referring to different phenomena. For instance, Oster et al. (1992) found that MAX and the FACS Interpretive Dictionary gave different emotion labels to the same displays. Automated Face Analysis 3 The lack of standard meaning to specific “emotion expressions” as well as the implication that emotion expressions represent subjective experience of emotion, are concerns about the use of emotion labels in referring to facial displays. The descriptive power of FACS, by contrast, has made it well suited to a broad range of substantive applications, including nonverbal behavior, pain research, neuropsychology, and computer graphics, in addition to emotion science (Ekman & Rosenberg, 1997; Parke & Waters, 1996; Rinn, 1984; 1991). In everyday life, expressions of emotion, whether defined by objective criteria (e.g., combinations of FACS action units or MAX movement codes) or by subjective criteria, occur infrequently. More often, emotion is communicated by small changes in facial features, such as furrowing of the brows to convey negative affect. Consequently, a system that describes only emotion expressions is of limited use. Only FACS, and to a lesser extent MAX, can produce the detailed descriptions of facial displays that are required to reveal components of emotion expressions (e.g., Carroll & Russell, 1997; Gosselin, Kirouac & Dore, 1995). FACS action units are the smallest visibly discriminable changes in facial display, and combinations of FACS action units can be used to describe emotion expressions (Ekman & Friesen, 1978b; Ekman, 1993) and global distinctions between positive and negative expression (e.g., Moore, Cohn, & Campbell, 1997). With extensive training, human observers can achieve acceptable levels of inter-observer reliability in coding facial displays. Human-observer-based (i.e., manual) methods, however, are labor intensive, semi-quantitative, and, with the possible exception of FACS, difficult to standardize across laboratories or over time. Training is time consuming (approximately 100 hours with the most objective methods), and coding criteria may drift with time (Bakeman & Gottman, 1986; Martin & Bateson, 1986). Implementing comprehensive systems is reported to take up to 10 hours of coding time per minute of behavior depending upon the comprehensiveness of the system and the density of behavior changes (Ekman, 1982). Such extensive effort discourages standardized measurement and may encourage the use of less specific coding systems with unknown convergent validity (Matias, Cohn, & Ross, 1989). These problems tend to promote the use of smaller sample sizes (of subjects and behavior samples), prolong study completion times, and thus limit the generalizability of study findings. To enable rigorous, efficient, and quantitative measurement of facial displays, we have used computer vision to develop an automated method of facial display analysis. Computer vision has been an active area of research for some 30 years (Duda & Hart, 1973); early work included attempts at automated face recognition (Kanade, 1973, 1977). More recently, there is significant interest in automated facial display analysis by computer vision. One approach, initially developed for face recognition, uses a combination of principal components analysis (PCA) of digitized face images and artificial neural networks. High dimensional face images (e.g., 640 by 480 gray scale pixel arrays) are reduced to a lower dimensional set of eigenvectors, or “eigenfaces” (Turk & Pentland, 1991). The eigenfaces then are used as input to an artificial neural network or other classifier. A classifier developed by Padgett, Cottrell, and Adolphs (1996) discriminated 86% of six prototypic emotion expressions as defined by Ekman (i.e., joy, sadness, anger, disgust, fear, and surprise). Another classifier, developed by Bartlett and colleagues (Bartlett, Viola, Sejnowski, Golomb, Larsen, Hager, & Ekman, 1996), discriminated 89% of six upper face FACS action units. Although promising, these systems have some limitations. First, because Padgett et al. (1996) and Bartlett et al. (1996) perform PCA on gray scale values, information about individual identity is encoded along with information about expression, which may impair discrimination. Automated Face Analysis 4 Some robust lower-level image processing may be required to produce more robust discrimination of facial displays. Second, it is reported that eigenfaces are highly sensitive to minor variation in image alignment for the task of face recognition (Phillips, 1996). It is expected that similar or even better precision in image alignment is required when eigenfaces are used to discriminate facial displays. The image alignment used by Padgett et al. (1996) and Bartlett et al. (1996) was limited to translation and scaling, which is insufficient to align face images across subjects with face rotation. Third, these methods have been tested only on rather limited image data sets. Padgett et al. (1996) analyzed photographs from Ekman’s and Friesen’s Pictures of Facial Affect, which are considered prototypic expressions of emotion. Prototypic expressions differ from each other in many ways, which facilitates automated discrimination. Bartlett et al. (1996) analyzed images of subjects many of whom were experts in recognizing and performing FACS action units, and target action units occurred individually rather than being embedded within other facial displays. Fourth, Bartlett et al. performed manual time warping to produce a standard set of six pre-selected frames for each subject. Manual time warping is of variable reliability and is time consuming. Moreover, in many applications behavior samples are variable in duration, and therefore standardizing duration may omit critical information. More recent research has taken optical-flow-based approaches to discriminate facial displays. Such approaches are based on the assumption that muscle contraction causes deformation of overlying skin. In a digitized image sequence, algorithms for optical flow extract motion from the subtle texture changes in skin, and the pattern of such movement may be used to discriminate facial displays. Specifically, the velocity and direction of pixel movement across the entire face or within windows selected to cover certain facial regions are computed between successive frames. Using measures of optical flow, Essa, Pentland, and Mase (Essa and Pentland, 1994; Mase, 1991; Mase & Pentland, 1990), and Yacoob and Davis (1994) discriminated among emotion-specified displays (e.g., joy, surprise, fear). This level of analysis is comparable to the objective of manual methods that are based on prototypic emotion expressions (e.g., AFFEX: Izard et al., 1983). The work of Mase (1991), Mase and Pentland (1991) and Essa and Pentland (1994) suggested that more subtle changes in facial displays, as represented by FACS action units, could be detected from differential patterns of optical flow. Essa and Pentland (1994), for instance, found increased flow associated with action units in the brow and mouth region. The specificity of optical flow to action unit discrimination, however, was not tested. Discrimination of facial displays remained at the level of emotion expressions rather than the finer and more objective level of FACS action units. Bartlett et al. (1996) discriminated between action units in the brow and eye regions in a small number of subjects. A question about optical-flow based methods is whether they have sufficient sensitivity to subtle differences in facial displays, as represented in FACS action units. Work to date has used aggregate measures of optical flow within relatively large facial regions (e.g., forehead or cheeks), including modal flow (Black & Yacoob, 1995; Rosenblum, Yacoob, & Davis, 1994; Yacoob & Davis, 1994) and mean flow within the region (Mase, 1991; Mase & Pentland, 1991). Black and Yacoob (1995) and Black, Yacoob, Jepson, and Fleet (1997) also disregard subtle changes in flow that are below an assigned threshold. Information about small deviations is lost when the flow pattern is aggregated or thresholds are imposed. As a result, the accuracy for discriminating FACS action units may be reduced. The objective of the present study was to implement the first version of our automated method of face analysis and to assess its concurrent validity with manual FACS coding. Unlike Automated Face Analysis 5 previous automated systems that use aggregate flow within large feature windows, our system tracks the movement of closely spaced feature points within very small feature windows (currently 13 by 13 pixels) and imposes no arbitrary thresholds. The feature points to be tracked are selected based on two criteria: they are in regions of high texture and represent underlying muscle activation of closely related action units. Discriminant function analyses are performed on the feature point measurements for action units in brow, eye, and mouth regions. The descriptive power of feature point marking is evaluated by comparing the results of a discriminant classifier based on feature point tracking with those of manual FACS coding. Method Image acquisition Subjects were 100 university students enrolled in introductory psychology classes. They ranged in age from 18 to 30 years. Sixty-five percent were female, 15 percent were AfricanAmerican, and three percent were Asian or Latino. The observation room was equipped with a chair for the subject and two Panasonic WV3230 cameras, each connected to a Panasonic S-VHS AG-7500 video recorder with a Horita synchronized time-code generator. One of the cameras was located directly in front of the subject, and the other was positioned 30 degrees to the right of the subject. Only image data from the frontal camera are included in this report. Subjects were instructed by an experimenter to perform a series of 23 facial displays that included single action units (e.g., AU 12, or lip corners pulled obliquely) and combinations of action units (e.g., AU 1+2, or inner and outer brows raised). Subjects began and ended each display from a neutral face. Before performing each display, an experimenter described and modeled the desired display. Six of the displays were based on descriptions of prototypic emotions (i.e., joy, surprise, anger, fear, disgust, and sadness). These six tasks and mouth opening in the absence of other action units were coded by one of the authors (AZ) who is certified in the use of FACS. Seventeen percent of the data were comparison coded by a second certified FACS coder. Inter-observer agreement was quantified with coefficient kappa, which is the proportion of agreement above what would be expected to occur by chance (Cohen, 1960; Fleiss, 1981). The mean kappa for inter-observer agreement was 0.86. Action units which occurred a minimum of 25 times in the image data base were selected for analysis. This criterion ensured sufficient data for training and testing of Automated Face Analysis. When an action unit occurred in combination with other action units that may modify its appearance, the combination rather than the single action unit was the unit of analysis. Figure 1 shows the action units and action unit combinations thus selected. The action units we analyzed in three facial regions (brows, eyes, and mouth) are key components of emotion and other paralinguistic displays, and are common variables in emotions research. For instance, AU 4 is characteristic of negative emotion and mental effort, and AU 1+2 is a component of surprise. AU 6 differentiates felt, or Duchenne, smiles (AU 6+12) from non-Duchenne smiles (AU 12) (Ekman et al., 1990). In all three facial regions, the action units chosen are relatively difficult to discriminate because they involve subtle differences in appearance (e.g. brow narrowing due to AU 1+4 versus AU 4, eye narrowing due to AU 6 versus AU 7, three separate action unit combinations involving AU 17, and mouth widening due to AU 12 versus AU 20.). Unless otherwise noted, “action units” as used below refers to both single action units and action-unit combinations. Figure 1 About Here Automated Face Analysis 6 Image processing and analysis Image sequences from neutral to target display (mean duration ~ 20 frames at 30 frames per second) were digitized automatically into 640 by 490 pixel arrays with 8-bit precision for gray scale values. Target displays represented a range of action unit intensities, including low, medium, and high intensity. Figure 2 About Here Image alignment. To remove the effects of spatial variation in face position, slight rotation, and facial proportions, images must be aligned and normalized prior to analysis. Three facial feature points were manually marked in the initial image: the medial canthus of both eyes and the uppermost point of the philtrum. Using an affine transformation, the images were then automatically mapped to a standard face model based on these feature points (Figure 2). By automatically controlling for face position, orientation, and magnification in this initial processing step, optical flows in each frame had exact geometric correspondence. Figure 3 About Here Feature point tracking. In the first frame, 37 features were manually marked using a computer mouse (leftmost image in Figure 3): 6 feature points around the contours of the brows, 8 around the eyes, 13 the nose, and 10 around the mouth. The inter-observer reliability of feature point marking was assessed by independently marking 33 of the initial frames. Mean inter-observer error was 2.29 and 2.01 pixels in the horizontal and vertical dimensions, respectively. Mean inter-observer reliability, quantified with Pearson correlation coefficients, was 0.97 and 0.93 in the horizontal and vertical dimensions, respectively. The movement of feature points was automatically tracked in the image sequence using an optical flow algorithm (Lucas & Kanade, 1981). Given an n by n feature region R and a grayscale image I, the algorithm solves for the displacement vector d = (dx, dy) of the original n by n feature region by minimizing the residual E(d), which is defined as E(d) = ∑ [ It+1(x+d) It(x) ] 2 x∈R where x = (x, y) is a vector of image coordinates. The Lucas-Kanade algorithm performs the minimization efficiently by using spatio-temporal gradients, and the displacements dx and dy are solved with sub-pixel accuracy. The region size used in the algorithm was 13-by-13. The algorithm was implemented by using an iterative hierarchical 5-level image-pyramid (Poelman, 1995), with which rapid and large displacements of up to 100 pixels (e.g., as found in sudden mouth opening) can be robustly tracked while maintaining sensitivity to subtle (sub-pixel) facial motion. On a dual-processor 300 MHz Pentium II computer with 128 megabytes of random access memory, processing time is approximately 1 second per frame. The two images on the right in Figure 3 show an example of feature-point-tracking results. The subject’s face changes from neutral (AU 0) to brow raise (AU 1+2), eye widening (AU 5), and jaw drop (AU 26), which is characteristic of surprise. The feature points are precisely tracked across the image sequence. Lines trailing from the feature points represent changes in their location during the image sequence. As the action units become more extreme, the feature point trajectory becomes longer. Automated Face Analysis 7 Data analysis and action unit recognition To evaluate the descriptive power of feature point tracking measurements, discriminant function analysis was used. Separate discriminant function analyses (DFA) were conducted on the measurement data for action units within each facial region. In the analyses of the brow region, the measurements consisted of the horizontal and vertical displacements of the 6 feature points around the brows. In the analyses of the eye region and of Duchenne versus non-Duchenne smiles, the measurements consisted of the horizontal and vertical displacements of the 8 feature points around the eyes. In analyses of the mouth region, the measurements consisted of the horizontal and vertical displacements of the 10 feature points around the mouth and four on the either side of the nostrils because of the latter’s relevance to the action of AU 9. Therefore, each measurement was represented by a 2p dimensional vector by concatenating p feature point displacements; that is D = (d1,d2, ......,dp) = (d1x, d1y, d2x,d2y, ...dpx, dpy). The discrimination between action units was done by computing and comparing the a posteriori probabilities of action units AUs; that is DÆ AUk if p (AUk | D) > p(AUj | D)j ≠ k where ∑ = = = k j j AU p j AU p i AU p i AU p p i AU p i AU p i AU p 1 ) ( ) | ( ) ( ) | ( ) ( ) ( ) | ( ) | ( D D D D D The discriminant function between AUi and AUj is therefore the log-likelihood ratio ) ( ) | ( ) ( ) | ( log ) | ( ) | ( log ) ( j AU p j AU p i AU p i AU p j AU p i AU p ij f D D D D D = = The p(D |AUi) was assumed to be a multivariate Gaussian probability distribution N(ui, ∑i), where the mean ui, and the covariance matrix ∑i were estimated by the sample means and sample covariance matrices of the training data. This discriminant function is a quadratic discriminant function in general; but if covariance matrices ∑i and ∑j are the same, it reduces to a linear discriminant function. Because we were interested in the descriptive power of the feature point displacement vector itself, rather than relying on other information (e.g., relative frequencies of action units in our specific samples), a priori probabilities p(AUi)s were assumed to be equal. The analyses used 872 samples of 15 action units or action unit combinations that occurred 25 or more times in 504 image sequences of 100 subjects. The samples were randomly divided into a training and a cross-validation set. However, if an action unit occurred in more than one image sequence from the same subject, all of the samples of that action unit by that subject were assigned to the training set. Thus, for each action unit, samples from the same subject belonged exclusively either to the training or the cross-validation set but not both. This strict criterion ensured that the training and the cross-validation set were uncorrelated with respect to subjects for each action unit, and thus that what was recognized was the action unit rather than the subject. Automated Face Analysis 8 The agreement of action unit discrimination between manual FACS coding and Automated Face Analysis by feature point tracking was quantified. We used coefficient kappa (κ) to measure the proportion of agreement above what would be expected to occur by chance (Cohen, 1960; Fleiss, 1981). In preliminary analyses, subjects’ race and gender were unrelated to classification accuracy and therefore were not included as factors in the discriminant function analyses and classification results reported below.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Face Detection with methods based on color by using Artificial Neural Network

The face Detection methodsis used in order to provide security. The mentioned methods problems are that it cannot be categorized because of the great differences and varieties in the face of individuals. In this paper, face Detection methods has been presented for overcoming upon these problems based on skin color datum. The researcher gathered a face database of 30 individuals consisting of ov...

متن کامل

Cost Function Modelling for Semi-automated SC, RTG and Automated and Semi-automated RMG Container Yard Operating Systems

This study analyses the concept of cost functions for semi-automated Straddle Carrier (SC), Rubber Tyred Gantry (RTG) and automated Rail Mounted Gantry (RMG) container yard operating cranes. It develops a generic cost based model for a pair-wise comparison, analysis and evaluation of economic efficiency and effectiveness of container yard equipment to be used for decision-making by terminal pla...

متن کامل

Automated Face Analysis for Affective Computing Jeffrey

Facial expression communicates emotion, intention, and physical state, and regulates interpersonal behavior. Automated Face Analysis (AFA) for detection, synthesis, and understanding of facial expression is a vital focus of basic research. While open research questions remain, the field has become sufficiently mature to support initial applications in a variety of areas. We review 1) human-obse...

متن کامل

Frontal View Human Face Detection and Recognition

This paper is about an attempt to unravel the classical problem of automated human face recognition. A near realtime, fully automated computer vision system was developed to detect and recognise expressionless, frontalview human faces in static images. In the implemented system, automated face detection was achieved using a deformable template algorithm based on image invariants. The natural sy...

متن کامل

Semi-quantitative segmental perfusion scoring in myocardial perfusion SPECT: visual vs. automated analysis

Introduction: It is recommended that the physician apply at least a semi-quantitative segmental scoring system in myocardial perfusion SPECT.  We aimed to assess the agreement between automated semi-quantitative analysis using QPS (quantitative Perfusion SPECT) software and visual approach for calculation of summed stress  score (SSS), summed rest score (SRS) and summed difference score (SDS). ...

متن کامل

Advances in Automated Image Categorization: Sorting Images using Person Recognition Techniques

The core problem addressed by this thesis is to provide practical tools for automatic sorting and cataloging of a typical consumer collection of digital images. The thesis presents a complete system solution, comprising (i) automated detection of face regions in images; (ii) multiple automated face recognition modules; (iii) automated colour and texture analysis of regions peripheral to the det...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998