Data Driven Knowledge Extraction of Materials Properties
نویسندگان
چکیده
In this paper the problem of modelling a large commercial materials dataset using advanced adaptive numeric methods is described. The various approaches are briefly outlined, with an emphasis on their characteristics with respect to generalisation, performance and transparency. A highly novel Support Vector Machine (SVM) approach incorporating a high degree of transparency via a full ANalysis Of VAriance (ANOVA) expansion is also used. Using the example of predicting 0.2% proof stress from a set of materials features, we show how the different modelling techniques compare when benchmarked against independent test data. INTRODUCTION The development of empirical models is fundamental to the understanding of complex materials properties within the field of materials science [1, 2]. Models may then be used to understand the physical relationships that exist and to enable optimisation of materials production. Empirical modelling is the extraction of system relationships from observational data, to produce a model of the system, from which it is possible to predict responses of that system. Ultimately the quantity and the quality of the observations govern the performance of the empirical model. Often only partial knowledge is available about the physical processes involved, although significant amounts of ‘raw’ data may be available from production and product release records, which may then be used to construct a data driven model. The empirical study of materials phenomena through statistical models has a number of limiting characteristics. Consider a dataset DN = {xi, yi} N i=1, drawn from an unknown probability distribution, F, where xi represents a set of inputs (e.g. alloy composition and thermomechanical processing information), yi represents a set of outputs, (e.g. mechanical properties) and N represents the number of data-points. The empirical modelling problem is to find any underlying mapping x→y that is consistent with the dataset D. Due to its observational nature the data obtained is finite. Typically, this sampling is non-uniform and, due to the high dimensional nature of the problems of interest (i.e. large numbers of inputs), the data will only form a sparse distribution in the input space. Consequently the problem is nearly always ill posed in the sense of Hadamard [3]. To address the ill-posed nature of the problem, it is necessary to convert the problem to one that is well posed. For the problem to be well posed a unique solution must exist that varies continuously with the data. We consider various modelling approaches that are intended to transform the problem to one that is well-posed. A further limitation of any empirical modelling technique is its ability to resolve the problem of highly correlated inputs; if two inputs are highly correlated it is difficult to identify individual effects on the output. The work presented in this paper compares and contrasts common empirical models, and state of the art approaches, on the basis of their generalisation ability and transparency. This paper advocates a transparent approach to the modelling problem, which enables understanding of the underlying relationships between inputs and outputs. This knowledge can then be used to enhance model validation through comparison with prior physical knowledge. Generalisation performance is the assessment of model predictions to new and unseen data. Traditional empirical modelling approaches may suffer in terms of generalisation, producing models that can overfit the data. Typically, this is a consequence of the model selection procedure which controls the complexity of the model. For a given learning task, with a finite amount of training data, the best generalisation performance will be achieved if the "capacity" of the model is matched to the complexity of the underlying process. THE MATERIALS DATASET In this paper we consider an extensive commercial dataset for Aluminium alloy 2024 in a T351 temper, with the objective being to predict 0.2% proof stress. The "raw" dataset consists of 35 input variables and 2870 data pairs covering various compositional and thermomechanical processing parameters, as well as containing "shop floor" information such as plate numbers and date of alloy manufacture. For a physically amenable model to be constructed, the original data set was decomposed into a smaller subset based on a single tensile direction (LT), thickness position (C), and a width position (0.5). All of the major alloying elements and the major impurities were retained as inputs to the model, however the minor compositional information was removed. The "shop floor" information was also removed since it was not expected to contribute directly to proof stress, but does provide a valuable check for changes in processing methods, equipment etc. Assessment of the slab dimensional information revealed the majority of the slab width and the slab gauges to be fixed; as a consequence the dataset which was used for modelling contained information for a single slab width/gauge combination. The initial scalped slab gauges on inspection were found to be equal, and as such the total reduction of each plate is entirely defined by the final gauge. The hot-rolled width and length were used to define reduction in the longitudinal and transverse directions; hence a "reduction-ratio" was computed as the ratio of engineering strain in the long and transverse directions between the slab and the final plate. This stage of data pre-processing left a reduced size dataset: the input variables comprised ten characteristics; the final gauge (FG), Cu, Fe, Mg, Mn, Si (in weight percent), slab length (SL), solution treatment time (STT), percentage stretch (%st.), and reduction ratio (RR). After removing the entries with missing and repeated values, 290 data points remained. Before any of the modelling techniques were used to predict the proof stress, the dataset was normalised to have a mean of zero and unit variance. MODELLING TECHNIQUES This section considers the adaptive numeric methods used to predict proof stress based on a dataset described in the previous section. Three techniques were considered: (i) Multivariate Linear model, (ii) Bayesian multi-layer perceptron, (iii) Support Vector Machine. Data structure was also examined using a graphical Gaussian model. Each of these models (except the graphical Gaussian model) are assessed against each other quantitatively using the MSE test statistic, and qualitatively in terms of transparency. Graphical Gaussian Models As the dimensionality of the problem domain increases graphical models and graphical representations are playing an increasingly important role in statistics, and empirical modelling in particular. Relationships between variables in a model can be represented graphically by edges in a graph where the nodes represent the data variables. Such graphs provide qualitative representations of the conditional independence structure of the model, as well as simplifying inference in highly structured stochastic systems. Let X be a k-dimensional vector of random variables. A conditional independence graph [4], G = (V,E) describes the association structure of X by means of a graph, specified by the vertex set V, and the edge set E. Conditional independence is an attractive method to generalise the relation between two variables. A graphical model is then a family of probability distributions PG that is a Markov distribution over G. A graphical Gaussian model is obtained when only continuous random variables are considered. If we can assume that the data has been drawn from a Gaussian distribution, then there is no loss of information by condensing the data into the sample mean vector, and the sample variancecovariance matrix. A symmetric correlation coefficient matrix can then be obtained from this matrix. To construct the graphical model it is necessary to test for the presence or otherwise of dependencies between the variables. Using a scaled inverse correlation matrix, a second deviance matrix can be computed using equation 1, where Xa and Xb represent the variables against for which conditional independence is being tested for given the other variables in the dataset XC. This test statistic has an asymptotic chi-squared distribution with one degree of freedom. ( dev Xb||Xc|Xa) )) | , ( 2 1 ln( a X c X b X N corr N − − = 1. This second matrix, the deviance matrix, measures the overall goodness of fit of a graphical model by carrying out a hypothesis at a 95% confidence interval of the chi-squared distribution. Figure 1 illustrates the graphical model obtained for the materials dataset.
منابع مشابه
A New GIS based Application of Sequential Technique to Prospect Karstic Groundwater using Remotely Sensed and Geoelectrical Methods in Karstified Tepal Area, Shahrood, Iran
In this research, recognition of karstic water-bearing zones using the management of exploration data in Kal-Qorno valley, situated in the Tepal area of Shahrood, has been considered. For this purpose, the sequential exploration method was conducted using geological evidences and applying remote sensing and geoelectrical resistivity methods in two major phases including the regional and local s...
متن کاملIdentifying and Ranking Development Drivers of Knowledge-based Technology-Driven Companies (Case study: Fars Province Science and Technology Park)
The purpose of this Study study is to identify and rank the development drivers of knowledge-based, technology-driven businesses. This work is conducted as a case study in Fars Province Science and Technology Park. It is a descriptive survey in terms of purpose since a part of its data is collected through questionnaires and is of surveying type because it describes the existing conditions. The...
متن کاملUser driven Information Extraction with LODIE
Information Extraction (IE) is the technique for transforming unstructured or semi-structured data into structured representation that can be understood by machines. In this paper we use a user-driven Information Extraction technique to wrap entity-centric Web pages. The user can select concepts and properties of interest from available Linked Data. Given a number of websites containing pages a...
متن کاملA comparison between knowledge-driven fuzzy and data-driven artificial neural network approaches for prospecting porphyry Cu mineralization; a case study of Shahr-e-Babak area, Kerman Province, SE Iran
The study area, located in the southern section of the Central Iranian volcano–sedimentary complex, contains a large number of mineral deposits and occurrences which is currently facing a shortage of resources. Therefore, the prospecting potential areas in the deeper and peripheral spaces has become a high priority in this region. Different direct and indirect methods try to predict promising a...
متن کاملThe Influence of Data-Driven Exercises Through Using a Computer Program on Vocabulary Improvement in an EFL Context
The present study was conducted to evaluate data driven learning (DDL) combined with Computer Assisted Language Learning (CALL) as an approach to improving vocabulary knowledge of Iranian postgraduates majoring in teaching English, English literature and translation. The purpose was to help language learners get familiar with DDL as a student-centered method taking advantage of a computer progr...
متن کاملThe Influence of Data-Driven Exercises Through Using a Computer Program on Vocabulary Improvement in an EFL Context
The present study was conducted to evaluate data driven learning (DDL) combined with Computer Assisted Language Learning (CALL) as an approach to improving vocabulary knowledge of Iranian postgraduates majoring in teaching English, English literature and translation. The purpose was to help language learners get familiar with DDL as a student-centered method taking advantage of a computer progr...
متن کامل