A Straightforward Author Profiling Approach in MapReduce
نویسندگان
چکیده
Most natural language processing tasks deal with large amounts of data, which takes a lot of time to process. For better results, a larger dataset and a good set of features are very helpful. But larger volumes of text and high dimensionality of features will mean slower performance. Thus, natural language processing and distributed computing are a good match. In the PAN 2013 competition, the test runtimes for author profiling range from several minutes to several days. Most author profiling systems available now are either inaccurate or slow or both. Our system, written entirely in MapReduce, employs nearly 3 million features and still manages to finish the task in a fraction of time than state-of-theart systems and with better accuracy. Our system demonstrates that when we deal with a huge amount of data and/or a large number of features, using distributed systems makes perfect sense.
منابع مشابه
A Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure
Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...
متن کاملA Simple Approach to Author Profiling in MapReduce
Author profiling, being an important problem in forensics, security, marketing, and literary research, needs to be accurate. With massive amounts of online text readily available on which we might need to perform author profiling, building a fast system is as important as building an accurate system, but this can be challenging. However, the use of distributive computing techniques like MapRedu...
متن کاملOn Modeling CPU Utilization of MapReduce Applications
In this paper, we present an approach to predict the total CPU utilization in terms of CPU clock tick of applications when running on MapReduce framework. Our approach has two key phases: profiling and modeling. In the profiling phase, an application is run several times with different sets of MapReduce configuration parameters to profile total CPU clock tick of the application on a given platf...
متن کاملScalable Distributed Reasoning Using MapReduce
We address the problem of scalable distributed reasoning, proposing a technique for materialising the closure of an RDF graph based on MapReduce. We have implemented our approach on top of Hadoop and deployed it on a compute cluster of up to 64 commodity machines. We show that a naive implementation on top of MapReduce is straightforward but performs badly and we present several non-trivial opt...
متن کاملOn Modeling Dependency between MapReduce Configuration Parameters and Total Execution Time
In this paper, we propose an analytical method to model the dependency between configuration parameters and total execution time of Map-Reduce applications. Our approach has three key phases: profiling, modeling, and prediction. In profiling, an application is run several times with different sets of MapReduce configuration parameters to profile the execution time of the application on a given ...
متن کامل