SOG: A Synthetic Occupancy Generator to Support Entity Resolution Instruction and Research
نویسندگان
چکیده
This paper reports on a project to develop SOG (Synthetic Occupancy Generator), a system to create realistic, but synthetic residential occupancy (name and address) histories as input for Entity Resolution (ER) processes. ER processes are intended to link records referencing the same, or related, real-world entities. Most organizations use some type of ER process to recognize their customers or clients across different channels of contact such as name and address, telephone number, or email address. However, growing concerns over customer privacy and identity theft have made organizations reluctant to publicly release personally-identifiable customer information. The result is that it can difficult to obtain actual occupancy information to use for student exercises or to experiment with entity resolution methods and techniques. SOG was created to address this problem by providing a tool capable of automatically generating a large number of realistic, but synthetic occupancy histories. SOG control parameters allow the user to customize certain features of the simulated occupancy histories. The project reported here is the first phase of a larger project. The second phase is to develop tools that will systematically disrupt the SOG output to create ER test files that have varying degrees of data quality and file formats.
منابع مشابه
The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملReconfiguration of the respiratory network at the onset of locust flight.
Reconfiguration of the respiratory network at the onset of locust flight. J. Neurophysiol. 80: 3137-3147, 1998. The respiratory interneurons 377, 378, 379 and 576 were identified within the suboesophageal ganglion (SOG) of the locust. Intracellular stimulation of these neurons excited the auxillary muscle 59 (M59), a muscle that is involved in the control of thoracic pumping in the locust. Like...
متن کاملA Generator of Synthetic Access Logs that Contain Realistic User Behavior Patterns
Generating high quality synthetic data for testing algorithms and system implementations is challenging. This research designed and developed a tool called QUAlity Synthetic Information Log Generator (Quasi-Log) to facilitate the development and testing of a new series of Information Discovery Systems (IDS) that focus on detecting User Behavior Patterns to improve the quality, security and pote...
متن کاملFusion of LST products of ASTER and MODIS Sensors Using STDFA Model
Land Surface Temperature (LST) is one of the most important physical and climatological crucial yet variable parameter in environmental phenomena studies such as, soil moisture conditions, urban heat island, vegetation health, fire risk for forest areas and heats effects on human’s health. These studies need to land surface temperature with high spatial and temporal resolution. Remote sensing ...
متن کامل