Generalization Tree

Learning Semantically Robust Rules from Data

by Yiheng Li and Latanya Sweeney

Abstract

We introduce the problem of mining robust rules, which are expressive multi-dimensional generalized association rules. Consider a large relational table, where associated with each attribute is a hierarchy whose base values are those originally represented in the data, and values appearing at higher levels in the hierarchy represent increasingly more general concepts of base values. Attribute hierarchies provide meaningful levels of concept aggregation, such as the encoding of postal codes (ZIP) or dates, or the taxonomy of products. We find the least general rules formed by combining mixed levels of generalizations across attributes to convey the maximum expression of information supported by attribute hierarchies, parameter settings and data tuples. We term these "robust rules" and introduce a GenTree algorithm as a means to learn robust rules from a table. An example of a robust rule from a table having base values {5-digit ZIP, gender, registration date (year/month/day), party} might be "women living in Cambridge (021**) and registered in the 1970’s (197*/**/**) tend to be Democrats." Previous studies on mining generalized association rules have been limited dimensionally (e.g., transactional data), by data type (e.g., quantitative data), and/or to rules expressed from either fixed-level or non-semantic abstractions. Such approaches limit the kinds of rules that can be learned. Experiments using GenTree with two real-world datasets, containing 10,000 six-attributed tuples and over 4,000 eight-attributed tuples each, show that learned rules convey more comprehensive information than possible with traditional association rule mining algorithms, because traditional approaches limit the expressivity of the rules they generate.

Experiments based on Voter Data:: Pittsburgh, Pennsylvania and Cambridge, Massachusetts.

Keywords: association rules, classification problems, data mining, knowledge acquisition, rule learning, hiearchical learning

Poster

Citation:
Y. Li and L. Sweeney. Learning Robust Rules from Data, Carnegie Mellon University, School of Computer Science, Tech Report, CMU ISRI 04-107, CMU-CALD-04-100. Pittsburgh: February 2004. Paper: 21 pages, PDF.

Related Links


Spring 2004 [LIDAP@dataprivacylab.org]