Concept hierarchy generation for categorical data (2022)

Categorical data are discrete data. Categorical attributes have a nite (but possibly large) number of distinct values, with no ordering among the values. Examples include geographic location, job category, and item type. There are several methods for the generation of concept hierarchies for categorical data.

1.

Specication of a partial ordering of attributes explicitly at the schema level by users or experts.

Concept hierarchies for categorical attributes or dimensions typically involve a group of attributes. A user or an expert can easily dene a concept hierarchy by specifying a partial or total ordering of the attributes at the schema level. For example, a relational database or a dimensionlocationof a data warehouse may contain the following group of attributes: street, city, province or state, and country. A hierarchy can be dened by specifying the total ordering among these attributes at the schema level, such asstreet<city<province or state

<country.

country

street city

65 distinct values 15 distinct values

674,339 distinct values 3567 distinct values province_or_state

Figure 3.15: Automatic generation of a schema concept hierarchy based on the number of distinct attribute values.

2.

Specication of a portion of a hierarchy by explicit data grouping.

This is essentially the manual denition of a portion of concept hierarchy. In a large database, it is unrealistic to dene an entire concept hierarchy by explicit value enumeration. However, it is realistic to specify explicit groupings for a small portion of intermediate level data. For example, after specifying thatprovinceandcountry form a hierarchy at the schema level, one may like to add some intermediate levels manually, such as dening explicitly, \fAlberta, Saskatchewan, Manitobagprairies Canada", and \fBritish Columbia, prairies Canadag

Western Canada".

3.

Specication of a

set of attributes

, but not of their partial ordering

.

A user may simply group a set of attributes as a preferred dimension or hierarchy, but may omit stating their partial order explicitly. This may require the system to automatically generate the attribute ordering so as to construct a meaningful concept hierarchy. Without knowledge of data semantics, it is dicult to provide an ideal hierarchical ordering for an arbitrary set of attributes. However, an important observation is that since higher level concepts generally cover several subordinate lower level concepts, an attribute dening a high concept level will usually contain a smaller number of distinct values than an attribute dening a lower concept level. Based on this observation, a concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. The lesser the number of distinct values an attribute has, the higher it is in the generated concept hierarchy. This heuristic rule works ne in many cases. Some local level swapping or adjustments may be performed by users or experts, when necessary, after examination of the generated hierarchy.

Let's examine an example of this method.

Example 3.6

Suppose a user selects a set of attributes, street, country, province or state, and city, for a dimensionlocation from the databaseAllElectronics, but does not specify the hierarchical ordering among the attributes.

The concept hierarchy for location can be generated automatically as follows. First, sort the attributes in ascending order based on the number of distinct values in each attribute. This results in the following (where the number of distinct values per attribute is shown in parentheses): country(15),province or state(65), city (3567), and street (674,339). Second, generate the hierarchy from top down according to the sorted order, with the rst attribute at the top-level and the last attribute at the bottom-level. The resulting hierarchy is shown in Figure 3.15. Finally, the user examines the generated hierarchy, and when necessary, modies it to reect desired semantic relationship among the attributes. In this example, it is obvious that there is no need

to modify the generated hierarchy. 2

Note that this heristic rule cannot be pushed to the extreme since there are obvious cases which do not follow such a heuristic. For example, a time dimension in a database may contain 20 distinct years, 12 distinct months and 7 distinct days of the week. However, this does not suggest that the time hierarchy should be \year <

month<days of the week", withdays of the week at the top of the hierarchy.

4.

Specication of only a partial set of attributes

.

Sometimes a user can be sloppy when dening a hierarchy, or may have only a vague idea about what should be included in a hierarchy. Consequently, the user may have included only a small subset of the relevant attributes in a hierarchy specication. For example, instead of including all the hierarchically relevant attributes for location, one may specify onlystreetandcity. To handle such partially specied hierarchies, it is important to embed data semantics in the database schema so that attributes with tight semantic connections can be pinned together. In this way, the specication of one attribute may trigger a whole group of semantically tightly linked attributes to be \dragged-in" to form a complete hierarchy. Users, however, should have the option to over-ride this feature, as necessary.

Example 3.7

Suppose that a database system has pinned together the ve attributes, number, street, city, province or state, andcountry, because they are closely linked semantically, regarding the notion oflocation. If a user were to specify only the attribute cityfor a hierarchy deninglocation, the system may automatically drag all of the above ve semantically-related attributes to form a hierarchy. The user may choose to drop any of these attributes, such as number andstreet, from the hierarchy, keepingcity as the lowest conceptual level

in the hierarchy. 2

3.6 Summary

Data preparation

is an important issue for both data warehousing and data mining, as real-world data tends to be incomplete, noisy, and inconsistent. Data preparation includes data cleaning, data integration, data transformation, and data reduction.

Data cleaning

routines can be used to ll in missing values, smooth noisy data, identify outliers, and correct data inconsistencies.

Data integration

combines data from multiples sources to form a coherent data store. Metadata, correlation analysis, data conict detection, and the resolution of semantic heterogeneity contribute towards smooth data integration.

Data transformation

routines conform the data into appropriate forms for mining. For example, attribute data may be

normalized

so as to fall between a small range, such as 0 to 1.0.

Data reduction

techniques such as data cube aggregation, dimension reduction, data compression, numerosity reduction, and discretization can be used to obtain a reduced representation of the data, while minimizing the loss of information content.

Concept hierarchies

organize the values of attributes or dimensions into gradual levels of abstraction. They are a form a discretization that is particularly useful in multilevel mining.

Automatic generation of concept hierarchies

for categoric data may be based on the number of distinct values of the attributes dening the hierarchy. For numeric data, techniques such as data segmentation by partition rules, histogram analysis, and clustering analysis can be used.

Although several methods of data preparation have been developed, data preparation remains an active area of research.

Exercises

1. Data quality can be assessed in terms of accuracy, completeness, and consistency. Propose two other dimensions of data quality.

2. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe various methods for handling this problem.

3. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order):

13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.

(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.

Comment on the eect of this technique for the given data.

(b) How might you determine outliers in the data?

(c) What other methods are there for data smoothing?

4. Discuss issues to consider during data integration.

5. Using the data forage given in Question 3, answer the following:

(a) Use min-max normalization to transform the value 35 forage onto the range [0;1].

(b) Use z-score normalization to transform the value 35 forage, where the standard deviation ofageis ??.

(c) Use normalization by decimal scaling to transform the value 35 forage.

(d) Comment on which method you would prefer to use for the given data, giving reasons as to why.

6. Use a ow-chart to illustrate the following procedures for attribute subset selection:

(a) step-wise forward selection.

(b) step-wise backward elimination

(c) a combination of forward selection and backward elimination.

7. Using the data forage given in Question 3:

(a) Plot an equi-width histogram of width 10.

(b) Sketch examples of each of the following sample techniques: SRSWOR, SRSWR, cluster sampling, strat-ied sampling.

8. Propose a concept hierarchy for the attribute ageusing the 3-4-5 partition rule.

9. Propose an algorithm, in pseudo-code or in your favorite programming language, for

(a) the automatic generation of a concept hierarchy for categorical data based on the number of distinct values of attributes in the given schema,

(b) the automatic generation of a concept hierarchy for numeric data based on the equi-width partitioning rule, and

(c) the automatic generation of a concept hierarchy for numeric data based on the equi-depth partitioning rule.

Bibliographic Notes

Data preprocessing is discussed in a number of textbooks, including Pyle [28], Kennedy et al. [21], and Weiss and Indurkhya [37]. More specic references to individual preprocessing techniques are given below.

For discussion regarding data quality, see Ballou and Tayi [3], Redman [31], Wand and Wang [35], and Wang, Storey and Firth [36]. The handling of missing attribute values is discussed in Quinlan [29], Breiman et al. [5], and Friedman [11]. A method for the detection of outlier or \garbage" patterns in a handwritten character database is given in Guyon, Matic, and Vapnik [14]. Binning and data normalization are treated in several texts, including [28, 21, 37].

A good survey of data reduction techniques can be found in Barbara et al. [4]. For algorithms on data cubes and their precomputation, see [33, 16, 1, 38, 32]. Greedy methods for attribute subset selection (orfeature subset selection) are described in several texts, such as Neter et al. [24], and John [18]. A combination forward selection and backward elimination method was proposed in Siedlecki and Sklansky [34]. For a description of wavelets for data compression, see Press et al. [27]. Daubechies transforms are described in Daubechies [6]. The book by Press et al. [27] also contains an introduction to singular value decomposition for principal components analysis.

An introduction to regression and log-linear models can be found in several textbooks, such as [17, 9, 20, 8, 24].

For log-linear models (known as multiplicative models in the computer science literature), see Pearl [25]. For a

general introduction to histograms, see [7, 4]. For extensions of single attribute histograms to multiple attributes, see Muralikrishna and DeWitt [23], and Poosala and Ioannidis [26]. Several references to clustering algorithms are given in Chapter 7 of this book, which is devoted to this topic. A survey of multidimensional indexing structures is in Gaede and Gunther [12]. The use of multidimensional index trees for data aggregation is discussed in Aoki [2].

Index trees include R-trees (Guttman [13]), quad-trees (Finkel and Bentley [10]), and their variations. For discussion on sampling and data mining, see John and Langley [19], and Kivinen and Mannila [22].

Entropy and information gain are described in Quinlan [30]. Concept hierarchies, and their automatic generationfrom categorical data are described in Han and Fu [15].

[1] S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi.

On the computation of multidimensional aggregates. In Proc. 1996 Int. Conf. Very Large Data Bases, pages 506{521, Bombay, India, Sept. 1996.

[2] P. M. Aoki. Generalizing \search" in generalized search trees. In Proc. 1998 Int. Conf. Data Engineering (ICDE'98), April 1998.

[3] D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73{78, 1999.

[4] D. Barbara et al. The new jersey data reduction report. Bulletin of the Technical Committee on Data Engi-neering, 20:3{45, December 1997.

[5] L. Breiman, J. Friedman, R. Olshen, and C. Stone.Classication and Regression Trees. Wadsworth International Group, 1984.

[6] I. Daubechies. Ten Lectures on Wavelets. Capital City Press, Montpelier, Vermont, 1992.

[7] J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. New York: Duxbury Press, 1997.

[8] J. L. Devore. Probability and Statistics for Engineering and the Science, 4th ed. Duxbury Press, 1995.

[9] A. J. Dobson.An Introduction to Generalized Linear Models. Chapman and Hall, 1990.

[10] R. A. Finkel and J. L. Bentley. Quad-trees: A data structure for retrieval on composite keys.ACTA Informatica, 4:1{9, 1974.

[11] J. H. Friedman. A recursive partitioning decision rule for nonparametric classiers. IEEE Trans. on Comp., 26:404{408, 1977.

[12] V. Gaede and O. Gunther. Multdimensional access methods. ACM Comput. Surv., 30:170{231, 1998.

[13] A. Guttman. R-tree: A dynamic index structure for spatial searching. InProc. 1984 ACM-SIGMOD Int. Conf.

Management of Data, June 1984.

[14] I. Guyon, N. Matic, and V. Vapnik. Discoverying informative patterns and data cleaning. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 181{203. AAAI/MIT Press, 1996.

[15] J. Han and Y. Fu. Dynamic generation and renement of concept hierarchies for knowledge discovery in databases. InProc. AAAI'94 Workshop Knowledge Discovery in Databases (KDD'94), pages 157{168, Seattle, WA, July 1994.

[16] V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes eciently. InProc. 1996 ACM-SIGMOD Int. Conf. Management of Data, pages 205{216, Montreal, Canada, June 1996.

[17] M. James. Classication Algorithms. John Wiley, 1985.

[18] G. H. John. Enhancements to the Data Mining Process. Ph.D. Thesis, Computer Science Dept., Stanford Univeristy, 1997.

29

[19] G. H. John and P. Langley. Static versus dynamic sampling for data mining. In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD'96), pages 367{370, Portland, OR, Aug. 1996.

[20] R. A. Johnson and D. W. Wickern. Applied Multivariate Statistical Analysis, 3rd ed. Prentice Hall, 1992.

[21] R. L Kennedy, Y. Lee, B. Van Roy, C. D. Reed, and R. P. Lippman. Solving Data Mining Problems Through Pattern Recognition. Upper Saddle River, NJ: Prentice Hall, 1998.

[22] J. Kivinen and H. Mannila. The power of sampling in knowledge discovery. InProc. 13th ACM Symp. Principles of Database Systems, pages 77{85, Minneapolis, MN, May 1994.

[23] M. Muralikrishna and D. J. DeWitt. Equi-depth histograms for extimating selectivity factors for multi-dimensional queries. In Proc. 1988 ACM-SIGMOD Int. Conf. Management of Data, pages 28{36, Chicago, IL, June 1988.

[24] J. Neter, M. H. Kutner, C. J. Nachtsheim, and L. Wasserman. Applied Linear Statistical Models, 4th ed. Irwin:

Chicago, 1996.

[25] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Palo Alto, CA: Morgan Kauman, 1988.

[26] V. Poosala and Y. Ioannidis. Selectivity estimationwithout the attribute value independence assumption. In Proc. 23rd Int. Conf. on Very Large Data Bases, pages 486{495, Athens, Greece, Aug. 1997.

[27] W. H. Press, S. A. Teukolosky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientic Computing. Cambridge University Press, Cambridge, MA, 1996.

[28] D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.

[29] J. R. Quinlan. Unknown attribute values in induction. InProc. 6th Int. Workshop on Machine Learning, pages 164{168, Ithaca, NY, June 1989.

[30] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

[31] T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992.

[32] K. Ross and D. Srivastava. Fast computation of sparse datacubes. InProc. 1997 Int. Conf. Very Large Data Bases, pages 116{125, Athens, Greece, Aug. 1997.

[33] S. Sarawagi and M. Stonebraker. Ecient organization of large multidimensional arrays. In Proc. 1994 Int.

Conf. Data Engineering, pages 328{336, Feb. 1994.

[34] W. Siedlecki and J. Sklansky. On automatic feature selection. Int. J. of Pattern Recognition and Articial Intelligence, 2:197{220, 1988.

[35] Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86{95, 1996.

[36] R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623{640, 1995.

[37] S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1998.

[38] Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensionalaggregates. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 159{170, Tucson, Arizona,May 1997.

Contents

4 Primitives for Data Mining 3

4.1 Data mining primitives: what denes a data mining task? . . . 3 4.1.1 Task-relevant data . . . 4 4.1.2 The kind of knowledge to be mined . . . 6 4.1.3 Background knowledge: concept hierarchies . . . 7 4.1.4 Interestingness measures . . . 10 4.1.5 Presentation and visualization of discovered patterns . . . 12 4.2 A data mining query language . . . 12 4.2.1 Syntax for task-relevant data specication . . . 15 4.2.2 Syntax for specifying the kind of knowledge to be mined . . . 15 4.2.3 Syntax for concept hierarchy specication . . . 18 4.2.4 Syntax for interestingness measure specication . . . 20 4.2.5 Syntax for pattern presentation and visualization specication . . . 20 4.2.6 Putting it all together | an example of a DMQL query . . . 21 4.3 Designing graphical user interfaces based on a data mining query language . . . 22 4.4 Summary . . . 22

1

Chapter 4

Primitives for Data Mining

A popular misconception about data mining is to expect that data mining systems can autonomously dig out all of the valuable knowledge that is embedded in a given large database, without human intervention or guidance.

Although it may at rst sound appealing to have an autonomous data mining system, in practice, such systems will uncover an overwhelmingly large set of patterns. The entire set of generated patterns may easily surpass the size of the given database! To let a data mining system \run loose" in its discovery of patterns, without providing it with any indication regarding the portions of the database that the user wants to probe or the kinds of patterns the user would nd interesting, is to let loose a data mining \monster". Most of the patterns discovered would be irrelevant to the analysis task of the user. Furthermore, many of the patterns found, though related to the analysis task, may be dicult to understand, or lack of validity, novelty, or utility | making them uninteresting. Thus, it is neither realistic nor desirable to generate, store, or present all of the patterns that could be discovered from a given database.

A more realistic scenario is to expect that users can communicate with the data mining system using a set of data mining primitives designed in order to facilitate ecient and fruitful knowledge discovery. Such primitives include the specication of the portions of the database or the set of data in which the user is interested (including the database attributes or data warehouse dimensions of interest), the kinds of knowledge to be mined, background knowledge useful in guiding the discovery process, interestingness measures for pattern evaluation, and how the discovered knowledge should be visualized. These primitives allow the user to interactively communicate with the data mining system during discovery in order to examine the ndings from dierent angles or depths, and direct the mining process.

A data mining query language can be designed to incorporate these primitives, allowing users to exibly interactwith data mining systems. Having a data mining query language also provides a foundation on which friendlygraphical user interfaces can be built. In this chapter, you will learn about the data mining primitives in detail, aswell as study the design of a data mining query language based on these principles.

You might also like

Latest Posts

Article information

Author: Arline Emard IV

Last Updated: 09/12/2022

Views: 5873

Rating: 4.1 / 5 (52 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Arline Emard IV

Birthday: 1996-07-10

Address: 8912 Hintz Shore, West Louie, AZ 69363-0747

Phone: +13454700762376

Job: Administration Technician

Hobby: Paintball, Horseback riding, Cycling, Running, Macrame, Playing musical instruments, Soapmaking

Introduction: My name is Arline Emard IV, I am a cheerful, gorgeous, colorful, joyous, excited, super, inquisitive person who loves writing and wants to share my knowledge and understanding with you.