Which of the following occurs when the same attribute in related data files has different values

Requirements Analysis and Conceptual Data Modeling

Toby Teorey, ... H.V. Jagadish, in Database Modeling and Design (Fifth Edition), 2011

Multivalued Attributes

A multivalued attribute of an entity is an attribute that can have more than one value associated with the key of the entity. For example, a large company could have many divisions, some of them possibly in different cities. In this case, division or division-name would be classified as a multivalued attribute of the Company entity (and its key, company-name). The headquarters-address attribute of the company, on the other hand, would normally be a single-valued attribute.

Classify multivalued attributes as entities. In this example, the multivalued attribute division-name should be reclassified as an entity Division with division-name as its identifier (key) and division-address as a descriptor attribute. If attributes are restricted to be single valued only, the later design and implementation decisions will be simplified.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123820204000045

Data Modeling in UML

Terry Halpin, Tony Morgan, in Information Modeling and Relational Databases (Second Edition), 2008

9.3 Attributes

Like other ER notations, UML allows relationships to be modeled as attributes. For instance, in Figure 9.6(a) the Employee class has eight attributes. The corresponding ORM diagram is shown in Figure 9.6(b).

Which of the following occurs when the same attribute in related data files has different values

Figure 9.6. UML attributes (a) depicted as ORM relationship types (b).

In UML, attributes are mandatory and single valued by default. So the employee number, name, title, gender, and smoking status attributes are all mandatory. In the ORM model, the unary predicate “smokes” is optional (not everybody has to smoke). UML does not support unary relationships, so it models this instead as the Boolean attribute “isSmoker”, with possible values True or False. In UML the domain (i.e., type) of any attribute may optionally be displayed after it (preceded by a colon). In this example, the domain is displayed only for the isSmoker attribute. By default, ORM tools usually take a closed world approach to unaries, which agrees with the isSmoker attribute being mandatory.

The ORM model also indicates that Gender and Country are identified by codes (rather than names, say). We could convey some of this detail in the UML diagram by appending domain names. For example, “Gendercode” and “Countrycode” could be appended to “gender: “ and “birthcountry: “ to provide syntactic domains.

In the ORM model it is optional whether we record birth country, social security number, or passport number. This is captured in UML by appending [0..1] to the attribute name (each employee has 0 or 1 birth country, and 0 or 1 social security number). This is an example of an attribute multiplicity constraint. The main multiplicity cases are shown in Table 9.2. If the multiplicity is not declared explicitly, it is assumed to be 1 (exactly one). If desired, we may indicate the default multiplicity explicitly by appending[1..1] or [1] to the attribute.

Table 9.2. Multiplicities.

MultiplicityAbbreviationMeaningNote
0.. 1 0 or 1 (at most one)
0..* * 0 to many (zero or more)
1 exactly 1 Assumed by default
1..* 1 or more (at least 1)
n..* n or more (at least n) n ≥ 0
n..m at least n and at most m m > n ≥ 0

In the ORM model, the uniqueness constraints on the right-hand roles (including the Employee Nr reference scheme shown explicitly earlier) indicate that each employee number, social security number, and passport number refer to at most one employee. As mentioned earlier, UML has no standard graphic notation for such “attribute uniqueness constraints”, so we've added our own {P} and {Un} notations for preferred identifiers and uniqueness. UML 2 added the option of specifying {unique} or {nonunique} as part of a multiplicity declaration, but this is only to declare whether instances of collections for multivalued attributes or multivalued association roles may include duplicates, so it can't be used to specify that instances of single valued attributes or combinations of such attributes are unique for the class.

UML has no graphic notation for an inclusive-or constraint, so the ORM constraint that each employee has a social security number or passport number needs to be expressed textually in an attached note, as in Figure 9.6(a). Such textual constraints may be expressed informally, or in some formal language interpretable by a tool. In the latter case, the constraint is placed in braces.

In our example, we've chosen to code the inclusive-or constraint in SQL syntax. Although UML provides OCL for this purpose, it does not mandate its use, allowing users to pick their own language (even programming code). This of course weakens the portability of the model. Moreover, the readability of the constraint is typically poor compared with the ORM verbalization.

The ORM fact type Employee was born in Country is modeled as a birthcountry attribute in the UML class diagram of Figure 9.6(a). If we later decide to record the population of a country, then we need to introduce Country as a class, and to clarify the connection between birthcountry and Country we would probably reformulate the birthcountry attribute as an association between Employee and Country. This is a significant change to our model. Moreover, any object-based queries or code that referenced the birthcountry attribute would also need to be reformulated. ORM avoids such semantic instability by always using relationships instead of attributes.

Another reason for introducing a Country class is to enable a listing of countries to be stored, identified by their country codes, without requiring all of these countries to participate in a fact. To do this in ORM, we simply declare the Country type to be independent. The object type Country may be populated by a reference table that contains those country codes of interest (e.g., ‘AU’ denotes Australia).

A typical argument in support of attributes runs like this: “Good UML modelers would declare country as a class in the first place, anticipating the need to later record something about it, or to maintain a reference list; on the other hand, features such as the title and gender of a person clearly are things that will never have other properties, and hence are best modeled as attributes”. This argument is flawed. In general, you can't be sure about what kinds of information you might want to record later, or about how important some model feature will become.

Even in the title and gender case, a complete model should include a relationship type to indicate which titles are restricted to which gender (e.g., “Mrs”, “Miss”, “Ms”, and “Lady” apply only to the female sex). In ORM this kind of constraint can be captured graphically as a join-subset constraint or textually as a constraint in a formal ORM language (e.g., If Person1 has a Title that is restricted to Gender1 then Person1 is of Gender1). In contrast, attribute usage hinders expression of the relevant restriction association (try expressing and populating this rule in UML).

ORM includes algorithms for dynamically generating ER and UML diagrams as attribute views. These algorithms assign different levels of importance to object types depending on their current roles and constraints, redisplaying minor fact types as attributes of the major object types. Modeling and maintenance are iterative processes. The importance of a feature can change with time as we discover more of the global model, and the domain being modeled itself changes.

To promote semantic stability, ORM makes no commitment to relative importance in its base models, instead supporting this dynamically through views. Elementary facts are the fundamental units of information, are uniformly represented as relationships, and how they are grouped into structures is not a conceptual issue. You can have your cake and eat it too by using ORM for analysis, and if you want to work with UML class diagrams, you can use your ORM models to derive them.

One way of modeling this in UML is shown in Figure 9.7(a). Here the information about who plays what sport is modeled as the multivalued attribute “sports”. The “[0..*]” multiplicity constraint on this attribute indicates how many sports may be entered here for each employee. The “0” indicates that it is possible that no sports might be entered for some employee. UML uses a null value for this case, just like the relational model. The presence of nulls exposes users to implementation rather than conceptual issues and adds complexity to the semantics of queries. The “*” in “[0..*]” indicates there is no upper bound on the number of sports of a single employee. In other words, an employee may play many sports, and we don't care how many. If “*” is used without a lower bound, this is taken as an abbreviation for “0..*”.

Which of the following occurs when the same attribute in related data files has different values

Figure 9.7. (a) Multivalued UML sports attribute depicted as (b) ORM m:n fact type.

An equivalent ORM schema is shown in Figure 9.7(b). Here an optional, many:many fact type is used instead of the multivalued sports attribute. As discussed in the next section, this approach may also be used in UML using an m:n association.

To discuss class instance populations, UML uses object diagrams. These are essentially class diagrams in which each object is shown as a separate instance of a class, with data values supplied for its attributes. As a simple example, Figure 9.8(a) includes object diagrams to model three employee instances along with their attribute values. The ORM model in Figure 9.8(b) displays the same sample population, using fact tables to list the fact instances.

Which of the following occurs when the same attribute in related data files has different values

Figure 9.8. Populated models in (a) UML and (b) ORM.

For simple cases like this, object diagrams are useful. However, they rapidly become unwieldy if we wish to display multiple instances for more complex cases. In contrast, fact tables scale easily to handle large and complex cases.

ORM constraints are easily clarified using sample populations. For example, in Figure 9.8(b) the absence of employee 101 in the Plays fact table clearly shows that playing sport is optional, and the uniqueness constraints mark out which column or column-combination values can occur on at most one row. In the EmployeeName fact table, the first column values are unique, but the second column includes duplicates. In the Plays table, each column contains duplicates: only the whole rows are unique. Such populations are very useful for checking constraints with the subject matter experts. This validation-via-example feature of ORM holds for all its constraints, not just mandatory roles and uniqueness, since all its constraints are role-based or type-based, and each role corresponds to a fact table column.

As a final example of multivalued attributes, suppose that we wish to record the nicknames and colors of country flags. Let us agree to record at most two nicknames for any given flag and that nicknames apply to only one flag. For example, “Old Glory” and perhaps “The Star-spangled Banner” might be used as nicknames for the United States flag. Flags have at least one color.

Figure 9.9(a) shows one way to model this in UML. The “[0..2]” indicates that each flag has at most two (from zero to two) nicknames. The [”1..*] declares that a flag has one or more colors. An additional constraint is needed to ensure that each nickname refers to at most one flag. A simple attribute uniqueness constraint (e.g., {U1}) is not enough, since the nicknames attribute is set valued. Not only must each nicknames set be unique for each flag, but each element in each set must be unique (the second condition implies the former). This more complex constraint is specified informally in an attached note.

Which of the following occurs when the same attribute in related data files has different values

Figure 9.9. A flag model in (a) UML and (b) ORM.

Here the attribute domains are hidden. Nickname elements would typically have a data type domain (e.g., String). If we don't store other information about countries or colors, we might choose String as the domain for country and color as well (although this is subconceptual, because real countries and colors are not character strings). However, since we might want to add information about these later, it's better to use classes for their domains (e.g., Country and Color). If we do this, we need to define the classes as well.

Figure 9.9 (b) shows one way to model this in ORM. For verbalization we identify each flag by its country. Since country is an entity type, the reference scheme is shown explicitly (reference modes may abbreviate reference schemes only when the referencing type is a value type). The “≤ 2” frequency constraint indicates that each flag has at most two nicknames, and the uniqueness constraint on the role of NickName indicates that each nickname refers to at most one flag.

UML gives us the choice of modeling a feature as an attribute or an association. For conceptual analysis and querying, explicit associations usually have many advantages over attributes, especially multivalued attributes. This choice helps us verbalize, visualize, and populate the associations. It also enables us to express various constraints involving the “role played by the attribute” in standard notation, rather than resorting to some nonstandard extension. This applies not only to simple uniqueness constraints (as discussed earlier) but also to other kinds of constraints (frequency, subset, exclusion, etc.) over one or more roles that include the role played by the attribute's domain (in the implicit association corresponding to the attribute).

For example, if the association Flag is of Country is depicted explicitly in UML, the constraint that each country has at most one flag can be captured by adding a multiplicity constraint of “0..1” on the left role of this association. Although country and color are naturally conceived as classes, nickname would normally be construed as a data type (e.g., a subtype of String). Although associations in UML may include data types (not just classes), this is somewhat awkward; so in UML, nicknames might best be left as a multivalued attribute. Of course, we could model it cleanly in ORM first.

Another reason for favoring associations over attributes is stability. If we ever want to talk about a relationship, it is possible in both ORM and UML to make an object out of it and simply attach the new details to it. If instead we modeled the feature as an attribute, we would need to first replace the attribute by an association. For example, consider the association Employee plays Sport in Figure 9.8(b). If we need to record a skill level for this play, we can simply objectify this association as Play, and attach the fact type: Play has SkillLevel. A similar move can be made in UML if the play feature has been modeled as an association. In Figure 9.8(a) however, this feature is modeled as the sports attribute, which needs to be replaced by the equivalent association before we can add the new details about skill level. The notion of objectified relationship types or association classes is covered in a later section.

Another problem with multivalued attributes is that queries on them need some way to extract the components, and hence complicate the query process for users. As a trivial example, compare queries Q1, Q2 expressed in ConQuer (an ORM query language) with their counterparts in OQL (the Object Query language proposed by the ODMG). Although this example is trivial, the use of multivalued attributes in more complex structures can make it harder for users to express their requirements.

(Q1)

List each Color that is of Flag ‘USA’.

(Q2)

List each Flag that has Color ‘red’.

(Q1a)

select x.colors from x in Flag where x.country = “USA”

(Q2a)

select x.country from x in Flag where “red” in x.colors

For such reasons, multivalued attributes should normally be avoided in analysis models, especially if the attributes are based on classes rather than data types. If we avoid multivalued attributes in our conceptual model, we can still use them in the actual implementation. Some UML and ORM tools allow schemas to be annotated with instructions to override the default actions of whatever mapper is used to transform the schema to an implementation. For example, the ORM schema in Figure 9.9 might be prepared for mapping by annotating the roles played by NickName and Color to map as sets inside the mapped Flag structure. Such annotations are not a conceptual issue, and can be postponed until mapping.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123735683500138

Database Development Process

Ming Wang, Russell K. Chan, in Encyclopedia of Information Systems, 2003

I.C.1.d. Rule for Each Multivalued Attribute in a Relation

Create a new relation and use the same name as the multivalued attribute. The primary key in the new relation is the combination of the multivalued attribute and the primary key in the parent entity type. For example, department location is a multivalued attribute associated with the Department entity type since one department has more than one location. Since multivalued attributes are not allowed in a relation, we have to split the department location into another table. The primary key is the combination of deptCode and deptLocation. The new relation dept-Location is

Which of the following occurs when the same attribute in related data files has different values

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0122272404000265

The Relational Data Model

Jan L. Harrington, in Relational Database Design (Third Edition), 2009

Rows and Row Characteristics

A row in a relation has the following properties.

Only one value at the intersection of a column and row: A relation does not allow multivalued attributes.

Uniqueness: There are no duplicate rows in a relation.

A primary key: A primary key is a column or combination of columns with a value that uniquely identifies each row. As long as you have unique primary keys, you also have unique rows. We will look at the issue of what makes a good primary key in great depth in the next major section of this chapter.

There are no positional concepts: The rows can be viewed in any order without affecting the meaning of the data.

Note: For the most part, DBMSs do not enforce the unique row constraint automatically. However, as you will see shortly, there is another way to obtain the same effect.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012374730300005X

The Relational Data Model

Jan L. Harrington, in Relational Database Design and Implementation (Fourth Edition), 2016

Rows and Row Characteristics

In relational design theory, a row in a relation has the following properties:

Only one value at the intersection of a column and row: A relation does not allow multivalued attributes.

Uniqueness: There are no duplicate rows in a relation.

Note: for the most part, DBMSs do not enforce the unique row constraint automatically. However, as you will see in the next bullet, there is another way to obtain the same effect.

A primary key: A primary key is a column or combination of columns with a value that uniquely identifies each row. As long as you have unique primary keys, you will ensure that you also have unique rows. We will look at the issue of what makes a good primary key in great depth in the next major section of this chapter.

There are no positional concepts. The rows can be viewed in any order without affecting the meaning of the data.

Note: You can’t necessarily move both columns and rows around at the same time and maintain the integrity of a relation. When you change the order of the columns, the rows must remain in the same order; when you change the order of the rows, you must move each entire row as a unit.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128043998000053

Public Folder Interoperability and Migration

Kieran McCorry, in Microsoft® Exchange Server 2003 Deployment and Migration, 2004

5.11 Reintroducing Public Folder Affinity

With Exchange 5.5, there was no such lowest-cost transitive routing mechanism to determine where a client should be directed for specific Public Folder content. Instead, you explicitly defined a server for a particular Public Folder to which referrals would be directed. This Public Folder affinity capability was not present in Exchange 2000 but was re-introduced with Exchange 2003 to give administrators more flexibility for dealing with Public Folder referrals rather than relying on routing costs.

You can set Public Folder affinity costs on a server-by-server basis. For example, assume that I host specific Public Folder content on server OSBEX02 but not on my home mailbox server of OSBEX01. I can set the Public Folder Referrals property of the OSBEX01 server so that all Public Folder referrals are directed to OSBEX02. This is shown in Figure 5-6.

Which of the following occurs when the same attribute in related data files has different values

Figure 5-6. Setting Public Folder Affinity Characteristics with Exchange 2003

Little granularity can be implemented using this affinity mechanism. For instance, you cannot select specific affinity servers for specific Public Folders. Nor can you implement a fallback to using Public Folder referrals based on routing costs: It’s a one or the other approach. However, you can define multiple affinity servers and associate a cost with each one, so that the lowest-cost affinity server is used for client referrals if it is available. If a specific affinity server is not reachable, then the next highest-cost one is selected.

Entering server information into the Public Folder Referrals property tab results in the msExchFolderAffinityCustom attribute being set to 1, and the values you enter for the affinity servers are held in the msExchFolderAffinityList multivalued attribute. You can review these settings using ADSI Edit or LDP; both are to be found as properties of the following object in the AD:

CN = Configuration Container/CN = Services/CN = Microsoft Exchange

/CN = < OrgName >/CN = Administrative Groups

/CN = < SiteName >/CN = Servers/CN < ServerName >

Where

< OrgName > is the name of your Exchange Organization,

< SiteName > is the name of your Exchange Site, and

< ServerName > is the name of your Exchange server.

From a deployment perspective, it’s obviously a small next step to use some simple programming to populate these values programmatically using a technique such as CDOEXM.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781555583163500075

Structured Search Solutions

Mikhail Gilula, in Structured Search for Big Data, 2016

7.3 Native KeySQL Systems

In this section, we consider some native KeySQL applications. The list is by no means comprehensive but is intended to illustrate the typical benefits that can be brought by the use of structured search technology in the form of native key-object data stores.

7.3.1 Healthcare Information Systems

We consider the healthcare applications not just because they are positioned to benefit from the use of the structured search technology and KeySQL, but also as a representative of a class of such applications, which have common issues with respect to their relational database implementations.

As a background, let us mention that after more than 45 years from the beginning of the relational era, there are still prerelational medical systems in use. This illustrates not just the conservative nature of the healthcare subject area, but also the probable fact that the conversion of those systems to the relational platform did not look overwhelmingly advantageous.

For the sake of brevity, let us point to just two principal characteristics of the healthcare information systems as follows:

1.

The healthcare data objects tend to be relatively complex and variable in their structure and contain multiple groups of multivalued attributes. For example, a patient can have multiple diagnoses, each of which can require multiple medications, etc.

2.

There is an underlying design requirement of supporting the electronic exchange of the health records between the different systems.

Both support the idea that the key-object data model and KeySQL can be more appropriate than the relational model and SQL for use in the healthcare applications.

Particularly, the key-object model drastically reduces the number of related data records needed for representing a clinical case compared to the relational model. This simplifies and speeds up the ad hoc querying of the related data and combining it into the comprehensive information objects, particularly for the data exchange purposes. The reverse process of inserting the information from the incoming electronic exchange messages into the receiving systems also becomes more straightforward and quick.

The natural compatibility of the key-object instance syntax with the JSON based data transport formats can bring additional advantages.

Data warehousing of healthcare information and subsequent analytical processing and reporting can also benefit from the use of the key-object data model and KeySQL. The supporting arguments are in line with those presented in Section 7.3.2, dedicated to data warehousing.

7.3.2 Big Data Warehousing

Data warehousing is a field of database applications that received its recognition and wide acceptance some 20 years after the relational databases were invented. Since that time, the data warehouses became an important and valuable part of almost any IT organization.

Unlike the operational systems, which typically use a relatively small set of predefined data access paths, the data warehousing applications require the full-scale use of structured query languages, particularly SQL, which currently has little competition in this area.

The intrinsic part of the data warehousing technology are the processes collectively known as extract, transform, and load (ETL), which are used to extract data from the operational systems and load it into the data warehouses for subsequent analytical processing.

The ETL procedures typically involve moving around large amounts of data, and are performance-hungry. This is especially true when the Big Data must be analyzed as fast as possible in order to extract information critical for tactical and strategic business insights.

NoSQL systems are successfully competing with SQL databases for their use in operational systems. However, the data warehousing still remains mostly the SQL domain because the use of SQL and particularly the use of ad hoc queries, is so far basically irreplaceable for the business users.

That is why at least part of the data produced by the NoSQL systems is eventually loaded into the SQL data warehouses for analytical processing. At the same time, it is already clear that the performance of ETL procedures and SQL databases become more and more inadequate for digesting the Big Data.

The critical path of the Big Data warehousing is determined by the following main issues.

1.

The data from the NoSQL operational systems need significant transformations in order to be loaded into multiple relational tables. This makes it difficult to fit the ETL processes into the batch windows, and leads to the principle inability of loading all data that may be potentially beneficial for gaining the business intelligence. In reality, the percent of Big Data that can be timely and reliably loaded into the SQL data warehouses is diminishing with time as the Big Data grows along the dimensions of the three V’s.

2.

The performance of even pretty big and expensive SQL databases puts limits on the ability to process the ever-growing data volumes. The most problematic part of this processing is joining big tables. In Chapter 6, we have already mentioned that the joins are generally difficult to parallelize. But the relational technology heavily relies on the joins because of its inability to handle multiple data values and data normalization, which in turn is caused by the need to avoid the update anomalies and the excessive storage volumes.

The structured search technology based on the key-object data model and implemented in the native KeySQL data stores is on the one hand compatible with the rich data objects of the NoSQL operational systems, and on the other hand provides functional equivalent of the SQL querying capabilities. This makes it a better choice for the Big Data warehousing than the relational database technology.

The use of KeySQL stores would allow speeding up the ETL processes because the lossless data transformations from the NoSQL models into the key-object model are generally much more straightforward. At the same time, the ad hoc querying capabilities of the KeySQL are comparable with those of the SQL, as basically entire SQL functionality can have its analogs in the KeySQL. Performance-wise, KeySQL has an advantage of reducing the relative share of joins that hamper the overall performance of the SQL data warehousing solutions.

7.3.3 KeySQL on MapReduce Clusters

The key-object data model is more capacious and general than the relational one. And it is also more scalable. As mentioned in Chapter 6, though KeySQL supports the analogs of the relational join operations, it eliminates the intrinsic necessity of joins caused by the flat table structure and the need for handling multiple values via joins. As a result, the share of join operations in the KeySQL query processing is reduced relatively to the relational model. At the same time, the share of restriction operations is increased. This is because, unlike the relational model, complex data objects with multiple values are native to KeySQL, so the restriction predicates are evaluated directly on the base key-object instances instead of first collecting their parts from multiple tables via joins. Minimizing the share of joins and maximizing the share of restrictions allow KeySQL systems to take better advantage of the MPP shared-nothing architectures since the restrictions always scale linearly, while the joins generally do not.

Unlike the relational restriction, its key-object analog is a total operation. Its definition allows any key-object instance based on a given catalog as the argument, while the relational restriction is bound by the table schema. This facilitates associative access to key-object data and promotes scalability.

A general property of the key-object data model that makes it inherently more scalable than the relational one is called “additivity” and relates to the function of data accumulation. Suppose something is called “data.” Then, there must be an operation of adding or combining the data. The question is what is the result of adding data to data. The intuition says that the result must be data as well. In other words, if A is data, and B is data, then A + B (and B + A) must be data, where the plus sign “+” denotes the operation of data accumulation. Let us call the data model additive if the “+” operation has the following properties:

1.

Idempotence: A + A = A

2.

Associativity: A + (B + C) = (A + B) + C

3.

Commutativity: A + B = B + A

Note that the mentioned properties should be valid for any “data.” So, the “+” operation is total with respect to whatever we call data.

The data accumulation operation of the key-object model is the union operation on the data stores. Namely, the union of any two data stores (based on the same catalog) is a data store. Of course all other set operations on the data stores are total as well, and generally all operations on the data stores we have considered are total.

This is not the case for the relational model, where the union of two relations, as well as all set operations on the relations, is partial. They are only defined for the union-compatible relations, which are the relations having equal number of attributes of compatible types. So, the relational model is only partially additive.

The properties of the key-object data model enable highly scalable implementations of the native KeySQL databases using predominantly or exclusively associative access to data. Those implementations can use computer clusters having, by orders of magnitude, more nodes than any contemporary SQL MPP systems.

Particularly, the MapReduce framework over the distributed file systems provides a natural foundation for the cluster KeySQL implementations. Figure 7.1 illustrates the architecture of such “stackable” structured search clusters integrated by the common namespaces of key-object catalogs, where each node can be a cluster of its own, receiving the queries and returning the responses.

Which of the following occurs when the same attribute in related data files has different values

Fig. 7.1. Structured search cluster.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128046319000078

Classification

Jiawei Han, ... Jian Pei, in Data Mining (Third Edition), 2012

Other Attribute Selection Measures

This section on attribute selection measures was not intended to be exhaustive. We have shown three measures that are commonly used for building decision trees. These measures are not without their biases. Information gain, as we saw, is biased toward multivalued attributes. Although the gain ratio adjusts for this bias, it tends to prefer unbalanced splits in which one partition is much smaller than the others. The Gini index is biased toward multivalued attributes and has difficulty when the number of classes is large. It also tends to favor tests that result in equal-size partitions and purity in both partitions. Although biased, these measures give reasonably good results in practice.

Many other attribute selection measures have been proposed. CHAID, a decision tree algorithm that is popular in marketing, uses an attribute selection measure that is based on the statistical χ2 test for independence. Other measures include C-SEP (which performs better than information gain and the Gini index in certain cases) and G-statistic (an information theoretic measure that is a close approximation to χ2 distribution).

Attribute selection measures based on the Minimum Description Length (MDL) principle have the least bias toward multivalued attributes. MDL-based measures use encoding techniques to define the “best” decision tree as the one that requires the fewest number of bits to both (1) encode the tree and (2) encode the exceptions to the tree (i.e., cases that are not correctly classified by the tree). Its main idea is that the simplest of solutions is preferred.

Other attribute selection measures consider multivariate splits (i.e., where the partitioning of tuples is based on a combination of attributes, rather than on a single attribute). The CART system, for example, can find multivariate splits based on a linear combination of attributes. Multivariate splits are a form of attribute (or feature) construction, where new attributes are created based on the existing ones. (Attribute construction was also discussed in Chapter 3, as a form of data transformation.) These other measures mentioned here are beyond the scope of this book. Additional references are given in the bibliographic notes at the end of this chapter (Section 8.9).

“Which attribute selection measure is the best?” All measures have some bias. It has been shown that the time complexity of decision tree induction generally increases exponentially with tree height. Hence, measures that tend to produce shallower trees (e.g., with multiway rather than binary splits, and that favor more balanced splits) may be preferred. However, some studies have found that shallow trees tend to have a large number of leaves and higher error rates. Despite several comparative studies, no one attribute selection measure has been found to be significantly superior to others. Most measures give quite good results.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123814791000083

Entities and Relationships

Jan L. Harrington, in Relational Database Design and Implementation (Fourth Edition), 2016

Single-Valued Versus Multivalued Attributes

Because we are eventually going to create a relational database, the attributes in our data model must be single-valued. This means that for a given instance of an entity, each attribute can have only one value. For example, the customer entity shown in Figure 4.1 allows only one telephone number for each customer. If a customer has more than one phone number, and wants them all included in the database, then the customer entity cannot handle them.

Note: While it is true that the conceptual data model of a database is independent of the formal data model used to express the structure of the data to a DBMS, we often make decisions on how to model the data based on the requirements of the formal data model we will be using. Removing multivalued attributes is one such case. You will also see an example of this when we deal with many-to-many relationships between entities, later in this chapter.

The existence of more than one phone number turns the phone number attribute into a multivalued attribute. Because an entity in a relational database cannot have multivalued attributes, you must handle those attributes by creating an entity to hold them.

In the case of the multiple phone numbers, we could create a phone number entity. Each instance of the entity would include the customer number of the person to whom the phone number belonged, along with the telephone number. If a customer had three phone numbers, then there would be three instances of the phone number entity for the customer. The entity’s identifier would be the concatenation of the customer number and the telephone number.

Note: There is no way to avoid using the telephone number as part of the entity identifier in the telephone number entity. As you will come to understand as you read this book, in this particular case, there is no harm in using it in this way.

Note: Some people view a telephone number as made of three distinct pieces of data: an area code, an exchange, and a unique number. However, in common use, we generally consider a telephone number to be a single value.

What is the problem with multivalued attributes? Multivalued attributes can cause problems with the meaning of data in the database, significantly slow down searching, and place unnecessary restrictions on the amount of data that can be stored.

Assume, for example, that you have an Employee entity, with attributes for the name and birthdates of dependents. Each attribute is allowed to store multiple values, as in Figure 4.2, where each gray blob represents a single instance of the Employee entity. How will you associate the correct birthdate with the name of the dependent to which it applies? Will it be by the position of a value stored in the attribute (in other words, the first name is related to the first birthdate, and so on)? If so, how will you ensure that there is a birthdate for each name, and a name for each birthdate? How will you ensure that the order of the values is never mixed up?

Which of the following occurs when the same attribute in related data files has different values

Figure 4.2. Entity instances containing multivalued attributes.

When searching a multivalued attribute, a DBMS must search each value in the attribute, most likely scanning the contents of the attribute sequentially. A sequential search is the slowest type of search available.

In addition, how many values should a multivalued attribute be able to store? If you specify a maximum number, what will happen when you need to store more than the maximum number of values? For example, what if you allow room for 10 dependents in the Employee entity just discussed, and you encounter an employee with 11 dependents? Do you create another instance of the Employee entity for that person? Consider all the problems that doing so would create, particularly in terms of the unnecessary duplicated data.

Note: Although it is theoretically possible to write a DBMS that will store an unlimited number of values in an attribute, the implementation would be difficult, and searching much slower than if the maximum number of values were specified in the database design.

As a general rule, if you run across a multivalued attribute, this is a major hint that you need another entity. The only way to handle multiple values of the same attribute is to create an entity of which you can store multiple instances, one for each value of the attribute (for example, Figure 4.3). In the case of the Employee entity, we would need a Dependent entity that could be related to the Employee entity. There would be one instance of the Dependent entity related to an instance of the Employee entity, for each of an employee’s dependents. In this way, there is no limit to the number of an employee’s dependents. In addition, each instance of the Dependent entity would contain the name and birthdate of only one dependent, eliminating any confusion about which name was associated with which birthdate. Searching would also be faster, because the DBMS could use fast searching techniques on the individual Dependent entity instances, without resorting to the slow sequential search.

Which of the following occurs when the same attribute in related data files has different values

Figure 4.3. Using multiple instances of an entity to handle a multivalued attribute.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128043998000041

Data Modeling: Entity-Relationship Data Model

Salvatore T. March, in Encyclopedia of Information Systems, 2003

II.C. Attribute

Attributes name and specify the characteristics or descriptors of entities and relationships that must be maintained within an information system. Each instance of an entity or relationship has a value for each attribute ascribed to that entity or relationship. Chen defined an attribute as a function that maps from tin entity or relationship instance into a set of values. The implication is that an attribute is single valued—each instance has exactly one value for each attribute. Some data modeling formalisms allow multivalued attributes, however, these are often difficult to conceptualize and implement. They will not be considered in this article.

Returning to the definition of an entity, the “common set of characteristics or descriptors” shared by all instances of an entity is the combination of its attributes and relationships. Hence an entity may be viewed as that collection of instances having the same set of attributes and participating in the same set of relationships. Of course, the context determines the set of attributes and relationships that are “of interest.” For example, within one context a Customer entity may be defined as the collection of instances having the attributes customer number, name, street address, city, state, zip code, and credit card number, independent of whether that instance is an individual person, a company, a local government, a federal agency, a charity, or a country. In a different context, where the type of organization determines how the customer is billed or even if it is legal to sell specific product to that instance, these same instances may be organized into different entities and additional attributes may be defined for each.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0122272404000344

Answer: Data redundancy occurs when different divisions, functional areas, and groups in an organization independently collect the same piece of information.

Which of the following enables users to view the same data in different ways?

Multidimensional analysis enables users to view the same data in different ways using multiple dimensions.

Which of the following is a specialized language that programmers use to add and change data in the database?

Chapter 6.

When the same data is duplicated in numerous files of a database This is known as data?

Data redundancy occurs when the same piece of data is stored in two or more separate places and is a common occurrence in many businesses.