A_Normalization

Database Normalization
Peter Komisar © Conestoga College version 1.1

reference: A Simple Guide to Five Normal Forms in Relational
Database Theory, William Kent Core MySQL, Leon Atkinson,
Prentice-Hall Publishing, Database Normalization, Wikipedia,
http://www.bkent.net/Doc/simple5.htm, Rules of Data Normalization
http://www.datamodel.org/NormalizationRules.html,
A Simple Database Normalization Primer,
http://www.databasedev.co.uk/normalization_primer.html,
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
'Normal Forms', http://www.anaesthetist.com/mnm/sql/normal.htm
Data Modeling (Week 2!), http://www.cs.luc.edu/~pld/courses/353/spr07/notes/2

Normalization

Normalization is a method or design process for organizing
relational databases that reduces redundancy and protects
data integrity. In the normalization article written by Mike Hillyer,
he describes the scenario where a database schema has
been inherited that suffers from 'Spreadsheet Syndrome'.
This is he states, a tendency for developers to lump together
every possible piece of information into as few a number of
tables as possible.

Historical Note

IN 1972, Edgar .F.Codd, better known as Ted Codd, the author
of the relational database model and SQL, published a follow
up to his original paper on relational algebra with a sequel,
entitled "Further Normalization of the Data Base Relational
Model". It it he described the first three commonly known
normalization forms.

Description

Normalization is a process used in designing databases
that removes duplicated data which can lead to anomalies
or inconsistent data results. An example of where problems
can occur is in the circumstance where the same data is
stored in several different places. A change to this data
may inadvertently not be applied to one of these records.
This leaves the data in an inconsistent state. This is often
referred to as a loss in 'data integrity'.

Normalization involves breaking down an unnormalized table
into two or more tables that may be recombined in a join to
convey the same information that was held in the original table.

Upside

The cost of maintaining duplicate information synchronized
can be prohibitive for all but the simplest databases.
Normalization reduces redundancy and thereby can reduce
database size.

Normalized tables may also make better use of indexing.

Downsides

Normalization, tends to increase the number of tables
that are used and subsequently the number of rows that
are referenced in the joins that are used. This may lead
to a reduction in performance.

To counter the effect of the performance losses due
to joins, a designer may actually violate normalization
rules in a process called 'denormalization'.

Where Used

Tables used by banks, using automated tellers and many
isolated transactions are typical examples of highly
normalized databases.

Types of Anomalies

Update Anomaly - An update anomaly is the condition
where multiple instances of a data item are not all
updated consistently. Perhaps four of five items are
updated, leaving the database in an inconsistent state.

// partial or incomplete updates

Insertion Anomaly - An insert anomaly describes the
case where a fact cannot be recorded. For instance
a contractor cannot have contact information entered
as that contractor has not been assigned a contract
yet.

Deletion Anomaly - deleting a set of rows removes
all records of data aspect. For instance, removing
all members that attended a function from one college
removes all mention of the fact that anyone from that
college attended the event.

Normalization Terminology

Types of Dependencies

Functional dependency - Attribute B is functionally
dependent on attribute A if, for each value of attribute
A, there is exactly one value of attribute B.

Example

For every product serial number there is a product.
Without a product there is no serial number!

Functional Dependence Notation

A → B // attribute B is functionally dependent on A

Partial Dependency - Where a data item only depends
on part of a key.

Example

A composite key is formed from (product, location)
but a field 'address' is only partially dependent on such
a key as it relates only to location.

Trivial functional dependency - A trivial functional dependency
describes the relationship between same attributes found in
both subsets and supersets.

Example

For every product there is a product!

Example

{store_ID, store_Address} → {store_Address}

Full functional dependency - A fully functional dependence
is one where an attribute or set of attributes are functionally
dependent on a set X but not any proper subset of X.

Example

{store_Address} has a functional dependency on
{store_ID, city} but not a full functional dependency,
as it is also dependent on {store_ID}.

// full functional dependence is where there is no other
// subset that the attribute depends on.

Another full functional dependence might be a store service
review where the store at city and address describes the key
on which the review is targeted. No subset of this key would
suffice to target the review correctly.

Transitive dependence - an indirect dependency where B
depends on A and C depends on B therefore C depends
on A. Transitive dependence, also referred to as hidden
dependences, are usually found between two non-key
values.

Example

A → B , B → C therefore A → C

Example

A transitive dependence example might be a chapter
title is dependent on the chapter number which is
dependent on a book. ChapterTitle is transitively
dependent on BookTitle.

BookTitle -> ChapterNo -> ChapterTitle

Join dependency - A table T is "a subject of a join dependency" if
T can always be recreated by joining multiple tables each having
a subset of the attributes of T.

// a table that is dependent on the results of a join on certain tables

Types of Keys

Superkey - A superkey is a combination of attributes that
form unique identifiers for every row of a table. Two distinct
rows will always have distinct superkeys.

Candidate key - A candidate key is a minimal version of
a superkey. It is a lowest common denominator which has
no proper subsets.

Primary key - A candidate key which is used by the DBMS
to identify rows of a table.

Alternate keys - candidate keys which might have served
as primary keys.

Every table has conceptually a set of superkeys. A subset
of the super keys are the candidate keys. One of the
candidate key may act as the primary key.

Example

Consider a table with columns Name, Phone and SSN.
Superkeys might be (Name, Phone), (Phone, SSN),
(Name,SSN) or all three, (Name, Phone, SSN). In this
group only SSN is likely going to be unique enough to
serve as a candidate key. If SSN is used as the row
identifier it becomes the primary key.

Non-prime attribute - a non-prime attribute is one that
is not part of a candidate key.

Normal Forms

Goal of Normalization

There is an mnemonic that is often quoted in normalization
articles which states the general goal of normalization, and
perhaps more specifically of "third normal form". The last
part of it is a humorous touch.

"Data should depend on the key, the whole key and nothing
but the key so help me Codd!"             - author?

You will see, that the more normalized a table of data gets,
the simpler and more directly data elements will rely on
single primary keys.

Database tables are transformed into higher and higher
states of called normal forms where they are less and
less vulnerable to anomalous behavior. The inventor
of SQL, Edgar Codd, described the first three levels
of normalization which are considered sufficient to
optimize a database.

First Normal Form
// well defined keys, fields have atomic values

First Normal is also referred to as 1NF. William Kent
states that first normal form as referring to the 'shape' of
a table and that "Under first normal form, all occurrences
of a record type must contain the same number of fields."

One implication of first normal is that we can't have a table
with variably sized rows.

Example // can't have variably sized rows

1 cat      furry       meow
2 dog      furry       barks
3 rooster feathered crows    AM/PM
4 fish    scales      bubbles

A typical example of a list that is not in first normal is as
follows. It points out a requirement of first normal that
attribute must be atomic or having a single value. In the
following example Bill has three rather than one atomic
value listed under function.

Example

Member Name     Function
101     Bob        president, member
102   Betty      member
103   Bill       vice-president,treasurer, member

First Normal is a form that makes minimal requirements
on a table. The first requirement of first normal is that the
table describes a single object. Each of the columns is
related some way to the object that the table describes.

In a general sense a second requirement of first normal form
states that a table should not contain any repeating groups.
That is, the table has no duplicate rows.)

A primary key must exist. This can be an auto-incrementing
key or a composite key made up of a number of columns.

A table with a unique key by definition, prevents duplicate
rows.

The Controversy Regarding Nullability in 1NF // just for reference

One reference states that all the columns should be not
NULLABLE. to be in 1NF. The following quote from the
Wikipedia article on Normalization summarizes the
problems associated with making 1NF nullable or not.

"Note that the restriction on nullable columns as a 1NF
requirement, as espoused by Chris Date, et. al., is
controversial. This particular requirement for 1NF is
a direct contradiction to Dr. Codd's vision of the
relational database, in which he stated that "null values"
must be supported in a fully relational DBMS in order
to represent "missing information and inapplicable
information in a systematic way, independent of data
type."

By redefining 1NF to exclude nullable columns in 1NF,
no level of normalization can ever be achieved unless
all nullable columns are completely eliminated from the
entire database. This is in line with Date's and Darwen's
vision of the perfect relational database, but can introduce
additional complexities in SQL databases to the point
of impracticality."                           - Database Wikipedia

A Case Scenario For Using Auto_incrementing Primary Keys

Using auto-incrementing primary keys are hard to beat.
Consider a database described in Core MySQL and
copied below.

Example // from Core MySQL

CREATE TABLE recording(
Artist VARCHAR(32),
Released DATE,
Title VARCHAR(32),
Format VARCHAR(32),
Label VARCHAR(32),
Genre VARCHAR(32),
Length INT(11),
Purchased DATE,
Cost DECIMAL(11,2)
);

Atkinson makes the case that a key made of of 'Artist'
and 'Release' date doesn't protect against a release
of more than one record by an artist in a day. Adding
'Title' might help however this would not protect against
the same recording released in different formats. As
the list of columns used in the primary key increases
it becomes more and more impractical, so a auto-
incrementing primary key makes a good addition.

Example With Primary Key Added // from Core MySQL

CREATE TABLE recording(
ID INT(11) NOT NULL AUTO_INCREMENT,
Artist VARCHAR(32),
Released DATE,
Title VARCHAR(32),
Format VARCHAR(32),
Label VARCHAR(32),
Genre VARCHAR(32),
Length INT(11),
Purchased DATE,
Cost DECIMAL(11,2)
);

// the MS practice would be to call the new column, recording_ID

This table is now a sample of first normal, but problems
remain.

Fields As Facts

William Kent, in his 1982 paper 'A Simple Guide
to Five Normal Forms in Relational Database Theory",
a paper that was reviewed by Ted Codd, describes
non-key attributes as providing 'facts' about a key
attribute.

Second Normal Form

In Kent's description, second normal form is violated
"when a non-key field is a fact about a subset of a key.
He describes a table with the following columns where
PART and WAREHOUSE serve as a composite
primary key.

In the following example QUANTITY is a 'fact' about
the combination of PART and WAREHOUSE. However
WAREHOUSE-ADDRESS is a fact about the
WAREHOUSE alone.

Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",

--------------------------------------------------
| PART |WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |
==================================================

Some of the weakness in this design are as follows:

Design Weaknesses

the warehouse address will be highly repeated
If the address is changed, every record will need updating
data may be inconsistent with the address shown in different forms
without a part the warehouse address becomes absent

The satisfy second normal, the record is 'decomposed'
and replaced by two records.

Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",

------------------------------ --------------------------------
| PART |WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS |
============================== =================================

The Wikipedia reference describes 2NF more formally
in the following way.

First, "the table must be in 1NF. . . None of the non-prime
attributes of the table are functionally dependent on a part
(proper subset) of a candidate key; in other words, all
functional dependencies of non-prime attributes on candidate
keys are full functional dependencies."

The example shown in Core MySQL is interesting, first
in that it provides an example of two columns being
declared as a composite primary key which has not
been exampled yet.

Example

CREATE TABLE work(
Project INT(11) NOT NULL,
ProjectName VARCHAR(64),
ContractorSSN VARCHAR(16),
ContractorName VARCHAR(64),
ContractorWork VARCHAR(16),
ContractorRate DECIMAL(6,2),
Quantity DECIMAL(6,2),
PRIMARY KEY(Project, ContractorSSN)
);

One problem in the above table that arises is that
two contractors can be working on the same project.
The ProjectName would appear twice, an example
of redundancy which might happen a lot.

The solution is to put the table into second normal
form, which will remove duplication. The table is
decomposed into three sub tables. The primary
table is called work and includes the three
columns, Project, ContractorSSN and Quantity.
The primary key remains the same. ContractorSSN
and Project are now holders of foreign keys.

Broken Out 2nd Normal Architecture

Project --|--------O<-Work->O---------|--Contractor

Following are the tables representing the above
depiction.

Examples // from Core MySQL

CREATE TABLE work(
Project INT(11) NOT NULL,
Quantity DECIMAL(6,2),
PRIMARY KEY(Project, ContractorSSN),
FOREIGN KEY(Project) REFERENCES project(ID),
FOREIGN KEY(ContractorSSN) REFERENCES contractor(SSN)
);

CREATE TABLE project(
ID INT(11) NOT NULL AUTO_INCREMENT,
Name VARCHAR(64),
PRIMARY KEY(ID)
);

CREATE TABLE contractor(
SSN VARCHAR NOT NULL AUTO_INCREMENT,
Name VARCHAR(64),
WorkType VARCHAR(16),
Rate DECIMAL(6,2),
PRIMARY KEY(SSN)
);

Third Normal Form

Back to the William Kent paper, third normal form
takes the attack a little further, taking issue with any
occurrence of a non-key field being a fact about any
other non-key field. This is an effort to remove
transitive dependency.

In the following example from the William Kent paper,
EMPLOYEE is the key. LOCATION is a fact about
DEPARTMENT, assuming departments are located
in single places. It is indirectly then a fact about the
employee that is in that department.

Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",

------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
====================================

Following are a summary of criticisms of this
relationship.

Design Weaknesses

Location is repeated in every employee record
a change in location results in every record being updated
data may become inconsistent, with differing locations
with no employees, the department's location is loss

The solution is to decompose the table into third normal
form by creating two new tables. Now every field "is either
part of the key or provides a (single-valued) fact about
exactly the whole key and nothing else".

Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",

------------------------ ------------------------
| EMPLOYEE |DEPARTMENT | |DEPARTMENT | LOCATION |
======================== ========================

While second normal gets rid of columns that depend
on the primary key, third normal eliminates columns
that depend on columns not part of the primary key.
This removes 'transitive dependence' which may lead
to unnecessary data duplication.

The way the Wikipedia reference state it, " The table
must be in 2NF. Every non-prime attribute of the table
must be non-transitively dependent on each candidate
key."

An Example From Core MySQL

In the contractor table shown earlier, the same work
might be sold for different rates, depending on the
status of the client.

Two more tables are split off from the contractor table
as is shown in the following example.

Example // from Core MySQL

CREATE TABLE contractor(
SSN VARCHAR(16) NOT NULL,
Name VARCHAR(64),
WorkType VARCHAR(16) NOT NULL,
PRIMARY KEY(SSN),
FOREIGN KEY(Worktype) REFERENCES worktype(ID)
);

CREATE TABLE worktype(
ID VARCHAR(16),
Rate DECIMAL(6,2),
PRIMARY KEY(ID)
);

Fourth Normal Form

Often, the Boyce-Codd Form is mentioned
next but in the early written, 1982 Kent paper,
fourth and fifth normal form are next mentioned
so we continue with these forms.

Fourth and Fifth Normal forms deal with multi-valued
facts which may correspond to many-to-one or
many-to-many relationships. Children of a employee
is an example of a many-to-one relationship. A
many-to-many relationship might be an employee
with many skill or a skill that is shared by many
employees.

In Kent's words, fourth Normal states that a record
type should not contain two or more independent,
multi-valued facts about an entity. In addition, the
record must satisfy third normal form.

The example given, is a record type that lists
an employee, his or her skills, and his or her
language. Here there are two independent,
multi-valued 'facts' about an entity. The fourth
normalization is to decompose this table into
two tables where each multi-valued fact is
treated separately.

Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================

Fourth Normal Form of the Above Table

-------------------- -----------------------
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================

Problems Associated With Tables Where
Fourth Normal Form is Violated

The main problem associated with tables whose
rows have more than one multi-valued fields comes
with maintaining the table.

Maintenance Policies For Table Violating Fourth
Normal Form

Disjoint Format

Random Mix With Three Variations

Cross Product Form

updates to repeated field values may lead to inconsistencies
inserting an new skill may lead to:

looking for a record with a blank skill
inserting a new skill with a possible blank for language
or alternatively coupling the same skill with multiple languages

deleting a skill may involve

blanking out a skill field in more than one record
check that this doesn't leave two records

with same language and a blank skill

Fourth Normal Form helps to reduce these update problems.

Fifth Normal Form

Fifth Normal supplies a form that reduces redundancy in cases
where information can be reconstructed from smaller parts.

The Kent paper examples the following scenario. Consider a
row type composed of agents, companies and products.

Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",

-------------------------------
| AGENT | COMPANY | PRODUCT |
-------------------------------
| Smith | Ford    | car       |
| Smith | GM      | truck     |
-------------------------------

The above row type records which agents sell which
products for which company. A combination of all
three fields is needed to understand which combinations
are valid.

Now a rule is introduced. "If an agent sells a certain
product, and he or she represents a company making
that product, then he sells that product for that company".
This is represented in the following example.

Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",

-------------------------------
| AGENT | COMPANY | PRODUCT |
-------------------------------
| Smith | Ford    | car       |
| Smith | Ford    | truck     |
| Smith | GM    | car       |
| Smith | GM      | truck     |
| Jones | Ford    | car       |
-------------------------------

In this case, all true facts can be reconstituted in a
normalization involving three record types.

Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",

------------------- ---------------------- --------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
------------------- ---------------------- --------------------
| Smith | Ford    | | Ford     | car     | | Smith | car     |
| Smith | GM    | | Ford | truck   | | Smith | truck   |
| Jones | Ford    | | GM     | car     | | Jones | car     |
------------------- | GM    | truck   | --------------------
   ----------------------

These records are now in fifth normal. The addition
of a rule was important to distinguish fifth from fourth
normal. "In the absence of such a constraint, a record
in fourth normal form is always in fifth normal form."

Other Normal Forms

There is the Boyce-Codd Form, and a sixth and seventh
normal described but for our purposes, this is probably
normal enough!

Assignment

1 ) Working theoretically, create a table which
is not in first normal form, breaking first normal
requirements. Populate your table with five or
six rows of data.

2) Take the table from 1) or create a new table
and put it into first normal but not in second normal
form.

Your first normalization might look like the following.

Example

Member Name     Function
101     Bob        president
101     Bob        member
102   Betty      member
103   Bill       vice-president
103   Bill       treasurer
103   Bill       member

// Note technically all three columns serve
// as an implied composite key

3) Take the table from 2) or create a new table
that is in second normal form but not in third normal
form.

4) Take the table from 3) and create a set of tables
from it that are in third normal form.