Database
Normalization
Peter
Komisar
©
Conestoga College
version 1.1
reference: A Simple
Guide to Five Normal Forms in Relational
Database Theory, William Kent Core
MySQL, Leon Atkinson,
Prentice-Hall Publishing, Database
Normalization, Wikipedia,
http://www.bkent.net/Doc/simple5.htm, Rules of Data Normalization
http://www.datamodel.org/NormalizationRules.html,
A Simple Database Normalization Primer,
http://www.databasedev.co.uk/normalization_primer.html,
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
'Normal Forms', http://www.anaesthetist.com/mnm/sql/normal.htm
Data Modeling (Week 2!),
http://www.cs.luc.edu/~pld/courses/353/spr07/notes/2
Normalization
Normalization is a method or design process for organizing
relational databases that reduces redundancy and protects
data integrity. In the normalization article written by Mike
Hillyer,
he describes the scenario where a database schema has
been inherited that suffers from 'Spreadsheet Syndrome'.
This is he states, a tendency for developers to lump together
every possible piece of information into as few a number of
tables as possible.
Historical
Note
IN 1972, Edgar .F.Codd, better known as Ted Codd, the author
of the relational database model and SQL, published a follow
up to his original paper on relational algebra with a sequel,
entitled "Further Normalization of the Data Base Relational
Model". It it he described the first three commonly known
normalization forms.
Description
Normalization is a process used in designing databases
that removes duplicated data which can lead to anomalies
or inconsistent data results. An example of where problems
can occur is in the circumstance where the same data is
stored in several different places. A change to this data
may inadvertently not be applied to one of these records.
This leaves the data in an inconsistent state. This is often
referred to as a loss in 'data integrity'.
Normalization involves breaking down an unnormalized table
into two or more tables that may be recombined in a join to
convey the same information that was held in the original table.
Upside
The cost of maintaining duplicate information synchronized
can be prohibitive for all but the simplest databases.
Normalization reduces redundancy and thereby can reduce
database size.
Normalized tables may also make better use of indexing.
Downsides
Normalization, tends to increase the number of tables
that are used and subsequently the number of rows that
are referenced in the joins that are used. This may lead
to a reduction in performance.
To counter the effect of the performance losses due
to joins, a designer may actually violate normalization
rules in a process called 'denormalization'.
Where Used
Tables used by banks, using automated tellers and many
isolated transactions are typical examples of highly
normalized databases.
Types
of Anomalies
Update Anomaly
- An update anomaly is the condition
where multiple instances of a data item are not all
updated consistently. Perhaps four of five items are
updated, leaving the database in an inconsistent state.
//
partial or incomplete updates
Insertion Anomaly - An insert anomaly
describes the
case where a fact cannot be recorded. For instance
a contractor cannot have contact information entered
as that contractor has not been assigned a contract
yet.
Deletion Anomaly
- deleting a set of rows removes
all records of data aspect. For instance, removing
all members that attended a function from one college
removes all mention of the fact that anyone from that
college attended the event.
Normalization
Terminology
Types
of Dependencies
Functional
dependency - Attribute B is functionally
dependent on
attribute A if, for each value of attribute
A, there is
exactly one value of attribute B.
Example
For every product serial number there is a product.
Without a product there is no serial number!
Functional Dependence Notation
A → B // attribute B is functionally dependent
on A
Partial
Dependency - Where a data item only depends
on part of a key.
Example
A composite key is formed from (product, location)
but a field 'address' is only partially dependent on such
a key as it relates only to location.
Trivial
functional dependency - A trivial functional dependency
describes the relationship between same attributes found in
both subsets and supersets.
Example
For every product there is a product!
Example
{store_ID, store_Address} → {store_Address}
Full functional
dependency - A fully functional dependence
is one where an attribute or set of attributes are functionally
dependent on a set X but not any proper subset of X.
Example
{store_Address} has a functional dependency on
{store_ID, city} but not a full functional dependency,
as it is also dependent on
{store_ID}.
// full
functional dependence is where there is no other
// subset that the attribute depends on.
Another full functional dependence might be a store service
review where the store at city and address describes the key
on which the review is targeted. No subset of this key would
suffice to target the review correctly.
Transitive
dependence - an indirect dependency where B
depends on A and C depends on B therefore C depends
on A. Transitive dependence, also referred to as hidden
dependences, are usually found between two non-key
values.
Example
A → B , B → C therefore A → C
Example
A transitive dependence example might be a chapter
title is dependent on the chapter number which is
dependent on a book. ChapterTitle is transitively
dependent on BookTitle.
BookTitle -> ChapterNo -> ChapterTitle
Join
dependency - A table T is "a subject of a join dependency" if
T can
always be recreated by joining multiple tables each having
a subset of
the attributes of T.
//
a table that is dependent on the results of a join on certain tables
Types of Keys
Superkey - A
superkey is a combination of attributes that
form unique identifiers for every row of a table. Two distinct
rows will always have distinct superkeys.
Candidate key
- A candidate key is a minimal version of
a superkey. It is a lowest common denominator which has
no proper subsets.
Primary key
- A candidate key which is used by the DBMS
to identify rows of a table.
Alternate keys
- candidate keys which might have served
as primary keys.
Every table has conceptually a set of superkeys. A subset
of the super keys are the candidate keys. One of the
candidate key may act as the primary key.
Example
Consider a table with columns Name, Phone
and SSN.
Superkeys might be (Name, Phone), (Phone, SSN),
(Name,SSN) or all three, (Name, Phone, SSN). In this
group only SSN is likely going to be unique enough to
serve as a candidate key. If SSN is used as the row
identifier it becomes the primary key.
Non-prime attribute
- a non-prime attribute is one that
is not part of a candidate key.
Normal Forms
Goal
of Normalization
There is an mnemonic that is often quoted in normalization
articles which states the general goal of normalization, and
perhaps more specifically of "third
normal form". The last
part of it is a humorous touch.
"Data should depend on the key, the whole key and nothing
but the key so help me
Codd!"
- author?
You will see, that the more normalized a table of data gets,
the simpler and more directly data elements will rely on
single primary keys.
Database tables are transformed into higher and higher
states of called normal forms where they are less and
less vulnerable to anomalous behavior. The inventor
of SQL, Edgar Codd, described the first three levels
of normalization which are considered sufficient to
optimize a database.
First
Normal Form
// well defined keys, fields have atomic values
First Normal is also referred to as
1NF. William Kent
states that first normal form as referring to the 'shape' of
a table and that "Under first normal form, all occurrences
of a record type must contain the same number of fields."
One implication of first normal is that we can't have a table
with variably sized rows.
Example //
can't have variably sized rows
1
cat
furry meow
2
dog
furry barks
3 rooster feathered
crows AM/PM
4 fish scales bubbles
A typical
example of a list that is not in first normal is
as
follows. It points out a requirement of first normal that
attribute must be atomic or having a single value. In the
following example Bill has three rather than one atomic
value listed under function.
Example
Member
Name Function
101
Bob president, member
102 Betty member
103 Bill
vice-president,treasurer, member
First Normal is a form that makes minimal requirements
on a table. The first requirement of first normal is that the
table describes a single object. Each of the columns is
related some way to the object that the table describes.
In a general sense a second requirement of first normal form
states that a table should not contain any repeating groups.
That is, the table has no duplicate rows.)
A primary key must exist. This can be an auto-incrementing
key or a composite key made up of a number of columns.
A table with a unique key by definition, prevents
duplicate
rows.
The
Controversy Regarding Nullability in 1NF // just for
reference
One reference states that all the columns should be not
NULLABLE. to be in 1NF. The following quote from the
Wikipedia article on Normalization summarizes the
problems associated with making 1NF nullable or not.
"Note that the restriction on nullable columns as a 1NF
requirement, as
espoused by Chris Date, et. al., is
controversial. This particular
requirement for 1NF is
a direct contradiction to Dr. Codd's vision of
the
relational database, in which he stated that "null values"
must be
supported in a fully relational DBMS in order
to represent "missing
information and inapplicable
information in a systematic way,
independent of data
type."
By redefining 1NF to exclude nullable
columns in 1NF,
no level of normalization can ever be achieved unless
all nullable columns are completely eliminated from the
entire
database. This is in line with Date's and Darwen's
vision of the
perfect relational database, but can introduce
additional complexities
in SQL databases to the point
of
impracticality."
- Database Wikipedia
A Case Scenario For Using
Auto_incrementing Primary Keys
Using auto-incrementing primary keys are hard to beat.
Consider a database described in Core MySQL and
copied below.
Example // from Core MySQL
CREATE TABLE recording(
Artist VARCHAR(32),
Released DATE,
Title VARCHAR(32),
Format VARCHAR(32),
Label VARCHAR(32),
Genre VARCHAR(32),
Length INT(11),
Purchased DATE,
Cost DECIMAL(11,2)
);
Atkinson makes the case that a key made of of 'Artist'
and 'Release' date doesn't protect against a release
of more than one record by an artist in a day. Adding
'Title' might help however this would not protect against
the same recording released in different formats. As
the list of columns used in the primary key increases
it becomes more and more impractical, so a auto-
incrementing primary key makes a good addition.
Example With Primary Key Added // from Core MySQL
CREATE TABLE recording(
ID INT(11) NOT NULL AUTO_INCREMENT,
Artist VARCHAR(32),
Released DATE,
Title VARCHAR(32),
Format VARCHAR(32),
Label VARCHAR(32),
Genre VARCHAR(32),
Length INT(11),
Purchased DATE,
Cost DECIMAL(11,2)
);
// the MS
practice would be to call the new column, recording_ID
This table is now a sample of first normal, but problems
remain.
Fields
As Facts
William Kent, in his 1982
paper 'A Simple Guide
to Five Normal Forms in Relational Database Theory",
a paper that was reviewed by Ted Codd, describes
non-key attributes as providing 'facts' about a key
attribute.
Second
Normal Form
In Kent's description, second normal form is violated
"when a non-key field is a fact about a subset of a key.
He describes a table with the following columns where
PART and WAREHOUSE serve as a composite
primary key.
In the following example QUANTITY is a 'fact' about
the combination of PART and WAREHOUSE. However
WAREHOUSE-ADDRESS is a fact about the
WAREHOUSE alone.
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
--------------------------------------------------
| PART |WAREHOUSE
| QUANTITY | WAREHOUSE-ADDRESS |
==================================================
Some of the weakness in this design are as follows:
Design Weaknesses
- the warehouse address will be highly repeated
- If the address is changed, every record will need updating
- data may be inconsistent with the address shown in different forms
- without a part the warehouse address becomes absent
The satisfy second normal, the record is 'decomposed'
and replaced by two records.
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
------------------------------
--------------------------------
| PART |WAREHOUSE
| QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS |
============================== =================================
The Wikipedia reference describes 2NF more formally
in the following way.
First, "the table must be in 1NF. . . None of the non-prime
attributes of the table are functionally
dependent on a part
(proper subset) of a candidate key; in other words,
all
functional dependencies of non-prime attributes on candidate
keys
are full functional dependencies."
The example shown in Core MySQL is interesting, first
in that it provides an example of two columns being
declared as a composite primary key which has not
been exampled yet.
Example
CREATE TABLE work(
Project INT(11) NOT NULL,
ProjectName VARCHAR(64),
ContractorSSN VARCHAR(16),
ContractorName VARCHAR(64),
ContractorWork VARCHAR(16),
ContractorRate DECIMAL(6,2),
Quantity DECIMAL(6,2),
PRIMARY KEY(Project, ContractorSSN)
);
One problem in the above table that arises is that
two contractors can be working on the same project.
The ProjectName would appear twice, an example
of redundancy which might happen a lot.
The solution is to put the table into second normal
form, which will remove duplication. The table is
decomposed into three sub tables. The primary
table is called work and includes the three
columns, Project, ContractorSSN and Quantity.
The primary key remains the same. ContractorSSN
and Project are now holders of foreign keys.
Broken Out 2nd Normal
Architecture
Project --|--------O<-Work->O---------|--Contractor
Following are the tables representing the above
depiction.
Examples //
from Core MySQL
CREATE TABLE work(
Project INT(11) NOT NULL,
Quantity DECIMAL(6,2),
PRIMARY KEY(Project, ContractorSSN),
FOREIGN KEY(Project) REFERENCES project(ID),
FOREIGN KEY(ContractorSSN) REFERENCES contractor(SSN)
);
CREATE TABLE project(
ID INT(11) NOT NULL AUTO_INCREMENT,
Name VARCHAR(64),
PRIMARY KEY(ID)
);
CREATE TABLE contractor(
SSN VARCHAR NOT NULL AUTO_INCREMENT,
Name VARCHAR(64),
WorkType VARCHAR(16),
Rate DECIMAL(6,2),
PRIMARY KEY(SSN)
);
Third
Normal Form
Back to the William Kent
paper, third normal form
takes the attack a little further, taking issue with any
occurrence of a non-key field being a fact about any
other non-key field. This is an effort to remove
transitive dependency.
In the following example from the William Kent paper,
EMPLOYEE is the key. LOCATION is a fact about
DEPARTMENT, assuming departments are located
in single places. It is indirectly then a fact about the
employee that is in that department.
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
------------------------------------
| EMPLOYEE |
DEPARTMENT | LOCATION |
====================================
Following
are a summary of criticisms of this
relationship.
Design
Weaknesses
- Location is repeated in every employee record
- a change in location results in every record being updated
- data may become inconsistent, with differing locations
- with no employees, the department's location is loss
The solution is to decompose the table into third normal
form by creating two new tables. Now every field "is either
part of the key or provides a (single-valued) fact about
exactly the whole key and nothing else".
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
------------------------
------------------------
| EMPLOYEE |DEPARTMENT
| |DEPARTMENT |
LOCATION |
======================== ========================
While second normal gets rid of columns that depend
on the primary key, third normal eliminates columns
that depend on columns not part of the primary key.
This removes 'transitive dependence' which may lead
to unnecessary data duplication.
The way the Wikipedia reference state it, " The table
must be in 2NF. Every non-prime attribute of the table
must be
non-transitively dependent on each candidate
key."
An Example From Core MySQL
In the contractor table shown earlier, the same work
might be sold for different rates, depending on the
status of the client.
Two more tables are split off from the contractor table
as is shown in the following example.
Example //
from Core MySQL
CREATE TABLE contractor(
SSN VARCHAR(16) NOT NULL,
Name VARCHAR(64),
WorkType VARCHAR(16) NOT NULL,
PRIMARY KEY(SSN),
FOREIGN KEY(Worktype) REFERENCES worktype(ID)
);
CREATE TABLE worktype(
ID VARCHAR(16),
Rate DECIMAL(6,2),
PRIMARY KEY(ID)
);
Fourth
Normal Form
Often, the Boyce-Codd Form is mentioned
next but in the early written, 1982 Kent paper,
fourth and fifth normal form are next mentioned
so we continue with these forms.
Fourth and Fifth Normal forms deal with multi-valued
facts which may correspond to many-to-one or
many-to-many relationships. Children of a employee
is an example of a many-to-one relationship. A
many-to-many relationship might be an employee
with many skill or a skill that is shared by many
employees.
In Kent's words, fourth Normal states that a record
type should not contain two or more independent,
multi-valued facts about an entity. In addition, the
record must satisfy third normal form.
The example given, is a record type that lists
an employee, his or her skills, and his or her
language. Here there are two independent,
multi-valued 'facts' about an entity. The fourth
normalization is to decompose this table into
two tables where each multi-valued fact is
treated separately.
Example
// from 'A
Simple Guide to Five Normal
// Forms in Relational Database Theory",
-------------------------------
| EMPLOYEE |
SKILL | LANGUAGE |
===============================
Fourth Normal Form of the Above Table
--------------------
-----------------------
| EMPLOYEE |
SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================
Problems Associated With Tables Where
Fourth Normal Form is Violated
The main problem associated with tables whose
rows have more than one multi-valued fields comes
with maintaining the table.
Maintenance Policies For Table
Violating Fourth
Normal Form
Records contain Skill or Language but not both.
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
-------------------------------
| EMPLOYEE |
SKILL | LANGUAGE |
-------------------------------
| Smith | cook
| |
| Smith | type
| |
| Smith | | French |
| Smith | | German |
| Smith | | Greek |
-------------------------------
- Random Mix
With Three Variations
a ) Minimal Number of Records With Repetitions
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
-------------------------------
| EMPLOYEE |
SKILL | LANGUAGE |
-------------------------------
| Smith | cook | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
b ) Minimal Number of Records With Null Values
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
-------------------------------
| EMPLOYEE |
SKILL | LANGUAGE |
-------------------------------
| Smith | cook | French |
| Smith | type | German |
| Smith | | Greek |
-------------------------------
c ) Unrestricted
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
-------------------------------
| EMPLOYEE |
SKILL | LANGUAGE |
-------------------------------
| Smith | cook | French |
| Smith | type
| |
| Smith | | German |
| Smith | type | Greek |
-------------------------------
For each employee there must be a record
for every pairing for one of his or her skills with
one of his or her languages. You will recognize
this is similar to the CROSS JOIN.
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
-------------------------------
| EMPLOYEE |
SKILL | LANGUAGE |
-------------------------------
| Smith | cook | French |
| Smith | cook | German |
| Smith | cook | Greek |
|
Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
Other Problems with Violations of the
Fourth Normal Form
- updates to repeated field values may lead to inconsistencies
- inserting an new skill may lead to:
- looking for a record with a blank skill
- inserting a new skill with a possible blank for language
- or alternatively coupling the same skill with multiple languages
- deleting a skill may involve
- blanking out a skill field in more than one record
- check that this doesn't leave two records
- with same language and a blank skill
Fourth Normal Form helps to reduce these update problems.
Fifth Normal Form
Fifth Normal supplies a form
that reduces redundancy in cases
where information can be reconstructed from smaller parts.
The Kent paper examples the following scenario. Consider a
row type composed of agents, companies and products.
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
-------------------------------
| AGENT | COMPANY
| PRODUCT |
-------------------------------
| Smith | Ford |
car |
| Smith | GM |
truck |
-------------------------------
The above row type records which agents sell which
products for which company. A combination of all
three fields is needed to understand which combinations
are valid.
Now a rule is introduced. "If an agent sells a certain
product, and he or she represents a company making
that product, then he sells that product for that company".
This is represented in the following example.
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
-------------------------------
| AGENT | COMPANY
| PRODUCT |
-------------------------------
| Smith | Ford |
car |
| Smith | Ford | truck |
| Smith | GM
| car |
| Smith | GM |
truck |
| Jones |
Ford | car |
-------------------------------
In
this case, all true facts can be reconstituted in a
normalization involving three record types.
Example
// from 'A Simple Guide to Five Normal
// Forms in Relational Database Theory",
-------------------
---------------------- --------------------
| AGENT | COMPANY
| | COMPANY | PRODUCT | | AGENT | PRODUCT |
------------------- ----------------------
--------------------
| Smith | Ford | | Ford
| car | | Smith |
car |
| Smith | GM | | Ford |
truck | | Smith | truck |
| Jones | Ford
| | GM |
car | | Jones |
car |
------------------- | GM |
truck | --------------------
----------------------
These records are now in fifth normal. The addition
of a rule was important to distinguish fifth from fourth
normal. "In the absence of such a constraint, a record
in fourth normal form is always in fifth normal form."
Other Normal Forms
There is the Boyce-Codd Form,
and a sixth and seventh
normal described but for our purposes, this is probably
normal enough!
Assignment
1 ) Working theoretically, create a table which
is not in first normal form, breaking first normal
requirements. Populate your table with five or
six rows of data.
2) Take the table from 1) or create a new table
and put it into first normal but not in second normal
form.
Your first normalization might look like the following.
Example
Member
Name Function
101
Bob president
101
Bob member
102
Betty member
103 Bill
vice-president
103
Bill treasurer
103
Bill member
// Note technically all three columns serve
//
as an implied composite key
3) Take the table from 2) or create a new table
that is in second normal form but not in third normal
form.
4) Take the table from 3) and create a set of tables
from it that are in third normal form.