Relational Database Design: Comprehensive Guide
Relational database design is the cornerstone of effective database systems, focusing on organizing data efficiently while reducing redundancy and preserving data integrity. This article provides a thorough exploration of decomposition, normalization, functional dependencies, and keys, ensuring you have a complete understanding of relational database design principles.
Decomposition in Relational Database Design
Decomposition is the process of breaking a large relation (table) into smaller, meaningful relations to eliminate redundancy, improve consistency, and optimize performance. It is a critical aspect of normalization.
Types of Decomposition
-
Lossy Decomposition:
- A decomposition is lossy if the original table cannot be perfectly reconstructed by joining the decomposed relations.
- This happens when some data or relationships are lost during decomposition.
-
Example:
Consider the table:
EmployeeID | ProjectID | ProjectManager --------------------------------------- E1 | P1 | M1 E2 | P1 | M1
If this is decomposed into:
- Table 1: EmployeeID | ProjectID
- Table 2: ProjectID | ProjectManager
Rejoining these tables can lead to duplicate or inconsistent data, resulting in a lossy decomposition.
-
Lossless Decomposition:
- A decomposition is lossless if the original table can be perfectly reconstructed by joining the decomposed relations without losing any data or introducing inconsistencies.
- This is achieved when the decomposition preserves all functional dependencies or when key attributes are included in each decomposed relation.
Functional Dependency
A functional dependency (FD) describes a relationship between two attributes in a relation where the value of one attribute (or set of attributes) determines the value of another attribute (or set of attributes). It is a fundamental concept in relational database design and normalization.
Definition:
Let X and Y be sets of attributes in a relation R. A functional dependency X → Y means that for any two tuples (rows) in R, if the tuples agree on the values of X, they must also agree on the values of Y.
- X: Determinant (the attribute(s) on the left side).
- Y: Dependent (the attribute(s) on the right side).
Example:
Consider a table storing student information:
StudentID | Name | Major
----------------------------
S1 | Alice | CS
S2 | Bob | EE
S3 | Alice | CS
Here, StudentID → Name, Major
because the StudentID
uniquely determines both Name
and Major
.
Properties of Functional Dependencies:
- Reflexivity: If Y is a subset of X, then X → Y.
- Augmentation: If X → Y, then XZ → YZ (adding attributes to both sides preserves the dependency).
- Transitivity: If X → Y and Y → Z, then X → Z.
Keys in Relational Databases
Keys are essential for identifying records uniquely in a table and enforcing data integrity.
Types of Keys:
-
Superkey:
- A set of one or more attributes that can uniquely identify a tuple in a relation.
- Example: In a table with attributes
EmployeeID
andName
,{EmployeeID}
,{EmployeeID, Name}
are superkeys.
-
Candidate Key:
- A minimal superkey, meaning no proper subset of it is also a superkey.
- Example: If
{EmployeeID}
can uniquely identify a tuple, it is a candidate key.
-
Primary Key:
- A candidate key chosen by the database designer to uniquely identify tuples.
- Example:
EmployeeID
in anEmployee
table.
-
Foreign Key:
- An attribute (or set of attributes) in one table that references the primary key in another table, establishing a relationship between the tables.
- Example:
DepartmentID
in anEmployee
table referencingDepartmentID
in aDepartment
table.
-
Composite Key:
- A primary key composed of two or more attributes.
- Example:
(StudentID, CourseID)
in a table of student enrollments.
-
Unique Key:
- A key constraint ensuring that all values in a column (or combination of columns) are unique.
Normalization and Normal Forms
Normalization is the process of organizing attributes and relations to reduce redundancy and dependency, ensuring data integrity. This is achieved by progressively meeting the criteria of successive normal forms.
Normal Forms (Comprehensive Overview)
First Normal Form (1NF)
Definition:
A relation is said to be in First Normal Form (1NF) if it satisfies the following criteria:
- Atomicity: All attributes (columns) must contain atomic values. This means that the values in each column are indivisible and cannot be further broken down.
- Single-Valued Entries: Each column in a table should contain values of a single data type, and no column should have sets, lists, or arrays.
- Uniqueness of Rows: Each row must be unique, meaning the table should have a primary key to distinguish between rows.
- No Repeating Groups: The table should not have multiple columns for the same attribute (like Item1, Item2, etc.), nor should it have multiple values stored in a single cell.
Explanation:
- Atomic Values: Data in each cell must be in its simplest form. For example, instead of storing multiple items in one cell, each item should occupy its own row.
- Repeating Groups: This is where multiple columns or rows represent the same type of data, making the table non-compliant with 1NF.
- Primary Key: A primary key ensures that each row is uniquely identifiable, which is a fundamental requirement for relational databases.
Example:
Non-Compliant Table (Not in 1NF):
OrderID | Items
-------------------
1 | Pen, Notebook
2 | Pencil
- The
Items
column violates atomicity because it contains multiple values (e.g., “Pen, Notebook”). - There are repeating groups because items are stored in a single cell rather than separate rows.
Compliant Table (In 1NF):
OrderID | Item
---------------
1 | Pen
1 | Notebook
2 | Pencil
- Here, the
Items
column is broken down into atomic values, with each item in a separate row. - No cell contains multiple values, ensuring atomicity.
- The table has no repeating groups or arrays, making it compliant with 1NF.
Second Normal Form (2NF)
Definition:
A relation is in Second Normal Form (2NF) if:
- It is already in First Normal Form (1NF) (i.e., no multi-valued or repeating groups).
- Every non-prime attribute is fully functionally dependent on the entire primary key.
- Non-prime attribute: An attribute that is not part of any candidate key.
- Fully functionally dependent: A non-prime attribute must depend on the entire composite primary key and not just a part of it.
Explanation:
- A partial dependency occurs when a non-prime attribute depends on only a part of a composite primary key, rather than the whole key.
- 2NF eliminates partial dependencies by decomposing the relation into smaller relations, ensuring that non-prime attributes are dependent only on the entire primary key or another candidate key.
This step reduces redundancy caused by partial dependencies and organizes the data better.
Example:
Non-Compliant Table (Not in 2NF):
Consider a table storing student-course information:
StudentID | CourseID | Instructor | Department
----------------------------------------------
S1 | C1 | Dr. Smith | CS
S2 | C2 | Dr. Jones | EE
S1 | C2 | Dr. Jones | EE
-
Composite Primary Key:
(StudentID, CourseID)
. -
Partial Dependency:
-
Instructor
andDepartment
depend only onCourseID
and not on the entire primary key(StudentID, CourseID)
.
-
This violates 2NF because non-prime attributes (Instructor
and Department
) are partially dependent on the composite key.
Compliant Tables (In 2NF):
To remove the partial dependency, decompose the table into two relations:
- Student-Course Table:
StudentID | CourseID
--------------------
S1 | C1
S2 | C2
S1 | C2
- Course-Details Table:
CourseID | Instructor | Department
-----------------------------------
C1 | Dr. Smith | CS
C2 | Dr. Jones | EE
Third Normal Form (3NF)
Definition:
A relation is in Third Normal Form (3NF) if:
- It is in Second Normal Form (2NF) (i.e., no partial dependencies).
-
No transitive dependency exists, which means:
- No non-prime attribute is dependent on another non-prime attribute.
- A non-prime attribute should depend only on a candidate key, not through another non-prime attribute.
- Non-prime attribute: An attribute that is not part of any candidate key.
- Transitive dependency: A dependency where a non-prime attribute depends indirectly on a candidate key through another non-prime attribute.
Explanation:
In 3NF, we eliminate transitive dependencies to reduce redundancy and improve data consistency.
-
Transitive Dependency Example: If
A → B
andB → C
, thenA → C
is a transitive dependency. This meansC
indirectly depends onA
throughB
. - Such dependencies introduce redundancy, as changes to
B
could lead to anomalies when updatingC
.
Example:
Non-Compliant Table (Not in 3NF):
StudentID | Department | HOD
--------------------------------
S1 | CS | Dr. Lee
S2 | EE | Dr. Brown
S3 | CS | Dr. Lee
Candidate Key: StudentID
uniquely identifies each row.
-
Issue: The
HOD
attribute depends onDepartment
, not directly onStudentID
.-
StudentID → Department
(Direct dependency). -
Department → HOD
(Transitive dependency). - So,
StudentID → HOD
is a transitive dependency.
-
This structure leads to redundancy: if the HOD
for the CS
department changes, multiple rows need updating.
Compliant Tables (In 3NF):
To resolve the transitive dependency, decompose the table into two relations:
- Student-Department Table:
StudentID | Department
-----------------------
S1 | CS
S2 | EE
S3 | CS
- Department-HOD Table:
Department | HOD
-----------------
CS | Dr. Lee
EE | Dr. Brown
Boyce-Codd Normal Form (BCNF)
Definition:
A relation is in Boyce-Codd Normal Form (BCNF) if:
- It is in Third Normal Form (3NF) (i.e., no partial or transitive dependencies exist).
- Every determinant is a candidate key.
- Determinant: An attribute (or a set of attributes) on which another attribute is functionally dependent.
- Candidate Key: A minimal set of attributes that can uniquely identify each tuple in a relation.
Key Difference Between 3NF and BCNF:
- While 3NF allows some dependencies where a non-prime attribute is functionally dependent on a candidate key, BCNF eliminates any such anomalies by ensuring that every determinant is a candidate key.
Explanation:
BCNF is stricter than 3NF and addresses situations where a relation may satisfy 3NF but still have redundancy caused by dependencies that violate BCNF.
When BCNF is Needed:
- BCNF is necessary when a non-candidate key attribute determines part of a candidate key, leading to redundancy and anomalies.
Example:
Non-Compliant Table (Not in BCNF):
CourseID | Instructor | Room
----------------------------
C1 | Dr. Smith | R101
C1 | Dr. Smith | R102
C2 | Dr. Jones | R101
Functional Dependencies:
-
CourseID → Instructor
-
Instructor → Room
Candidate Key: CourseID
Issue:
- The determinant
Instructor
is not a candidate key but determines theRoom
. - This violates BCNF, as not all determinants are candidate keys.
Compliant Tables (In BCNF):
To achieve BCNF, decompose the table into two relations:
- Course-Instructor Table:
CourseID | Instructor
----------------------
C1 | Dr. Smith
C2 | Dr. Jones
- Instructor-Room Table:
Instructor | Room
------------------
Dr. Smith | R101
Dr. Smith | R102
Dr. Jones | R101
Fourth Normal Form (4NF)
Definition:
A relation is in Fourth Normal Form (4NF) if:
- It is in Boyce-Codd Normal Form (BCNF) (i.e., no partial, transitive, or other anomalies).
- It does not have any multi-valued dependencies.
- Multi-valued Dependency (MVD): A multi-valued dependency exists when one attribute in a table determines multiple independent sets of attributes. In other words, if a relation contains two or more independent multi-valued attributes that are not related to each other, it violates 4NF.
Explanation:
In 4NF, the primary goal is to eliminate multi-valued dependencies, which occur when a record contains two or more independent attributes that are not directly related but appear together due to their dependence on the same key.
- These types of dependencies lead to redundancy because multiple copies of the same information are repeated in rows.
- By decomposing the relation to remove MVDs, we eliminate redundancy and improve consistency in the database.
Key Concept:
- In 4NF, a relation should not have two or more multi-valued attributes that depend on a candidate key. Each multi-valued dependency must be eliminated by decomposing the table appropriately.
Example:
Non-Compliant Table (Not in 4NF):
Consider a table that stores information about students, the courses they take, and the clubs they are involved in:
StudentID | Course | Club
-------------------------------
S1 | Math | Chess
S1 | History | Drama
S2 | Math | Chess
S2 | Biology | Music
Candidate Key: StudentID
Multi-Valued Dependencies:
- A
StudentID
can determine both a set ofCourses
and a set ofClubs
, but these sets are independent of each other.-
StudentID → {Courses}
(Multi-valued dependency betweenStudentID
andCourses
) -
StudentID → {Clubs}
(Multi-valued dependency betweenStudentID
andClubs
)
-
The table violates 4NF because StudentID
determines both the courses and the clubs independently. This causes redundancy, as the same StudentID
is repeated multiple times with different combinations of courses and clubs.
Compliant Tables (In 4NF):
To make the table comply with 4NF, we must eliminate the multi-valued dependencies by decomposing it into two tables:
- Student-Course Table:
StudentID | Course
--------------------
S1 | Math
S1 | History
S2 | Math
S2 | Biology
- Student-Club Table:
StudentID | Club
-------------------
S1 | Chess
S1 | Drama
S2 | Chess
S2 | Music
Now, the two multi-valued dependencies are handled separately:
- The
Student-Course Table
stores the relationship between students and the courses they take. - The
Student-Club Table
stores the relationship between students and the clubs they are involved in.
Fifth Normal Form (5NF)
Definition:
A relation is in Fifth Normal Form (5NF), also known as Projection-Join Normal Form (PJNF), if:
- It is in Fourth Normal Form (4NF) (i.e., no multi-valued dependencies exist).
- It cannot be further decomposed without losing information, meaning that the relation does not contain any join dependency or lossless join decomposition.
- Join Dependency (JD): A join dependency occurs when a relation can be decomposed into two or more relations, but when they are joined back together, no information is lost. In other words, a join dependency exists when a relation can be divided into sub-relations, but the original relation can be reconstructed without losing any data.
Explanation:
5NF deals with join dependencies, and it ensures that the data is decomposed in such a way that all information can be reconstructed from its decomposed parts without any loss of data. A relation in 5NF is designed in such a way that all of its non-trivial join dependencies are implied by its candidate keys.
- Lossless Join Decomposition: When a relation is decomposed into smaller relations and then rejoined, the original relation can be fully reconstructed without any data loss. A relation is in 5NF if it cannot be further decomposed without causing a loss of information.
- Non-trivial Join Dependency: A join dependency is non-trivial if the join dependency is not trivially satisfied (i.e., not all attributes from the relation are present in the join dependency).
In simpler terms, 5NF is concerned with ensuring that there is no redundancy caused by improper decompositions. It guarantees that when a relation is decomposed and later joined back, all of the original data is still available without any loss or ambiguity.
Example:
Non-Compliant Table (Not in 5NF):
Consider a table that stores information about which suppliers supply which parts for different projects:
Supplier | Part | Project
------------------------------
S1 | P1 | ProjA
S1 | P2 | ProjB
S2 | P1 | ProjA
S2 | P2 | ProjC
Candidate Key: (Supplier, Part, Project)
Join Dependency:
The relation above has a join dependency because it can be decomposed into smaller relations without losing information. For example, the table can be decomposed into three sub-relations:
- Supplier-Part Table:
Supplier | Part
----------------
S1 | P1
S1 | P2
S2 | P1
S2 | P2
- Supplier-Project Table:
Supplier | Project
-------------------
S1 | ProjA
S1 | ProjB
S2 | ProjA
S2 | ProjC
- Part-Project Table:
Part | Project
-----------------
P1 | ProjA
P2 | ProjB
P1 | ProjA
P2 | ProjC
By decomposing the table into these smaller relations, we can still recreate the original table by performing a natural join on these three smaller relations.
However, because this decomposition is possible, it violates 5NF. The reason it violates 5NF is because the information about which supplier supplies which part for a given project is redundantly stored across multiple rows. We are storing the same facts multiple times, which is unnecessary and could lead to inconsistencies.
Compliant Table (In 5NF):
To achieve 5NF, we decompose the table so that the relation cannot be decomposed further without losing information:
- Supplier-Part-Project Table:
Supplier | Part | Project
------------------------------
S1 | P1 | ProjA
S1 | P2 | ProjB
S2 | P1 | ProjA
S2 | P2 | ProjC
In this form, the relation is now in 5NF because it cannot be decomposed further without losing data. This table represents the same information as the original but in a more normalized form where each attribute is fully dependent on the candidate key, and no redundancy exists due to improper decomposition.
Key Concepts in Relational Design
- Multi-Valued Dependency: When one attribute determines multiple independent values.
- Join Dependency: Ensures no spurious tuples are created during joins.
- Dependency Preservation: Ensures all functional dependencies are preserved after decomposition.
This comprehensive guide equips you to master relational database design, ensuring efficient, consistent, and anomaly-free database systems.
Source link
lol