Introduction
Database - Name itself indicates what it is. Database is a place/container where all the data is stored. But what is data? In a database, even a smallest piece of information becomes data. For example, Student is a data, course is a data, and Color is a data, height, weight, food everything is data. In short, all the living and non-living objects in this world is a data.
Why we need data? We need data so that we can perform various actions on them. Say, we do not have any database and we want to record what is the height and weight of a baby in a year. What we do is note it in a piece of paper every month. At the end of the certain period or year, we would check up if he/she is growing correctly. If some entry is wrong or irrelevant, we correct it or strike it off. Same is done using database. We would be storing all these information in the database. If we want to check the growth, we would be pulling the information from the database, if we need to change any information, we can update/delete them. But all the data will be at one place - Database.
What type of data is stored in database? In a database, we would be grouping only related data together and storing them under one group name called table. This helps in identifying which data is stored where and under what name. It reduces the time to search for a particular data in a whole database.
And for whom these datas are stored? We store only related data - related to one particular requirement / application. For example, Student database - it will have all the information of students ranging from his ID, Name, Date of birth, class, to grade, prizes who are studying in a particular College.
How do we determine which data is relevant to be put in a particular database? It all depends on what database we are developing, and what is the exact requirement/purpose of it. Say, we need to create College database. What could college database contain? First thing is we need to store college information like its name, address. Next comes courses offered in that college, Staffs and their details, students and their details. But do we store all these information under one table - College? Will database be quick in getting the data or updating? Certainly Not! It would become a chaos if everything is stored in a single table. Hence they introduce certain rules to manage the database - relational database management system (RDBMS). RDBMS is a program that guides us how to create and maintain a database. It tells us how to divide related information into different tables and inter-relate them so that we can select/insert/update/delete all the related data easily and efficiently.
A database is a collection of different set of related data i.e.; mainly it is collection of tables. In the college example above, let us start identifying groups of related data - College details, Courses, Students, Staffs - lists goes on. Below diagram illustrates how do we divide the college data and put it in different tables.
If we observe COURSE table, all the information that related to course are listed under COURSE table. Similarly, STUDENT table has all the student information like his ID, Name -first, middle and last, DOB, age, his course, semester etc. Each table has related information called attributes. In STUDENT table STUDENT_ID is an attribute, STUDENT_FIRST_NAME is an attribute and so on. All these attribute together gives meaningful information about the student, called as Student record. i.e.; in a table each attribute acts as a column and their values become a row. Each row gives meaningful information about that table and it's called as a record.
Now we have to decide what the related and relevant data for each table are. While designing a database, it is very important to design it correctly. Database design is like a foundation for a good database. A smallest mistake done at the design stage spoils the whole design, code and efficiency of the database. A poor database design creates unwanted data in a table called data redundancy. Because of redundancy, unnecessarily database size increases and it also increases the chance of incorrect data while insertion, deletion and updation - leading to anamolies. To understand it in a better way, consider the example of student table design.
A student has, say; his ID, Name, and Address, Subjects that he has opted for. The table structure for the same is designed as below.
By observing the STUDENT table design above, we identify that,
- Joseph has opted for two subjects - Mathematics and Physics. At the same time, his address is repeated for each of his subject, which is unnecessary data. It unnecessarily increases the table size and introduced a redundancy in the above table.
- Say, we have to update address of Joseph. In this case, we need to update all the entries of Joseph in the above table. Else his data will become inconsistent, leading to update anomaly.
- Imagine, student's address is present in more than one table. IF we want to update any one of the student's address, just updating Student table is not enough. We need to know which all table has address field which are related to Student. Not updating any one of the table will lead to incorrect data leading to update anamoly.
- Similarly, if any of the students drops out any of the subject and we need to delete his entry, entire entry of his would be deleted from above table. But he is still part of that college and whole of information is lost. But this should not happen. We should have his information stored in database. This is called leading to delete anomaly.
- Also, if we need to enter any new student detail, which has not yet opted for any subject, then we need to enter all of his information with subject as NULL. Also, if a particular field, say address of a student is present in more than one table, we need to insert the data in all the tables. But for some of the tables, these data would be irrelevant at that point of time. All these situations lead to insertion anomaly.
Hence to avoid all these redundancies and anomalies, set of guidelines are introduced which is called Normalization.
Normalization
Normalization is a set of rules/guidelines/technique that is used while designing a database. These rules help to remove all the anomalies and distribute the data among different related tables and query them efficiently and effectively. It removes all the duplication issues and incorrect data issues, helping to have a well designed database. Normalization is divided into following normal forms:
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce-Codd Normal Form (3.5NF)
- Forth Normal Form (4NF)
- Fifth Normal Form (5NF)
Note: In real lifescenarios, we really do not apply all the normal forms to get the database design done. We usually apply till 3NF normal form. 4th and 5th normal forms are ignored most of the cases. It's all depends on the necessity!!!