Database Normalization

Introduction

When developing the schema of a relational database, one of the most important aspects of to be taken into accountis to ensure that duplication is minimized. There are two reasons:

reduce the amount of required storage for data
avoid unecessary data conflict that can be casued by the data duplication

Normalization in Database

database normalization is one of the technique that can ensure the above two points. The core idea of database normalization is to divide the table into smaller subtables and store pointers to the data rather replicating it. In order to illustrate this concept, we are going to use Spotify dataset

Take a look at the original data

df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
print(df.shape)
df.head()

	track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
0	6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.102	0	0.0653	0.518	122.036	194754
1	0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	0.00421	0.357	0.693	99.972	162600
2	1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.11	0.613	124.008	176616
3	75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.718	0.93	7	-3.778	1	0.102	0.0287	9.43e-06	0.204	0.277	121.956	169093
4	1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.65	0.833	1	-4.672	1	0.0359	0.0803	0	0.0833	0.725	123.976	189052

each row represents a track. However, Not all columns are dedicate to the attribute of each tracks. For example, playlist_id, playlist genre dedicates to playlist attributes. Track_album_name, track_albumn_release_date dedicate to album attributes. It shows that our dataset is not in the Thrid Normal form.

Types of database normalization

These are various database “normal” forms. Each normal forms has an importance which helps in optimizing the database to save storage and reduce redundancies

First Normal Form(1NF)

the first normal form simply says that each cell of a table should contain exactly one value. According to our original spotify dataset table, it appears to satisfy the first normal form.

Second Normal Form(2NF)

For a table to be in second normal form, the following 2 conditions are to be met:

table should be in first normal form
the primary key of the table should compose of exactly 1 column

a primary key is a set fo columns that uniquely identifies a row. Basically no 2 rows have the same primary keys. Let’s take a look at the dataset.

FIrst, we would like to separate the original dataframe into the following dataframe:

track_df: contain all attributes of specific track

#select track attributes 
track_df = df.iloc[:,0:4].drop_duplicates()
attribute_df = df.iloc[:,11:].drop_duplicates()
#concatenate all columns together
track_df = pd.concat([track_df,attribute_df],axis=1).drop_duplicates()
track_df.head()

	track_id	track_name	track_artist	track_popularity	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
0	6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	0.748	0.916	6	-2.634	1	0.0583	0.102	0	0.0653	0.518	122.036	194754
1	0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	0.726	0.815	11	-4.969	1	0.0373	0.0724	0.00421	0.357	0.693	99.972	162600
2	1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.11	0.613	124.008	176616
3	75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	0.718	0.93	7	-3.778	1	0.102	0.0287	9.43e-06	0.204	0.277	121.956	169093
4	1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	0.65	0.833	1	-4.672	1	0.0359	0.0803	0	0.0833	0.725	123.976	189052

album_df: contain all attributes of specific album

#select album attributes
album_df = df.iloc[:,4:7].drop_duplicates()
album_df.head()

	track_album_id	track_album_name	track_album_release_date
0	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14
1	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13
2	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05
3	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19
4	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05

Playlist_df: contain all attributes of specific playlist

#select playlist attributes
playlist_df = df.iloc[:,7:11].drop_duplicates()
playlist_df.head()

	playlist_name	playlist_id	playlist_genre	playlist_subgenre
0	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop
70	Dance Pop	37i9dQZF1DWZQaaqNMbbXa	pop	dance pop
167	Dance Room	37i9dQZF1DX2ENAPP1Tyed	pop	dance pop
223	Cardio	37i9dQZF1DWSJHnPb1f0X3	pop	dance pop
272	Dance Pop Hits	37i9dQZF1DX6pH08wMhkaI	pop	dance pop

df_track_album: connect track_df and album_df

#select track_id and track_albumn_id to link the track_df and album_df
df_track_album = df.iloc[:,[0,4]].drop_duplicates()
df_track_album.head()

	track_id	track_album_id
0	6f807x0ima9a1j3VPbc7VN	2oCs0DGTsRO98Gh5ZSl2Cx
1	0r7CVbZTWZgbTCYdfa2P31	63rPSO264uRjW1X5E6cWv6
2	1z1Hg7Vb0AhHDiEmnDE79l	1HoSmj2eLcsrR0vE9gThr4
3	75FpbthrwQmzHlBJLuGdC7	1nqYsOef1yKKuGOVchbsk6
4	1e8PAfcKUYoKkxPhrHqw4x	7m7vv9wlQ4i0LFuJiE2zsQ

df_track_playlist: connect track_df and playlist_df

#select track_id and playlist_id to link the track_df and playlist_df
df_track_playlist = df.iloc[:,[0,8]].drop_duplicates()
print(df_track_playlist.shape)
df_track_playlist.head()

	track_id	playlist_id
0	6f807x0ima9a1j3VPbc7VN	37i9dQZF1DXcZDD7cfEKhW
1	0r7CVbZTWZgbTCYdfa2P31	37i9dQZF1DXcZDD7cfEKhW
2	1z1Hg7Vb0AhHDiEmnDE79l	37i9dQZF1DXcZDD7cfEKhW
3	75FpbthrwQmzHlBJLuGdC7	37i9dQZF1DXcZDD7cfEKhW
4	1e8PAfcKUYoKkxPhrHqw4x	37i9dQZF1DXcZDD7cfEKhW

Third Normal Form(3NF)

Before we try to understand the details of third normal form, we have to understand the concept of a functional dependency on a table

column A is said to be functionally dependent on column B if changing the value of A may require a change in the value of column B. The third normal form avoids this by breaking this table into separate tables.

In playlist_df, column playlist_subgenere is dependent on playlist_genre. Thus, we have to separate two dataframe. In addition, each playlist can only belong to a single subgenre. Hence, it is the best that we separate all of the columns into different tables.

df_playlist_name = playlist_df.iloc[:,0:2].drop_duplicates()
print(df_playlist_name.shape) 
df_playlist_name.head()

	playlist_name	playlist_id
0	Pop Remix	37i9dQZF1DXcZDD7cfEKhW
70	Dance Pop	37i9dQZF1DWZQaaqNMbbXa
167	Dance Room	37i9dQZF1DX2ENAPP1Tyed
223	Cardio	37i9dQZF1DWSJHnPb1f0X3
272	Dance Pop Hits	37i9dQZF1DX6pH08wMhkaI

#one track can only belong to one genre
df_playlist_genre = playlist_df.iloc[:,[1,2]].drop_duplicates(subset='playlist_id',keep='first')
print(df_playlist_genre.shape)
df_playlist_genre.head()

	playlist_id	playlist_genre
0	37i9dQZF1DXcZDD7cfEKhW	pop
70	37i9dQZF1DWZQaaqNMbbXa	pop
167	37i9dQZF1DX2ENAPP1Tyed	pop
223	37i9dQZF1DWSJHnPb1f0X3	pop
272	37i9dQZF1DX6pH08wMhkaI	pop

#separate into genre  vs subgenre
df_genre_sub = playlist_df.iloc[:,2:].drop_duplicates(subset='playlist_subgenre',keep='first')
print(df_genre_sub.shape)
df_genre_sub.head()

	playlist_genre	playlist_subgenre
0	pop	dance pop
1298	pop	post-teen pop
2427	pop	electropop
3835	pop	indie poptimism
5507	rap	hip hop

Denormalization

Now that we have all the tables separrated and deduplicated, it is appropriate that we combine all the subtables into a combined tables to make sure that we are able to convert to the original table format

final_df=(
    df_track_attribute.
    merge(df_popularity).
    merge(df_track_playlist).
    merge(df_track_info).
    merge(df_playlist_name,on='playlist_id').
    merge(df_playlist_genre,on='playlist_id').
    merge(df_track_album,on='track_id').
    merge(album_df,on='track_album_id')
    
)

	track_id	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms	track_popularity	playlist_id	track_name	track_artist	playlist_name	playlist_genre	track_album_id	track_album_name	track_album_release_date
0	6f807x0ima9a1j3VPbc7VN	0.748	0.916	6	-2.634	1	0.0583	0.102	0	0.0653	0.518	122.036	194754	66	37i9dQZF1DXcZDD7cfEKhW	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	Pop Remix	pop	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14
1	6f807x0ima9a1j3VPbc7VN	0.748	0.916	6	-2.634	1	0.0583	0.102	0	0.0653	0.518	122.036	194754	66	4aUEH3uhbofktrFkXOOaKj	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	Pop EDM Remixes	edm	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14
2	0r7CVbZTWZgbTCYdfa2P31	0.726	0.815	11	-4.969	1	0.0373	0.0724	0.00421	0.357	0.693	99.972	162600	67	37i9dQZF1DXcZDD7cfEKhW	Memories - Dillon Francis Remix	Maroon 5	Pop Remix	pop	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13
3	0r7CVbZTWZgbTCYdfa2P31	0.726	0.815	11	-4.969	1	0.0373	0.0724	0.00421	0.357	0.693	99.972	162600	67	5CTzufLc0f6MufjKYrIaoO	Memories - Dillon Francis Remix	Maroon 5	EDM House & Dance	edm	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13
4	1z1Hg7Vb0AhHDiEmnDE79l	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.11	0.613	124.008	176616	70	37i9dQZF1DXcZDD7cfEKhW	All the Time - Don Diablo Remix	Zara Larsson	Pop Remix	pop	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05

Wow, we have successfully convert the data table into our original form. It shows that our process works!

Summary

The various form of database normalization are useful while designing the schema of a database in such a way that there is no data replication which may possibly lead to inconsistancies. While designing the schema for application, we should always think about how cna we make use of these forms.