Database Normalization

Introduction

When developing the schema of a relational database, one of the most important aspects of to be taken into accountis to ensure that duplication is minimized. There are two reasons:

  • reduce the amount of required storage for data
  • avoid unecessary data conflict that can be casued by the data duplication

Normalization in Database

database normalization is one of the technique that can ensure the above two points. The core idea of database normalization is to divide the table into smaller subtables and store pointers to the data rather replicating it. In order to illustrate this concept, we are going to use Spotify dataset

Take a look at the original data

df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
print(df.shape)
df.head()
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
0 6f807x0ima9a1j3VPbc7VN I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.102 0 0.0653 0.518 122.036 194754
1 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 0.00421 0.357 0.693 99.972 162600
2 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.11 0.613 124.008 176616
3 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.718 0.93 7 -3.778 1 0.102 0.0287 9.43e-06 0.204 0.277 121.956 169093
4 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.65 0.833 1 -4.672 1 0.0359 0.0803 0 0.0833 0.725 123.976 189052

each row represents a track. However, Not all columns are dedicate to the attribute of each tracks. For example, playlist_id, playlist genre dedicates to playlist attributes. Track_album_name, track_albumn_release_date dedicate to album attributes. It shows that our dataset is not in the Thrid Normal form.

Types of database normalization

These are various database “normal” forms. Each normal forms has an importance which helps in optimizing the database to save storage and reduce redundancies

First Normal Form(1NF)

the first normal form simply says that each cell of a table should contain exactly one value. According to our original spotify dataset table, it appears to satisfy the first normal form.

Second Normal Form(2NF)

For a table to be in second normal form, the following 2 conditions are to be met:

  1. table should be in first normal form
  2. the primary key of the table should compose of exactly 1 column

a primary key is a set fo columns that uniquely identifies a row. Basically no 2 rows have the same primary keys. Let’s take a look at the dataset.

FIrst, we would like to separate the original dataframe into the following dataframe:

  • track_df: contain all attributes of specific track

    #select track attributes 
    track_df = df.iloc[:,0:4].drop_duplicates()
    attribute_df = df.iloc[:,11:].drop_duplicates()
    #concatenate all columns together
    track_df = pd.concat([track_df,attribute_df],axis=1).drop_duplicates()
    track_df.head()
    
    track_id track_name track_artist track_popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
    0 6f807x0ima9a1j3VPbc7VN I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 0.748 0.916 6 -2.634 1 0.0583 0.102 0 0.0653 0.518 122.036 194754
    1 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 0.726 0.815 11 -4.969 1 0.0373 0.0724 0.00421 0.357 0.693 99.972 162600
    2 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.11 0.613 124.008 176616
    3 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 0.718 0.93 7 -3.778 1 0.102 0.0287 9.43e-06 0.204 0.277 121.956 169093
    4 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 0.65 0.833 1 -4.672 1 0.0359 0.0803 0 0.0833 0.725 123.976 189052
  • album_df: contain all attributes of specific album

    #select album attributes
    album_df = df.iloc[:,4:7].drop_duplicates()
    album_df.head()
    
    track_album_id track_album_name track_album_release_date
    0 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14
    1 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13
    2 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05
    3 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19
    4 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05
  • Playlist_df: contain all attributes of specific playlist

    #select playlist attributes
    playlist_df = df.iloc[:,7:11].drop_duplicates()
    playlist_df.head()
    
    playlist_name playlist_id playlist_genre playlist_subgenre
    0 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop
    70 Dance Pop 37i9dQZF1DWZQaaqNMbbXa pop dance pop
    167 Dance Room 37i9dQZF1DX2ENAPP1Tyed pop dance pop
    223 Cardio 37i9dQZF1DWSJHnPb1f0X3 pop dance pop
    272 Dance Pop Hits 37i9dQZF1DX6pH08wMhkaI pop dance pop
  • df_track_album: connect track_df and album_df

    #select track_id and track_albumn_id to link the track_df and album_df
    df_track_album = df.iloc[:,[0,4]].drop_duplicates()
    df_track_album.head()
    
    track_id track_album_id
    0 6f807x0ima9a1j3VPbc7VN 2oCs0DGTsRO98Gh5ZSl2Cx
    1 0r7CVbZTWZgbTCYdfa2P31 63rPSO264uRjW1X5E6cWv6
    2 1z1Hg7Vb0AhHDiEmnDE79l 1HoSmj2eLcsrR0vE9gThr4
    3 75FpbthrwQmzHlBJLuGdC7 1nqYsOef1yKKuGOVchbsk6
    4 1e8PAfcKUYoKkxPhrHqw4x 7m7vv9wlQ4i0LFuJiE2zsQ
  • df_track_playlist: connect track_df and playlist_df

    #select track_id and playlist_id to link the track_df and playlist_df
    df_track_playlist = df.iloc[:,[0,8]].drop_duplicates()
    print(df_track_playlist.shape)
    df_track_playlist.head()
    
    track_id playlist_id
    0 6f807x0ima9a1j3VPbc7VN 37i9dQZF1DXcZDD7cfEKhW
    1 0r7CVbZTWZgbTCYdfa2P31 37i9dQZF1DXcZDD7cfEKhW
    2 1z1Hg7Vb0AhHDiEmnDE79l 37i9dQZF1DXcZDD7cfEKhW
    3 75FpbthrwQmzHlBJLuGdC7 37i9dQZF1DXcZDD7cfEKhW
    4 1e8PAfcKUYoKkxPhrHqw4x 37i9dQZF1DXcZDD7cfEKhW

Third Normal Form(3NF)

Before we try to understand the details of third normal form, we have to understand the concept of a functional dependency on a table

column A is said to be functionally dependent on column B if changing the value of A may require a change in the value of column B. The third normal form avoids this by breaking this table into separate tables.

In playlist_df, column playlist_subgenere is dependent on playlist_genre. Thus, we have to separate two dataframe. In addition, each playlist can only belong to a single subgenre. Hence, it is the best that we separate all of the columns into different tables.

df_playlist_name = playlist_df.iloc[:,0:2].drop_duplicates()
print(df_playlist_name.shape) 
df_playlist_name.head()
playlist_name playlist_id
0 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
70 Dance Pop 37i9dQZF1DWZQaaqNMbbXa
167 Dance Room 37i9dQZF1DX2ENAPP1Tyed
223 Cardio 37i9dQZF1DWSJHnPb1f0X3
272 Dance Pop Hits 37i9dQZF1DX6pH08wMhkaI
#one track can only belong to one genre
df_playlist_genre = playlist_df.iloc[:,[1,2]].drop_duplicates(subset='playlist_id',keep='first')
print(df_playlist_genre.shape)
df_playlist_genre.head()
playlist_id playlist_genre
0 37i9dQZF1DXcZDD7cfEKhW pop
70 37i9dQZF1DWZQaaqNMbbXa pop
167 37i9dQZF1DX2ENAPP1Tyed pop
223 37i9dQZF1DWSJHnPb1f0X3 pop
272 37i9dQZF1DX6pH08wMhkaI pop
#separate into genre  vs subgenre
df_genre_sub = playlist_df.iloc[:,2:].drop_duplicates(subset='playlist_subgenre',keep='first')
print(df_genre_sub.shape)
df_genre_sub.head()
playlist_genre playlist_subgenre
0 pop dance pop
1298 pop post-teen pop
2427 pop electropop
3835 pop indie poptimism
5507 rap hip hop

Denormalization

Now that we have all the tables separrated and deduplicated, it is appropriate that we combine all the subtables into a combined tables to make sure that we are able to convert to the original table format

final_df=(
    df_track_attribute.
    merge(df_popularity).
    merge(df_track_playlist).
    merge(df_track_info).
    merge(df_playlist_name,on='playlist_id').
    merge(df_playlist_genre,on='playlist_id').
    merge(df_track_album,on='track_id').
    merge(album_df,on='track_album_id')
    
) 
track_id danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms track_popularity playlist_id track_name track_artist playlist_name playlist_genre track_album_id track_album_name track_album_release_date
0 6f807x0ima9a1j3VPbc7VN 0.748 0.916 6 -2.634 1 0.0583 0.102 0 0.0653 0.518 122.036 194754 66 37i9dQZF1DXcZDD7cfEKhW I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran Pop Remix pop 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14
1 6f807x0ima9a1j3VPbc7VN 0.748 0.916 6 -2.634 1 0.0583 0.102 0 0.0653 0.518 122.036 194754 66 4aUEH3uhbofktrFkXOOaKj I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran Pop EDM Remixes edm 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14
2 0r7CVbZTWZgbTCYdfa2P31 0.726 0.815 11 -4.969 1 0.0373 0.0724 0.00421 0.357 0.693 99.972 162600 67 37i9dQZF1DXcZDD7cfEKhW Memories - Dillon Francis Remix Maroon 5 Pop Remix pop 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13
3 0r7CVbZTWZgbTCYdfa2P31 0.726 0.815 11 -4.969 1 0.0373 0.0724 0.00421 0.357 0.693 99.972 162600 67 5CTzufLc0f6MufjKYrIaoO Memories - Dillon Francis Remix Maroon 5 EDM House & Dance edm 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13
4 1z1Hg7Vb0AhHDiEmnDE79l 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.11 0.613 124.008 176616 70 37i9dQZF1DXcZDD7cfEKhW All the Time - Don Diablo Remix Zara Larsson Pop Remix pop 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05

Wow, we have successfully convert the data table into our original form. It shows that our process works!

Summary

The various form of database normalization are useful while designing the schema of a database in such a way that there is no data replication which may possibly lead to inconsistancies. While designing the schema for application, we should always think about how cna we make use of these forms.