Phd_data_dashboard

The objective of this blog is to explore the Phd datase. We would like to build a data dashboard via streamlit package.

Dataset

Data presented in Doctorate Recipients from U.S. Universities: 2017 were collected by the Survey of Earned Doctorates (SED). The Survey collects data on the number and characteristics of individuals receiving research doctoral degrees from U.S. There are total 72 tables within the dataset. Let’s talk about streamlit package.

Streamlit

Streamlit is an open source app framework specifically designed for ML engineers working with Python. It allows you to create a stunning looking application with only a few lines of code.A few of the advantages of using Streamlit tools like Dash and Flask:

  • It embraces Python scripting; No HTML knowledge is needed!
  • Less code is needed to create a beautiful application
  • No callbacks are needed since widgets are treated as variables
  • Data caching simplifies and speeds up computation pipelines.

Installation

In order to install the streamlit package. We can use the following command in Mac terminal:

pip install streamlit

we can open the terminal with following command

streamlit hello

Dashboard

we first import all the neccesary package to perform data analysis

import streamlit as st
import pandas as pd
import numpy as np
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from plotly import tools
import plotly.offline as py

then, let me introduce some basic features that you are likely to be using in your own apps. We will first read in the “sed17-sr-tab005” excel file and display a colormap with following code

df = pd.read_excel('data_tables/sed17-sr-tab005.xlsx')
df = df[2:]
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df_ = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv')
df_code = df_[['code','state']]
result = pd.merge(df, df_code, how='inner', left_on='State or location', right_on='state')

if st.checkbox('Show Dataframe--Number of Doctorate recipients by States'):
    st.write(result)

fig = go.Figure(data=go.Choropleth(
    locations=result['code'], # Spatial coordinates
    z = result['Doctorate recipients'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'blues',
    colorbar_title = "Number of Doctorate recipients",
))

fig.update_layout(
    title_text = 'State, ranked by number of doctorate recipients: 2017',
    geo_scope='usa', # limite map scope to USA
)

st.plotly_chart(fig)

image alter text

we would then display a multiselect box with “sed17-sr-tab006” excel file.

df_6 = pd.read_excel('data_tables/sed17-sr-tab006.xlsx')
df_6 = df_6.iloc[2:5,1:]
subjects = df_6.iloc[0].dropna().to_list()
df_male = df_6.iloc[2].to_list()[::2]
df_female = df_6.iloc[2].to_list()[1::2]
df_6 = pd.DataFrame({'Male': df_male,
                  'Female': df_female},
                  index=subjects)

sex = ['Male','Female']



if st.checkbox('Show Dataframe--Doctorates awarded by sex'):
    st.write(df_6)

selected_metrics_2 = st.selectbox(
    label="Choose...", options=subjects
)

fig = go.Figure()
for i in subjects:
    if selected_metrics_2 == i:
        fig = go.Figure(data=[go.Pie(labels=sex, values=df_6.loc[i],title= i+' Doctorates awarded by sex')])


st.plotly_chart(fig, use_container_width=True)



subset_data = df_2
country_name_input = st.multiselect('Choose one or more', df_2.columns[1:])
# by country name
if len(country_name_input) > 0:
    subset_data = df_2[country_name_input]

fig = go.Figure()
for i in subset_data.columns:
        fig.add_trace(go.Scatter(x=df_2.Year, y=df_2[i],
                        mode='lines+markers',
                        name=i))

st.plotly_chart(fig, use_container_width=True)

image alter text

Lastly, we would like to display a piechart for “sed17-sr-tab002” file to display the portion of Doctorate-granting by sex

df_2 = pd.read_excel('data_tables/sed17-sr-tab002.xlsx')
df_2 = df_2.iloc[4:,:]
df_2.columns = ['Year','Doctorate-granting institutions','Total','Mean (per institution)','Median (per institution)']
#df_2 = pd.read_csv('')
if st.checkbox('Show Dataframe--Number of Doctorate recipients by Institutions'):
    st.write(df_2)

selected_metrics = st.selectbox(
    label="Choose...", options=['Doctorate-granting institutions','Total','Mean (per institution)','Median (per institution)']
)


fig = go.Figure()
if selected_metrics == 'Doctorate-granting institutions':
    fig.add_trace(go.Scatter(x=df_2.Year, y=df_2['Doctorate-granting institutions'],
                    mode='lines+markers',
                    name='Doctorate-granting institutions'))
if selected_metrics == 'Total':
    fig.add_trace(go.Scatter(x=df_2.Year, y=df_2.Total,
                        mode='lines+markers', name='Total'))
if selected_metrics == 'Mean (per institution)':
    fig.add_trace(go.Scatter(x=df_2.Year, y=df_2['Mean (per institution)'],
                        mode='lines+markers',
                        name='Mean (per institution)'))

if selected_metrics == 'Median (per institution)':
    fig.add_trace(go.Scatter(x=df_2.Year, y=df_2['Median (per institution)'],
                        mode='lines+markers',
                        name='Median (per institution)'))
st.plotly_chart(fig, use_container_width=True)

image alter text

Run the dashboard

in order to run the above application, we will go to the directory where your .py file is stored and run the following command

streamlit run {filename}

The above precedure is required to successfully build a streamlit dashboard.