Creating a Dashboard with the Matplotlib Library 📈
The purpose of this tutorial is that we can build graphics to assist in the application of the data science process. We may employ visualizations during exploratory analysis, before or after processing data, or even a chart or a dashboard as final delivery. Therefore, knowing how to create a visualization, regardless of which tool it is, is of fundamental importance.
Visit Jupyter Notebook to see the concepts that will be covered about Data Visualization with Matplotlib. Note: important functions, outputs and terms are bold to facilitate understanding - at least mine.
Matplotlib
In this first step we will create graphics in matplotlib, manipulate formatting, make necessary adjustments to the data so that it allows its plotting in a fit.
• Import packages
import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np import pandas as pd from IPython.display import Image %matplotlib inline mpl.__version__ '3.3.3'
• Matplotlib Styles
Once matplotlib is loaded, the library brings some styles—templates that can be used to create graphics, without the need to construct everything from scratch.
print(plt.style.available) ['seaborn-dark', 'seaborn-darkgrid', 'seaborn-ticks', 'fivethirtyeight', 'seaborn-whitegrid', 'classic', '_classic_test', 'fast', 'seaborn-talk', 'seaborn-dark-palette', 'seaborn-bright', 'seaborn-pastel', 'grayscale', 'seaborn-notebook', 'ggplot', 'seaborn-colorblind', 'seaborn-muted', 'seaborn', 'Solarize_Light2', 'seaborn-paper', 'bmh', 'tableau-colorblind10', 'seaborn-white', 'dark_background', 'seaborn-poster', 'seaborn-deep']
Each style of this above has the configuration of color, size, positioning of the elements etc.
• Create function
Let's define a function to create a plot. Once we have a code it starts to repeat over and over again, it is convenient to make this repetition into a function - we create function for repetition.
First, let's define some random values with the randn function of the random module of the numpy packet.
Then create subplots - several small plots within the plot area. As output we have the figure objects and axes.
Following we define all the parameters of the graph, the parameters of varíaxel x and y, type and density are applied to the axes and then we still specify in axes the title, labels, subtitles. Finally, use the show() function to have the graph displayed on the notebook:
def plot_1(): x = np.random.randn(5000, 6) (figure, axes) = plt.subplots(figsize = (16,10)) (n, bins, patches) = axes.hist(x, 12, density = 1, histtype = 'bar', label = ['Color 1', 'Color 2', 'Color 3', 'Color 4', 'Color 5', 'Color 6']) axes.set_title("Histogram\nFor\nNormal Distribution ", fontsize = 25) axes.set_xlabel("Dados", fontsize = 16) axes.set_ylabel("Frequência", fontsize = 16) axes.legend() plt.show()
• Call function
Let's call the created function:
plot_1()
- The data are comes from the 5000 random randn values;
- The colors were defined with a list of colors label;
- Chart type was passed through the histtype parameter;
- Graph legends enabled on axes with legend();
- X and y axe sums defined set_label();
- Main title defined with set_title().
Customize charts
We can create our styles, that is, we can completely customize the graphics. We will run a command directly on the operating system:
• Windows users
When running the code below, we list the contents of the directory:
!dir styles
• Mac and Linux users
When running the code below, we list the contents of the directory:
!ls -l styles
Query styles in the directory
We have two files in the directory - mplstyle, are matplotlib styles. Let's look at one of these files:
• Windows users
When running the code below, we load the personalstyle-1 of the styles directory:
!type styles/personalstyle-1.mplstyle
• Mac and Linux users
When running the code below, we load the personalstyle-1 of the styles directory:
!cat styles/personalstyle-1.mplstyle
• How to use custom style
We call the function plt.style.use and point to the directory where the style text file is stored:
plt.style.use("styles/personalstyle-1.mplstyle")
• Call plot function
We call again the plot function defined at the beginning:
plot_1
See that a priori the chart was in the standard format of matplotlib and we changed the style to a more professional look.
To assist in our work, we will use the car dataset from the UCI Machine Learning Repository: UCI Automobile Data Set. We will upload the csv file and, from this dataset, we will build our graphics.
Python modularization
Let's open the directory and and find the three files with extension .py - 3 very important modules: generatedata.py, generateplot.py, radar.py. - We can open these files through a text editor and explore them a bit more.
As we build our analysis process, we will use codes that repeat itself multiple times in multiple projects. For example, for the vast majority of Machine Learning algorithms we will have to normalize the data, they require to receive normalized data before applying predictive modeling. That is, we will have to normalize the data in several different projects in which we work.
In so that we do not have to repeat this code, we can create a Python Module and use it with each project required. Automating work will help us to be more productive and consequently get better results. Therefore, having a custom module, it is enough to load the module and make the call to the function specified in the module - Modularize, is to professionalize the work of a Data Scientist.
Using Pandas to Load Data
First, let's import the package sys - operating system management pack, call the append function of the path module of the sys library and make a join to the lib directory, i.e. we will bring the lib directory to be recognized by this Jupyter Notebook that we are working on.
Next we import the modules that are allocated in the lib: generatedata.py, generateplot.py, and radar.py.
import sys sys.path.append("lib") import generatedata, generateplot, radar
• Call to module function
We will call the get_raw_data function - when checking the function in the generatedata.py module, we can interpret that this function loads the csv file and calls read_csv from Pandas, that is, every time we want to load a csv file, just change the file name in data_file of the function in the module:
data = generate.get_raw_data() """ Load full set def get_raw_data(): data_file = "cars.csv" return pd.read_csv(data_file) """ data.head()
• Another function of the module
Just as we call the get_raw_data, we can call the get_limited_data - it will load only a few variables from the dataset:
data_subset = generatedata.get_limited_data() """ Check limited data def get_limited_data(cols = None, lower_bound = None): if not cols: cols = limited_columns data = get_raw_data()[cols] if lower_bound: (makes, _) = get_make_counts(data, lower_bound) data = data[data["make"].isin(makes)] return data """ data_subset.head()
The function results in a subset - only a few variables have been returned to our dataset. In this way, we can load the complete data set or just a few variables, according to our ultimate goal.
generatedata.get_all_auto_makes() """ Search only car manufacturers def get_all_auto_makes(): return pd.Series(get_raw_data()["make"]).unique() array(['audi', 'bmw', 'chevrolet', 'dodge', 'honda', 'jaguar', 'mazda', 'mercedes-benz', 'mitsubishi', 'nissan', 'peugot', 'plymouth','porsche', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'],dtype=object)
With the get_all_auto_makes we only look for the car manufacturers of the set, in the Series name make.
The functions are ready, just call them. The get_make_counts will count manufacturers in the dataset, returning two parameters (manufacturers, total) that will return the outputs of the function.
automakers, total) = generatedata.get_make_counts(data_subset) """ # get count def get_make_counts(pddata, lower_bound=0): counts = [] filtered_makes = [] for make in get_all_auto_makes(): data = get_make_data(make, pddata) count = len(data.index) if count >= lower_bound: filtered_makes.append(make) counts.append(count) return (filtered_makes, list(zip(filtered_makes, counts))) """ total [('audi', 4), ('bmw', 4), ('chevrolet', 3), ('dodge', 8), ('honda', 13), ('jaguar', 1), ('mazda', 11), ('mercedes-benz', 5), ('mitsubishi', 10), ('nissan', 18), ('peugot', 7), ('plymouth', 6), ('porsche', 1), ('saab', 6), ('subaru', 12), ('toyota', 31), ('volkswagen', 8), ('volvo', 11)]
• Number of indexes
Get the amount of indexes within the set or table:
len(data.index) 141
Normalizing Data
When we work with a dataset, with very different scales, it may be necessary to normalize the data, that is, to put the data on the same scale - a statistical task.
First, let's create a copy of our dataset with the copy function and pass it on to norm_data, a mere copy of the set - a good practice as we make transformations in the data.
norm_data = data.copy()
Next, let's rename the horsepower column:
norm_data.rename(columns = {"horsepower" : "power"}, inplace = True) norm_data.head()
To normalize, we will call the norm_column - this function is in the generatedata.py.
The norm_column function operates the minimum and maximum values, i.e. (value -min value) / (max-min) - an elementary mathematical operation, to then normalize the data by placing them on the same scale.
The function norm_column receives as parameter the column name in col_name, pddata, and inverted. If inverted is equal to True, the function executes the if inverted code block, that is, if the value is True, it executes the last line of code of this function:
def norm_column(col_name, pddata, inverted = False): pddata[col_name] -= pddata[col_name].min() pddata[col_name] /= pddata[col_name].max() if inverted: pddata[col_name] = 1 - pddata[col_name]
• Normalize columns
Let's normalize some columns, since it is not necessary to normalize the strings or categorical, normalize only the numeric columns.
# higher values generatedata.norm_columns(["city mpg", "highway mpg", "power"], norm_data) norm_data.head()
When comparing the two tables, we have the table on the left with the original data and the right table with the normalized data, that is, on the same numerical scale. We don't change the scale contained in the data at all, we just change the scale - the data still represents the same thing, just on a different scale. This is useful for chart construction and predictive modeling.
However, some variables may require a different way to normalize the data. The previous normalization was done to higher values, now we will normalize lower values - apply the reversed normalization with the function of module generatedata.py, invert_norm_columns, which calls the norm_column function to larger values and this time passes the inverted parameter equal to True.
For this reversed normalization, the lower value variables will be passed:
# normalize lower values generatedata.invert_norm_columns(["price", "weight", "riskiness", "losses"], norm_data) norm_data.head()
Having all variables now normalized, we are ready to start the plot series.
Plots
First, we call plt.figure, which will create a figure - a plot area with figsize dimensions, and create GridSpec, a kind of drawing area. Then we call the make_autos_price_plot function that belongs to the generateplot.py module, which defines the title, the chart type for scatter plot, defines the default label type and some parameters to leave the chart exactly in the desired position and we make just a simple call to this make_autos_price_plot function:
figure = plt.figure(figsize = (15, 5)) prices_gs = mpl.gridspec.GridSpec(1, 1) prices_axes = generateplot.make_autos_price_plot(figure, prices_gs, data) plt.show()
• Vertical Dispersion Plot
figure = plt.figure(figsize = (15, 5)) mpg_gs = mpl.gridspec.GridSpec(1, 1) mpg_axes = generateplot.make_autos_mpg_plot(figure, mpg_gs, data) plt.show()
• Stacked Bar Plot
figure = plt.figure(figsize = (15, 5)) risk_gs = mpl.gridspec.GridSpec(1, 1) risk_axes = generate.make_autos_riskiness_plot(figure, risk_gs, norm_data) plt.show()
• Inverted Stacked Bar Plot
figure = plt.figure(figsize = (15, 5)) loss_gs = mpl.gridspec.GridSpec(1, 1) loss_axes = geraplot.make_autos_losses_plot(figure, loss_gs, norm_data) plt.show()
• Standard bar chart
figure = plt.figure(figsize = (15, 5)) risk_loss_gs = mpl.gridspec.GridSpec(1, 1) risk_loss_axes = generateplot.make_autos_loss_and_risk_plot(figure, risk_loss_gs, norm_data) plt.show()
With all this, we were able to create several different graphs through the calls of functions belonging to the plotting module generateplot.py, in an organized and effective way.
• Radar Graph
Finally, we will create the radar graph - very complex chart type. We have to use a unique module for radar.py because of its complexity, it may take up to a long time to render this graph.
import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np import pandas as pd from IPython.display import Image import warnings #warnings.filterwarnings('ignore') import matplotlib %matplotlib inline import sys sys.path.append("lib") import generatedata, generateplot, radar #plt.style.use("styles/personalstyle-1.mplstyle") data = generatedata.get_raw_data() data.head() data_subset = generatadata.get_limited_data() data_subset.head() data = generatedata.get_limited_data(lower_bound = 6) data.head() norm_data = data.copy() norm_data.rename(columns = {"horsepower": "power"}, inplace = True) figure = plt.figure(b= (15, 5)) radar_gs = mpl.gridspec.GridSpec(3, 7, height_ratios = [1, 10, 10], wspace = 0.50, hspace = 0.60, top = 0.95, bottom = 0.25) radar_axes = generateplot.make_autos_radar_plot(figure, gs=radar_gs, pddata=norm_data) plt.show()
See below we have the radar chart with the title: Radar plot with 7 dimensions for 12 manufacturers - each named at the top of each radar. The dimensions refer to the variables that are around each radar, indicating the relationship of each of the manufacturers through individual plots.
• Combined Plots - Dashboard
Let's now build a dashboard through matplotlib. The diagram below is known as wireframe - very common term in design, a general idea of what will be done.
This wireframe shows a dashboard template - a dashboard is a set of charts, that is, we define our plot area and within that area we organize our graphics that were created individually through a combined plot - dashboard.
-------------------------------------------- | overall title | -------------------------------------------- | price ranges | -------------------------------------------- | combined loss/risk | | | | radar | ---------------------- plots | | risk | loss | | -------------------------------------------- | mpg | --------------------------------------------
Below we have the construction of the wireframe in a fragmented way, layer by layer: draw the figure with matplotlib pyplot, draw the grid that is the area where the graphics will be placed. Once the figure is posted, we will place the title layers, the overall title - the first layer where the subplot will be defined and added.
# building layers (without data) figure = plt.figure(figsize=(10, 8)) gs_master = mpl.gridspec.GridSpec(4, 2, height_ratios=[1, 2, 8, 2]) ## ----------------------------- ## ----------------------------- # layer 1 - Title gs_1 = mpl.gridspec.GridSpecFromSubplotSpec(1, 1, subplot_spec=gs_master[0, :]) title_axes = figure.add_subplot(gs_1[0]) ## ----------------------------- ## ----------------------------- # layer 2 - Price gs_2 = mpl.gridspec.GridSpecFromSubplotSpec(1, 1, subplot_spec=gs_master[1, :]) price_axes = figure.add_subplot(gs_2[0]) ## ----------------------------- ## ----------------------------- # layer 3 - Risks and Radar gs_31 = mpl.gridspec.GridSpecFromSubplotSpec(2, 2, height_ratios=[2, 1], subplot_spec=gs_master[2, :1]) risk_and_loss_axes = figure.add_subplot(gs_31[0, :]) risk_axes = figure.add_subplot(gs_31[1, :1]) loss_axes = figure.add_subplot(gs_31[1:, 1]) gs_32 = mpl.gridspec.GridSpecFromSubplotSpec(1, 1, subplot_spec=gs_master[2, 1]) radar_axes = figure.add_subplot(gs_32[0]) ## ----------------------------- ## ----------------------------- # layer 4 - MPG gs_4 = mpl.gridspec.GridSpecFromSubplotSpec(1, 1, subplot_spec=gs_master[3, :]) mpg_axes = figure.add_subplot(gs_4[0]) ## ----------------------------- ## ----------------------------- # joins layers still without data gs_master.tight_layout(figure) plt.show()
The second layer is the pricecharts, third layer with subdisiviews with scratches and radar charts, fourth layer of MPG and finally we call the tight_layout() to join all these layers in the same area.
Once the wireframe is created, we can plot the charts in the reserved areas within that grid. From here, we call the charts for each area:
# building layers with data figure = plt.figure(figsize = (15, 15)) gs_master = mpl.gridspec.GridSpec(4, 2, height_ratios = [1, 24, 128, 32], hspace = 0, wspace = 0) ## ----------------------------- ## ----------------------------- # layer 1 - title gs_1 = mpl.gridspec.GridSpecFromSubplotSpec(1, 1, subplot_spec = gs_master[0, :]) title_axes = figure.add_subplot(gs_1[0]) title_axes.set_title("Plots", fontsize = 30, color = "#cdced1") geraplot.hide_axes(title_axes) ## ----------------------------- ## ----------------------------- # layer 2 - price gs_2 = mpl.gridspec.GridSpecFromSubplotSpec(1, 1, subplot_spec = gs_master[1, :]) price_axes = figure.add_subplot(gs_2[0]) geraplot.make_autos_price_plot(figure, pddata = dados, axes = price_axes) ## ----------------------------- ## ----------------------------- # layer 3 - risks gs_31 = mpl.gridspec.GridSpecFromSubplotSpec(2, 2, height_ratios = [2, 1], hspace = 0.4, subplot_spec = gs_master[2, :1]) risk_and_loss_axes = figure.add_subplot(gs_31[0, :]) geraplot.make_autos_loss_and_risk_plot(figure, pddata = dados_normalizados, axes = risk_and_loss_axes, x_label = False, rotate_ticks = True) risk_axes = figure.add_subplot(gs_31[1, :1]) geraplot.make_autos_riskiness_plot(figure, pddata = dados_normalizados, axes = risk_axes, legend = False, labels = False) loss_axes = figure.add_subplot(gs_31[1:, 1]) geraplot.make_autos_losses_plot(figure, pddata = dados_normalizados, axes = loss_axes, legend = False, labels = False) ## ----------------------------- ## ----------------------------- # layer 3 - radar gs_32 = mpl.gridspec.GridSpecFromSubplotSpec(5, 3, height_ratios = [1, 20, 20, 20, 20], hspace = 0.6, wspace = 0, subplot_spec = gs_master[2, 1]) (rows, cols) = geometry = gs_32.get_geometry() title_axes = figure.add_subplot(gs_32[0, :]) inner_axes = [] projection = radar.RadarAxes(spoke_count = len(norma_data.groupby("make").mean().columns)) [inner_axes.append(figure.add_subplot(m, projection = projection)) for m in [n for n in gs_32][cols:]] geraplot.make_autos_radar_plot(figure, pddata = dados_normalizados, title_axes = title_axes, inner_axes = inner_axes, legend_axes = False, geometry = geometry) ## ----------------------------- ## ----------------------------- # layer 4 - MPG gs_4 = mpl.gridspec.GridSpecFromSubplotSpec(1, 1, subplot_spec = gs_master[3, :]) mpg_axes = figure.add_subplot(gs_4[0]) geraplot.make_autos_mpg_plot(figure, pddata = dados, axes = mpg_axes) ## ----------------------------- ## ----------------------------- # joining layers gs_master.tight_layout(figure) plt.show()
When we join all these layers, we have as output a complete dashboard:
We can understand then that a dashboard is a set of charts in the same plot area. We can build our own dashboard from 0, without having to pay for proprietary tools. On the other hand, there is the issue of programming that generates additional complexity.
This dashboard could be used as the final result of our work. It could be a dashboard for real-time data monitoring, sales forecasting or historical data analysis - it always depends on our goal.
Thank you.
Data Science Lead at Citi
3yIt could have been great if the code was written in one language. I had to translate the part he left out and download the data from the internet...etc. I had to fix some axes ticks errors too. In general it was fine.
Data Science | Automation | Business Intelligence | Python | PySpark | Power BI | Snowflake | Azure Synapse | ADB | ADF | Data-Storyteller | Photography enthusiast
4yThis is absolutely great. Thanks for sharing