
Data visualization is a critical technique in interpreting and communicating complex data, transforming raw numbers into intuitive visual formats like graphs, charts, and maps. This approach leverages the human brain’s capability to process visual information more efficiently than text, making it invaluable in our data-rich world. It simplifies complex information, allowing for the easy identification of trends and patterns, essential in areas where rapid data comprehension and decision-making are crucial.
The strength of data visualization lies in its ability to make data accessible and understandable to a wide audience, including those without technical expertise. It not only aids in revealing hidden insights within datasets but also enhances decision-making efficiency by providing clear and concise visual representations. In summary, data visualization is more than an aesthetic enhancement of data; it’s an essential tool for effective data analysis and communication in various professional and academic fields, turning complex data into actionable knowledge.
We will break down into three parts. Part-1 mainly covering on the python setup for this tutorial and many other up coming tutorials and introduction to matplotlib and seaborn libraries.
Part-2, we will go little deep into seaborn and touch on plotly which gives mouse over details and 3d data visualization.
Part-3, we will do a simple project based on food delivery service data visualization.
Setting up Python Environment
It is important to structure your data science project based on a certain standard so that you can easily maintain and modify your project, also share with your colabrators/development team
I have put together template repository that incorporates best practices to create a maintainable and reproducible data science project. Assuming you have already install python on your machine. We will install cookiecutter and poetry packages using pip
pip install cookiecutter
pip install poetry
Creating a project based on the template and answer the questions to create your data science project folder
cookiecutter https://github.com/wstanislaus/data_science_template
directory_name [project-name]: data_visualization
author_name [Your Name]: William Stanislaus <william@tagus.io>
compatible_python_versions [>=3.8]:
cd data_visualization
make
Executing “make” command creates the python virtual environment and install all the required dependency packages based on “pyproject.toml” file.
#Run poetry shell to activate virtual environment
poetry shell
For adding additional one or more dependency packages, use poetry add command, such as
poetry add pandas numpy seaborn plotly
Assuming that you will be using Visual Code as source code editor and know how to setup by downloading required extensions such as Python and Jupyter notebook support. Since we create our python virtual environment in the project folder itself, visual code can auto detect python interpreter from virtual environment and use it.
Brief Introduction to the Tools Used
Poetry
Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and it will manage (install/update) them for you. Poetry offers a lockfile to ensure repeatable installs, and can build your project for distribution.
Hydra
Hydra is an open-source Python framework that simplifies the development of research and other complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. The name Hydra comes from its ability to run multiple similar jobs – much like a Hydra with multiple heads.
Pre-Commit hooks
Git hook scripts help identify simple coding issues before review, streamlining the process. Sharing these scripts across projects was initially challenging due to differences in project structures. Pre-commit, a multi-language package manager for hooks, addresses this by managing hook installation and execution in any language, enhancing efficiency and consistency without needing root access.
Data Version Control (DVC)
DVC is a tool for data science that takes advantage of existing software engineering toolset. It helps machine learning teams manage large datasets, make projects reproducible, and collaborate better.
pdoc
Python package pdoc
provides types, functions, and a command-line interface for accessing public documentation of Python modules, and for presenting it in a user-friendly, industry-standard open format. It is best suited for small- to medium-sized projects with tidy, hierarchical APIs.
Git Repository
For your convince, I have already created Jupyter notebook project, checkout and execute make
git clone https://github.com/wstanislaus/data_visualization.git
cd data_visualization
make
poetry shell
Python Pandas
Python Pandas is a powerful data manipulation and analysis tool, built on top of the Python programming language. It offers data structures and operations for manipulating numerical tables and time series, making it indispensable for data analysis and cleaning. With its intuitive syntax and rich functionality, Pandas simplifies complex data operations.
# Libraries to help with reading and manipulating data
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Command to tell Python to actually display the graphs
%matplotlib inline
cfg = None
# hydra for configuration
from hydra import initialize, compose
with initialize(version_base=None, config_path="../config/"):
cfg = compose(config_name='main.yaml')
Loading the test data file (automobile.csv) from configuration
#Loading the data
df = pd.read_csv(f'../{cfg.data.raw}')
#printing top 5 rows from the data frame
df.head()
Since the output contains 26 columns we may not capture all in our current screen, hence I will copy paste as an image

[in] df.shape
[out] (201, 26)
# The data has 201 rows and 26 columns
[in] df.info()
[out]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 201 non-null int64
1 normalized_losses 201 non-null int64
2 make 201 non-null object
3 fuel_type 201 non-null object
4 aspiration 201 non-null object
5 number_of_doors 201 non-null object
6 body_style 201 non-null object
7 drive_wheels 201 non-null object
8 engine_location 201 non-null object
9 wheel_base 201 non-null float64
10 length 201 non-null float64
11 width 201 non-null float64
12 height 201 non-null float64
13 curb_weight 201 non-null int64
14 engine_type 201 non-null object
15 number_of_cylinders 201 non-null object
16 engine_size 201 non-null int64
17 fuel_system 201 non-null object
18 bore 201 non-null float64
19 stroke 201 non-null float64
20 compression_ratio 201 non-null float64
21 horsepower 201 non-null int64
22 peak_rpm 201 non-null int64
23 city_mpg 201 non-null int64
24 highway_mpg 201 non-null int64
25 price 201 non-null int64
dtypes: float64(7), int64(9), object(10)
memory usage: 41.0+ KB
#There are attributes of different types (*int*, *float*, *object*) in the data.
Lets look into the data numeric columns basic statistics such as mean, standard deviation, min, max etc
[in] df.describe().T
# dot "T" at the end denotes transpose, if we have many columns it will help in displaying the columns as rows.

- The car price ranges from 5118 to 45400 units.
- The car weight ranges from 1488 to 4066 units.
- The average milage for cars are 25 mpg.
Histogram
- A histogram is a univariate plot which helps us understand the distribution of a continuous numerical variable.
- It breaks the range of the continuous variables into a intervals of equal length and then counts the number of observations in each interval.
- We will use the *histplot()* function of seaborn to create histograms.
sns.histplot(data=df, x='price')

Seaborn histogram supports various customization and visualization techniques. In addition to the bars, we can also add a density estimate by setting the kde parameter to True.
- Kernel Density Estimation, or KDE, visualizes the distribution of data over a continuous interval.
- The conventional scale for KDE is: Total frequency of each bin × Probability
sns.histplot(data=df, x='price', kde=True);

Histograms are intuitive but it is hardly a good choice when we want to compare the distributions of several groups.
sns.histplot(data=df, x='price', hue='body_style', kde=True);

# Sometimes it might be better to use subplots using seaborn FacetGrid feature
g = sns.FacetGrid(df, col="body_style")
g.map(sns.histplot, "price");

Boxplot
- A boxplot, or a box-and-whisker plot, shows the distribution of numerical data and skewness through displaying the data quartiles
- It is also called a five-number summary plot, where the five-number summary includes the minimum value, first quartile, median, third quartile, and the maximum value.
- The *boxplot()* function of seaborn can be used to create a boxplot.

- In a boxplot, when the median is closer to the left of the box and the whisker is shorter on the left end of the box, we say that the distribution is positively skewed (skewed right).
- Similarly, when the median is closer to the right of the box and the whisker is shorter on the right end of the box, we say that the distribution is negatively skewed (skewed left).

plt.title('Boxplot:Horsepower')
plt.xlim(30,300)
plt.xlabel('Horsepower')
sns.axes_style('whitegrid')
sns.boxplot(data=df, x='horsepower',color='green');

Let’s see how we can compare groups with boxplots.
sns.boxplot(data=df, x='body_style', y='price', palette = "Set1")

Bar Graph
- A bar graph is generally used to show the counts of observations in each bin (or level or group) of categorical variable using bars.
- We can use the *countplot()* function of seaborn to plot a bar graph.
sns.countplot(data=df, x='body_style', palette = "Set1");

#We can also make the plot more granular by specifying the *hue* parameter to display counts for subgroups
sns.countplot(data=df, x='body_style', hue='fuel_type');

Lineplot
Suppose, your dataset has multiple y values for each x value. A lineplot is a great way to visualize this. This type of data often shows up when we have data that evolves over time, for example, when we have monthly data over several years. If we want to compare the individual months, then a line plot is a great option. This is sometimes called seasonality analysis.

- A line plot uses straight lines to connect individual data points to display a trend or pattern in the data.
- For example, seasonal effects and large changes over time.
- The *lineplot()* function of seaborn, by default, aggregates over multiple y values at each value of x and uses an estimate of the central tendency for the plot.
- *lineplot()* assumes that you are most often trying to draw y as a function of x. So, by default, it sorts the data by the x values before plotting.
- Note that, unlike barplots and histograms, line plots may not include a zero baseline.
- We create a line chart is to emphasize changes in value, rather than the magnitude of the values themselves, and hence, a zero line is not meaningful.
# loading one of the example datasets available in seaborn
flights = sns.load_dataset("flights")
# creating a line plot
sns.lineplot(data = flights , x = 'month' , y = 'passengers', ci=True);
# parameter ci = confidence interval

- The light blue shaded area is actually the ‘**confidence interval**‘ of the y-value estimates for each x-axis value.
- The **confidence interval** is a range of values around that estimate that are believed to contain the true value of that estimate with a certain probability.
#We can change the style of the lines by adding 'style' parameter to the function
# loading one of the example datasets available in seaborn
fmri = sns.load_dataset("fmri")
# creating the line plot
sns.lineplot(data = fmri, x="timepoint", y="signal", hue="region", style="region", ci = False);

We can also add markers at each observation to identify groups in a better way
sns.lineplot(data = fmri, x="timepoint", y="signal", hue="region", style="region", ci = False, markers = True);

We will conclude Part-1 with line plot and we will continue few more seaborn visualization techniques in Part-2, stay tuned!!!