Data Visualization using Python, Pandas, Seaborn and Plotly – Part 1

Data visualization is a critical technique in interpreting and communicating complex data, transforming raw numbers into intuitive visual formats like graphs, charts, and maps. This approach leverages the human brain’s capability to process visual information more efficiently than text, making it invaluable in our data-rich world. It simplifies complex information, allowing for the easy identification of trends and patterns, essential in areas where rapid data comprehension and decision-making are crucial.

The strength of data visualization lies in its ability to make data accessible and understandable to a wide audience, including those without technical expertise. It not only aids in revealing hidden insights within datasets but also enhances decision-making efficiency by providing clear and concise visual representations. In summary, data visualization is more than an aesthetic enhancement of data; it’s an essential tool for effective data analysis and communication in various professional and academic fields, turning complex data into actionable knowledge.

We will break down into three parts. Part-1 mainly covering on the python setup for this tutorial and many other up coming tutorials and introduction to matplotlib and seaborn libraries.

Part-2, we will go little deep into seaborn and touch on plotly which gives mouse over details and 3d data visualization.

Part-3, we will do a simple project based on food delivery service data visualization.

Setting up Python Environment

It is important to structure your data science project based on a certain standard so that you can easily maintain and modify your project, also share with your colabrators/development team

I have put together template repository that incorporates best practices to create a maintainable and reproducible data science project. Assuming you have already install python on your machine. We will install cookiecutter and poetry packages using pip

pip install cookiecutter
pip install poetry

Creating a project based on the template and answer the questions to create your data science project folder

cookiecutter https://github.com/wstanislaus/data_science_template

directory_name [project-name]: data_visualization

author_name [Your Name]: William Stanislaus <william@tagus.io>

compatible_python_versions [>=3.8]:

cd data_visualization
make

Executing “make” command creates the python virtual environment and install all the required dependency packages based on “pyproject.toml” file.

#Run poetry shell to activate virtual environment
poetry shell

For adding additional one or more dependency packages, use poetry add command, such as

poetry add pandas numpy seaborn plotly

Assuming that you will be using Visual Code as source code editor and know how to setup by downloading required extensions such as Python and Jupyter notebook support. Since we create our python virtual environment in the project folder itself, visual code can auto detect python interpreter from virtual environment and use it.

Brief Introduction to the Tools Used

Poetry

Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and it will manage (install/update) them for you. Poetry offers a lockfile to ensure repeatable installs, and can build your project for distribution.

Hydra

Hydra is an open-source Python framework that simplifies the development of research and other complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. The name Hydra comes from its ability to run multiple similar jobs – much like a Hydra with multiple heads.

Pre-Commit hooks

Git hook scripts help identify simple coding issues before review, streamlining the process. Sharing these scripts across projects was initially challenging due to differences in project structures. Pre-commit, a multi-language package manager for hooks, addresses this by managing hook installation and execution in any language, enhancing efficiency and consistency without needing root access.

Data Version Control (DVC)

DVC is a tool for data science that takes advantage of existing software engineering toolset. It helps machine learning teams manage large datasets, make projects reproducible, and collaborate better.

pdoc

Python package pdoc provides types, functions, and a command-line interface for accessing public documentation of Python modules, and for presenting it in a user-friendly, industry-standard open format. It is best suited for small- to medium-sized projects with tidy, hierarchical APIs.

Git Repository

For your convince, I have already created Jupyter notebook project, checkout and execute make

git clone https://github.com/wstanislaus/data_visualization.git
cd data_visualization
make
poetry shell

Python Pandas

Python Pandas is a powerful data manipulation and analysis tool, built on top of the Python programming language. It offers data structures and operations for manipulating numerical tables and time series, making it indispensable for data analysis and cleaning. With its intuitive syntax and rich functionality, Pandas simplifies complex data operations.

# Libraries to help with reading and manipulating data
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Command to tell Python to actually display the graphs
%matplotlib inline

cfg = None
# hydra for configuration 
from hydra import initialize, compose

with initialize(version_base=None, config_path="../config/"):
    cfg = compose(config_name='main.yaml')

Loading the test data file (automobile.csv) from configuration

#Loading the data
df = pd.read_csv(f'../{cfg.data.raw}')

#printing top 5 rows from the data frame
df.head()

Since the output contains 26 columns we may not capture all in our current screen, hence I will copy paste as an image

[in] df.shape
[out] (201, 26)
# The data has 201 rows and 26 columns

[in] df.info()
[out]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   symboling            201 non-null    int64  
 1   normalized_losses    201 non-null    int64  
 2   make                 201 non-null    object 
 3   fuel_type            201 non-null    object 
 4   aspiration           201 non-null    object 
 5   number_of_doors      201 non-null    object 
 6   body_style           201 non-null    object 
 7   drive_wheels         201 non-null    object 
 8   engine_location      201 non-null    object 
 9   wheel_base           201 non-null    float64
 10  length               201 non-null    float64
 11  width                201 non-null    float64
 12  height               201 non-null    float64
 13  curb_weight          201 non-null    int64  
 14  engine_type          201 non-null    object 
 15  number_of_cylinders  201 non-null    object 
 16  engine_size          201 non-null    int64  
 17  fuel_system          201 non-null    object 
 18  bore                 201 non-null    float64
 19  stroke               201 non-null    float64
 20  compression_ratio    201 non-null    float64
 21  horsepower           201 non-null    int64  
 22  peak_rpm             201 non-null    int64  
 23  city_mpg             201 non-null    int64  
 24  highway_mpg          201 non-null    int64  
 25  price                201 non-null    int64  
dtypes: float64(7), int64(9), object(10)
memory usage: 41.0+ KB

#There are attributes of different types (*int*, *float*, *object*) in the data.

Lets look into the data numeric columns basic statistics such as mean, standard deviation, min, max etc

[in] df.describe().T
# dot "T" at the end denotes transpose, if we have many columns it will help in displaying the columns as rows.

The car price ranges from 5118 to 45400 units.
The car weight ranges from 1488 to 4066 units.
The average milage for cars are 25 mpg.

Histogram

A histogram is a univariate plot which helps us understand the distribution of a continuous numerical variable.
It breaks the range of the continuous variables into a intervals of equal length and then counts the number of observations in each interval.
We will use the *histplot()* function of seaborn to create histograms.

sns.histplot(data=df, x='price')

Seaborn histogram supports various customization and visualization techniques. In addition to the bars, we can also add a density estimate by setting the kde parameter to True.

Kernel Density Estimation, or KDE, visualizes the distribution of data over a continuous interval.
The conventional scale for KDE is: Total frequency of each bin × Probability

sns.histplot(data=df, x='price', kde=True);

Histograms are intuitive but it is hardly a good choice when we want to compare the distributions of several groups.

sns.histplot(data=df, x='price', hue='body_style', kde=True);

# Sometimes it might be better to use subplots using seaborn FacetGrid feature
g = sns.FacetGrid(df, col="body_style")
g.map(sns.histplot, "price");

Boxplot

A boxplot, or a box-and-whisker plot, shows the distribution of numerical data and skewness through displaying the data quartiles
It is also called a five-number summary plot, where the five-number summary includes the minimum value, first quartile, median, third quartile, and the maximum value.
The *boxplot()* function of seaborn can be used to create a boxplot.

In a boxplot, when the median is closer to the left of the box and the whisker is shorter on the left end of the box, we say that the distribution is positively skewed (skewed right).
Similarly, when the median is closer to the right of the box and the whisker is shorter on the right end of the box, we say that the distribution is negatively skewed (skewed left).

plt.title('Boxplot:Horsepower')
plt.xlim(30,300)
plt.xlabel('Horsepower')
sns.axes_style('whitegrid')
sns.boxplot(data=df, x='horsepower',color='green');

Let’s see how we can compare groups with boxplots.

sns.boxplot(data=df, x='body_style', y='price', palette = "Set1")

Bar Graph

A bar graph is generally used to show the counts of observations in each bin (or level or group) of categorical variable using bars.
We can use the *countplot()* function of seaborn to plot a bar graph.

sns.countplot(data=df, x='body_style', palette = "Set1");

#We can also make the plot more granular by specifying the *hue* parameter to display counts for subgroups

sns.countplot(data=df, x='body_style', hue='fuel_type');

Lineplot

Suppose, your dataset has multiple y values for each x value. A lineplot is a great way to visualize this. This type of data often shows up when we have data that evolves over time, for example, when we have monthly data over several years. If we want to compare the individual months, then a line plot is a great option. This is sometimes called seasonality analysis.

A line plot uses straight lines to connect individual data points to display a trend or pattern in the data.
For example, seasonal effects and large changes over time.
The *lineplot()* function of seaborn, by default, aggregates over multiple y values at each value of x and uses an estimate of the central tendency for the plot.
*lineplot()* assumes that you are most often trying to draw y as a function of x. So, by default, it sorts the data by the x values before plotting.
Note that, unlike barplots and histograms, line plots may not include a zero baseline.
We create a line chart is to emphasize changes in value, rather than the magnitude of the values themselves, and hence, a zero line is not meaningful.

# loading one of the example datasets available in seaborn
flights = sns.load_dataset("flights")

# creating a line plot
sns.lineplot(data = flights , x = 'month' , y = 'passengers', ci=True);
# parameter ci = confidence interval

The light blue shaded area is actually the ‘**confidence interval**‘ of the y-value estimates for each x-axis value.
The **confidence interval** is a range of values around that estimate that are believed to contain the true value of that estimate with a certain probability.

#We can change the style of the lines by adding 'style' parameter to the function

# loading one of the example datasets available in seaborn
fmri = sns.load_dataset("fmri")

# creating the line plot
sns.lineplot(data = fmri, x="timepoint", y="signal", hue="region", style="region", ci = False);

We can also add markers at each observation to identify groups in a better way

sns.lineplot(data = fmri, x="timepoint", y="signal", hue="region", style="region", ci = False, markers = True);

We will conclude Part-1 with line plot and we will continue few more seaborn visualization techniques in Part-2, stay tuned!!!

Posted

October 7, 2023

Tutorials

wstanislaus

Tags: