EDVART

Edvart is an open-source Python library designed to simplify and streamline your exploratory data analysis (EDA) process. Edvart supports different levels of customization: from a default report generated in one line of code to a fully-customized report down to the level of code generating the visualizations.

Key Features

One-line Reports: Generate a comprehensive set of pandas DataFrame visualizations using a single Python statement. Edvart supports:
- Data overview,
- Univariate analysis,
- Bivariate analysis,
- Multivariate analysis,
- Grouped analysis,
- Time series analysis.
Customizable Reports: Produce, iterate, and style detailed reports in Jupyter notebooks and HTML formats.
Flexible API: From high-level simplicity in a single line of code to detailed control, choose the API level that fits your needs.
Interactive Visualizations: Many of the visualizations are interactive and can be used to explore the data in detail.

Table of Contents

Installation

edvart is distributed as a Python package via PyPI. It can be installed using pip:

$ pip install edvart

We recommend using Poetry for dependency management. To add edvart into a Poetry environment, add the following snippet to the pyproject.toml environment definition file:

[tool.poetry.dependencies]
python = ">=3.8, <3.12"
edvart = "4.0.0"

Extras

Edvart has an optional dependency umap, which adds a plot called UMAP to Multivariate Analysis.

To install Edvart with the optional umap dependency via pip, run the following command:

$ pip install "edvart[umap]"

To install Edvart with the optional extra using Poetry, replace the snippet of the pyproject.toml environment file above with the following snippet:

[tool.poetry.dependencies]
python = ">=3.8, <3.12"
edvart = { version = "4.0.0", extras = ["umap"] }

Rendering Plotly Interactive Plots

Edvart uses Plotly to render interactive plots.

JupyterLab

To display interactive plots which use Plotly in JupyterLab, you need to install some JupyterLab extensions.

The extension jupyter-dash needs to be installed in order for Plotly plots to be rendered correctly in JupyterLab. It can be simply installed as a Python package, e.g. via pip:

pip install jupyter-dash

to install plotly-dash to a Poetry environment, add the following line under tool.poetry.dependencies in the pyproject.toml environment definition file:

jupyter-dash = "^0.4.2"

See https://plot.ly/python/getting-started/ for more information.

Visual Studio Code

The following extensions need to be installed to display Plotly interactive plots in Visual Studio Code notebooks:

Jupyter
is required to run Jupyter notebooks in Visual Studio Code.
Jupyter Notebook Renderers
is required to render Plotly plots in Visual Studio Code notebooks.

Usage

Quick Start

Show a Default Report in a Jupyter Notebook

import edvart


df = edvart.example_datasets.dataset_titanic()
edvart.DefaultReport(df).show()

Export the Report Code to a Jupyter Notebook

import edvart


df = edvart.example_datasets.dataset_titanic()
report = edvart.DefaultReport(df)
report.export_notebook(
    "titanic_report.ipynb",
    dataset_name="Titanic",
    dataset_description="Dataset of 891 of the real Titanic passengers.",
)

The exported notebook contains the code which generates the report. It can be modified to fine-tune the report. The code can be exported with different levels of detail (see Verbosity).

Export a Report to HTML

import edvart


df = edvart.example_datasets.dataset_titanic()
report = edvart.DefaultReport(df)
report.export_html(
    html_filepath="titanic_report.html",
    dataset_name="Titanic",
    dataset_description="Dataset of 891 of the real Titanic passengers.",
)

A Report can be directly exported to HTML via the export_html() method.

Jupyter notebooks can be exported to other formats including HTML, using a tool called jupyter nbconvert (https://nbconvert.readthedocs.io/en/latest/). This can be useful to create a HTML report from a notebook which was exported using the export_notebook() method.

Customizing the Report

This section describes several concepts behind edvart and how a report can be customized.

Report Class

The Report class is central to the edvart API. A Report consists of sections, which can be added via methods of the Report class. The class DefaultReport is a subclass of Report, which includes a default set of sections.

With an instance of Report you can:

Show the report directly in a Jupyter notebook using the show() method.
Export the code which generates the report to a new Jupyter notebook using export_notebook() method. The code can be exported with different levels of verbosity. The notebook containing the exported code can be modified to fine-tune the report.
Export the output to a HTML file. You can specify an nbconvert template to style the report.

Selection of Sections

You can add sections using section-specific methods add_* (e.g. edvart.report.ReportBase.add_overview()) or the general method edvart.report.ReportBase.add_section of the Report class.

# Include univariate and bivariate analysis
import edvart


df = edvart.example_datasets.dataset_titanic()
report = (
    edvart.Report(df)
    .add_univariate_analysis()
    .add_bivariate_analysis()
)

Configuration of Sections

Each section can be also configured. For example you can define which columns should be used or omitted.

import edvart
from edvart.report_sections.dataset_overview import Overview
from edvart.report_sections.add_univariate_analysis import UnivariateAnalysis


df = edvart.example_datasets.dataset_titanic()
report = (
    edvart.Report(df)
    .add_section(Overview(columns=["PassengerId"]))
    .add_section(UnivariateAnalysis(columns=["Name", "Sex", "Age"]))
)

Subsections

Some sections are made of subsections. For those, you can can configure which subsections are be included.

import edvart
from edvart.report_sections.dataset_overview import Overview


df = edvart.example_datasets.dataset_titanic()
report = edvart.Report(df)

report.add_overview(
    subsections=[
        Overview.OverviewSubsection.QuickInfo,
        Overview.OverviewSubsection.DataPreview,
    ]
)

Verbosity

A Report can be exported to a Jupyter notebook containing the code which generates the report. The code can be exported with different levels of detail, referred to as verbosity.

It can be set on the level of the whole report or on the level of each section or subsection separately (see Configuration of Sections).

Specific verbosity overrides general verbosity, i.e. the verbosity set on a subsection overrides the verbosity set on a section, which overrides the verbosity set on the report.

EDVART supports three levels of verbosity:

LOW: High level functions for whole sections are exported, i.e. the output of each section is generated by a single function call. Suitable for small modifications such as changing parameters of the functions, adding commentary to the report, adding visualizations which are not in EDVART, etc.
MEDIUM: For report sections which consist of subsections, each subsection is exported to a separate function call. Same as LOW for report sections which do not consist of subsections.
HIGH: The definitions of (almost) all functions are exported. The functions can be modified or used as a starting point for custom analysis.

Examples

# Set default verbosity for all sections to Verbosity.MEDIUM
import edvart
from edvart import Verbosity


df = edvart.example_datasets.dataset_titanic()
edvart.DefaultReport(df, verbosity=Verbosity.MEDIUM).export_notebook("test-export.ipynb")

import edvart
from edvart import Verbosity


# Set report verbosity to Verbosity.MEDIUM but use verbosity Verbosity.HIGH for univariate analysis
df = edvart.example_datasets.dataset_titanic()
edvart.DefaultReport(
    df,
    verbosity=Verbosity.MEDIUM,
    verbosity_univariate_analysis=Verbosity.HIGH,
).export_notebook("exported-report.ipynb")

Reports for Time Series Datasets

The class TimeseriesReport is a version of the Report class which is specific for creating reports on time series datasets. There is also a DefaultTimeseriesReport, which contains a default set of sections, similar to DefaultReport.

The main differences compared to the report for tabular data are:

a different set of default sections for DefaultTimeseriesReport
TimeseriesAnalysis section, which contains visualizations for analyzing time series data
the assumption that the input data is time-indexed and sorted by time.

Helper functions edvart.utils.reindex_to_period() or edvart.utils.reindex_to_datetime() can be used to index a DataFrame by a pd.PeriodIndex or a pd.DatetimeIndex respectively.

Each column in the input data is treated as a separate time series.

df = pd.DataFrame(
   data=[
         ["2018Q1", 120000, 11000],
         ["2018Q2", 150000, 13000],
         ["2018Q3", 100000, 12000],
         ["2018Q4", 110000, 11000],
         ["2019Q1", 120000, 13000],
         ["2019Q2", 110000, 12000],
         ["2019Q3", 120000, 14000],
         ["2019Q4", 160000, 12000],
         ["2020Q1", 130000, 12000],
   ],
   columns=["Quarter", "Revenue", "Profit"],
)

# Reindex using helper function to have 'Quarter' as index
df = edvart.utils.reindex_to_datetime(df, datetime_column="Quarter")
report_ts = edvart.DefaultTimeseriesReport(df)
report_ts.show()

Report Sections

Dataset Overview

Provides essential information about whole dataset, such as inferred data types, number of rows and columns, number of missing values, duplicates, etc.

See edvart.report.ReportBase.add_overview()

Univariate Analysis

Provides analysis of individual columns. The analysis differs based on the data type of the column.

See edvart.report.ReportBase.add_univariate_analysis()

Bivariate Analysis

Provides analysis of pairs of columns, such as correlations, scatter plots, contingency tables, etc.

See edvart.report.ReportBase.add_bivariate_analysis()

Multivariate Analysis

Provides analysis of all columns together.

Currently features PCA, parallel coordinates and parallel categories subsections. Additionally, an UMAP section is included if the extra dependency umap is installed.

See edvart.report.ReportBase.add_multivariate_analysis()

Group Analysis

Provides analysis of each column when grouped by a column or a set of columns. Includes basic information similar to dataset overview and univariate analysis, but on a per-group basis.

See edvart.report.ReportBase.add_group_analysis()

Timeseries Analysis

Provides analysis specific for time series.

Used with edvart.report.TimeseriesReport

See edvart.report.TimeseriesReport.add_timeseries_analysis()