edvart.report_sections package

Subpackages

Submodules

edvart.report_sections.bivariate_analysis module

class edvart.report_sections.bivariate_analysis.BivariateAnalysis(subsections: List[BivariateAnalysisSubsection] | None = None, verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None, columns_pairs: List[Tuple[str, str]] | None = None, verbosity_correlations: Verbosity | None = None, verbosity_pairplot: Verbosity | None = None, verbosity_contingency_table: Verbosity | None = None, color_col: str | None = None)[source]

Bases: ReportSection

Generates the Bivariate analysis section of the report.

Contains an enum BivariateAnalysisSubsection of possible subsections.

Parameters:
  • subsections (List[BivariateAnalysisSubsection], optional) – List of subsections to include. All subsection in BivariateAnalysisSubsection are included by default.

  • verbosity (Verbosity (default = Verbosity.LOW)) – Generated code verbosity global to the Bivariate analysis sections. If subsection verbosities are None, then they will be overridden by this parameter.

  • columns (List[str], optional) – Columns on which to do bivariate analysis. If none of columns_x, columns_y and columns_pairs is specified bivariate analysis is performed on all pairs of columns. Ignored if columns_x and columns_y is specified. Ignored in contingency table if columns_x and columns_y or columns_pairs is specified. All columns are used by default.

  • columns_x (List[str], optional) – If specified, correlations and pairplots are performed on the cartesian product of columns_x and columns_y. If columns_x is specified, then columns_y must also be specified.

  • columns_y (List[str], optional) – If specified, correlations and pairplots are performed on the cartesian product of columns_x and columns_y. If columns_y is specified, then columns_x must also be specified.

  • columns_pairs (List[str], optional) – List of columns pairs on which to perform bivariate analysis. Used primarily in contingency tables. If specified, columns, columns_x and columns_y are ignored in contingency tables. Ignored in pairplots and correlations unless columns_pairs is specified and none of columns, columns_x, columns_y is specified. In that case, the first elements of each pair are treated as columns_x and the second elements as columns_y in pairplots and correlations.

  • verbosity_correlations (Verbosity, optional) – Correlation plots subsection code verbosity.

  • verbosity_pairplot (Verbosity, optional) – Pairplot subsection code verbosity.

  • verbosity_contingency_table (Verbosity, optional) – Contingency table subsection code verbosity.

  • color_col (str, optional) – Name of column according to use for coloring of the bivariate analysis subsections. Coloring is currently supported in pairplot.

Raises:

ValueError – If exactly one of columns_x, columns_y is specified.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells.

Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’]

Return type:

List[str]

show(df: DataFrame) None[source]

Generates cell output of this section in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output.

class edvart.report_sections.bivariate_analysis.BivariateAnalysisSubsection(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

Enum of all implemented bivariate analysis subsections.

ContingencyTable = 2
CorrelationPlot = 0
PairPlot = 1
class edvart.report_sections.bivariate_analysis.ContingencyTable(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None, columns_pairs: List[Tuple[str, str]] | None = None)[source]

Bases: Section

Generates the pairwise contingency tables subsection.

Parameters:
  • verbosity (Verbosity (default = Verbosity.LOW)) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – Columns on which to show contingency tables. If columns_x and columns_y and columns_pairs are unspecified, then analysis performed on all pairs of columns. Otherwise ignored. Columns which contain only null values are always excluded. All columns are used by default.

  • columns_x (List[str], optional) – If specified, contingency tables are plotted for each pair in the cartesian product of columns_x and columns_y. If columns_x is specified, then columns_y must also be specified. Columns which contain only null values are always excluded. Ignored if columns_pairs is specified.

  • columns_y (List[str], optional) – If specified, contingency tables are plotted for each pair in the cartesian product of columns_x and columns_y. If columns_y is specified, then columns_x must also be specified. Columns which contain only null values are always excluded. Ignored if columns_pairs is specified.

  • columns_pairs (List[Tuple[str, str]], optional) – If specified, contingency tables are plotted for exactly the specified pairs. Columns which contain only null values are always excluded, i.e. if at least one of the columns in a pair is excluded, then is excluded.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells. Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Generates contingency table in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

class edvart.report_sections.bivariate_analysis.CorrelationPlot(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None)[source]

Bases: Section

Generates the Correlation plot subsection.

Parameters:
  • verbosity (Verbosity (default = Verbosity.LOW)) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – Columns on which to plot pair-wise correlation plot. If columns_x and columns_y are unspecified, then analysis is performed on all pairs of columns. Otherwise ignored. All columns are used by default.

  • columns_x (List[str], optional) – If specified, correlation is plotted on the cartesian product of columns_x and columns_y. If columns_x is specified, then columns_y must also be specified.

  • columns_y (List[str], optional) – If specified, correlation is plotted on the cartesian product of columns_x and columns_y. If columns_y is specified, then columns_x must also be specified.

Raises:

ValueError – If exactly one of columns_x, columns_y is specified.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells. Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Generates correlation plots in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

class edvart.report_sections.bivariate_analysis.PairPlot(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None, color_col: str | None = None)[source]

Bases: Section

Generates the Pairplot subsection.

Parameters:
  • verbosity (Verbosity (default = Verbosity.LOW)) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – Columns on which to plot the pairplot. If columns_x and columns_y are unspecified, then analysis is performed on all pairs of columns. Otherwise ignored. All columns are used by default.

  • columns_x (List[str], optional) – If specified, correlation is plotted on the cartesian product of columns_x and columns_y. If columns_x is specified, then columns_y must also be specified.

  • columns_y (List[str], optional) – If specified, correlation is plotted on the cartesian product of columns_x and columns_y. If columns_y is specified, then columns_x must also be specified.

  • color_col (str, optional) – Name of column according to use for coloring of points and histogram in the pairplot.

Raises:

ValueError – If exactly one of columns_x, columns_y is specified.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells. Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Generates pairplot in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

edvart.report_sections.bivariate_analysis.contingency_table(df: ~pandas.core.frame.DataFrame, columns1: str | ~typing.List[str], columns2: str | ~typing.List[str], include_total: bool = True, hide_zeros: bool = True, scaling_func: ~typing.Callable[[~numpy.ndarray], ~numpy.ndarray] = <ufunc 'cbrt'>, colormap: ~typing.Any = 'Blues', size_factor: float | ~typing.Literal['auto'] = 'auto', fontsize: float = 15) None[source]

Show a colored contingency table for the two specified columns or lists of columns.

Parameters:
  • df (pd.DataFrame) – Data to analyze.

  • columns1 (Union[str, List[str]]) – Name of column or list of column names to use in the vertical axis of the table.

  • columns2 (Union[str, List[str]]) – Name of column or list of column names to use in the horizontal axis of the table.

  • include_total (bool (default = True)) – Whether to include marginal counts.

  • hide_zeros (bool (default = True)) – Whether to hide zero values in the table for better readability.

  • scaling_func (Callable[["array-like"], "array-like"]) – Function to scale the values for the purpose of coloring for smaller spread. Cube root is used by default.

  • colormap (Any (default = "Blues")) – Colormap compatible with matplotlib/seaborn.

  • size_factor (float or "auto") – Size of each cell in the table. If “auto”, the cell size is automatically adjusted so that the numbers fit in the cells.

  • fontsize (float (default = 15)) – Size of the font for axis labels.

edvart.report_sections.bivariate_analysis.contingency_tables(df: DataFrame, columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None, columns_pairs: List[Tuple[str, str]] | None = None, table_threshold: int = 30) None[source]

Display a contingency table for each pairs of columns.

Parameters:
  • df (pd.DataFrame) – Data based on which to create a contingency table.

  • columns (List[str], optional) – Which columns to generate pair-wise contingency tables for. Columns with more than table_threshold unique values are excluded. Columns which contain only null values are excluded. To override the excluded columns, specify columns_pairs. Ignored if columns_x and columns_y or columns_pairs is specified.

  • columns_x (List[str], optional) – If specified, contingency tables are plotted for each pair in the cartesian product of columns_x and columns_y. Columns with more than table_threshold unique values are excluded. Columns which contain only null values are excluded. If columns_x is specified, then columns_y must also be specified. Ignored if columns_pairs is specified.

  • columns_y (List[str], optional) – If specified, contingency tables are plotted for each pair in the cartesian product of columns_x and columns_y. Columns with more than table_threshold unique values are excluded. Columns which contain only null values are excluded. If columns_y is specified, then columns_x must also be specified. Ignored if columns_pairs is specified.

  • columns_pairs (List[Tuple[str, str]], optional) – If specified, contingency tables are plotted for exactly the specified pairs.

  • table_threshold (int (default = 30)) – Maximum number of unique values for a column to be used in contingency table. If non-positive, no columns are excluded according to this criterion.

edvart.report_sections.bivariate_analysis.default_correlations() Dict[str, Callable[[DataFrame], DataFrame]][source]

Default dictionary of bivariate statistics that will be calculated for numerical columns.

Returns:

A dictionary assigning correlation functions to correlation names. Dictionary signature: ‘CorrelationName’: corr_func

Return type:

dict

edvart.report_sections.bivariate_analysis.plot_correlation(df: DataFrame, corr_name: str, corr_func: Callable[[DataFrame], DataFrame], columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None, size_factor: float = 0.7, font_size: float = 15, color_map: Any | None = None) None[source]

Plots a correlation heatmap.

Parameters:
  • df (pd.DataFrame) – Data based on which to plot correlation.

  • corr_name (str) – Name of correlation being plotted.

  • corr_func (Callable[[np.ndarray, Optional[List[Tuple[str, str]]]], pd.DataFrame]) – Correlation function to be used.

  • columns (Optional[List[str]]) – List of columns of df to analyze. All numeric columns of df are used by default.

  • size_factor (float (default = 0.7)) – Size of each cell in the table.

  • font_size (float (default = 15)) – Size of axis labels of the correlation plot.

  • color_map (Any, optional) – Color map compatible with matplotlib/seaborn to use for the correlation plot. A divergent red-blue color map is used by default.

Raises:

ValueError – If exactly one of columns_x, columns_y is specified.

edvart.report_sections.bivariate_analysis.plot_correlations(df: DataFrame, columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None, pearson: bool = True, spearman: bool = True, kendall: bool = True) None[source]

Plots multiple correlations.

Parameters:
  • df (pd.DataFrame) – Data based on which to plot correlations.

  • columns (Optional[List[str]]) – List of columns of df to analyze. All numeric columns of df are used by default.

  • pearson (bool (default = True)) – If True, Pearson correlation will be plotted.

  • spearman (bool (default = True)) – If True, Spearman correlation will be plotted.

  • kendall (bool (default = True)) – If True, Kendall correlation will be plotted.

edvart.report_sections.bivariate_analysis.plot_pairplot(df: DataFrame, columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None, allow_categorical: bool = False, color_col: str | None = None) None[source]

Plot a pairplot for each pair of columns.

Parameters:
  • df (pd.DataFrame) – Data frame for which to plot pairplot.

  • columns (Union[List[Tuple[str, str]] or List[str]], optional) – Which columns to plot pairplot for. All columns that are not categorical and are not boolean are used by default.

  • columns_x (List[str], optional) – If specified, correlation is plotted on the cartesian product of columns_x and columns_y. If columns_x is specified, then columns_y must also be specified.

  • columns_y (List[str], optional) – If specified, correlation is plotted on the cartesian product of columns_x and columns_y. If columns_y is specified, then columns_x must also be specified.

  • allow_categorical (bool (default = False)) – Whether to allow plotting of categorical columns. If False (default), then even explicitly specified columns will be excluded. If True, categorical columns are still excluded by default, unless explicitly specified via columns/columns_x/columns_y.

  • color_col (str, optional) – Name of column according to use for coloring of points and histogram in the pairplot.

Raises:

ValueError – If exactly one of columns_x, columns_y is specified.

edvart.report_sections.bivariate_analysis.show_bivariate_analysis(df: DataFrame, subsections: List[BivariateAnalysisSubsection] | None = None, columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None, columns_pairs: List[Tuple[str, str]] | None = None, color_col: str | None = None) None[source]

Generates bivariate analysis for df.

Parameters:
  • df (pd.DataFrame) – Data to be analyzed

  • subsections (List[BivariateAnalysisSubsection], optional) – Subsections to include in the analysis. All subsections are included by default.

  • columns (List[str], optional) – Columns on which to do bivariate analysis. If none of columns_x, columns_y and columns_pairs is specified bivariate analysis is performed on all pairs of columns. Ignored if columns_x and columns_y is specified. Ignored in contingency table if columns_x and columns_y or columns_pairs is specified. All columns are used by default.

  • columns_x (List[str], optional) – If specified, correlations and pairplots are performed on the cartesian product of columns_x and columns_y. If columns_x is specified, then columns_y must also be specified.

  • columns_y (List[str], optional) – If specified, correlations and pairplots are performed on the cartesian product of columns_x and columns_y. If columns_y is specified, then columns_x must also be specified.

  • columns_pairs (List[str], optional) – List of columns pairs on which to perform bivariate analysis. Used primarily in contingency tables. If specified, columns, columns_x and columns_y are ignored in contingency tables. Ignored in pairplots and correlations unless columns_pairs is specified and none of columns, columns_x, columns_y is specified. In that case, the first elements of each pair are treated as columns_x and the second elements as columns_y in pairplots and correlations.

  • color_col (str, optional) – Name of column to use for coloring of the bivariate analysis subsections. Coloring is currently supported in pairplot.

edvart.report_sections.code_string_formatting module

edvart.report_sections.code_string_formatting.code_dedent(input_string: str) str[source]

Removes all white spaces from each line that is common for all lines.

Parameters:

input_string (str) – Input string with lines.

Returns:

input_string with common leading whitespace removed from each line.

Return type:

str

edvart.report_sections.code_string_formatting.dedecorate(input_string: str) str[source]

Removes all decorators from the beginning of a function source.

Parameters:

input_string (str) – Input function source.

Returns:

input_string with beginning lines starting with ‘@’ removed.

Return type:

str

edvart.report_sections.code_string_formatting.get_code(code_object: Any) str[source]

Gets the source code of code object and formats it.

Parameters:

code_object (Any) – Object from which to extract code (function, method)

Returns:

Formatted code

Return type:

str

edvart.report_sections.code_string_formatting.total_dedent(input_string: str) str[source]

Removes all white space from the beginning of each line.

Parameters:

input_string (str) – Input string with lines.

Returns:

input_string with no whitespace at the beginning of each line.

Return type:

str

edvart.report_sections.dataset_overview module

class edvart.report_sections.dataset_overview.ConstantOccurrence(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None)[source]

Bases: Section

Generates table with occurrence of a constant in each column.

Parameters:
  • verbosity (Verbosity) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – List of columns to count constant occurrence in. If None, all columns are used.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds code cells which calculate constant occurrence table to the list of cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’]

Return type:

List[str]

show(df: DataFrame) None[source]

Generates constant occurrence table in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

class edvart.report_sections.dataset_overview.DataPreview(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None)[source]

Bases: Section

Generates data preview (head, tail, sample) subsection.

Parameters:
  • verbosity (Verbosity) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – List of columns to preview. If None, all columns are used.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds dataframe preview cells to the list of cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Renders data preview tables in the calling notebook.

Parameters:

df (pd.DataFrame) – Data which to preview

class edvart.report_sections.dataset_overview.DataTypes(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None)[source]

Bases: Section

Generates data types inference subsection.

Parameters:
  • verbosity (Verbosity) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – List of columns for which to infer data type. If None, all columns are used.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds data type inference cells to the list of cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Renders a table with inferred data types in the calling notebook.

Parameters:

df (pd.DataFrame:) – Data for which to infer data types.

class edvart.report_sections.dataset_overview.DuplicateRows(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None)[source]

Bases: Section

Counts number of duplicated rows.

Parameters:
  • verbosity (Verbosity) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – List of columns to consider when counting. If None, all columns are used.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds code cells which count the number of duplicated rows to the list of cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’]

Return type:

List[str]

show(df: DataFrame) None[source]

Displays a table with duplicated row count and percentage in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

class edvart.report_sections.dataset_overview.MissingValues(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None)[source]

Bases: Section

Generates missing values percentages table for each column of the dataframe.

Parameters:
  • verbosity (Verbosity) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – List of columns for which to count missing values. If None, all columns are used.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds code cells which calculate missing values percentage table to the list of cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’]

Return type:

List[str]

show(df: DataFrame) None[source]

Generates missing values percentages table in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

class edvart.report_sections.dataset_overview.Overview(subsections: List[OverviewSubsection] | None = None, verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, verbosity_quick_info: Verbosity | None = None, verbosity_data_types: Verbosity | None = None, verbosity_data_preview: Verbosity | None = None, verbosity_missing_values: Verbosity | None = None, verbosity_rows_with_missing_value: Verbosity | None = None, verbosity_constant_occurrence: Verbosity | None = None, verbosity_duplicate_rows: Verbosity | None = None)[source]

Bases: ReportSection

Generates the Overview section of the report.

Contains an enum OverviewSubsection of possible subsections.

Parameters:
  • subsections (List[OverviewSubsection], optional) – List of subsections to include into the Overview section. All subsections in OverviewSubsection are used by default.

  • verbosity (Verbosity) – Generated code verbosity global to the Overview sections If subsection verbosities are None, then they will be overridden by this parameter.

  • columns (List[str], optional) – Columns on which to do overview analysis. All columns are used by default.

  • verbosity_quick_info (Verbosity, optional) – Quick info subsection code verbosity.

  • verbosity_data_types (Verbosity, optional) – Data types subsection code verbosity.

  • verbosity_data_preview (Verbosity, optional) – Data preview subsection code verbosity.

  • verbosity_missing_values (Verbosity, optional) – Missing values subsection code verbosity.

  • verbosity_rows_with_missing_value (Verbosity, optional) – Rows with missing value subsection code verbosity.

  • verbosity_constant_occurrence (Verbosity, optional) – Constant values subsection code verbosity.

  • verbosity_duplicate_rows (Verbosity, optional) – Duplicate rows subsection code verbosity.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells.

Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’]

Return type:

List[str]

show(df: DataFrame) None[source]

Generates cell output of this section in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output.

class edvart.report_sections.dataset_overview.OverviewSubsection(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

Enum of possible subsections of the Overview section.

ConstantOccurrence = 6
DataPreview = 3
DataTypes = 2
DuplicateRows = 7
MissingValues = 4
QuickInfo = 1
RowsWithMissingValue = 5
class edvart.report_sections.dataset_overview.QuickInfo(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None)[source]

Bases: Section

Generates the Quick info subsection.

Parameters:
  • verbosity (Verbosity) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – List of columns to consider in quick info. If None, all columns are used.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Renders the quick info table in the calling notebook.

Parameters:

df (pd.DataFrame) – Data which to use for the table.

class edvart.report_sections.dataset_overview.RowsWithMissingValue(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None)[source]

Bases: Section

Counts number of rows with at least one value missing.

Parameters:
  • verbosity (Verbosity) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – List of columns to consider when counting. If None, all columns are used.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds code cells which count the number of rows with missing value to the list of cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’]

Return type:

List[str]

show(df: DataFrame) None[source]

Generates a table with missing value row count and percentage in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

edvart.report_sections.dataset_overview.constant_occurrence(df: DataFrame, columns: List[str] | None = None, constant: Any = 0) None[source]

Displays a table with occurrence of a constant in each column.

By default, check for 0 occurrence.

Parameters:
  • df (pd.DataFrame) – Dataframe for which to calculate constant values occurrence.

  • columns (Optional[List[str]], optional) – Subset of columns for which to calculate constant values occurrence. If None, all columns of df are used.

  • constant (Any) – Constant for which to check occurrence in df, by default 0.

edvart.report_sections.dataset_overview.data_preview(df: DataFrame, columns: List[str] | None = None, n_head: int = 5, n_tail: int = 5, n_sample: int = 5) None[source]

Renders data preview tables in the calling notebook.

Parameters:
  • df (pd.DataFrame) – Data which to preview.

  • columns (List[str], optional) – Columns of df to preview. All columns of df are used by default.

  • n_head (int) – Number of first n rows of df to render, if None no preview is rendered.

  • n_tail (int) – Number of last n rows of df to render, if None no preview is rendered.

  • n_sample (int) – Size of random sample of df to render, if None no preview is rendered.

edvart.report_sections.dataset_overview.data_types(df: DataFrame, columns: List[str] | None = None) None[source]

Renders a table with inferred data types in the calling notebook.

Parameters:
  • df (pd.DataFrame:) – Data for which to infer data types

  • columns (List[str], optional) – List of columns for which to infer data type. All columns of df are used by default.

edvart.report_sections.dataset_overview.duplicate_row_count(df: DataFrame, columns: List[str] | None = None) None[source]

Displays a table with duplicated row count and percentage.

Parameters:
  • df (pd.DataFrame) – Dataframe for which to count missing value rows.

  • columns (Optional[List[str]], optional) – List of columns to consider when counting. If None, all columns are used.

edvart.report_sections.dataset_overview.missing_value_row_count(df: DataFrame, columns: List[str] | None = None) None[source]

Displays a table with missing value row count and percentage.

Parameters:
  • df (pd.DataFrame) – Dataframe for which to count missing value rows.

  • columns (Optional[List[str]], optional) – List of columns to consider when counting. If None, all columns are used.

edvart.report_sections.dataset_overview.missing_values(df: DataFrame, columns: List[str] | None = None, bar_plot: bool = True, bar_plot_figsize: Tuple[int, int] = (15, 6), bar_plot_title: str = 'Missing Values Percentage of Each Column', bar_plot_ylim: float = 0, bar_plot_color: str = '#FFA07A', **bar_plot_args: Any) None[source]

Displays a table of missing values percentages for each column of df and a bar plot of the percentages.

Parameters:
  • df (pd.DataFrame) – Dataframe for which to calculate missing values.

  • columns (Optional[List[str]], optional) – Subset of columns for which to calculate missing values percentage. If None, all columns of df are used.

  • bar_plot (bool (default = False)) – Whether to also display a bar plot visualizing missing values percentages for each column.

  • bar_plot_figsize (Tuple[int, int]) – Width and height of the bar plot.

  • bar_plot_title (str) – Title of the bar plot.

  • bar_plot_ylim (float) – Bar plot y axis bottom limit.

  • bar_plot_color (str) – Color of bars in the bar plot in hex format.

  • bar_plot_args (Any) – Additional kwargs passed to pandas.Series.bar.

edvart.report_sections.dataset_overview.quick_info(df: DataFrame, columns: List[str] | None = None, additional_rows: Dict[str, Any] | None = None) None[source]

Renders a quick info table about df in the calling notebook.

Parameters:
  • df (pd.DataFrame) – Data which to analyze.

  • columns (List[str], optional) – List of columns of df to analyze. All columns of df are used by default.

  • additional_rows (Dict[str, Any], optional) – Additional custom rows to add to the table.

edvart.report_sections.dataset_overview.show_overview(df: DataFrame, subsections: List[OverviewSubsection] | None = None, columns: List[str] | None = None) None[source]

Generates overview analysis for df.

Parameters:
  • df (pd.DataFrame) – Data to be analyzed

  • subsections (List[OverviewSubsection], optional) – Subsections to include into the overview

  • columns (List[str], optional) – Subset of columns of df to consider in overview, by default all columns are used.

edvart.report_sections.group_analysis module

class edvart.report_sections.group_analysis.GroupAnalysis(groupby: str | List[str], verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, show_within_group_statistics: bool = True, show_group_missing_values: bool = True, show_group_distribution_plots: bool = True)[source]

Bases: Section

Generate the group analysis section of the report.

Parameters:
  • groupby (Union[str, List[str]]) – Name of column or list of columns names to group by.

  • verbosity (Verbosity (default = Verbosity.LOW)) – Generated code verbosity global to the Group analysis sections. If subsection verbosities are None, then they will be overridden by this parameter.

  • columns (List[str], optional) – Columns on which to do group analysis. All columns are used by default.

  • show_within_group_statistics (bool (default = True)) – Whether to show per-group statistics.

  • show_group_missing_values (bool (default = True)) – Whether to show per-group missing values.

  • show_group_distribution_plots (bool (default = True)) – Whether to show per-group distribution plots.

Raises:

ValueError – If groupby columns are not a subset of the columns of the input DataFrame df.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Add cells to the list of cells.

Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [“import pandas as pd”, “import numpy as np”]

Return type:

List[str]

show(df: DataFrame) None[source]

Generates cell output of this section in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output.

edvart.report_sections.group_analysis.default_group_descriptive_stats() Dict[str, Callable[[Series], float]][source]

Descriptive statistic functions.

Returns:

A dictionary of statistic function names and functions.

Return type:

Dict[str, Callable[[pd.Series], float]]

edvart.report_sections.group_analysis.default_group_quantile_stats() Dict[str, Callable[[Series], float]][source]

Quantile statistic functions.

Returns:

A dictionary of statistic function names and functions.

Return type:

Dict[str, Callable[[pd.Series], float]]

edvart.report_sections.group_analysis.group_barplot(df: DataFrame, groupby: List[str], column: str, group_count_threshold: int = 20, conditional_probability: bool = True, xaxis_tickangle: float = 0, alpha: float = 0.5)[source]

Display a per-group barplot for a column.

Parameters:
  • df (pd.DataFrame) – Data to analyze.

  • groupby (List[str]) – List of column names to group by.

  • column (str) – Which column to analyze.

  • group_count_threshold (int (default = 20)) – Maximum number of unique values in column to plot. If the number of unique values is higher, a warning will be issued and plot will not be shown.

  • conditional_probability (bool (default = True)) – If True, conditional probability conditioned on group will be displayed, otherwise conditional frequency will be displayed.

  • xaxis_tickangle (float (default = 0)) – Rotation angle of ticks on the x axis.

  • alpha (float (default = 0.5)) – Opacity of bars in the plot.

edvart.report_sections.group_analysis.group_missing_values(df: DataFrame, groupby: str | List[str], columns: List[str] | None = None, round_decimals: int = 2, heatmap: bool = True, foreground_colormap: str = 'bone', background_colormap: str = 'OrRd', sort: bool = True, sort_by: List[str] | None = None, ascending: bool = False) None[source]

Display per-group number and percentage of missing values in each column.

Parameters:
  • df (pd.DataFrame) – Data to display missing values for.

  • groupby (str or List[str]) – Name of column or list of columns names to group by.

  • columns (List[str], optional) – Subset of columns to analyze. All columns except those for grouping are used by default.

  • round_decimals (int (default = 2)) – Number of decimals to round displayed results to.

  • heatmap (bool (default = True)) – Whether to color missing value percentage cells according to the corresponding value.

  • foreground_colormap (str (default = "bone")) – Color map of the foreground.

  • background_colormap (str (default = "OrRd)) – Color map of the background.

  • sort (bool (default = True)) – Whether to sort the results.

  • sort_by (List[str], optional) – List of column names to sort the results by. Sort by all column by default.

  • ascending (bool (default = False)) – If True, sort in ascending order, otherwise sort in descending order.

Raises:

ValueError – If groupby columns are not a subset of the columns of the input DataFrame df.

edvart.report_sections.group_analysis.overlaid_histograms(df: DataFrame, groupby: List[str], column: str, bins: int | None = None, density: bool = True, alpha: float = 0.5)[source]

Show per-group distribution histograms in a single plot overlaid over each other.

Parameters:
  • df (pd.DataFrame) – Data to analyze.

  • groupby (List[str]) – List of column names to group by.

  • column (str) – Name of column to analyze.

  • bins (int, optional) – Number of bins in the histogram. If None, number of bin will be inferred using Freedman-Diaconis bin number inference.

  • density (bool (default = True)) – If True, histograms will be normalized to display density.

  • alpha (float) – Opacity of individual histograms.

edvart.report_sections.group_analysis.show_group_analysis(df: DataFrame, groupby: str | List[str], columns: List[str] | None = None, show_within_group_statistics: bool = True, show_group_missing_values: bool = True, show_distribution_plots: bool = True) None[source]

Generate group analysis for df.

Parameters:
  • df (pd.DataFrame) – Data to be analyzed.

  • groupby (Union[str, List[str]]) – Name of column or list of columns names to group by.

  • columns (List[str], optional) – Subset of columns to analyze. All columns except those used for grouping are used by default.

  • show_within_group_statistics (bool (default = True)) – Whether to show per-group statistics.

  • show_group_missing_values (bool (default = True)) – Whether to show per-group missing values.

  • show_distribution_plots (bool (default = True)) – Whether to show per-group distribution plots.

Raises:

ValueError – If groupby columns are not a subset of the columns of the input DataFrame df.

edvart.report_sections.group_analysis.within_group_descriptive_stats(df: DataFrame, groupby: List[str], column: str, round_decimals: int = 2)[source]

Display within-group descriptive statistics for a column.

Parameters:
  • df (pd.DataFrame) – Data to display statistics for.

  • groupby (List[str]) – List of column names to group data by.

  • column (str) – Which column to display statistics for.

  • round_decimals (int (default = 2)) – Number of decimals to round displayed results to.

edvart.report_sections.group_analysis.within_group_quantile_stats(df: DataFrame, groupby: List[str], column: str, round_decimals: int = 2) None[source]

Display within-group quantile statistics for a column.

Parameters:
  • df (pd.DataFrame) – Data to display statistics for.

  • groupby (List[str]) – List of column names to group data by.

  • column (str) – Which column to display statistics for.

  • round_decimals (int (default = 2)) – Number of decimals to round displayed results to.

edvart.report_sections.group_analysis.within_group_stats(df: DataFrame, groupby: List[str], column: str, stats: Dict[str, Callable[[Series], float]] | None = None, round_decimals: int = 2) None[source]

Display withing group statistics for a column of df grouped by one or other more columns.

Parameters:
  • df (pd.DataFrame) – Data to display statistics for.

  • groupby (List[str]) – List of column names to group by.

  • column (str) – Name of column to display statistics for.

  • stats (Dict[str, Callable[[pd.Series], float]], optional) – A dictionary of statistic function names and functions. If None, default_group_quantile_stats() and default_group_descriptive_stats() will be used.

  • round_decimals (int (default = 2)) – Number of decimals to round displayed results to.

edvart.report_sections.multivariate_analysis module

class edvart.report_sections.multivariate_analysis.MultivariateAnalysis(subsections: List[MultivariateAnalysisSubsection] | None = None, verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, verbosity_pca: Verbosity | None = None, verbosity_umap: Verbosity | None = None, verbosity_parallel_coordinates: Verbosity | None = None, verbosity_parallel_categories: Verbosity | None = None, color_col: str | None = None)[source]

Bases: ReportSection

Generates the Multivariate analysis section of the report.

Contains an enum MultivariateAnalysisSubsection of possible subsections.

Parameters:
  • subsections (List[MultivariateAnalysisSubsection], optional) – List of subsections to include. All subsection in MultivariateAnalysisSubsection are included by default.

  • verbosity (Verbosity) – Generated code verbosity global to the Multivariate sections. If subsection verbosities are None, then they will be overridden by this parameter.

  • columns (List[str], optional) – Columns on which to do multivariate analysis. All columns of df will be used by default.

  • verbosity_pca (Verbosity, optional) – Principal component analysis subsection code verbosity.

  • verbosity_umap (Verbosity, optional) – UMAP subsection code verbosity.

  • verbosity_parallel_coordinates (Verbosity, optional) – Parallel coordinates subsection code verbosity.

  • verbosity_parallel_categories (Verbosity, optional) – Parallel categories subsection code verbosity.

  • color_col (str, optional) – Name of the column according to which to color points in the sections. Both numerical and categorical columns are supported.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells.

Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’]

Return type:

List[str]

show(df: DataFrame) None[source]

Generates cell output of this section in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output.

class edvart.report_sections.multivariate_analysis.MultivariateAnalysisSubsection(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

Enum of all implemented multivariate analysis subsections.

PCA = 0
ParallelCategories = 3
ParallelCoordinates = 2
UMAP = 1
class edvart.report_sections.multivariate_analysis.PCA(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, color_col: str | None = None, standardize: bool = True, interactive: bool = True)[source]

Bases: Section

Generates the Principal component analysis subsection.

Parameters:
  • verbosity (Verbosity (default = Verbosity.LOW)) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – Columns on which to perform PCA. Only numeric columns can be used. All numeric columns of df are used by default.

  • color_col (str, optional) – Name of column according to values of which to color points in the first vs second component plot. Can be both numeric and categorical. By default, all points have the same color.

  • standardize (bool (default = True)) – Whether to standardize the data to zero mean and unit variance before applying PCA.

  • interactive (bool (default = True)) – Whether to plot the first vs second principal component as an interactive plot. The interactive plot also shows labels for each point on hover.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells. Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Generates the PCA section in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

class edvart.report_sections.multivariate_analysis.ParallelCategories(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, color_col: str | None = None)[source]

Bases: Section

Generates the Parallel categories subsection.

Parameters:
  • verbosity (Verbosity (default = Verbosity.LOW)) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – Columns for which to generate parallel coordinates. All categorical columns with at most nunique_max unique values are used by default.

  • color_col (str, optional) – Name of column determining colors within categories. Both numeric and categorical columns are supported.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells. Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Generates the Parallel coordinates section in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

class edvart.report_sections.multivariate_analysis.ParallelCoordinates(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, color_col: str | None = None)[source]

Bases: Section

Generates the Parallel coordinates subsection.

Parameters:
  • verbosity (Verbosity (default = Verbosity.LOW)) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – Columns for which to generate parallel coordinates. All columns which are either numeric or categorical with at most nunique_max unique values are used by default.

  • color_col (str, optional) – Name of column determining color of the coordinate lines. Both numeric and categorical columns are supported.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells. Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Generates the Parallel coordinates section in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

edvart.report_sections.multivariate_analysis.parallel_categories(df: DataFrame, columns: List[str] | None = None, hide_columns: List[str] | None = None, drop_na: bool = False, color_col: str | None = None) None[source]

Generate the parallel coordinates interactive plot.

Parameters:
  • df (pd.DataFrame) – Data for which to generate the parallel coordinates plot.

  • columns (List[str], optional) – Columns for which to generate the plot.

  • hide_columns (List[str], optional) – Columns to exclude from plotting.

  • drop_na (bool (default = False)) – Whether to drop NaNs in data.

  • color_col (str, optional) – Which column to use for coloring of lines. Can be both numeric and categorical.

edvart.report_sections.multivariate_analysis.parallel_coordinates(df: DataFrame, columns: List[str] | None = None, hide_columns: List[str] | None = None, drop_na: bool = False, color_col: str | None = None, show_colorscale: bool = True) None[source]

Generate the parallel coordinates interactive plot.

Parameters:
  • df (pd.DataFrame) – Data for which to generate the parallel coordinates plot.

  • columns (List[str], optional) – Columns for which to generate the plot. All columns are used by default.

  • hide_columns (List[str], optional) – Columns to exclude from plotting.

  • drop_na (bool (default = False)) – Whether to drop NaNs in data.

  • color_col (str, optional) – Which column to use for coloring of lines. Can be both numeric and categorical.

  • show_colorscale (bool (default = True)) – Whether to show a color scale on the right side of the plot.

edvart.report_sections.multivariate_analysis.pca_explained_variance(df: DataFrame, columns: List[str] | None = None, standardize: bool = True, show_grid: bool = True, figsize: Tuple[float, float] = (10, 7)) None[source]

Plot a plot of variance explained by each principal component.

Parameters:
  • df (pd.DataFrame) – Data on which to perform PCA.

  • columns (List[str], optional) – Which columns to perform PCA on. All columns will be used by default.

  • standardize (bool (default = True)) – Whether to standardize the data zero mean and unit variance before applying PCA.

  • show_grid (bool (default = True)) – Whether to show a grid in the plot.

  • figsize (Tuple[float, float] (default = (10, 7))) – Size of the plot.

edvart.report_sections.multivariate_analysis.pca_first_vs_second(df: DataFrame, columns: List[str] | None = None, color_col: str | None = None, interactive: bool = True, standardize: bool = True, figsize: Tuple[float, float] = (12, 12), opacity: float = 0.8) None[source]

Plot a 2D scatter of first vs second PCA components.

Parameters:
  • df (pd.DataFrame) – Data to perform PCA on.

  • columns (List[str], optional) – Which columns to perform PCA on. All columns will be used by default.

  • color_col (str, optional) – Name of column according to values of which to color points in the plot. Can be both numeric and categorical. By default, all points have the same color.

  • interactive (bool (default = True)) – Whether to show an interactive plot.

  • standardize (bool (default = True)) – Whether to standardize the data to zero mean and unit variance before applying PCA.

  • figsize (Tuple[float, float] (default = (12, 12))) – Size of the plot.

  • opacity (float (default = 0.8)) – Opacity of the points in the plot. Higher means more opaque (less transparent).

edvart.report_sections.multivariate_analysis.show_multivariate_analysis(df: DataFrame, subsections: List[MultivariateAnalysisSubsection] | None = None, columns: List[str] | None = None, color_col: str | None = None) None[source]

Generates multivariate analysis for df.

Parameters:
  • df (pd.DataFrame) – Data to be analyzed

  • subsections (List[MultivariateAnalysisSubsection], optional) – Subsections to include in the analysis. All subsections are included by default.

  • columns (List[str], optional) – Subset of columns of df to consider in multivariate analysis. All numeric columns are used by default.

  • color_col (str, optional) – Name of the column according to which to color points in the sections. Both numeric and categorical columns are supported.

edvart.report_sections.section_base module

class edvart.report_sections.section_base.ReportSection(subsections: List[Section], verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None)[source]

Bases: Section

Base class for top level report sections.

Contains subsections which are also of subtype Section and implement the report generation.

Parameters:
  • subsections (List[Section]) – List of subsections that should be contained in this top level section

  • verbosity (Verbosity) – The verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – List of columns that are considered in the analysis of the section, all columns are used by default

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells.

Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Generates cell output of this section in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output.

class edvart.report_sections.section_base.Section(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None)[source]

Bases: ABC

Base class for report sections and subsections.

Parameters:
  • verbosity (Verbosity) – The verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – List of columns that are considered in the analysis of the section. All columns are used by default.

Notes

To create a new section, subclass this class and implement the abstract methods.

  • __init__ initializes your object and accepts verbosity and columns (in addition to any other section specific parameters).

    • verbosity is an enum representing the detail level of the exported code.

    • columns is a list of names of columns which will be used in the analysis.

  • required_imports returns a list of lines of code that import the packages

    required by the analysis which will get added to a cell at the top of the exported notebook. Keep in mind that different verbosity levels usually require a different set of imports.

  • add_cells(cells) adds cells to the list of cells cells.
    This method is used to build the code for the exported notebook.
    • To create a markdown cell, pass a string to nbformat.v4.new_markdown_cell()

    • To create a code cell pass a string to nbformat.v4.new_code_cell()

    • Finally append the objects returned by the functions mentioned above to cells

    • Keep in mind that the code created should conform to verbosity

  • show renders the analysis in place in the calling notebook.

abstract add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells.

Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells. The dictionaries can be generated with nbformat.v4.new_code_cell() and/or nbformat.v4.new_markdown_cell().

get_title(section_level: int) str[source]

Gets the title of the section in markdown format.

Includes a hyperlink id tag that is used by the table of contents.

Parameters:

section_level (int) – The level of the section. Adds # according to it. Highest level sections should have it set to 1.

Returns:

Title of the section in markdown format.

Return type:

str

abstract property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

abstract required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’]

Return type:

List[str]

abstract show(df: DataFrame) None[source]

Generates cell output in the calling notebook using IPython.display.display().

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

property uid: str

Identifier of the section used for generating table of contents.

Should be unique across sections.

Returns:

Unique identifier of the section

Return type:

str

class edvart.report_sections.section_base.Verbosity(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

Verbosity of the exported code.
LOW

A single function call generates the entire section.

MEDIUM

Function calls to parameterizable functions are generated for each column separately in separate cells.

HIGH

Similar to MEDIUM, but in addition, function definitions are generated, column data type inference and default statistics become customizable.

HIGH = 3
LOW = 1
MEDIUM = 2

edvart.report_sections.table_of_contents module

class edvart.report_sections.table_of_contents.TableOfContents(include_subsections: bool)[source]

Bases: object

Generates the Table of Contents section of the report.

Parameters:

include_subsections (bool) – A boolean controlling whether the subsections should be included in the table of contents. However, they won’t be included in an exported notebook created by report’s export_notebook function.

add_cells(sections: List[Section], cells: List[Dict[str, Any]]) None[source]

Adds table of contents cells to the list of cells. The subsections won’t be included.

Parameters:
  • sections (List[Section]) – List of sections that should be included in the table of contents.

  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries.

show(sections: List[Section]) None[source]

Generates table of contents’ cell output in the calling notebook.

Parameters:

sections (List[Section]) – List of sections that should be included in the table of contents.

edvart.report_sections.umap module

class edvart.report_sections.umap.UMAP(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None, color_col: str | None = None, interactive: bool = True, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean')[source]

Bases: Section

Plot a 2-dimensional UMAP embedding scatter plot.

Parameters:
  • verbosity (Verbosity (default = Verbosity.LOW)) – Verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – Columns to use in computing in the UMAP embedding. Only numeric columns can be used. All numeric columns are used by default.

  • color_col (str, optional) – Name of column to color points on the plot by. Can be both numeric and categorical. By default, all points have the same color.

  • interactive (bool (default = True)) – Whether to plot an interactive plot. The interactive plot also shows labels for each point on hover.

  • n_neighbors (int (default = 15)) – UMAP embedding parameter controlling the balance between focusing on the local structure vs global structure. A low value means more focus on local structure.

  • min_dist (int (default = 0.1)) – UMAP embedding parameter controlling how tightly points in the embedding are clumped together. A low value results in tighter clumping, which can show clusters or other similar structures, while a higher value encourages preservation of topological structure present in the input data.

  • metric (str (default = "euclidean")) – UMAP embedding parameter controlling how distance is computed in the ambient space of the input data. Many different metrics are available, see UMAP documentation for a complete list.

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds cells to the list of cells. Cells can be either code cells or markdown cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook, e.g. [‘import pandas as pd’, ‘import numpy as np’].

Return type:

List[str]

show(df: DataFrame) None[source]

Generates the UMAP plot section in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

edvart.report_sections.umap.plot_umap(df: DataFrame, columns: List[str] | None = None, color_col: str | None = None, interactive: bool = True, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', random_state: int = 42, figsize: Tuple[float, float] = (12, 12), opacity: float = 0.8, show_message: bool = True) None[source]

Plot a 2-dimensional UMAP embedding scatter plot.

Parameters:
  • df (pd.DataFrame) – Data to analyze.

  • columns (List[str], optional) – Columns to use in computing in the UMAP embedding. Only numeric columns can be used. All numeric columns are used by default.

  • color_col (str, optional) – Name of column to color points on the plot by. Can be both numeric and categorical. By default, all points have the same color.

  • interactive (bool (default = True)) – Whether to plot an interactive plot. The interactive plot also shows labels for each point on hover.

  • n_neighbors (int (default = 15)) – UMAP embedding parameter controlling the balance between focusing on the local structure vs global structure. A low value means more focus on local structure.

  • min_dist (int (default = 0.1)) – UMAP embedding parameter controlling how tightly points in the embedding are clumped together. A low value results in tighter clumping, which can show clusters or other similar structures, while a higher value encourages preservation of topological structure present in the input data.

  • metric (str (default = "euclidean")) – UMAP embedding parameter controlling how distance is computed in the ambient space of the input data. Many different metrics are available, see UMAP documentation for a complete list.

  • random_state (int (default = 42)) – Random state for reproducibility of results, since UMAP is stochastic. If None, a random seed is used.

  • figsize (Tuple[float, float] (default = (12, 12))) – Size of the resulting plot in inches.

  • opacity (float (default = 0.8)) – Opacity of the points drawn in the scatter plot.

  • show_message (bool (default = True)) – Whether to show a message informing the user to tune the embedding parameters.

edvart.report_sections.univariate_analysis module

class edvart.report_sections.univariate_analysis.UnivariateAnalysis(verbosity: Verbosity = Verbosity.LOW, columns: List[str] | None = None)[source]

Bases: Section

Generates univariate analysis section of the report.

Parameters:
  • verbosity (Verbosity) – The verbosity of the code generated in the exported notebook.

  • columns (List[str], optional) – List of columns for which to do univariate analysis, all columns are used by default

add_cells(cells: List[Dict[str, Any]], df: DataFrame) None[source]

Adds univariate analysis cells to the list of cells.

Parameters:
  • cells (List[Dict[str, Any]]) – List of generated notebook cells which are represented as dictionaries

  • df (pd.DataFrame) – Data for which to add the cells.

property name: str

Name of the section.

Returns:

Name of the section.

Return type:

str

required_imports() List[str][source]

Returns a list of imports to be put at the top of a generated notebook.

Returns:

List of import strings to be added at the top of the generated notebook.

Return type:

List[str]

show(df: DataFrame) None[source]

Generates univariate analysis cell output in the calling notebook.

Parameters:

df (pd.DataFrame) – Data based on which to generate the cell output

edvart.report_sections.univariate_analysis.bar_plot(series: Series, relative_count: bool = False, figsize: Tuple[float, float] = (20, 7), plotting_threshold: int = 50, **bar_plot_args: Any) None[source]

Plots a bar plot visualizing frequencies of series elements.

Parameters:
  • series (pd.Series) – Categorical series.

  • relative_count (bool) – If True, the frequencies will be normalized by the series length.

  • figsize (Tuple[float, float]) – Size of the bar plot.

  • plotting_threshold (int) – If the number of unique values in the series is greater than this, no plot is created instead a warning is issued.

  • bar_plot_args (Any) – Additional kwargs passed to pandas.Series.bar.

edvart.report_sections.univariate_analysis.default_descriptive_statistics()[source]

Return default descriptive statistics.

Returns:

Dictionary with keys as statistics names and values with functions.

Return type:

dict

edvart.report_sections.univariate_analysis.default_quantile_statistics()[source]

Return default quantile statistics.

Returns:

Dictionary with keys as statistics names and values with functions.

Return type:

dict

edvart.report_sections.univariate_analysis.histogram(series: Series, bins: int | str | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | float | complex | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, density: bool = False, box_plot: bool = True, figsize: Tuple[float, float] = (20, 7), distplot_kwargs: Dict[str, Any] | None = None, boxplot_kwargs: Dict[str, Any] | None = None) None[source]

Visualizes distribution of series with a histogram.

Parameters:
  • series (pd.Series) – Numerical series.

  • bins (int or str or array_like, optional) –

    If bins is an int, it defines the number of equal-width bins in the range of the series. If bins is a string, it defines the method used to calculate the optimal bin width. If bins is an array, it defines the bin edges.

    Can be any valid input for parameter bins of numpy.histogram_bin_edges. https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges

    By default, the number of bins is inferred based on the input data.

  • density (bool (default = False)) – If True, the area of the histogram bars will sum up to 1.

  • box_plot (bool (default = True)) – If True, a horizontal box plot will be added above the histogram to visualize quartiles.

  • figsize (Tuple[float, float] (default = (20, 7))) – Size of the figure of the visualization.

  • distplot_kwargs (Dict[str, Any], optional) – Additional keyword arguments passed to seaborn.distplot

  • boxplot_kwargs (Dict[str, Any], optional) – Additional keyword arguments passed to seaborn.boxplot

edvart.report_sections.univariate_analysis.numeric_statistics(series: Series, descriptive_stats: Dict[str, Callable] | None = None, quantile_stats: Dict[str, Callable] | None = None, thousand_separator: str = ' ') None[source]

Generates tables with statistics for numeric series.

Parameters:
  • series (pd.Series) – Numeric series.

  • descriptive_stats (Dict[str, Callable], optional) – Descriptive statistics computed for the series. Dictionary format: {‘statistic name’: stat_func(pd.Series) -> float} If None, UnivariateAnalysis.default_descriptive_statistics is used.

  • quantile_stats (Dict[str, Callable], optional) – Quantile statistics computed for the series. Dictionary format: {‘statistic name’: stat_func(pd.Series) -> float} If None, UnivariateAnalysis.default_quantile_statistics is used.

  • thousand_separator (str) – Thousand separator for numbers in the tables, by default space.

edvart.report_sections.univariate_analysis.show_univariate_analysis(df: DataFrame, columns: List[str] | None = None) None[source]

Generates univariate analysis for df.

Parameters:
  • df (pd.DataFrame) – Dataframe to be analyzed

  • columns (List[str], optional) – Subset of df columns to analyze, by default all columns of df are used

edvart.report_sections.univariate_analysis.top_most_frequent(series: Series, n_top: int = 5) None[source]

Generates a table with top n most frequent values in series.

Parameters:
  • series (pd.Series) – Categorical series

  • n_top (int) – The number of most frequent values to include in the table.

Module contents

Package consisting report sections.