edvart package

Subpackages

edvart.report_sections package

Submodules

edvart.data_types module

class edvart.data_types.DataType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

Class describe possible data types.

BOOLEAN = 3

CATEGORICAL = 2

DATE = 4

MISSING = 6

NUMERIC = 1

UNIQUE = 7

UNKNOWN = 5

edvart.data_types.infer_data_type(series: Series) → DataType[source]

Infers the data type of the series passed in.

Parameters:: series (pd.Series) – Series from which to infer data type.
Returns:: Inferred custom edvart data type.
Return type:: DataType

edvart.data_types.is_boolean(series: Series) → bool[source]

Heuristic which tells if a series contains only boolean values.

Parameters:: series (pd.Series) – Series from which to infer data type.
Returns:: Boolean indicating if series is boolean.
Return type:: bool

edvart.data_types.is_categorical(series: Series, unique_value_count_threshold: int = 10) → bool[source]

Heuristic to tell if a series is categorical.

Parameters:

series (pd.Series) – Series from which to infer data type.
unique_value_count_threshold (int) – The number of unique values of the series has to be less than or equal to this number for the series to satisfy one of the requirements to be a categorical series.

Returns:

Boolean indicating if series is categorical.

Return type:

bool

edvart.data_types.is_date(series: Series) → bool[source]

Heuristic which tells if a series is of type date.

Parameters:: series (pd.Series) – Series from which to infer data type.
Returns:: Boolean indicating if series is of type datetime.
Return type:: bool

edvart.data_types.is_missing(series: Series) → bool[source]

Function to tell if the series contains only missing values.

Parameters:: series (pd.Series) – Series from which to infer data type.
Returns:: True if all values in the series are missing, False otherwise.
Return type:: bool

edvart.data_types.is_numeric(series: Series) → bool[source]

Heuristic to tell if a series contains numbers only.

Parameters:: series (pd.Series) – Series from which to infer data type.
Returns:: Boolean indicating whether series contains only numbers.
Return type:: bool

edvart.data_types.is_unique(series: Series) → bool[source]

Heuristic to tell if a series is categorical with only unique values.

Parameters:: series (pd.Series) – Series from which to infer data type.
Returns:: Boolean indicating whether series contains only unique values.
Return type:: bool

edvart.decorators module

edvart.decorators.check_index_time_ascending(func)[source]

Check whether the index of the DataFrame is sorted in ascending order.

The DataFrame which is checked needs to be the first argument of the decorated function.

Raises:: ValueError – If the index is not a datetime or is not ascending.

edvart.example_datasets module

edvart.example_datasets.dataset_auto() → DataFrame[source]

Returns a dataset containing information on the technical specifications of cars. The dataset contains 398 rows and 9 columns. There are no missing values.

Source: https://www.kaggle.com/datasets/uciml/autompg-dataset Unmodified.

Return type:: pd.DataFrame

edvart.example_datasets.dataset_global_temp() → DataFrame[source]

Returns a time-series dataset containing monthly deviations from mean global average temperature from 1880 until 2016 according to two methodologies: GCAG and GISTEMP.

Source: https://datahub.io/core/global-temp. Modified: Moved each methodology into its own column.

edvart.example_datasets.dataset_meteorite_landings() → DataFrame[source]

Returns a dataset that includes the location, mass, composition, and fall year for over 45,000 meteorites that have struck our planet.

Source of data: NASA Open Data Portal (https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh)

Return type:: pd.DataFrame

edvart.example_datasets.dataset_pollution() → DataFrame[source]

Returns a time-series dataset containing hourly weather and pollution data from 2010 until 2014 from Beijing, China. There are 43800 rows and 8 columns.

Source: https://www.kaggle.com/datasets/djhavera/beijing-pm25-data-data-set Modified: - Merged columns “year”, “month”, “day”, “hour” into a single “date” column. - Removed the first day, for which pollution data is missing. - Renamed columns.

edvart.example_datasets.dataset_titanic() → DataFrame[source]

Returns a dataset that contains data for 891 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived, their age, their passenger-class, their sex and the fare they paid.

The dataset contains 891 rows and 12 columns.

Source: https://www.kaggle.com/datasets/hesh97/titanicdataset-traincsv Unmodified.

Return type:: pd.DataFrame

edvart.export_utils module

edvart.export_utils.embed_image_base64(image_path: str, mime: str = 'image/png') → str[source]

Loads content of an image and embeds it as base64 data URL into the template. Intended to be used as a Jinja filter.

Example Jinja filter usage (CSS): ``` #notebook-container {

background-image: url(‘{{ ‘background.png’ | embed_image_base64(‘image/png’) }}’);

```

Parameters:

image_path (str) – Relative path from the current template to the image.
mime (str) – Mime type of the image.

Returns:

Data URL with embedded image and mime type specified.

Return type:

str

edvart.pandas_formatting module

edvart.pandas_formatting.add_html_heading(html: str, heading: str, heading_level: int = 2) → str[source]

Adds a heading to an HTML string with the specified text and heading level.

Parameters:

html (str) – HTML string to which to add heading
heading (str) – Text of the heading
heading_level (int) – Level of the heading

Returns:

HTML string with heading added

Return type:

str

edvart.pandas_formatting.dict_to_html(dictionary: Dict[str, Any]) → str[source]

Converts a dictionary to a dataframe in HTML string form.

Parameters:: dictionary (Dict['str', Any]) – DictDictionary to be converted

edvart.pandas_formatting.format_number(number: int | float, decimal_places: int = 2, thousand_separator: str = '') → str[source]

Formats a number by truncating decimal places (if it is a float) and optionally adds thousand separators.

Parameters:

number (Union[int, float]) – Number to be converted to string
decimal_places (int) – Number of decimal places in case of float
thousand_separator (str) – Character or string with which thousands should be separated

Returns:

Formatted number in a string representation

Return type:

str

edvart.pandas_formatting.hide_index(df: DataFrame) → Styler[source]

Hides the index of a DataFrame.

Parameters:: df (pd.DataFrame) – DataFrame where the index should be hidden.
Returns:: Styler object with the index hidden.
Return type:: Styler

edvart.pandas_formatting.render_dictionary(dictionary: Dict[str, Any]) → None[source]

Converts a dictionary to a dataframe and renders that dataframe in the report notebook.

Parameters:: dictionary (Dict['str', Any]) – Dictionary to be rendered

edvart.pandas_formatting.series_to_frame(series: Series, index_name: str, column_name: str) → DataFrame[source]

Converts a pandas.Series to a pandas.DataFrame by putting the series index into a separate column.

Parameters:

series (pd.Series) – Input series
index_name (str) – Name of the new column into which the series index will be put
column_name (str) – Name of the series values column

Returns:

Dataframe with two columns index_name and column_name with values of series.index and series.values respectively

Return type:

pd.DataFrame

edvart.pandas_formatting.subcells_html(elements: List[List[str]]) → str[source]

Returns HTML table in string format according to the elements matrix.

Parameters:: elements (List[List[str]]) – Elements which should be rendered in table cells, outer list represents rows, inner list represents columns, elements themselves should be HTML strings
Returns:: Table in HTML string ready to be rendered for example by IPython.display.display_html
Return type:: str

edvart.plots module

edvart.plots.scatter_plot_2d(df: DataFrame, x: str | Series | ndarray, y: str | Series | ndarray, color_col: str | None = None, interactive: bool = True, figsize: Tuple[float, float] = (12, 12), opacity: float = 0.8, xlabel: str | None = None, ylabel: str | None = None, show_xticks: bool = False, show_yticks: bool = False, show_zerolines: bool = False, equal_scale_axes: bool = False) → None[source]

Display a 2D scatter plot of x and y, with optional coloring of points by values in a column.

Parameters:

df (pd.DataFrame) – Data to plot.
x (Union[str, pd.Series, np.ndarray]) – Name of column in df or flat array or series of x coordinates of plotted points.
y (Union[str, pd.Series, np.ndarray]) – Name of column in df or flat array or series of y coordinates of plotted points.
color_col (str, optional) – Name of column in df to color points in the plot by. Can be both numeric or categorical. By default, all points have the same color.
interactive (bool (default = True)) – Whether to plot an interactive plot. The interactive plot also shows labels for each point on hover.
figsize (Tuple[float, float] (default = (12, 12))) – Size of the resulting plot in inches.
opacity (float (default = 0.8)) – Opacity of the points drawn in the scatter plot.
xlabel (str, optional) – Label for the x axis. No label is displayed by default.
ylabel (str, optional) – Label for the y axis. No label is displayed by default.
show_xticks (bool (default = False)) – Whether to display ticks on the x axis.
show_yticks (bool (default = False)) – Whether to display ticks on the y axis.
show_zerolines (bool (default = False)) – Whether to display zero lines.
equal_scale_axes (bool (default = False)) – Whether to make the x and y axes have the same scale.

edvart.report module

class edvart.report.DefaultReport(dataframe: DataFrame, verbosity: Verbosity = Verbosity.LOW, verbosity_overview: Verbosity | None = None, verbosity_univariate_analysis: Verbosity | None = None, verbosity_bivariate_analysis: Verbosity | None = None, verbosity_multivariate_analysis: Verbosity | None = None, verbosity_group_analysis: Verbosity | None = None, columns_overview: List[str] | None = None, columns_univariate_analysis: List[str] | None = None, columns_bivariate_analysis: List[str] | None = None, columns_multivariate_analysis: List[str] | None = None, columns_group_analysis: List[str] | None = None, groupby: str | List[str] | None = None)[source]

Bases: Report

A report for tabular data containing default sections.

The report contains the following sections: - dataset overview - univariate analysis - bivariate analysis - multivariate analysis - group analysis (if groupby is specified)

Parameters:

dataframe (pd.DataFrame) – Data from which to generate the report.
verbosity (Verbosity (default = Verbosity.LOW)) – The default verbosity for the exported code of the entire report.
verbosity_overview (Verbosity, optional) – Verbosity of the overview section
verbosity_univariate_analysis (Verbosity, optional) – Verbosity of the univariate analysis section
verbosity_bivariate_analysis (Verbosity, optional) – Verbosity of the bivariate analysis section.
verbosity_multivariate_analysis (Verbosity, optional) – Verbosity of the multivariate analysis section
columns_overview (List[str], optional) – Subset of columns to use in overview section
columns_univariate_analysis (List[str], optional) – Subset of columns to use in univariate analysis section
columns_bivariate_analysis (List[str], optional) – Subset of columns to use in bivariate analysis section
columns_multivariate_analysis (List[str], optional) – Subset of columns to use in multivariate analysis section
columns_group_analysis (List[str], optional) – Subset of columns to use in group analysis section
groupby (Union[str, List[str]], optional) – Column or list of columns to group by in group analysis. If None, group analysis will not be included by default. It can still be added later using add_group_analysis. If a single column is specified, it will be used to color points in multivariate analysis. Default: None.

class edvart.report.DefaultTimeseriesReport(dataframe: DataFrame, verbosity: Verbosity = Verbosity.LOW, verbosity_overview: Verbosity | None = None, verbosity_univariate_analysis: Verbosity | None = None, verbosity_timeseries_analysis: Verbosity | None = None, columns_overview: List[str] | None = None, columns_univariate_analysis: List[str] | None = None, columns_timeseries_analysis: List[str] | None = None, sampling_rate: Verbosity | None = None, stft_window_size: Verbosity | None = None)[source]

Bases: TimeseriesReport

A default report for time series data.

The report contains the following sections: - dataset overview - univariate analysis - timeseries analysis

Parameters:

dataframe (pd.DataFrame) – Data from which to generate the report. Data needs to be indexed by time: pd.DateTimeIndex or pd.PeriodIndex. The data is assumed to be sorted according to the time index in ascending order.
verbosity (Verbosity (default = Verbosity.LOW)) – The default verbosity for the exported code of the entire report.
verbosity_overview (Verbosity, optional) – Verbosity of the overview section
verbosity_univariate_analysis (Verbosity, optional) – Verbosity of the univariate analysis section
verbosity_timeseries_analysis (Verbosity, optional) – Verbosity of the timeseries analysis section
columns_overview (List[str], optional) – Subset of columns to use in overview section
columns_univariate_analysis (List[str], optional) – Subset of columns to use in univariate analysis section
columns_timeseries_analysis (List[str], optional) – Subset of columns to use in timeseries analysis section
sampling_rate (int, optional) – Sampling rate for Fourier transform and Short-time Fourier transform subsections. Determines frequency unit for analysis of frequencies, for example with monthly data and sampling rate 12, yearly frequency spectrum is produced. If not set, these two sections will not be included.
stft_window_size (int, optional) – Windows size for short-time Fourier transform subsection. If not set, STFT will be excluded.

exception edvart.report.EmptyReportWarning[source]

Bases: UserWarning

Warning raised when a report contains no sections.

class edvart.report.ExportDataMode(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: StrEnum

Data export mode for the report.

EMBED = 'embed'

FILE = 'file'

NONE = 'none'

class edvart.report.Report(dataframe: DataFrame, verbosity: Verbosity = Verbosity.LOW)[source]

Bases: ReportBase

A report for tabular datasets. Contains no sections by default.

See DefaultReport for a report with default sections. See methods add_* for adding sections to the report.

Parameters:

dataframe (pd.DataFrame) – Data from which to generate the report.
verbosity (Verbosity (default = Verbosity.LOW)) – Verbosity of the exported code of the entire report.

class edvart.report.ReportBase(dataframe: DataFrame, verbosity: Verbosity = Verbosity.LOW)[source]

Bases: ABC

Abstract base class for reports.

Parameters:

dataframe (pd.DataFrame) – Data from which to generate the report.
verbosity (Verbosity (default = Verbosity.LOW)) – The default verbosity for the exported code of the entire report, by default Verbosity.LOW.

add_bivariate_analysis(columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None, columns_pairs: List[Tuple[str, str]] | None = None, subsections: List[BivariateAnalysisSubsection] | None = None, verbosity: Verbosity | None = None, verbosity_correlations: Verbosity | None = None, verbosity_pairplot: Verbosity | None = None, verbosity_contingency_table: Verbosity | None = None, color_col: str | None = None) → ReportBase[source]

Adds bivariate analysis section to the report.

Parameters:

columns (List[str], optional) – Columns which to analyze. If None, all columns are used.
columns_x (List[str], optional) – If specified, correlations and pairplots are performed on the cartesian product of columns_x and columns_y. If columns_x is specified, then columns_y must also be specified.
columns_y (List[str], optional) – If specified, correlations and pairplots are performed on the cartesian product of columns_x and columns_y. If columns_y is specified, then columns_x must also be specified.
columns_pairs (List[str], optional) – List of columns pairs on which to perform bivariate analysis. Used primarily in contingency tables. If specified, columns, columns_x and columns_y are ignored in contingency tables. Ignored in pairplots and correlations unless columns_pairs is specified and none of columns, columns_x, columns_y is specified. In that case, the first elements of each pair are treated as columns_x and the second elements as columns_y in pairplots and correlations.
subsections (List[BivariateAnalysisSubsection], optional) – List of sub-sections to include into the BivariateAnalysis section. If None, all subsections are added.
verbosity (Verbosity, optional) – The verbosity of the code generated in the exported notebook.
verbosity_correlations (Verbosity, optional) – Correlation plots subsection code verbosity.
verbosity_pairplot (Verbosity, optional) – Pairplot subsection code verbosity.
verbosity_contingency_table (Verbosity, optional) – Contingency table code verbosity.
color_col (str, optional) – Name of column according to use for coloring of the multivariate analysis subsections. Coloring is currently supported in pairplot.

add_group_analysis(groupby: str | List[str], columns: List[str] | None = None, verbosity: Verbosity | None = None, show_within_group_statistics: bool = True, show_group_missing_values: bool = True, show_group_distribution_plots: bool = True) → ReportBase[source]

Add group analysis section to the report.

Parameters:

groupby (Union[str, List[str]]) – Column or list of columns to group by in group analysis.
columns (List[str], optional) – Columns which to analyze. If None, all columns are used.
verbosity (Verbosity, optional) – The verbosity of the code generated in the exported notebook.
show_within_group_statistics (bool (default = True)) – Whether to show per-group statistics.
show_group_missing_values (bool (default = True)) – Whether to show per-group missing values.
show_group_distribution_plots (bool (default = True)) – Whether to show per-group distribution plots.

add_multivariate_analysis(columns: List[str] | None = None, subsections: List[MultivariateAnalysisSubsection] | None = None, verbosity: Verbosity | None = None, verbosity_pca: Verbosity | None = None, verbosity_umap: Verbosity | None = None, verbosity_parallel_coordinates: Verbosity | None = None, verbosity_parallel_categories: Verbosity | None = None, color_col: str | None = None) → ReportBase[source]

Add multivariate analysis section to the report.

Parameters:

columns (List[str], optional) – Columns which to analyze. If None, all columns are used.
subsections (List[MultivariateAnalysisSubsection], optional) – List of sub-sections to include into the BivariateAnalysis section. If None, all subsections are added.
verbosity (Verbosity, optional) – The verbosity of the code generated in the exported notebook.
verbosity_pca (Verbosity, optional) – Principal component analysis subsection code verbosity.
verbosity_umap (Verbosity, optional) – UMAP subsection code verbosity.
verbosity_parallel_coordinates (Verbosity, optional) – Parallel coordinates subsection code verbosity.
verbosity_parallel_categories (Verbosity, optional) – Parallel categories subsection code verbosity.
color_col (str, optional) – Name of column to use for coloring of the multivariate analysis subsections. The exact method of coloring depends on each particular subsection.

add_overview(columns: List[str] | None = None, subsections: List[OverviewSubsection] | None = None, verbosity: Verbosity | None = None, verbosity_quick_info: Verbosity | None = None, verbosity_data_types: Verbosity | None = None, verbosity_data_preview: Verbosity | None = None, verbosity_missing_values: Verbosity | None = None, verbosity_rows_with_missing_value: Verbosity | None = None, verbosity_constant_occurrence: Verbosity | None = None, verbosity_duplicate_rows: Verbosity | None = None) → ReportBase[source]

Adds a dataset overview section to the report.

Parameters:

columns (List[str], optional) – Columns which to include in the overview section. If None, all columns are used.
subsections (List[Overview.OverviewSubsection], optional) – List of sub-sections to include into the Overview section. If None, all subsections are added.
verbosity (Verbosity, optional) – Generated code verbosity global to the Overview sections. If subsection verbosities are None, then they will be overridden by this parameter.
verbosity_quick_info (Verbosity, optional) – Quick info sub-section code verbosity.
verbosity_data_types (Verbosity, optional) – Data types sub-section code verbosity.
verbosity_data_preview (Verbosity, optional) – Data preview sub-section code verbosity.
verbosity_missing_values (Verbosity, optional) – Missing values sub-section code verbosity.
verbosity_rows_with_missing_value (Verbosity, optional) – Rows with missing value sub-section code verbosity.
verbosity_constant_occurrence (Verbosity, optional) – Constant values occurrence sub-section code verbosity.
verbosity_duplicate_rows (Verbosity, optional) – Duplicate rows sub-section code verbosity.

add_section(section: Section) → ReportBase[source]

Add a section to the report. See edvart.report_sections for available sections.

Parameters:: section (Section) – Section to add to the report.
Returns:: Returns self.
Return type:: ReportBase

add_table_of_contents(include_subsections: bool = True) → ReportBase[source]

Adds table of contents section to the report.

Parameters:: include_subsections (bool) – A boolean controlling whether the subsections should be included in the table of contents. However, they won’t be included in an exported notebook created by report’s export_notebook function.

add_univariate_analysis(columns: List[str] | None = None, verbosity: Verbosity | None = None) → ReportBase[source]

Adds univariate section to the report.

Parameters:

columns (List[str], optional) – Columns which to analyze. If None, all columns are used.
verbosity (Verbosity) – The verbosity of the code generated in the exported notebook.

export_html(html_filepath: str, template_name: str | None = None, template_filepath: str | None = None, dataset_name: str = '[INSERT DATASET NAME]', dataset_description: str = '[INSERT DATASET DESCRIPTION]', timeout: int = 120) → None[source]

Generate HTML report for an already-loaded DataFrame.

Parameters:

html_filepath (str) – File path to save exported HTML report to.
template_name (str, optional) –
Path to template file to use for exporting the notebook to HTML.

The template must be found in a Jupyter path (see https://nbconvert.readthedocs.io/en/latest/customizing.html#where-are-nbconvert-templates-installed ). The default location is $HOME/.local/share/jupyter/nbconvert/templates
template_filepath (str, optional) – Template to use when exporting the HTML report.
dataset_name (str (default = "[INSERT DATASET NAME]")) – Name of dataset to be used in the title of the report.
dataset_description (str (default = "[INSERT DATASET DESCRIPTION]")) – Description of dataset to be used below the title of the report.
timeout (int (default = 120)) – Maximum number of seconds to wait for a cell to finish execution.

export_notebook(notebook_filepath: str | PathLike, dataset_name: str = '[INSERT DATASET NAME]', dataset_description: str = '[INSERT DATASET DESCRIPTION]', export_data_mode: ExportDataMode = ExportDataMode.NONE) → None[source]

Exports the report as an .ipynb file.

Parameters:

notebook_filepath (str or PathLike) – Filepath of the exported notebook.
dataset_name (str (default = "[INSERT DATASET NAME]")) – Name of dataset to be used in the title of the report.
dataset_description (str (default = "[INSERT DATASET DESCRIPTION]")) – Description of dataset to be used below the title of the report.
export_data_mode (ExportDataMode (default = ExportDataMode.NONE)) – Mode for exporting the data to the notebook. If ExportDataMode.NONE, the data is not exported to the notebook. If ExportDataMode.FILE, the data is exported to a parquet file and loaded from there. If ExportDataMode.EMBED, the data is embedded into the notebook as a base64 string.

show() → None[source]: Renders the report in the calling notebook.

class edvart.report.TimeseriesReport(dataframe: DataFrame, verbosity: Verbosity = Verbosity.LOW)[source]

Bases: ReportBase

A report for time-series data. Contains no sections by default.

See DefaultTimeseriesReport for a time-series report with default sections. See methods add_* for adding sections to the report.

Raises:: ValueError – If the input dataframe is not indexed by time.

add_timeseries_analysis(columns: List[str] | None = None, subsections: List[TimeseriesAnalysisSubsection] | None = None, verbosity: Verbosity | None = None, verbosity_time_series_line_plot: Verbosity | None = None, verbosity_rolling_statistics: Verbosity | None = None, verbosity_boxplots_over_time: Verbosity | None = None, verbosity_seasonal_decomposition: Verbosity | None = None, verbosity_autocorrelation: Verbosity | None = None, verbosity_stationarity_tests: Verbosity | None = None, verbosity_fourier_transform: Verbosity | None = None, verbosity_short_time_ft: Verbosity | None = None, sampling_rate: Verbosity | None = None, stft_window_size: Verbosity | None = None) → TimeseriesReport[source]

Add timeseries analysis section to the report.

Parameters:

columns (List[str], optional) – Columns which to analyze. If None, all columns are used.
subsections (List[TimeseriesAnalysis.TimeseriesAnalysisSubsection], optional) – List of sub-sections to include into the BivariateAnalysis section. If None, all subsections are added.
verbosity (Verbosity, optional) – The verbosity of the code generated in the exported notebook.
verbosity_time_series_line_plot (Verbosity, optional) – Time series line plot subsection code verbosity.
verbosity_rolling_statistics (Verbosity, optional) – Rolling statistics interactive plot subsection code verbosity.
verbosity_boxplots_over_time (Verbosity, optional) – Boxplots grouped over time intervals plot subsection code verbosity.
verbosity_seasonal_decomposition (Verbosity, optional) – Decomposition into trend, seasonal and residual components code verbosity.
verbosity_autocorrelation (Verbosity, optional) – Autocorrelation and partial autocorrelation vs. lag code verbosity.
verbosity_stationarity_tests (Verbosity, optional) – Stationarity tests code verbosity.
verbosity_fourier_transform (Verbosity, optional) – Fourier transform and short-time Fourier transform code verbosity.
verbosity_short_time_ft (Verbosity, optional) – Short-time Fourier transform transform spectrogram code verbosity.
sampling_rate (Verbosity, optional) – Sampling rate for Fourier transform and Short-time Fourier transform subsections. Needs to be set in order for these two subs to be included.
stft_window_size (Verbosity, optional) – Window size for Short-time Fourier transform. Needs to be set in order for the STFT subsection to be included.

edvart.utils module

edvart.utils.coefficient_of_variation(series: Series) → float[source]

Return coefficient of variation.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.contingency_table(df: DataFrame) → DataFrame[source]

Return contingency table.

Parameters:: df (pd.DataFrame) –
Return type:: pd.DataFrame

edvart.utils.env_var(name: str, value: str) → Iterator[None][source]

Set an environment variable for the duration of the context.

Parameters:

name (str) – Name of the environment variable.
value (str) – Value of the environment variable.

edvart.utils.get_default_discrete_colorscale(n_colors: int) → List[Tuple[float, str]][source]

Get a default Plotly-compatible colorscale of n discrete colors.

Parameters:: n_colors (int) – Number of colors.
Returns:: A list of 2n tuples, where each tuple contains a value between 0 and 1 and a plotly-compatible color string.
Return type:: list[tuple[float, str]]

edvart.utils.hsl_wheel_colorscale(n: int, saturation=0.5, lightness=0.5) → Iterable[str][source]

Generate a colorscale of n discrete colors.

Colors are equally spaced around the complete HSL wheel with constant saturation and lightness.

Returns:: An iterable of n plotly-compatible HSL strings.
Return type:: Iterable[str]

edvart.utils.iqr(series: Series) → float[source]

Return inter quartile range.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.kendall(df: DataFrame) → DataFrame[source]

Return kendall correlation coefficient.

Parameters:: df (pd.DataFrame) – DataFrame on which the stat should be calculated.
Return type:: pd.DataFrame

edvart.utils.kurtosis(series: Series) → Any[source]

Return kurtosis.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.mad(series: Series) → Any[source]

Return mean absolute deviation.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.make_discrete_colorscale(colorscale: List[str], n_colors: int) → Iterable[Tuple[float, str]][source]

Generate a colorscale of n discrete colors for use in plotly.graph_objects.

Note that when using plotly.express, the parameter color_discrete_sequence can be used instead.

Parameters:

colorscale (List[str]) – A list of plotly-compatible colors.
n_colors (int) – Number of colors to in the generated colorscale.

Returns:

An iterable of 2n tuples, where each tuple contains a value between 0 and 1 (the values are equally spaced in the interval and each value appears twice), and one of the colors from the colorscale.

Return type:

Iterable[Tuple[float, str]]

Examples

>>> list(make_discrete_colorscale(["red", "green", "blue"], 4))
[
    (0, "red"), (0.25, "red"),
    (0.25, "green"), (0.5, "green"),
    (0.5, "blue"), (0.75, "blue"),
    (0.75, "red"), (1, "red")
]

edvart.utils.maximum(series: Series) → float[source]

Return maximum.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.mean(series: Series) → float[source]

Return mean.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.median(series: Series) → float[source]

Return median.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.median_absolute_deviation(series: Series) → float[source]

Return median absolute deviation.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.minimum(series: Series) → float[source]

Return minimum.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.mode(series: Series) → float[source]

Return mode.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Returns:: The most frequent value. float(‘nan’) if the series contains only null values.
Return type:: float

edvart.utils.num_unique_values(series: Series) → int[source]

Return number of unique values.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: int

edvart.utils.pearson(df: DataFrame) → DataFrame[source]

Return pearson correlation coefficient.

Parameters:: df (pd.DataFrame) – DataFrame on which the stat should be calculated.
Return type:: pd.DataFrame

edvart.utils.quartile1(series: Series) → float[source]

Return first quartile.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.quartile3(series: Series) → float[source]

Return third quartile.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.reindex_to_datetime(df: DataFrame, datetime_column: str, keep_index: str | None = None, unit: str = 'ns', origin: str | Timestamp = 'unix', sort: bool = True) → DataFrame[source]

Reindex a given DataFrame to be indexed by a pd.DateTimeIndex.

Parameters:

df (pd.DataFrame) – DataFrame to reindex.
datetime_column (str) – Which column containing datetimes to index by.
keep_index (str, optional) – Name of column to store the original index. The original index will be discarded by default.
unit (str (default = "ns")) – Numeric values would be parsed as number of units from origin.
origin (Union[str, pd.Timestamp] (default = "unix")) – Define the reference date. Numeric values would be parsed as number of units (defined by unit) since this reference date. By default unix epoch 1970-01-01.
sort (bool (default = True)) – Whether to sort according to the index.

Returns:

Reindexed df.

Return type:

pd.DataFrame

edvart.utils.reindex_to_period(df: DataFrame, period_column: str, freq: str | None, keep_index: str | None = None, sort: bool = True) → DataFrame[source]

Reindex a given DataFrame to be indexed by a pd.PeriodIndex.

Parameters:

df (pd.DataFrame) – DataFrame to reindex.
period_column (str) – Which column containing periods to index by.
freq (Union[str, pd.Offset]) – One of pandas’ offset strings or an Offset object. Will be inferred by default.
keep_index (str, optional) – Name of column to store the original index. The original index will be discarded by default.
sort (bool (default = True)) – Whether to sort according to the index.

Returns:

Reindexed df.

Return type:

pd.DataFrame

edvart.utils.select_numeric_columns(df: DataFrame, columns: List[str] | None) → List[str][source]

Select all numeric columns from a DataFrame if columns is None, or check if all specified columns are numeric if columns is a list of column names.

Parameters:

df (pd.DataFrame) – DataFrame to select or check columns from.
columns (List[str], optional) – Specified columns.

Returns:

List of numeric or specified columns

Return type:

List[str]

Raises:

ValueError – If a non-numeric column is specified in columns.

edvart.utils.skewness(series: Series) → Any[source]

Return skewness.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.spearman(df: DataFrame) → DataFrame[source]

Return spearman correlation coefficient.

Parameters:: df (pd.DataFrame) – DataFrame on which the stat should be calculated.
Return type:: pd.DataFrame

edvart.utils.std(series: Series) → float[source]

Return standard deviation.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.sum_(series: Series) → float[source]

Return sum.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

edvart.utils.top_frequent_values(series: Series, n_top: int = 10) → Mapping[str, Any][source]

Counts top n most frequent values in series along with other value counts and NULL value counts.

Parameters:

series (pd.Series) – Input series of data for which frequencies will be calculated
n_top (int) – Number of values for which actual frequencies will be counted, other values will be grouped into ‘Other’ category

Returns:

result_dict – Dictionary with the mapping {‘value’: ‘frequency (relative frequency)’}

Return type:

Dict

edvart.utils.value_range(series: Series) → float[source]

Return value range.

Parameters:: series (pd.Series) – Series on which the stat should be calculated.
Return type:: float

Module contents

EDVART package.