edvart package
Subpackages
- edvart.report_sections package
- Subpackages
- edvart.report_sections.timeseries_analysis package
- Submodules
- edvart.report_sections.timeseries_analysis.autocorrelation module
- edvart.report_sections.timeseries_analysis.boxplots_over_time module
- edvart.report_sections.timeseries_analysis.fourier_transform module
- edvart.report_sections.timeseries_analysis.rolling_statistics module
- edvart.report_sections.timeseries_analysis.seasonal_decomposition module
- edvart.report_sections.timeseries_analysis.short_time_ft module
- edvart.report_sections.timeseries_analysis.stationarity_tests module
- edvart.report_sections.timeseries_analysis.time_series_line_plot module
- edvart.report_sections.timeseries_analysis.timeseries_analysis module
- Module contents
- edvart.report_sections.timeseries_analysis package
- Submodules
- edvart.report_sections.bivariate_analysis module
- edvart.report_sections.code_string_formatting module
- edvart.report_sections.dataset_overview module
- edvart.report_sections.group_analysis module
- edvart.report_sections.multivariate_analysis module
- edvart.report_sections.section_base module
- edvart.report_sections.table_of_contents module
- edvart.report_sections.umap module
- edvart.report_sections.univariate_analysis module
- Module contents
- Subpackages
Submodules
edvart.data_types module
- class edvart.data_types.DataType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
IntEnum
Class describe possible data types.
- BOOLEAN = 3
- CATEGORICAL = 2
- DATE = 4
- MISSING = 6
- NUMERIC = 1
- UNIQUE = 7
- UNKNOWN = 5
- edvart.data_types.infer_data_type(series: Series) DataType [source]
Infers the data type of the series passed in.
- Parameters:
series (pd.Series) – Series from which to infer data type.
- Returns:
Inferred custom edvart data type.
- Return type:
- edvart.data_types.is_boolean(series: Series) bool [source]
Heuristic which tells if a series contains only boolean values.
- Parameters:
series (pd.Series) – Series from which to infer data type.
- Returns:
Boolean indicating if series is boolean.
- Return type:
- edvart.data_types.is_categorical(series: Series, unique_value_count_threshold: int = 10) bool [source]
Heuristic to tell if a series is categorical.
- Parameters:
series (pd.Series) – Series from which to infer data type.
unique_value_count_threshold (int) – The number of unique values of the series has to be less than or equal to this number for the series to satisfy one of the requirements to be a categorical series.
- Returns:
Boolean indicating if series is categorical.
- Return type:
- edvart.data_types.is_date(series: Series) bool [source]
Heuristic which tells if a series is of type date.
- Parameters:
series (pd.Series) – Series from which to infer data type.
- Returns:
Boolean indicating if series is of type datetime.
- Return type:
- edvart.data_types.is_missing(series: Series) bool [source]
Function to tell if the series contains only missing values.
- Parameters:
series (pd.Series) – Series from which to infer data type.
- Returns:
True if all values in the series are missing, False otherwise.
- Return type:
edvart.decorators module
- edvart.decorators.check_index_time_ascending(func)[source]
Check whether the index of the DataFrame is sorted in ascending order.
The DataFrame which is checked needs to be the first argument of the decorated function.
- Raises:
ValueError – If the index is not a datetime or is not ascending.
edvart.example_datasets module
- edvart.example_datasets.dataset_auto() DataFrame [source]
Returns a dataset containing information on the technical specifications of cars. The dataset contains 398 rows and 9 columns. There are no missing values.
Source: https://www.kaggle.com/datasets/uciml/autompg-dataset Unmodified.
- Return type:
pd.DataFrame
- edvart.example_datasets.dataset_global_temp() DataFrame [source]
Returns a time-series dataset containing monthly deviations from mean global average temperature from 1880 until 2016 according to two methodologies: GCAG and GISTEMP.
Source: https://datahub.io/core/global-temp. Modified: Moved each methodology into its own column.
- edvart.example_datasets.dataset_meteorite_landings() DataFrame [source]
Returns a dataset that includes the location, mass, composition, and fall year for over 45,000 meteorites that have struck our planet.
Source of data: NASA Open Data Portal (https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh)
- Return type:
pd.DataFrame
- edvart.example_datasets.dataset_pollution() DataFrame [source]
Returns a time-series dataset containing hourly weather and pollution data from 2010 until 2014 from Beijing, China. There are 43800 rows and 8 columns.
Source: https://www.kaggle.com/datasets/djhavera/beijing-pm25-data-data-set Modified: - Merged columns “year”, “month”, “day”, “hour” into a single “date” column. - Removed the first day, for which pollution data is missing. - Renamed columns.
- edvart.example_datasets.dataset_titanic() DataFrame [source]
Returns a dataset that contains data for 891 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived, their age, their passenger-class, their sex and the fare they paid.
The dataset contains 891 rows and 12 columns.
Source: https://www.kaggle.com/datasets/hesh97/titanicdataset-traincsv Unmodified.
- Return type:
pd.DataFrame
edvart.export_utils module
- edvart.export_utils.embed_image_base64(image_path: str, mime: str = 'image/png') str [source]
Loads content of an image and embeds it as base64 data URL into the template. Intended to be used as a Jinja filter.
Example Jinja filter usage (CSS): ``` #notebook-container {
background-image: url(‘{{ ‘background.png’ | embed_image_base64(‘image/png’) }}’);
edvart.pandas_formatting module
- edvart.pandas_formatting.add_html_heading(html: str, heading: str, heading_level: int = 2) str [source]
Adds a heading to an HTML string with the specified text and heading level.
- edvart.pandas_formatting.dict_to_html(dictionary: Dict[str, Any]) str [source]
Converts a dictionary to a dataframe in HTML string form.
- Parameters:
dictionary (Dict['str', Any]) – DictDictionary to be converted
- edvart.pandas_formatting.format_number(number: int | float, decimal_places: int = 2, thousand_separator: str = '') str [source]
Formats a number by truncating decimal places (if it is a float) and optionally adds thousand separators.
- edvart.pandas_formatting.hide_index(df: DataFrame) Styler [source]
Hides the index of a DataFrame.
- Parameters:
df (pd.DataFrame) – DataFrame where the index should be hidden.
- Returns:
Styler object with the index hidden.
- Return type:
Styler
- edvart.pandas_formatting.render_dictionary(dictionary: Dict[str, Any]) None [source]
Converts a dictionary to a dataframe and renders that dataframe in the report notebook.
- Parameters:
dictionary (Dict['str', Any]) – Dictionary to be rendered
- edvart.pandas_formatting.series_to_frame(series: Series, index_name: str, column_name: str) DataFrame [source]
Converts a pandas.Series to a pandas.DataFrame by putting the series index into a separate column.
- Parameters:
- Returns:
Dataframe with two columns index_name and column_name with values of series.index and series.values respectively
- Return type:
pd.DataFrame
- edvart.pandas_formatting.subcells_html(elements: List[List[str]]) str [source]
Returns HTML table in string format according to the elements matrix.
- Parameters:
elements (List[List[str]]) – Elements which should be rendered in table cells, outer list represents rows, inner list represents columns, elements themselves should be HTML strings
- Returns:
Table in HTML string ready to be rendered for example by IPython.display.display_html
- Return type:
edvart.plots module
- edvart.plots.scatter_plot_2d(df: DataFrame, x: str | Series | ndarray, y: str | Series | ndarray, color_col: str | None = None, interactive: bool = True, figsize: Tuple[float, float] = (12, 12), opacity: float = 0.8, xlabel: str | None = None, ylabel: str | None = None, show_xticks: bool = False, show_yticks: bool = False, show_zerolines: bool = False, equal_scale_axes: bool = False) None [source]
Display a 2D scatter plot of x and y, with optional coloring of points by values in a column.
- Parameters:
df (pd.DataFrame) – Data to plot.
x (Union[str, pd.Series, np.ndarray]) – Name of column in df or flat array or series of x coordinates of plotted points.
y (Union[str, pd.Series, np.ndarray]) – Name of column in df or flat array or series of y coordinates of plotted points.
color_col (str, optional) – Name of column in df to color points in the plot by. Can be both numeric or categorical. By default, all points have the same color.
interactive (bool (default = True)) – Whether to plot an interactive plot. The interactive plot also shows labels for each point on hover.
figsize (Tuple[float, float] (default = (12, 12))) – Size of the resulting plot in inches.
opacity (float (default = 0.8)) – Opacity of the points drawn in the scatter plot.
xlabel (str, optional) – Label for the x axis. No label is displayed by default.
ylabel (str, optional) – Label for the y axis. No label is displayed by default.
show_xticks (bool (default = False)) – Whether to display ticks on the x axis.
show_yticks (bool (default = False)) – Whether to display ticks on the y axis.
show_zerolines (bool (default = False)) – Whether to display zero lines.
equal_scale_axes (bool (default = False)) – Whether to make the x and y axes have the same scale.
edvart.report module
- class edvart.report.DefaultReport(dataframe: DataFrame, verbosity: Verbosity = Verbosity.LOW, verbosity_overview: Verbosity | None = None, verbosity_univariate_analysis: Verbosity | None = None, verbosity_bivariate_analysis: Verbosity | None = None, verbosity_multivariate_analysis: Verbosity | None = None, verbosity_group_analysis: Verbosity | None = None, columns_overview: List[str] | None = None, columns_univariate_analysis: List[str] | None = None, columns_bivariate_analysis: List[str] | None = None, columns_multivariate_analysis: List[str] | None = None, columns_group_analysis: List[str] | None = None, groupby: str | List[str] | None = None)[source]
Bases:
Report
A report for tabular data containing default sections.
The report contains the following sections: - dataset overview - univariate analysis - bivariate analysis - multivariate analysis - group analysis (if groupby is specified)
- Parameters:
dataframe (pd.DataFrame) – Data from which to generate the report.
verbosity (Verbosity (default = Verbosity.LOW)) – The default verbosity for the exported code of the entire report.
verbosity_overview (Verbosity, optional) – Verbosity of the overview section
verbosity_univariate_analysis (Verbosity, optional) – Verbosity of the univariate analysis section
verbosity_bivariate_analysis (Verbosity, optional) – Verbosity of the bivariate analysis section.
verbosity_multivariate_analysis (Verbosity, optional) – Verbosity of the multivariate analysis section
columns_overview (List[str], optional) – Subset of columns to use in overview section
columns_univariate_analysis (List[str], optional) – Subset of columns to use in univariate analysis section
columns_bivariate_analysis (List[str], optional) – Subset of columns to use in bivariate analysis section
columns_multivariate_analysis (List[str], optional) – Subset of columns to use in multivariate analysis section
columns_group_analysis (List[str], optional) – Subset of columns to use in group analysis section
groupby (Union[str, List[str]], optional) – Column or list of columns to group by in group analysis. If None, group analysis will not be included by default. It can still be added later using add_group_analysis. If a single column is specified, it will be used to color points in multivariate analysis. Default: None.
- class edvart.report.DefaultTimeseriesReport(dataframe: DataFrame, verbosity: Verbosity = Verbosity.LOW, verbosity_overview: Verbosity | None = None, verbosity_univariate_analysis: Verbosity | None = None, verbosity_timeseries_analysis: Verbosity | None = None, columns_overview: List[str] | None = None, columns_univariate_analysis: List[str] | None = None, columns_timeseries_analysis: List[str] | None = None, sampling_rate: Verbosity | None = None, stft_window_size: Verbosity | None = None)[source]
Bases:
TimeseriesReport
A default report for time series data.
The report contains the following sections: - dataset overview - univariate analysis - timeseries analysis
- Parameters:
dataframe (pd.DataFrame) – Data from which to generate the report. Data needs to be indexed by time: pd.DateTimeIndex or pd.PeriodIndex. The data is assumed to be sorted according to the time index in ascending order.
verbosity (Verbosity (default = Verbosity.LOW)) – The default verbosity for the exported code of the entire report.
verbosity_overview (Verbosity, optional) – Verbosity of the overview section
verbosity_univariate_analysis (Verbosity, optional) – Verbosity of the univariate analysis section
verbosity_timeseries_analysis (Verbosity, optional) – Verbosity of the timeseries analysis section
columns_overview (List[str], optional) – Subset of columns to use in overview section
columns_univariate_analysis (List[str], optional) – Subset of columns to use in univariate analysis section
columns_timeseries_analysis (List[str], optional) – Subset of columns to use in timeseries analysis section
sampling_rate (int, optional) – Sampling rate for Fourier transform and Short-time Fourier transform subsections. Determines frequency unit for analysis of frequencies, for example with monthly data and sampling rate 12, yearly frequency spectrum is produced. If not set, these two sections will not be included.
stft_window_size (int, optional) – Windows size for short-time Fourier transform subsection. If not set, STFT will be excluded.
- exception edvart.report.EmptyReportWarning[source]
Bases:
UserWarning
Warning raised when a report contains no sections.
- class edvart.report.ExportDataMode(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
StrEnum
Data export mode for the report.
- EMBED = 'embed'
- FILE = 'file'
- NONE = 'none'
- class edvart.report.Report(dataframe: DataFrame, verbosity: Verbosity = Verbosity.LOW)[source]
Bases:
ReportBase
A report for tabular datasets. Contains no sections by default.
See DefaultReport for a report with default sections. See methods add_* for adding sections to the report.
- Parameters:
dataframe (pd.DataFrame) – Data from which to generate the report.
verbosity (Verbosity (default = Verbosity.LOW)) – Verbosity of the exported code of the entire report.
- class edvart.report.ReportBase(dataframe: DataFrame, verbosity: Verbosity = Verbosity.LOW)[source]
Bases:
ABC
Abstract base class for reports.
- Parameters:
dataframe (pd.DataFrame) – Data from which to generate the report.
verbosity (Verbosity (default = Verbosity.LOW)) – The default verbosity for the exported code of the entire report, by default Verbosity.LOW.
- add_bivariate_analysis(columns: List[str] | None = None, columns_x: List[str] | None = None, columns_y: List[str] | None = None, columns_pairs: List[Tuple[str, str]] | None = None, subsections: List[BivariateAnalysisSubsection] | None = None, verbosity: Verbosity | None = None, verbosity_correlations: Verbosity | None = None, verbosity_pairplot: Verbosity | None = None, verbosity_contingency_table: Verbosity | None = None, color_col: str | None = None) ReportBase [source]
Adds bivariate analysis section to the report.
- Parameters:
columns (List[str], optional) – Columns which to analyze. If None, all columns are used.
columns_x (List[str], optional) – If specified, correlations and pairplots are performed on the cartesian product of columns_x and columns_y. If columns_x is specified, then columns_y must also be specified.
columns_y (List[str], optional) – If specified, correlations and pairplots are performed on the cartesian product of columns_x and columns_y. If columns_y is specified, then columns_x must also be specified.
columns_pairs (List[str], optional) – List of columns pairs on which to perform bivariate analysis. Used primarily in contingency tables. If specified, columns, columns_x and columns_y are ignored in contingency tables. Ignored in pairplots and correlations unless columns_pairs is specified and none of columns, columns_x, columns_y is specified. In that case, the first elements of each pair are treated as columns_x and the second elements as columns_y in pairplots and correlations.
subsections (List[BivariateAnalysisSubsection], optional) – List of sub-sections to include into the BivariateAnalysis section. If None, all subsections are added.
verbosity (Verbosity, optional) – The verbosity of the code generated in the exported notebook.
verbosity_correlations (Verbosity, optional) – Correlation plots subsection code verbosity.
verbosity_pairplot (Verbosity, optional) – Pairplot subsection code verbosity.
verbosity_contingency_table (Verbosity, optional) – Contingency table code verbosity.
color_col (str, optional) – Name of column according to use for coloring of the multivariate analysis subsections. Coloring is currently supported in pairplot.
- add_group_analysis(groupby: str | List[str], columns: List[str] | None = None, verbosity: Verbosity | None = None, show_within_group_statistics: bool = True, show_group_missing_values: bool = True, show_group_distribution_plots: bool = True) ReportBase [source]
Add group analysis section to the report.
- Parameters:
groupby (Union[str, List[str]]) – Column or list of columns to group by in group analysis.
columns (List[str], optional) – Columns which to analyze. If None, all columns are used.
verbosity (Verbosity, optional) – The verbosity of the code generated in the exported notebook.
show_within_group_statistics (bool (default = True)) – Whether to show per-group statistics.
show_group_missing_values (bool (default = True)) – Whether to show per-group missing values.
show_group_distribution_plots (bool (default = True)) – Whether to show per-group distribution plots.
- add_multivariate_analysis(columns: List[str] | None = None, subsections: List[MultivariateAnalysisSubsection] | None = None, verbosity: Verbosity | None = None, verbosity_pca: Verbosity | None = None, verbosity_umap: Verbosity | None = None, verbosity_parallel_coordinates: Verbosity | None = None, verbosity_parallel_categories: Verbosity | None = None, color_col: str | None = None) ReportBase [source]
Add multivariate analysis section to the report.
- Parameters:
columns (List[str], optional) – Columns which to analyze. If None, all columns are used.
subsections (List[MultivariateAnalysisSubsection], optional) – List of sub-sections to include into the BivariateAnalysis section. If None, all subsections are added.
verbosity (Verbosity, optional) – The verbosity of the code generated in the exported notebook.
verbosity_pca (Verbosity, optional) – Principal component analysis subsection code verbosity.
verbosity_umap (Verbosity, optional) – UMAP subsection code verbosity.
verbosity_parallel_coordinates (Verbosity, optional) – Parallel coordinates subsection code verbosity.
verbosity_parallel_categories (Verbosity, optional) – Parallel categories subsection code verbosity.
color_col (str, optional) – Name of column to use for coloring of the multivariate analysis subsections. The exact method of coloring depends on each particular subsection.
- add_overview(columns: List[str] | None = None, subsections: List[OverviewSubsection] | None = None, verbosity: Verbosity | None = None, verbosity_quick_info: Verbosity | None = None, verbosity_data_types: Verbosity | None = None, verbosity_data_preview: Verbosity | None = None, verbosity_missing_values: Verbosity | None = None, verbosity_rows_with_missing_value: Verbosity | None = None, verbosity_constant_occurrence: Verbosity | None = None, verbosity_duplicate_rows: Verbosity | None = None) ReportBase [source]
Adds a dataset overview section to the report.
- Parameters:
columns (List[str], optional) – Columns which to include in the overview section. If None, all columns are used.
subsections (List[Overview.OverviewSubsection], optional) – List of sub-sections to include into the Overview section. If None, all subsections are added.
verbosity (Verbosity, optional) – Generated code verbosity global to the Overview sections. If subsection verbosities are None, then they will be overridden by this parameter.
verbosity_quick_info (Verbosity, optional) – Quick info sub-section code verbosity.
verbosity_data_types (Verbosity, optional) – Data types sub-section code verbosity.
verbosity_data_preview (Verbosity, optional) – Data preview sub-section code verbosity.
verbosity_missing_values (Verbosity, optional) – Missing values sub-section code verbosity.
verbosity_rows_with_missing_value (Verbosity, optional) – Rows with missing value sub-section code verbosity.
verbosity_constant_occurrence (Verbosity, optional) – Constant values occurrence sub-section code verbosity.
verbosity_duplicate_rows (Verbosity, optional) – Duplicate rows sub-section code verbosity.
- add_section(section: Section) ReportBase [source]
Add a section to the report. See edvart.report_sections for available sections.
- Parameters:
section (Section) – Section to add to the report.
- Returns:
Returns self.
- Return type:
- add_table_of_contents(include_subsections: bool = True) ReportBase [source]
Adds table of contents section to the report.
- Parameters:
include_subsections (bool) – A boolean controlling whether the subsections should be included in the table of contents. However, they won’t be included in an exported notebook created by report’s export_notebook function.
- add_univariate_analysis(columns: List[str] | None = None, verbosity: Verbosity | None = None) ReportBase [source]
Adds univariate section to the report.
- export_html(html_filepath: str, template_name: str | None = None, template_filepath: str | None = None, dataset_name: str = '[INSERT DATASET NAME]', dataset_description: str = '[INSERT DATASET DESCRIPTION]', timeout: int = 120) None [source]
Generate HTML report for an already-loaded DataFrame.
- Parameters:
html_filepath (str) – File path to save exported HTML report to.
template_name (str, optional) –
Path to template file to use for exporting the notebook to HTML.
The template must be found in a Jupyter path (see https://nbconvert.readthedocs.io/en/latest/customizing.html#where-are-nbconvert-templates-installed ). The default location is $HOME/.local/share/jupyter/nbconvert/templates
template_filepath (str, optional) – Template to use when exporting the HTML report.
dataset_name (str (default = "[INSERT DATASET NAME]")) – Name of dataset to be used in the title of the report.
dataset_description (str (default = "[INSERT DATASET DESCRIPTION]")) – Description of dataset to be used below the title of the report.
timeout (int (default = 120)) – Maximum number of seconds to wait for a cell to finish execution.
- export_notebook(notebook_filepath: str | PathLike, dataset_name: str = '[INSERT DATASET NAME]', dataset_description: str = '[INSERT DATASET DESCRIPTION]', export_data_mode: ExportDataMode = ExportDataMode.NONE) None [source]
Exports the report as an .ipynb file.
- Parameters:
notebook_filepath (str or PathLike) – Filepath of the exported notebook.
dataset_name (str (default = "[INSERT DATASET NAME]")) – Name of dataset to be used in the title of the report.
dataset_description (str (default = "[INSERT DATASET DESCRIPTION]")) – Description of dataset to be used below the title of the report.
export_data_mode (ExportDataMode (default = ExportDataMode.NONE)) – Mode for exporting the data to the notebook. If ExportDataMode.NONE, the data is not exported to the notebook. If ExportDataMode.FILE, the data is exported to a parquet file and loaded from there. If ExportDataMode.EMBED, the data is embedded into the notebook as a base64 string.
- class edvart.report.TimeseriesReport(dataframe: DataFrame, verbosity: Verbosity = Verbosity.LOW)[source]
Bases:
ReportBase
A report for time-series data. Contains no sections by default.
See DefaultTimeseriesReport for a time-series report with default sections. See methods add_* for adding sections to the report.
- Raises:
ValueError – If the input dataframe is not indexed by time.
- add_timeseries_analysis(columns: List[str] | None = None, subsections: List[TimeseriesAnalysisSubsection] | None = None, verbosity: Verbosity | None = None, verbosity_time_series_line_plot: Verbosity | None = None, verbosity_rolling_statistics: Verbosity | None = None, verbosity_boxplots_over_time: Verbosity | None = None, verbosity_seasonal_decomposition: Verbosity | None = None, verbosity_autocorrelation: Verbosity | None = None, verbosity_stationarity_tests: Verbosity | None = None, verbosity_fourier_transform: Verbosity | None = None, verbosity_short_time_ft: Verbosity | None = None, sampling_rate: Verbosity | None = None, stft_window_size: Verbosity | None = None) TimeseriesReport [source]
Add timeseries analysis section to the report.
- Parameters:
columns (List[str], optional) – Columns which to analyze. If None, all columns are used.
subsections (List[TimeseriesAnalysis.TimeseriesAnalysisSubsection], optional) – List of sub-sections to include into the BivariateAnalysis section. If None, all subsections are added.
verbosity (Verbosity, optional) – The verbosity of the code generated in the exported notebook.
verbosity_time_series_line_plot (Verbosity, optional) – Time series line plot subsection code verbosity.
verbosity_rolling_statistics (Verbosity, optional) – Rolling statistics interactive plot subsection code verbosity.
verbosity_boxplots_over_time (Verbosity, optional) – Boxplots grouped over time intervals plot subsection code verbosity.
verbosity_seasonal_decomposition (Verbosity, optional) – Decomposition into trend, seasonal and residual components code verbosity.
verbosity_autocorrelation (Verbosity, optional) – Autocorrelation and partial autocorrelation vs. lag code verbosity.
verbosity_stationarity_tests (Verbosity, optional) – Stationarity tests code verbosity.
verbosity_fourier_transform (Verbosity, optional) – Fourier transform and short-time Fourier transform code verbosity.
verbosity_short_time_ft (Verbosity, optional) – Short-time Fourier transform transform spectrogram code verbosity.
sampling_rate (Verbosity, optional) – Sampling rate for Fourier transform and Short-time Fourier transform subsections. Needs to be set in order for these two subs to be included.
stft_window_size (Verbosity, optional) – Window size for Short-time Fourier transform. Needs to be set in order for the STFT subsection to be included.
edvart.utils module
- edvart.utils.coefficient_of_variation(series: Series) float [source]
Return coefficient of variation.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.contingency_table(df: DataFrame) DataFrame [source]
Return contingency table.
- Parameters:
df (pd.DataFrame) –
- Return type:
pd.DataFrame
- edvart.utils.env_var(name: str, value: str) Iterator[None] [source]
Set an environment variable for the duration of the context.
- edvart.utils.get_default_discrete_colorscale(n_colors: int) List[Tuple[float, str]] [source]
Get a default Plotly-compatible colorscale of n discrete colors.
- edvart.utils.hsl_wheel_colorscale(n: int, saturation=0.5, lightness=0.5) Iterable[str] [source]
Generate a colorscale of n discrete colors.
Colors are equally spaced around the complete HSL wheel with constant saturation and lightness.
- Returns:
An iterable of n plotly-compatible HSL strings.
- Return type:
Iterable[str]
- edvart.utils.iqr(series: Series) float [source]
Return inter quartile range.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.kendall(df: DataFrame) DataFrame [source]
Return kendall correlation coefficient.
- Parameters:
df (pd.DataFrame) – DataFrame on which the stat should be calculated.
- Return type:
pd.DataFrame
- edvart.utils.kurtosis(series: Series) Any [source]
Return kurtosis.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.mad(series: Series) Any [source]
Return mean absolute deviation.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.make_discrete_colorscale(colorscale: List[str], n_colors: int) Iterable[Tuple[float, str]] [source]
Generate a colorscale of n discrete colors for use in plotly.graph_objects.
Note that when using plotly.express, the parameter color_discrete_sequence can be used instead.
- Parameters:
- Returns:
An iterable of 2n tuples, where each tuple contains a value between 0 and 1 (the values are equally spaced in the interval and each value appears twice), and one of the colors from the colorscale.
- Return type:
Examples
>>> list(make_discrete_colorscale(["red", "green", "blue"], 4)) [ (0, "red"), (0.25, "red"), (0.25, "green"), (0.5, "green"), (0.5, "blue"), (0.75, "blue"), (0.75, "red"), (1, "red") ]
- edvart.utils.maximum(series: Series) float [source]
Return maximum.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.mean(series: Series) float [source]
Return mean.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.median(series: Series) float [source]
Return median.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.median_absolute_deviation(series: Series) float [source]
Return median absolute deviation.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.minimum(series: Series) float [source]
Return minimum.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.mode(series: Series) float [source]
Return mode.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Returns:
The most frequent value. float(‘nan’) if the series contains only null values.
- Return type:
- edvart.utils.num_unique_values(series: Series) int [source]
Return number of unique values.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.pearson(df: DataFrame) DataFrame [source]
Return pearson correlation coefficient.
- Parameters:
df (pd.DataFrame) – DataFrame on which the stat should be calculated.
- Return type:
pd.DataFrame
- edvart.utils.quartile1(series: Series) float [source]
Return first quartile.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.quartile3(series: Series) float [source]
Return third quartile.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.reindex_to_datetime(df: DataFrame, datetime_column: str, keep_index: str | None = None, unit: str = 'ns', origin: str | Timestamp = 'unix', sort: bool = True) DataFrame [source]
Reindex a given DataFrame to be indexed by a pd.DateTimeIndex.
- Parameters:
df (pd.DataFrame) – DataFrame to reindex.
datetime_column (str) – Which column containing datetimes to index by.
keep_index (str, optional) – Name of column to store the original index. The original index will be discarded by default.
unit (str (default = "ns")) – Numeric values would be parsed as number of units from origin.
origin (Union[str, pd.Timestamp] (default = "unix")) – Define the reference date. Numeric values would be parsed as number of units (defined by unit) since this reference date. By default unix epoch 1970-01-01.
sort (bool (default = True)) – Whether to sort according to the index.
- Returns:
Reindexed df.
- Return type:
pd.DataFrame
- edvart.utils.reindex_to_period(df: DataFrame, period_column: str, freq: str | None, keep_index: str | None = None, sort: bool = True) DataFrame [source]
Reindex a given DataFrame to be indexed by a pd.PeriodIndex.
- Parameters:
df (pd.DataFrame) – DataFrame to reindex.
period_column (str) – Which column containing periods to index by.
freq (Union[str, pd.Offset]) – One of pandas’ offset strings or an Offset object. Will be inferred by default.
keep_index (str, optional) – Name of column to store the original index. The original index will be discarded by default.
sort (bool (default = True)) – Whether to sort according to the index.
- Returns:
Reindexed df.
- Return type:
pd.DataFrame
- edvart.utils.select_numeric_columns(df: DataFrame, columns: List[str] | None) List[str] [source]
Select all numeric columns from a DataFrame if columns is None, or check if all specified columns are numeric if columns is a list of column names.
- Parameters:
df (pd.DataFrame) – DataFrame to select or check columns from.
columns (List[str], optional) – Specified columns.
- Returns:
List of numeric or specified columns
- Return type:
List[str]
- Raises:
ValueError – If a non-numeric column is specified in columns.
- edvart.utils.skewness(series: Series) Any [source]
Return skewness.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.spearman(df: DataFrame) DataFrame [source]
Return spearman correlation coefficient.
- Parameters:
df (pd.DataFrame) – DataFrame on which the stat should be calculated.
- Return type:
pd.DataFrame
- edvart.utils.std(series: Series) float [source]
Return standard deviation.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.sum_(series: Series) float [source]
Return sum.
- Parameters:
series (pd.Series) – Series on which the stat should be calculated.
- Return type:
- edvart.utils.top_frequent_values(series: Series, n_top: int = 10) Mapping[str, Any] [source]
Counts top n most frequent values in series along with other value counts and NULL value counts.
- Parameters:
series (pd.Series) – Input series of data for which frequencies will be calculated
n_top (int) – Number of values for which actual frequencies will be counted, other values will be grouped into ‘Other’ category
- Returns:
result_dict – Dictionary with the mapping {‘value’: ‘frequency (relative frequency)’}
- Return type:
Dict
Module contents
EDVART package.