benford package¶
benford.benford module¶
-
class
benford.benford.
Base
(data, decimals, sign='all', sec_order=False)[source]¶ Bases:
pandas.core.frame.DataFrame
Internalizes and prepares the data for Analysis.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.`
- Raises
TypeError – if not receiving int or float as input.
-
class
benford.benford.
Test
(base, digs, confidence, limit_N=None, sec_order=False)[source]¶ Bases:
pandas.core.frame.DataFrame
Transforms the original number sequence into a DataFrame reduced by the ocurrences of the chosen digits, creating other computed columns
- Parameters
base – The Base object with the data prepared for Analysis
digs – Tells which test to perform: 1: first digit; 2: first two digits; 3: furst three digits; 22: second digit; -2: last two digits.
confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show.
limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
-
N
¶ Number of records in the sample to consider in computations
-
ddf
¶ Degrees of Freedom to look up for the critical chi-square value
-
chi_square
¶ Chi-square statistic for the given test
-
KS
¶ Kolmogorov-Smirnov statistic for the given test
-
MAD
¶ Mean Absolute Deviation for the given test
-
confidence
¶ Confidence level to consider when setting some critical values
-
digs
¶ numerical representation of the test at hand. 1: F1D; 2: F2D; 3: F3D; 22: SD; -2: L2D.
- Type
int
-
sec_order
¶ True if the test is a Second Order one
- Type
bool
-
update_confidence
(new_conf, check=True)[source]¶ Sets a new confidence level for the Benford object, so as to be used to produce critical values for the tests
- Parameters
new_conf – new confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics.
check – checks the value provided for the confidence. Defaults to True
-
property
critical_values
¶ a dictionary with the critical values for the test at hand, according to the current confidence level.
- Type
dict
-
show_plot
(save_plot=None, save_plot_kwargs=None)[source]¶ Draws the test plot.
- Parameters
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when save_plot is a string with the figure file path/name.
-
report
(high_Z='pos', show_plot=True, save_plot=None, save_plot_kwargs=None)[source]¶ Handles the report especific to the test, considering its statistics and according to the current confidence level.
- Parameters
high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the critical value or not.
show_plot – calls the show_plot method, to draw the test plot
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
-
class
benford.benford.
Summ
(base, test)[source]¶ Bases:
pandas.core.frame.DataFrame
Gets the base object and outputs a Summation test object
- Parameters
base – The Base object with the data prepared for Analysis
test – The test for which to compute the summation
-
MAD
¶ Mean Absolute Deviation for the test
-
confidence
¶ Confidence level to consider when setting some critical values
-
show_plot
(save_plot=None, save_plot_kwargs=None)[source]¶ Draws the Summation test plot
- Parameters
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when save_plot is a string with the figure file path/name.
-
report
(high_diff=None, show_plot=True, save_plot=None, save_plot_kwargs=None)[source]¶ Gives the report on the Summation test.
- Parameters
high_diff – Number of records to show after ordering by the absolute differences between the found and the expected proportions
show_plot – calls the show_plot method, to draw the Summation test plot
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
-
class
benford.benford.
Mantissas
(data, confidence=95, limit_N=None)[source]¶ Bases:
object
Computes and holds the mantissas of the logarithms of the records
- Parameters
data – sequence to compute mantissas from. numpy 1D array, pandas Series of pandas DataFrame column.
confidence – confidence level for computing the critical values to compare with some statistics
-
data
¶ pandas DataFrame with the mantissas
- Type
(DataFrame)
-
property
stats
¶
-
update_confidence
(new_conf, check=True)[source]¶ Sets a new confidence level for the Benford object, so as to be used to produce critical values for the tests
- Parameters
new_conf – new confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics.
check – checks the value provided for the confidence. Defaults to True
-
report
(show_plot=True, save_plot=None, save_plot_kwargs=None)[source]¶ Displays the Mantissas test stats.
- Parameters
show_plot – shows the Ordered Mantissas plot and the Arc Test plot. Defaults to True.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
-
show_plot
(figsize=(12, 6), save_plot=None, save_plot_kwargs=None)[source]¶ Plots the ordered mantissas and a line with the expected inclination.
- Parameters
figsize (tuple) – figure size dimensions
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when save_plot is a string with the figure file path/name.
-
arc_test
(grid=True, figsize=12, save_plot=None, save_plot_kwargs=None)[source]¶ Adds two columns to Mantissas’s DataFrame equal to their “X” and “Y” coordinates, plots its to a scatter plot and calculates the gravity center of the circle.
- Parameters
grid – show grid of the plot. Defaluts to True.
figsize (int) – size of the figure to be displayed. Since it is a square, there is no need to provide a tuple, like is usually the case with matplotlib.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
-
class
benford.benford.
Benford
(data, decimals=2, sign='all', confidence=95, mantissas=True, sec_order=False, summation=False, limit_N=None, verbose=True)[source]¶ Bases:
object
Initializes a Benford Analysis object and computes the proportions for the digits. The tets dataFrames are atributes, i.e., obj.F1D is the First Digit DataFrame, the obj.F2D,the First Two Digits one, and so one, F3D for First Three Digits, SD for Second Digit and L2D for Last Two Digits.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a tuple with a pandas DataFrame and the name (str) of the chosen column. Values must be integers or floats.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics. Defaults to 95.
mantissas (bool) – opts for also running the mantissas Test. Defaulst to True
sec_order – runs the Second Order tests, which are the Benford’s tests performed on the differences between the ordered sample (a value minus the one before it, and so on). If the original series is Benford- compliant, this new sequence should aldo follow Beford. The Second Order can also be called separately, through the method sec_order().
summation – creates the Summation DataFrames for the First, First Two, and First Three Digits. The summation tests can also be called separately, through the method summation().
limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
verbose – gives some information about the data and the registries used and discarded for each test.
-
data
¶ the raw data provided for the analysis
-
chosen
¶ the column of the DataFrame to be analysed or the data itself
-
sign
¶ which number sign(s) to include in the analysis
- Type
str
-
confidence
¶ current confidence level
-
limit_N
¶ sample size to use in computations
- Type
int
-
verbose
¶ verbose or not
- Type
bool
-
base
¶ the Base, pre-processed object
-
tests
¶ keeps track of the tests the instance has
- Type
list
ofstr
-
update_confidence
(new_conf, tests=None)[source]¶ Sets (a) new confidence level(s) for the Benford object, so as to be used to produce critical values for the tests.
- Parameters
new_conf – new confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics.
tests (
list
ofstr
) – list of tests names (strings) to have their confidence updated. If only one, provide a one-element list, like [‘F1D’]. Defauts to None, in which case it will use the instance .test list attribute.
- Raises
ValueError – if the test argument is not a list or None.
-
property
all_confidences
¶ a dictionary with a confidence level for each computed tests, when applicable.
- Type
dict
-
mantissas
()[source]¶ Adds a Mantissas object to the tests, with all its statistics and plotting capabilities.
-
sec_order
()[source]¶ Runs the Second Order tests, which are the Benford’s tests performed on the differences between the ordered sample (a value minus the one before it, and so on). If the original series is Benford- compliant, this new sequence should aldo follow Beford. The Second Order can also be called separately, through the method sec_order().
-
class
benford.benford.
Source
(data, decimals=2, sign='all', sec_order=False, verbose=True, inform=None)[source]¶ Bases:
pandas.core.frame.DataFrame
Prepares the data for Analysis. pandas DataFrame subclass.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
sec_order – choice for the Second Order Test, which cumputes the differences between the ordered entries before running the Tests.
verbose (bool) – tells the number of registries that are being subjected to the analysis; defaults to True.
- Raises
ValueError – if the sign arg is not in [‘all’, ‘pos’, ‘neg’]
TypeError – if not receiving int or float as input.
-
verbose
¶ verbose or not
- Type
(bool)
-
mantissas
(report=True, show_plot=True, figsize=(15, 8), save_plot=None, save_plot_kwargs=None)[source]¶ Calculates the mantissas, their mean and variance, and compares them with the mean and variance of a Benford’s sequence.
- Parameters
report – prints the mamtissas mean, variance, skewness and kurtosis for the sequence studied, along with reference values.
show_plot – plots the ordered mantissas and a line with the expected inclination. Defaults to True.
figsize – tuple that sets the figure dimensions.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
-
first_digits
(digs, confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, show_plot=True, save_plot=None, save_plot_kwargs=None, simple=False, bhat_coeff=False, bhat_dist=False, kl_diverg=False, ret_df=False)[source]¶ Performs the Benford First Digits test with the series of numbers provided, and populates the mapping dict for future selection of the original series.
- Parameters
digs (int) – number of first digits to consider. Must be 1 (first digit), 2 (first two digits) or 3 (first three digits).
verbose (bool) – tells the number of registries that are being subjected to the analysis; defaults to True
confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics. Defaults to None.
high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
bhat_coeff (bool) – computes the Bhattacharyya Coefficient between the found and the expected (Benford) digits distribution; defaults to Fasle
bhat_dist (bool) – calculates the Bhattacharyya Distance between the found and the expected (Benford) digits distribution; defaults to Fasle
kl_diverg (bool) – calculates the Kulback-Laibler Divergence between the found and the expected (Benford) digits distribution; defaults to False
show_plot (bool) – draws the test plot. Defaults to True.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
ret_df – returns the test DataFrame. Defaults to False. True if run by the test function.
- Returns
- DataFrame with the Expected and Found proportions, and the Z scores of
the differences
-
second_digit
(confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, bhat_coeff=False, bhat_dist=False, kl_diverg=False, show_plot=True, save_plot=None, save_plot_kwargs=None, simple=False, ret_df=False)[source]¶ Performs the Benford Second Digit test with the series of numbers provided.
- Parameters
verbose (bool) – tells the number of registries that are being subjected to the analysis; defaults to True
MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics. Defaults to None.
high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
bhat_coeff (bool) – computes the Bhattacharyya Coefficient between the found and the expected (Benford) digits distribution; defaults to Fasle
bhat_dist (bool) – calculates the Bhattacharyya Distance between the found and the expected (Benford) digits distribution; defaults to Fasle
kl_diverg (bool) – calculates the Kulback-Laibler Divergence between the found and the expected (Benford) digits distribution; defaults to False
show_plot (bool) – draws the test plot.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
ret_df – returns the test DataFrame. Defaults to False. True if run by the test function.
- Returns
- DataFrame with the Expected and Found proportions, and the Z scores of
the differences
-
last_two_digits
(confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, bhat_coeff=False, bhat_dist=False, kl_diverg=False, show_plot=True, save_plot=None, save_plot_kwargs=None, simple=False, ret_df=False)[source]¶ Performs the Benford Last Two Digits test with the series of numbers provided.
- Parameters
verbose (bool) – tells the number of registries that are being subjected to the analysis; defaults to True
MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics. Defaults to None.
high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
bhat_coeff (bool) – computes the Bhattacharyya Coefficient between the found and the expected (Benford) digits distribution; defaults to Fasle
bhat_dist (bool) – calculates the Bhattacharyya Distance between the found and the expected (Benford) digits distribution; defaults to Fasle
kl_diverg (bool) – calculates the Kulback-Laibler Divergence between the found and the expected (Benford) digits distribution; defaults to False
show_plot (bool) – draws the test plot.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
- Returns
- DataFrame with the Expected and Found proportions, and the Z scores of
the differences
-
summation
(digs=2, top=20, show_plot=True, save_plot=None, save_plot_kwargs=None, ret_df=False)[source]¶ Performs the Summation test. In a Benford series, the sums of the entries begining with the same digits tends to be the same.
- Parameters
digs – tells the first digits to use. 1- first; 2- first two; 3- first three. Defaults to 2.
top – choses how many top values to show. Defaults to 20.
show_plot – plots the results. Defaults to True.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
- Returns
- DataFrame with the Expected and Found proportions, and their
absolute differences
-
duplicates
(top_Rep=20, inform=None)[source]¶ Performs a duplicates test and maps the duplicates count in descending order.
- Parameters
verbose (bool) – tells how many duplicated entries were found and prints the top numbers according to the top_Rep argument. Defaluts to True.
top_Rep – int or None. Chooses how many duplicated entries will be shown withe the top repititions. Defaluts to 20. If None, returns al the ordered repetitions.
- Returns
- DataFrame with the duplicated records and their occurrence counts,
in descending order (if verbose is False; if True, prints to terminal).
- Raises
ValueError – if the top_Rep arg is not int or None.
-
class
benford.benford.
Roll_mad
(data, test, window, decimals=2, sign='all')[source]¶ Bases:
object
Applies the MAD to sequential subsets of the Series, returning another Series.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
test – tells which test to use. 1: Fisrt Digits; 2: First Two Digits; 3: First Three Digits; 22: Second Digit; and -2: Last Two Digits.
window – size of the subset to be used.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
-
test
¶ the test (F1D, SD, F2D…) used for the MAD calculation and critical values
-
show_plot
(figsize=(15, 8), save_plot=None, save_plot_kwargs=None)[source]¶ Shows the rolling MAD plot
- Parameters
figsize – the figure dimensions.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when save_plot is a string with the figure file path/name.
-
class
benford.benford.
Roll_mse
(data, test, window, decimals=2, sign='all')[source]¶ Bases:
object
Applies the MSE to sequential subsets of the Series, returning another Series.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
test – tells which test to use. 1: Fisrt Digits; 2: First Two Digits; 3: First Three Digits; 22: Second Digit; and -2: Last Two Digits.
window – size of the subset to be used. decimals: number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. ‘pos’: only the positive entries; ‘neg’: only negative entries; ‘all’: all entries but zeros. Defaults to ‘all’.
-
show_plot
(figsize=(15, 8), save_plot=None, save_plot_kwargs=None)[source]¶ Shows the rolling MSE plot
- Parameters
figsize – the figure dimensions.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when save_plot is a string with the figure file path/name.
-
benford.benford.
first_digits
(data, digs, decimals=2, sign='all', verbose=True, confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, show_plot=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]¶ Performs the Benford First Digits test on the series of numbers provided.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. ‘pos’: only the positive entries; ‘neg’: only negative entries; ‘all’: all entries but zeros. Defaults to ‘all’.
digs (int) – number of first digits to consider. Must be 1 (first digit), 2 (first two digits) or 3 (first three digits).
verbose (bool) – tells the number of registries that are being subjected to the analysis and returns tha analysis DataFrame sorted by the highest Z score down. Defaults to True.
MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show. Defaults to None.
high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
chi_square – calculates the chi_square statistic of the sample and compares it with a critical value, according to the confidence level chosen and the series’s degrees of freedom. Defaults to False. Requires confidence != None.
KS – calculates the Kolmogorov-Smirnov test, comparing the cumulative distribution of the sample with the Benford’s, according to the confidence level chosen. Defaults to False. Requires confidence != None.
show_plot (bool) – draws the test plot.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
- Returns
- DataFrame with the Expected and Found proportions, and the Z scores of
the differences if the confidence is not None.
-
benford.benford.
second_digit
(data, decimals=2, sign='all', verbose=True, confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, show_plot=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]¶ Performs the Benford Second Digits test on the series of numbers provided.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. ‘pos’: only the positive entries; ‘neg’: only negative entries; ‘all’: all entries but zeros. Defaults to ‘all’.
verbose (bool) – tells the number of registries that are being subjected to the analysis and returns tha analysis DataFrame sorted by the highest Z score down. Defaults to True.
MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show. Defaults to None.
high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
chi_square – calculates the chi_square statistic of the sample and compares it with a critical value, according to the confidence level chosen and the series’s degrees of freedom. Defaults to False. Requires confidence != None.
KS – calculates the Kolmogorov-Smirnov test, comparing the cumulative distribution of the sample with the Benford’s, according to the confidence level chosen. Defaults to False. Requires confidence != None.
show_plot (bool) – draws the test plot.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
- Returns
- DataFrame with the Expected and Found proportions, and the Z scores of
the differences if the confidence is not None.
-
benford.benford.
last_two_digits
(data, decimals=2, sign='all', verbose=True, confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, show_plot=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]¶ Performs the Last Two Digits test on the series of numbers provided.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column,with values being integers or floats.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. ‘pos’: only the positive entries; ‘neg’: only negative entries; ‘all’: all entries but zeros. Defaults to ‘all’.
verbose (bool) – tells the number of registries that are being subjected to the analysis and returns tha analysis DataFrame sorted by the highest Z score down. Defaults to True.
confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show. Defaults to None.
high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
chi_square – calculates the chi_square statistic of the sample and compares it with a critical value, according to the confidence level chosen and the series’s degrees of freedom. Defaults to False. Requires confidence != None.
KS – calculates the Kolmogorov-Smirnov test, comparing the cumulative distribution of the sample with the Benford’s, according to the confidence level chosen. Defaults to False. Requires confidence != None.
show_plot (bool) – draws the test plot.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
- Returns
- DataFrame with the Expected and Found proportions, and the Z scores of
the differences if the confidence is not None.
-
benford.benford.
mantissas
(data, report=True, show_plot=True, arc_test=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]¶ Extraxts the mantissas of the records logarithms
- Parameters
data – sequence to compute mantissas from, numpy 1D array, pandas Series of pandas DataFrame column.
report – prints the mamtissas mean, variance, skewness and kurtosis for the sequence studied, along with reference values.
show_plot – plots the ordered mantissas and a line with the expected inclination. Defaults to True.
arc_test – draws the Arc Test plot. Defaluts to True.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
- Returns
Series with the data mantissas.
-
benford.benford.
summation
(data, digs=2, decimals=2, sign='all', top=20, verbose=True, show_plot=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]¶ Performs the Summation test. In a Benford series, the sums of the entries begining with the same digits tends to be the same. Works only with the First Digits (1, 2 or 3) test.
- Parameters
digs – tells the first digits to use: 1- first; 2- first two; 3- first three. Defaults to 2.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
top – choses how many top values to show. Defaults to 20.
show_plot – plots the results. Defaults to True.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
- Returns
- DataFrame with the Summation test, whether sorted in descending order
(if verbose == True) or not.
-
benford.benford.
mad
(data, test, decimals=2, sign='all', verbose=False)[source]¶ Calculates the Mean Absolute Deviation of the Series
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
test – informs which base test to use for the mad.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
- Returns
the Mean Absolute Deviation of the Series
- Return type
float
-
benford.benford.
mse
(data, test, decimals=2, sign='all', verbose=False)[source]¶ Calculates the Mean Squared Error of the Series
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
test – informs which base test to use for the mad.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
- Returns
the Mean Squared Error of the Series
- Return type
float
-
benford.benford.
bhattacharyya_distance
(data, test, decimals, sign='all', verbose=False)[source]¶ Computes the Bhattacharyya Distance between the Found and the Expected (Benford) digits distributions, according toe the test chosen (First, Second, First Two…)
- Parameters
data (ndarray, Series) – sequence to be evaluated, with values being integers or floats.
test (int, str) – informs which base test to be used.
decimals (int) – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign (str, optional) – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to “all”.
- Returns
the Bhattacharyya Distance between the distributions
- Return type
float
-
benford.benford.
kullback_leibler_divergence
(data, test, decimals, sign='all', verbose=False)[source]¶ Computes the Kulback-Leibler Divergence between the Found and the Expected (Benford) digits distributions, according toe the test chosen (First, Second, First Two…).
- Parameters
data (ndarray, Series) – sequence to be evaluated, with values being integers or floats.
test (int, str) – informs which base test to be used.
decimals (int) – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign (str, optional) – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to “all”.
- Returns
the Kulback-Leibler Divergence between the distributions
- Return type
float
-
benford.benford.
mad_summ
(data, test, decimals=2, sign='all', verbose=False)[source]¶ Calculate the Mean Absolute Deviation of the Summation Test
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
test – informs which base test to use for the summation mad.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
- Returns
the Mean Absolute Deviation of the Summation Test
- Return type
float
-
benford.benford.
rolling_mad
(data, test, window, decimals=2, sign='all', show_plot=False, save_plot=None, save_plot_kwargs=None)[source]¶ Applies the MAD to sequential subsets of the records.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
test – tells which test to use. 1: Fisrt Digits; 2: First Two Digits; 3: First Three Digits; 22: Second Digit; and -2: Last Two Digits.
window – size of the subset to be used.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
show_plot (bool) – draws the test plot.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
- Returns
Series with sequentially computed MADs.
-
benford.benford.
rolling_mse
(data, test, window, decimals=2, sign='all', show_plot=False, save_plot=None, save_plot_kwargs=None)[source]¶ Applies the MSE to sequential subsets of the records.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
test – tells which test to use. 1: Fisrt Digits; 2: First Two Digits; 3: First Three Digits; 22: Second Digit; and -2: Last Two Digits.
window – size of the subset to be used.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
show_plot (bool) – draws the test plot.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
- Returns
Series with sequentially computed MSEs.
-
benford.benford.
duplicates
(data, top_Rep=20, verbose=True, inform=None)[source]¶ Performs a duplicates test and maps the duplicates count in descending order.
- Parameters
data – sequence to take the duplicates from. pandas Series or numpy Ndarray.
verbose (bool) – tells how many duplicated entries were found and prints the top numbers according to the top_Rep argument. Defaluts to True.
top_Rep – chooses how many duplicated entries will be shown withe the top repititions. int or None. Defaluts to 20. If None, returns al the ordered repetitions.
- Returns
DataFrame with the duplicated records and their respective counts
- Raises
ValueError – if the top_Rep arg is not int or None.
-
benford.benford.
second_order
(data, test, decimals=2, sign='all', verbose=True, MAD=False, confidence=None, high_Z='pos', limit_N=None, MSE=False, show_plot=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]¶ Performs the chosen test after subtracting the ordered sequence by itself. Hence Second Order.
- Parameters
data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
test – the test to be performed - 1 or ‘F1D’: First Digit; 2 or ‘F2D’: First Two Digits; 3 or ‘F3D’: First three Digits; 22 or ‘SD’: Second Digits; -2 or ‘L2D’: Last Two Digits.
decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
verbose (bool) – tells the number of registries that are being subjected to the analysis and returns tha analysis DataFrame sorted by the highest Z score down. Defaults to True.
MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show. Defaults to None.
high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
chi_square – calculates the chi_square statistic of the sample and compares it with a critical value, according to the confidence level chosen and the series’s degrees of freedom. Defaults to False. Requires confidence != None.
KS – calculates the Kolmogorov-Smirnov test, comparing the cumulative distribution of the sample with the Benford’s, according to the confidence level chosen. Defaults to False. Requires confidence != None.
show_plot (bool) – draws the test plot.
save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
- Returns
- DataFrame of the test chosen, but applied on Second Order pre-
processed data.
benford.expected module¶
-
class
benford.expected.
First
(digs, plot=True, save_plot=None, save_plot_kwargs=None)[source]¶ Bases:
pandas.core.frame.DataFrame
Holds the expected probabilities of the First, First Two, or First Three digits according to Benford’s distribution.
- Parameters
digs – 1, 2 or 3 - tells which of the first digits to consider: 1 for the First Digit, 2 for the First Two Digits and 3 for the First Three Digits.
plot – option to plot a bar chart of the Expected proportions. Defaults to True.
save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
-
class
benford.expected.
Second
(plot=True, save_plot=None, save_plot_kwargs=None)[source]¶ Bases:
pandas.core.frame.DataFrame
Holds the expected probabilities of the Second Digits according to Benford’s distribution.
- Parameters
plot – option to plot a bar chart of the Expected proportions. Defaults to True.
save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
-
class
benford.expected.
LastTwo
(num=False, plot=True, save_plot=None, save_plot_kwargs=None)[source]¶ Bases:
pandas.core.frame.DataFrame
Holds the expected probabilities of the Last Two Digits according to Benford’s distribution.
- Parameters
plot – option to plot a bar chart of the Expected proportions. Defaults to True.
save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
benford.stats module¶
-
benford.stats.
Z_score
(frame, N)[source]¶ Computes the Z statistics for the proportions studied
- Parameters
frame – DataFrame with the expected proportions and the already calculated Absolute Diferences between the found and expeccted proportions
N – sample size
- Returns
Series of computed Z scores
-
benford.stats.
chi_sq
(frame, ddf, confidence, verbose=True)[source]¶ Comnputes the chi-square statistic of the found distributions and compares it with the critical chi-square of such a sample, according to the confidence level chosen and the degrees of freedom - len(sample) -1.
- Parameters
frame – DataFrame with Found, Expected and their difference columns.
ddf – Degrees of freedom to consider.
confidence – Confidence level to look up critical value.
verbose – prints the chi-squre result and compares to the critical chi-square for the sample. Defaults to True.
- Returns
- The computed Chi square statistic and the critical chi square
(according) to the degrees of freedom and confidence level, for comparison. None if confidence is None
-
benford.stats.
chi_sq_2
(frame)[source]¶ Computes the chi-square statistic of the found distributions
- Parameters
frame – DataFrame with Found, Expected and their difference columns.
- Returns
The computed Chi square statistic
-
benford.stats.
kolmogorov_smirnov
(frame, confidence, N, verbose=True)[source]¶ Computes the Kolmogorov-Smirnov test of the found distributions and compares it with the critical chi-square of such a sample, according to the confidence level chosen.
- Parameters
frame – DataFrame with Foud and Expected distributions.
confidence – Confidence level to look up critical value.
N – Sample size
verbose – prints the KS result and the critical value for the sample. Defaults to True.
- Returns
- The Suprem, which is the greatest absolute difference between the
Found and the expected proportions, and the Kolmogorov-Smirnov critical value according to the confidence level, for ccomparison
-
benford.stats.
kolmogorov_smirnov_2
(frame)[source]¶ Computes the Kolmogorov-Smirnov test of the found distributions
- Parameters
frame – DataFrame with Foud and Expected distributions.
- Returns
- The Suprem, which is the greatest absolute difference between the
Found end th expected proportions
-
benford.stats.
mad
(frame, test, verbose=True)[source]¶ Computes the Mean Absolute Deviation (MAD) between the found and the expected proportions.
- Parameters
frame – DataFrame with the Absolute Deviations already calculated.
test – Test to compute the MAD from (F1D, SD, F2D…)
verbose – prints the MAD result and compares to limit values of conformity. Defaults to True.
- Returns
- The Mean of the Absolute Deviations between the found and expected
proportions.
-
benford.stats.
mse
(frame, verbose=True)[source]¶ Computes the test’s Mean Square Error
- Parameters
frame – DataFrame with the already computed Absolute Deviations between the found and expected proportions
verbose – Prints the MSE. Defaults to True.
- Returns
Mean of the squared differences between the found and the expected proportions.
benford.viz module¶
-
benford.viz.
plot_expected
(df, digs, save_plot=None, save_plot_kwargs=None)[source]¶ Plots the Expected Benford Distributions
- Parameters
df – DataFrame with the Expected Proportions
digs – Test’s digit
save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
-
benford.viz.
plot_digs
(df, x, y_Exp, y_Found, N, figsize, conf_Z, text_x=False, save_plot=None, save_plot_kwargs=None)[source]¶ Plots the digits tests results
- Parameters
df – DataFrame with the data to be plotted
x – sequence to be used in the x axis
y_Exp – sequence of the expected proportions to be used in the y axis (line)
y_Found – sequence of the found proportions to be used in the y axis (bars)
N – lenght of sequence, to be used when plotting the confidence levels
figsize – tuple to state the size of the plot figure
conf_Z – Confidence level
save_pic – file path to save figure
text_x – Forces to show all x ticks labels. Defaluts to True.
save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
-
benford.viz.
plot_sum
(df, figsize, li, text_x=False, save_plot=None, save_plot_kwargs=None)[source]¶ Plots the summation test results
- Parameters
df – DataFrame with the data to be plotted
figsize – sets the dimensions of the plot figure
li – value with which to draw the horizontal line
save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
-
benford.viz.
plot_ordered_mantissas
(col, figsize=(12, 12), save_plot=None, save_plot_kwargs=None)[source]¶ - Plots the ordered mantissas and compares them to the expected, straight
line that should be formed in a Benford-cmpliant set.
- Parameters
col (Series) – column of mantissas to plot.
figsize (tuple) – sets the dimensions of the plot figure.
save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
-
benford.viz.
plot_mantissa_arc_test
(df, gravity_center, grid=True, figsize=12, save_plot=None, save_plot_kwargs=None)[source]¶ Draws thee Mantissa Arc Test after computing X and Y circular coordinates for every mantissa and the center of gravity for the set
- Parameters
df (DataFrame) – pandas DataFrame with the mantissas and the X and Y coordinates.
gravity_center (tuple) – coordinates for plottling the gravity center
grid (bool) – show grid. Defaults to True.
figsize (int) – figure dimensions. No need to be a tuple, since the figure is a square.
save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
-
benford.viz.
plot_roll_mse
(roll_series, figsize, save_plot=None, save_plot_kwargs=None)[source]¶ Shows the rolling MSE plot
- Parameters
roll_series – pd.Series resultant form rolling mse.
figsize – the figure dimensions.
save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
-
benford.viz.
plot_roll_mad
(roll_mad, figsize, save_plot=None, save_plot_kwargs=None)[source]¶ Shows the rolling MAD plot
- Parameters
roll_mad – pd.Series resultant form rolling mad.
figsize – the figure dimensions.
save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html