utils package

Support module

This module synthesizes some of the functions using on the major algorithms of this project, as a way to keep the main code as clean as possible for analysis. And here the APIs for each one of this functions is presented.

utils.support.balanceDataSet(phi, y)

This function will receive an output wise unbalaced regression model in the format

\[y(k) = f(\phi(k), \Theta )\]

and will return a balanced dataset with randomized samples by considering the True label of the output as a reference. If one wants to use the False label as reference, it is just necessary to pass the version of the output.

The balanced database means a database with 50% of true targets, and 50% of false targets. This is interesting to remove bias of upper/lower cut learning of the models.

Parameters:
  • phi (numpy.ndarray) – The regressor matrix.
  • y (numpy.ndarray) – The targets vector.
Returns:

The new regression model => (phi, target)

Return type:

tuple

utils.support.dropNonInformative(dataset=None)

This function search for inconsistency and stationarity inside the dataset, and then remove those informations from the dataset. It returns a dataset without features that does not provide any interesting information… It also removes some

Note

Notice that the dataset description has only 25 fields, but the table has 29 features… Probably some of those does not have any information at all.

Parameters:dataset (pandas.DataFrame) – The dataset table as dataframe
Returns:The dataset with all non informative data dropped
Return type:pandas.DataFrame
utils.support.encodeDataSet(dataset=None)

This function endodes the dataset into only numeric fields for machine learning purpose. Returning the numerical dataset, with the encoders responsible for the transformation.

Parameters:dataset (pandas.DataFrame) – The dataset table as dataframe
Returns:The dataset with all categorical fields numerrically encoded together with a dictionary of each field encoder
Return type:pandas.DataFrame, dict
utils.support.replaceFields(dataset=None)

This function is responsible to replace some data features with ones that are more suitable considering the analysis porpuse, such as:

  • Birth Year => Age in years (int)
  • Dt Customer => Persistance in months (int)
Parameters:dataset (pandas.DataFrame) – The dataset table as dataframe
Returns:The dataset with some fields preprocessed
Return type:pandas.DataFrame
utils.support.svmCostFunction(p, yt, xt, yv, xv)

The cost function responsible to build a model with the provided set of parameters, then estimate the model, and test its result in the testing dataset. To then retrieve a performance indicator that will be the reference for the optimization algorithm to minimize.

Parameters:
  • p (list) – The set of hyper parameters candidates.
  • yt (numpy.ndarray) – The train targets.
  • xt (numpy.ndarray) – The train features.
  • yv (numpy.ndarray) – The test targets.
  • xv (numpy.ndarray) – The test features.
Returns:

The sum of the false positive indicators from the confusion matrix.

Return type:

float

utils.support.svmHyperGridSearch(bounds, data, iters=1000)

This function will run the annealing stochastic optimization algorithm based on the svmCostFunction, to find the best set of hyper parameters for the SVC classifier from sklearn, by minimizing the svmCostFunction.

Parameters:
  • bounds (list) – The upper and lower bounds of each parameter.
  • data (list) – The list of datasets that will be used in the svmCostFunction.
  • iters (int) – The number of maximun iterations on the annealing search.
Returns:

The resulted parameters and the optimization summary, respectivelly.

Return type:

tuple

utils.support.xgbCostFunction(p, yt, xt, yv, xv)

The cost function responsible to build a model with the provided set of parameters, then estimate the model, and test its result in the testing dataset. To then retrieve a performance indicator that will be the reference for the optimization algorithm to minimize.

Parameters:
  • p (list) – The set of hyper parameters candidates.
  • yt (numpy.ndarray) – The train targets.
  • xt (numpy.ndarray) – The train features.
  • yv (numpy.ndarray) – The test targets.
  • xv (numpy.ndarray) – The test features.
Returns:

The sum of the false positive indicators from the confusion matrix.

Return type:

float

utils.support.xgbHyperGridSearch(bounds, data, iters=1000)

This function will run the annealing stochastic optimization algorithm based on the xgbCostFunction, to find the best set of hyper parameters for the XGBoost classifier, by minimizing the xgbCostFunction.

Parameters:
  • bounds (list) – The upper and lower bounds of each parameter.
  • data (list) – The list of datasets that will be used in the xgbCostFunction.
  • iters (int) – The number of maximun iterations on the annealing search.
Returns:

The resulted parameters and the optimization summary, respectivelly.

Return type:

tuple

Module contents

copyright:2020 Marcelo Lima
license:BSD-3-Clause