causallift package

CausalLift

Subpackages

Submodules

causallift.causal_lift module

class causallift.causal_lift.CausalLift(train_df=None, test_df=None, cols_features=None, col_treatment='Treatment', col_outcome='Outcome', col_propensity='Propensity', col_proba_if_treated='Proba_if_Treated', col_proba_if_untreated='Proba_if_Untreated', col_cate='CATE', col_recommendation='Recommendation', col_weight='Weight', min_propensity=0.01, max_propensity=0.99, verbose=2, uplift_model_params={'cv': 3, 'estimator': 'xgboost.XGBClassifier', 'n_jobs': -1, 'param_grid': {'base_score': [0.5], 'booster': ['gbtree'], 'colsample_bylevel': [1], 'colsample_bytree': [1], 'gamma': [0], 'learning_rate': [0.1], 'max_delta_step': [0], 'max_depth': [3], 'min_child_weight': [1], 'missing': [None], 'n_estimators': [100], 'n_jobs': [-1], 'nthread': [None], 'objective': ['binary:logistic'], 'random_state': [0], 'reg_alpha': [0], 'reg_lambda': [1], 'scale_pos_weight': [1], 'subsample': [1], 'verbose': [0]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, enable_ipw=True, enable_weighting=False, propensity_model_params={'cv': 3, 'estimator': 'sklearn.linear_model.LogisticRegression', 'n_jobs': -1, 'param_grid': {'C': [0.1, 1, 10], 'class_weight': [None], 'dual': [False], 'fit_intercept': [True], 'intercept_scaling': [1], 'max_iter': [100], 'multi_class': ['ovr'], 'n_jobs': [1], 'penalty': ['l1', 'l2'], 'random_state': [0], 'solver': ['liblinear'], 'tol': [0.0001], 'warm_start': [False]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, index_name='index', partition_name='partition', runner='SequentialRunner', conditionally_skip=False, df_print=<function display>, dataset_catalog={'df_03': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'estimated_effect_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'propensity_model': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>, 'treated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'untreated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'uplift_models_dict': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>}, logging_config={'disable_existing_loggers': False, 'formatters': {'json_formatter': {'class': 'pythonjsonlogger.jsonlogger.JsonFormatter', 'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s'}, 'simple': {'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'level': 'INFO', 'stream': 'ext://sys.stdout'}, 'error_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './errors.log', 'formatter': 'simple', 'level': 'ERROR', 'maxBytes': 10485760}, 'info_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './info.log', 'formatter': 'simple', 'level': 'INFO', 'maxBytes': 10485760}}, 'loggers': {'anyconfig': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'causallift': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.io': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'kedro.pipeline': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.runner': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}}, 'root': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO'}, 'version': 1})[source]

Bases: object

Set up datasets for uplift modeling. Optionally, propensity scores are estimated based on logistic regression.

Parameters:

train_df (Optional[DataFrame]) – Pandas Data Frame containing samples used for training
test_df (Optional[DataFrame]) – Pandas Data Frame containing samples used for testing
cols_features (Optional[List[str]]) – List of column names used as features. If None (default), all the columns except for outcome, propensity, CATE, and recommendation.
col_treatment (str) – Name of treatment column. ‘Treatment’ in default.
col_outcome (str) – Name of outcome column. ‘Outcome’ in default.
col_propensity (str) – Name of propensity column. ‘Propensity’ in default.
col_cate (str) – Name of CATE (Conditional Average Treatment Effect) column. ‘CATE’ in default.
col_recommendation (str) – Name of recommendation column. ‘Recommendation’ in default.
col_weight (str) – Name of weight column. ‘Weight’ in default.
min_propensity (float) – Minimum propensity score. 0.01 in default.
max_propensity (float) – Maximum propensity score. 0.99 in defualt.
verbose (int) –
How much info to show. Valid values are:
- 0 to show nothing
- 1 to show only warning
- 2 (default) to show useful info
- 3 to show more info

uplift_model_params (Union[Dict[str, List[Any]], Type[BaseEstimator]]) –

Parameters used to fit 2 XGBoost classifier models.

Optionally use search_cv key to specify the Search CV class name.

e.g. sklearn.model_selection.GridSearchCV

Refer to https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Use estimator key to specify the estimator class name.

e.g. xgboost.XGBClassifier

Refer to https://xgboost.readthedocs.io/en/latest/parameter.html
Optionally use const_params key to specify the constant parameters to construct the estimator.

If None (default):

dict(
    search_cv="sklearn.model_selection.GridSearchCV",
    estimator="xgboost.XGBClassifier",
    scoring=None,
    cv=3,
    return_train_score=False,
    n_jobs=-1,
    param_grid=dict(
        max_depth=[3],
        learning_rate=[0.1],
        n_estimators=[100],
        verbose=[0],
        objective=["binary:logistic"],
        booster=["gbtree"],
        n_jobs=[-1],
        nthread=[None],
        gamma=[0],
        min_child_weight=[1],
        max_delta_step=[0],
        subsample=[1],
        colsample_bytree=[1],
        colsample_bylevel=[1],
        reg_alpha=[0],
        reg_lambda=[1],
        scale_pos_weight=[1],
        base_score=[0.5],
        missing=[None],
    ),
)

Alternatively, estimator model object is acceptable. The object must have the following methods compatible with scikit-learn estimator interface.

fit()

predict()

predict_proba()

enable_ipw (bool) – Enable Inverse Probability Weighting based on the estimated propensity score. True in default.
enable_weighting (bool) – Enable Weighting. False in default.

propensity_model_params (Dict[str, List[Any]]) –

Parameters used to fit logistic regression model to estimate propensity score.

Optionally use search_cv key to specify the Search CV class name.

e.g. sklearn.model_selection.GridSearchCV

Refer to https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Use estimator key to specify the estimator class name.

e.g. sklearn.linear_model.LogisticRegression

Refer to https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Optionally use const_params key to specify the constant parameters to construct the estimator.

If None (default):

dict(
    search_cv="sklearn.model_selection.GridSearchCV",
    estimator="sklearn.linear_model.LogisticRegression",
    scoring=None,
    cv=3,
    return_train_score=False,
    n_jobs=-1,
    param_grid=dict(
        C=[0.1, 1, 10],
        class_weight=[None],
        dual=[False],
        fit_intercept=[True],
        intercept_scaling=[1],
        max_iter=[100],
        multi_class=["ovr"],
        n_jobs=[1],
        penalty=["l1", "l2"],
        solver=["liblinear"],
        tol=[0.0001],
        warm_start=[False],
    ),
)

index_name (str) –
Index name of the pandas data frame after resetting the index. ‘index’ in default.

If None, the index will not be reset.
partition_name (str) – Additional index name to indicate the partition, train or test. ‘partition’ in default.
runner (str) –
If set to ‘SequentialRunner’ (default) or ‘ParallelRunner’, the pipeline is run by Kedro sequentially or in parallel, respectively.

If set to None , the pipeline is run by native Python.

Refer to https://kedro.readthedocs.io/en/latest/04_user_guide/05_nodes_and_pipelines.html#runners
conditionally_skip (bool) –
[Effective only if runner is set to either ‘SequentialRunner’ or ‘ParallelRunner’]

Skip running the pipeline if the output files already exist. False in default.
df_print – callable to use to show output data frames. IPython.display.display in default.

dataset_catalog (Dict[str, AbstractDataSet]) –

[Effective only if runner is set to either ‘SequentialRunner’ or ‘ParallelRunner’]

Specify dataset files to save in Dict[str, kedro.io.AbstractDataSet] format.

To find available file formats, refer to https://kedro.readthedocs.io/en/latest/kedro.io.html#data-sets

In default:

dict(
    # args_raw = CSVLocalDataSet(filepath='../data/01_raw/args_raw.csv', version=None),
    # train_df = CSVLocalDataSet(filepath='../data/01_raw/train_df.csv', version=None),
    # test_df = CSVLocalDataSet(filepath='../data/01_raw/test_df.csv', version=None),
    propensity_model  = PickleLocalDataSet(
        filepath='../data/06_models/propensity_model.pickle',
        version=None
    ),
    uplift_models_dict = PickleLocalDataSet(
        filepath='../data/06_models/uplift_models_dict.pickle',
        version=None
    ),
    df_03 = CSVLocalDataSet(
        filepath='../data/07_model_output/df.csv',
        load_args=dict(index_col=['partition', 'index'], float_precision='high'),
        save_args=dict(index=True, float_format='%.16e'),
        version=None,
    ),
    treated__sim_eval_df = CSVLocalDataSet(
        filepath='../data/08_reporting/treated__sim_eval_df.csv',
        version=None,
    ),
    untreated__sim_eval_df = CSVLocalDataSet(
        filepath='../data/08_reporting/untreated__sim_eval_df.csv',
        version=None,
    ),
    estimated_effect_df = CSVLocalDataSet(
        filepath='../data/08_reporting/estimated_effect_df.csv',
        version=None,
    ),
)

logging_config (Optional[Dict[str, Any]]) –

Specify logging configuration.

Refer to https://docs.python.org/3.6/library/logging.config.html#logging-config-dictschema

In default:

{'disable_existing_loggers': False,
 'formatters': {
     'json_formatter': {
         'class': 'pythonjsonlogger.jsonlogger.JsonFormatter',
         'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s',
     },
     'simple': {
         'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s',
     },
 },
 'handlers': {
     'console': {
         'class': 'logging.StreamHandler',
         'formatter': 'simple',
         'level': 'INFO',
         'stream': 'ext://sys.stdout',
     },
    'info_file_handler': {
        'class': 'logging.handlers.RotatingFileHandler',
        'level': 'INFO',
        'formatter': 'simple',
        'filename': './info.log',
        'maxBytes': 10485760, # 10MB
        'backupCount': 20,
        'encoding': 'utf8',
        'delay': True,
    },
     'error_file_handler': {
         'class': 'logging.handlers.RotatingFileHandler',
         'level': 'ERROR',
         'formatter': 'simple',
         'filename': './errors.log',
         'maxBytes': 10485760,  # 10MB
         'backupCount': 20,
         'encoding': 'utf8',
         'delay': True,
     },
 },
 'loggers': {
     'anyconfig': {
         'handlers': ['console', 'info_file_handler', 'error_file_handler'],
         'level': 'WARNING',
         'propagate': False,
     },
     'kedro.io': {
         'handlers': ['console', 'info_file_handler', 'error_file_handler'],
         'level': 'WARNING',
         'propagate': False,
     },
     'kedro.pipeline': {
         'handlers': ['console', 'info_file_handler', 'error_file_handler'],
         'level': 'INFO',
         'propagate': False,
     },
     'kedro.runner': {
         'handlers': ['console', 'info_file_handler', 'error_file_handler'],
         'level': 'INFO',
         'propagate': False,
     },
     'causallift': {
         'handlers': ['console', 'info_file_handler', 'error_file_handler'],
         'level': 'INFO',
         'propagate': False,
     },
 },
 'root': {
     'handlers': ['console', 'info_file_handler', 'error_file_handler'],
     'level': 'INFO',
 },
 'version': 1}

__init__(train_df=None, test_df=None, cols_features=None, col_treatment='Treatment', col_outcome='Outcome', col_propensity='Propensity', col_proba_if_treated='Proba_if_Treated', col_proba_if_untreated='Proba_if_Untreated', col_cate='CATE', col_recommendation='Recommendation', col_weight='Weight', min_propensity=0.01, max_propensity=0.99, verbose=2, uplift_model_params={'cv': 3, 'estimator': 'xgboost.XGBClassifier', 'n_jobs': -1, 'param_grid': {'base_score': [0.5], 'booster': ['gbtree'], 'colsample_bylevel': [1], 'colsample_bytree': [1], 'gamma': [0], 'learning_rate': [0.1], 'max_delta_step': [0], 'max_depth': [3], 'min_child_weight': [1], 'missing': [None], 'n_estimators': [100], 'n_jobs': [-1], 'nthread': [None], 'objective': ['binary:logistic'], 'random_state': [0], 'reg_alpha': [0], 'reg_lambda': [1], 'scale_pos_weight': [1], 'subsample': [1], 'verbose': [0]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, enable_ipw=True, enable_weighting=False, propensity_model_params={'cv': 3, 'estimator': 'sklearn.linear_model.LogisticRegression', 'n_jobs': -1, 'param_grid': {'C': [0.1, 1, 10], 'class_weight': [None], 'dual': [False], 'fit_intercept': [True], 'intercept_scaling': [1], 'max_iter': [100], 'multi_class': ['ovr'], 'n_jobs': [1], 'penalty': ['l1', 'l2'], 'random_state': [0], 'solver': ['liblinear'], 'tol': [0.0001], 'warm_start': [False]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, index_name='index', partition_name='partition', runner='SequentialRunner', conditionally_skip=False, df_print=<function display>, dataset_catalog={'df_03': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'estimated_effect_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'propensity_model': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>, 'treated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'untreated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'uplift_models_dict': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>}, logging_config={'disable_existing_loggers': False, 'formatters': {'json_formatter': {'class': 'pythonjsonlogger.jsonlogger.JsonFormatter', 'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s'}, 'simple': {'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'level': 'INFO', 'stream': 'ext://sys.stdout'}, 'error_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './errors.log', 'formatter': 'simple', 'level': 'ERROR', 'maxBytes': 10485760}, 'info_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './info.log', 'formatter': 'simple', 'level': 'INFO', 'maxBytes': 10485760}}, 'loggers': {'anyconfig': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'causallift': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.io': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'kedro.pipeline': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.runner': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}}, 'root': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO'}, 'version': 1})[source]

estimate_cate_by_2_models()[source]

Estimate CATE (Conditional Average Treatment Effect) using 2 XGBoost classifier models.

Return type:: Tuple[DataFrame, DataFrame]

estimate_recommendation_impact(cate_estimated=None, treatment_fraction_train=None, treatment_fraction_test=None, verbose=None)[source]

Estimate the impact of recommendation based on uplift modeling.

Parameters:

cate_estimated (Optional[Type[Series]]) – Pandas series containing the CATE. If None (default), use the ones calculated by estimate_cate_by_2_models method.
treatment_fraction_train (Optional[float]) – The fraction of treatment in train dataset. If None (default), use the ones calculated by estimate_cate_by_2_models method.
treatment_fraction_test (Optional[float]) – The fraction of treatment in test dataset. If None (default), use the ones calculated by estimate_cate_by_2_models method.
verbose (Optional[int]) – How much info to show. If None (default), use the value set in the constructor.

Return type:

Type[DataFrame]

causallift.generate_data module

This code is an enhanced (backward-compatible) version that can simulate observational dataset including “sleeping dogs.”

“Sleeping dogs” (people who will “buy” if not treated but will not “buy” if treated) can be simulated by negative values in tau parameter. Observational data which includes confounding can be simulated by non-zero values in propensity_coef parameter. A/B Test (RCT) with a 50:50 split can be simulated by all-zeros values in propensity_coef parameter (default). The first element in each list parameter specifies the intercept.

causallift.generate_data.generate_data(N=1000, n_features=3, beta=[1, -2, 3, -0.8], error_std=0.5, tau=3, discrete_outcome=False)[source]

Generates random data with a ground truth data generating process. Draws random values for features from [0, 1), errors from a 0-centered distribution with std error_std, and creates an outcome y.

Parameters:

N – (Optional[int]) - Number of observations.
n_features – (Optional[int]) - Number of features.
beta – (Optional[List[float]]) - Array of beta coefficients to multiply by X to get y.
error_std – (Optional[float]) - Standard deviation (scale) of distribution from which errors are drawn.
tau – (Union[List[float], float]) - Array of coefficients to multiply by X to get y if treated. More/larger negative values will simulate more “sleeping dogs” If float scalar is input, effect of features is not considered.
tau_std – (Optional[float]) - When not None, draws tau from a normal distribution centered around tau with standard deviation tau_std rather than just using a constant value of tau.
discrete_outcome – (Optional[bool]) - If True, outcomes are 0 or 1; otherwise continuous.
seed – (Optional[int]) - Random seed fed to np.random.seed to allow for deterministic behavior.
feature_effect – (Optional[float]) - Effect of beta on outcome if treated.
propensity_coef – (Optional[List[float]]) - Array of coefficients to multiply by X to get propensity log-odds to be treated.
index_name – (Optional[str]) - Index name in the output DataFrame. If None (default), index name will not be set.

Returns:

pd.DataFrame: A DataFrame containing the generated data.

Return type:

df

causallift.pipeline module

Pipeline construction.

causallift.pipeline.create_pipeline(**kwargs)[source]

Create the project’s pipeline.

Parameters:: kwargs – Ignore any additional arguments added in the future.
Returns:: The resulting pipeline.
Return type:: Pipeline

causallift package

Subpackages

Submodules

causallift.causal_lift module

causallift.generate_data module

causallift.pipeline module

causallift.run module