causallift package

CausalLift

Subpackages

Submodules

causallift.causal_lift module

class causallift.causal_lift.CausalLift(train_df=None, test_df=None, cols_features=None, col_treatment='Treatment', col_outcome='Outcome', col_propensity='Propensity', col_proba_if_treated='Proba_if_Treated', col_proba_if_untreated='Proba_if_Untreated', col_cate='CATE', col_recommendation='Recommendation', col_weight='Weight', min_propensity=0.01, max_propensity=0.99, verbose=2, uplift_model_params={'cv': 3, 'estimator': 'xgboost.XGBClassifier', 'n_jobs': -1, 'param_grid': {'base_score': [0.5], 'booster': ['gbtree'], 'colsample_bylevel': [1], 'colsample_bytree': [1], 'gamma': [0], 'learning_rate': [0.1], 'max_delta_step': [0], 'max_depth': [3], 'min_child_weight': [1], 'missing': [None], 'n_estimators': [100], 'n_jobs': [-1], 'nthread': [None], 'objective': ['binary:logistic'], 'random_state': [0], 'reg_alpha': [0], 'reg_lambda': [1], 'scale_pos_weight': [1], 'subsample': [1], 'verbose': [0]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, enable_ipw=True, enable_weighting=False, propensity_model_params={'cv': 3, 'estimator': 'sklearn.linear_model.LogisticRegression', 'n_jobs': -1, 'param_grid': {'C': [0.1, 1, 10], 'class_weight': [None], 'dual': [False], 'fit_intercept': [True], 'intercept_scaling': [1], 'max_iter': [100], 'multi_class': ['ovr'], 'n_jobs': [1], 'penalty': ['l1', 'l2'], 'random_state': [0], 'solver': ['liblinear'], 'tol': [0.0001], 'warm_start': [False]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, index_name='index', partition_name='partition', runner='SequentialRunner', conditionally_skip=False, df_print=<function display>, dataset_catalog={'df_03': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'estimated_effect_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'propensity_model': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>, 'treated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'untreated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'uplift_models_dict': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>}, logging_config={'disable_existing_loggers': False, 'formatters': {'json_formatter': {'class': 'pythonjsonlogger.jsonlogger.JsonFormatter', 'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s'}, 'simple': {'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'level': 'INFO', 'stream': 'ext://sys.stdout'}, 'error_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './errors.log', 'formatter': 'simple', 'level': 'ERROR', 'maxBytes': 10485760}, 'info_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './info.log', 'formatter': 'simple', 'level': 'INFO', 'maxBytes': 10485760}}, 'loggers': {'anyconfig': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'causallift': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.io': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'kedro.pipeline': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.runner': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}}, 'root': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO'}, 'version': 1})[source]

Bases: object

Set up datasets for uplift modeling. Optionally, propensity scores are estimated based on logistic regression.

Parameters:
  • train_df (Optional[DataFrame]) – Pandas Data Frame containing samples used for training

  • test_df (Optional[DataFrame]) – Pandas Data Frame containing samples used for testing

  • cols_features (Optional[List[str]]) – List of column names used as features. If None (default), all the columns except for outcome, propensity, CATE, and recommendation.

  • col_treatment (str) – Name of treatment column. ‘Treatment’ in default.

  • col_outcome (str) – Name of outcome column. ‘Outcome’ in default.

  • col_propensity (str) – Name of propensity column. ‘Propensity’ in default.

  • col_cate (str) – Name of CATE (Conditional Average Treatment Effect) column. ‘CATE’ in default.

  • col_recommendation (str) – Name of recommendation column. ‘Recommendation’ in default.

  • col_weight (str) – Name of weight column. ‘Weight’ in default.

  • min_propensity (float) – Minimum propensity score. 0.01 in default.

  • max_propensity (float) – Maximum propensity score. 0.99 in defualt.

  • verbose (int) –

    How much info to show. Valid values are:

    • 0 to show nothing

    • 1 to show only warning

    • 2 (default) to show useful info

    • 3 to show more info

  • uplift_model_params (Union[Dict[str, List[Any]], Type[BaseEstimator]]) –

    Parameters used to fit 2 XGBoost classifier models.

    If None (default):

    dict(
        search_cv="sklearn.model_selection.GridSearchCV",
        estimator="xgboost.XGBClassifier",
        scoring=None,
        cv=3,
        return_train_score=False,
        n_jobs=-1,
        param_grid=dict(
            max_depth=[3],
            learning_rate=[0.1],
            n_estimators=[100],
            verbose=[0],
            objective=["binary:logistic"],
            booster=["gbtree"],
            n_jobs=[-1],
            nthread=[None],
            gamma=[0],
            min_child_weight=[1],
            max_delta_step=[0],
            subsample=[1],
            colsample_bytree=[1],
            colsample_bylevel=[1],
            reg_alpha=[0],
            reg_lambda=[1],
            scale_pos_weight=[1],
            base_score=[0.5],
            missing=[None],
        ),
    )
    

    Alternatively, estimator model object is acceptable. The object must have the following methods compatible with scikit-learn estimator interface.

    • fit()

    • predict()

    • predict_proba()

  • enable_ipw (bool) – Enable Inverse Probability Weighting based on the estimated propensity score. True in default.

  • enable_weighting (bool) – Enable Weighting. False in default.

  • propensity_model_params (Dict[str, List[Any]]) –

    Parameters used to fit logistic regression model to estimate propensity score.

    If None (default):

    dict(
        search_cv="sklearn.model_selection.GridSearchCV",
        estimator="sklearn.linear_model.LogisticRegression",
        scoring=None,
        cv=3,
        return_train_score=False,
        n_jobs=-1,
        param_grid=dict(
            C=[0.1, 1, 10],
            class_weight=[None],
            dual=[False],
            fit_intercept=[True],
            intercept_scaling=[1],
            max_iter=[100],
            multi_class=["ovr"],
            n_jobs=[1],
            penalty=["l1", "l2"],
            solver=["liblinear"],
            tol=[0.0001],
            warm_start=[False],
        ),
    )
    

  • index_name (str) –

    Index name of the pandas data frame after resetting the index. ‘index’ in default.

    If None, the index will not be reset.

  • partition_name (str) – Additional index name to indicate the partition, train or test. ‘partition’ in default.

  • runner (str) –

    If set to ‘SequentialRunner’ (default) or ‘ParallelRunner’, the pipeline is run by Kedro sequentially or in parallel, respectively.

    If set to None , the pipeline is run by native Python.

    Refer to https://kedro.readthedocs.io/en/latest/04_user_guide/05_nodes_and_pipelines.html#runners

  • conditionally_skip (bool) –

    [Effective only if runner is set to either ‘SequentialRunner’ or ‘ParallelRunner’]

    Skip running the pipeline if the output files already exist. False in default.

  • df_print – callable to use to show output data frames. IPython.display.display in default.

  • dataset_catalog (Dict[str, AbstractDataSet]) –

    [Effective only if runner is set to either ‘SequentialRunner’ or ‘ParallelRunner’]

    Specify dataset files to save in Dict[str, kedro.io.AbstractDataSet] format.

    To find available file formats, refer to https://kedro.readthedocs.io/en/latest/kedro.io.html#data-sets

    In default:

    dict(
        # args_raw = CSVLocalDataSet(filepath='../data/01_raw/args_raw.csv', version=None),
        # train_df = CSVLocalDataSet(filepath='../data/01_raw/train_df.csv', version=None),
        # test_df = CSVLocalDataSet(filepath='../data/01_raw/test_df.csv', version=None),
        propensity_model  = PickleLocalDataSet(
            filepath='../data/06_models/propensity_model.pickle',
            version=None
        ),
        uplift_models_dict = PickleLocalDataSet(
            filepath='../data/06_models/uplift_models_dict.pickle',
            version=None
        ),
        df_03 = CSVLocalDataSet(
            filepath='../data/07_model_output/df.csv',
            load_args=dict(index_col=['partition', 'index'], float_precision='high'),
            save_args=dict(index=True, float_format='%.16e'),
            version=None,
        ),
        treated__sim_eval_df = CSVLocalDataSet(
            filepath='../data/08_reporting/treated__sim_eval_df.csv',
            version=None,
        ),
        untreated__sim_eval_df = CSVLocalDataSet(
            filepath='../data/08_reporting/untreated__sim_eval_df.csv',
            version=None,
        ),
        estimated_effect_df = CSVLocalDataSet(
            filepath='../data/08_reporting/estimated_effect_df.csv',
            version=None,
        ),
    )
    

  • logging_config (Optional[Dict[str, Any]]) –

    Specify logging configuration.

    Refer to https://docs.python.org/3.6/library/logging.config.html#logging-config-dictschema

    In default:

    {'disable_existing_loggers': False,
     'formatters': {
         'json_formatter': {
             'class': 'pythonjsonlogger.jsonlogger.JsonFormatter',
             'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s',
         },
         'simple': {
             'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s',
         },
     },
     'handlers': {
         'console': {
             'class': 'logging.StreamHandler',
             'formatter': 'simple',
             'level': 'INFO',
             'stream': 'ext://sys.stdout',
         },
        'info_file_handler': {
            'class': 'logging.handlers.RotatingFileHandler',
            'level': 'INFO',
            'formatter': 'simple',
            'filename': './info.log',
            'maxBytes': 10485760, # 10MB
            'backupCount': 20,
            'encoding': 'utf8',
            'delay': True,
        },
         'error_file_handler': {
             'class': 'logging.handlers.RotatingFileHandler',
             'level': 'ERROR',
             'formatter': 'simple',
             'filename': './errors.log',
             'maxBytes': 10485760,  # 10MB
             'backupCount': 20,
             'encoding': 'utf8',
             'delay': True,
         },
     },
     'loggers': {
         'anyconfig': {
             'handlers': ['console', 'info_file_handler', 'error_file_handler'],
             'level': 'WARNING',
             'propagate': False,
         },
         'kedro.io': {
             'handlers': ['console', 'info_file_handler', 'error_file_handler'],
             'level': 'WARNING',
             'propagate': False,
         },
         'kedro.pipeline': {
             'handlers': ['console', 'info_file_handler', 'error_file_handler'],
             'level': 'INFO',
             'propagate': False,
         },
         'kedro.runner': {
             'handlers': ['console', 'info_file_handler', 'error_file_handler'],
             'level': 'INFO',
             'propagate': False,
         },
         'causallift': {
             'handlers': ['console', 'info_file_handler', 'error_file_handler'],
             'level': 'INFO',
             'propagate': False,
         },
     },
     'root': {
         'handlers': ['console', 'info_file_handler', 'error_file_handler'],
         'level': 'INFO',
     },
     'version': 1}
    

__init__(train_df=None, test_df=None, cols_features=None, col_treatment='Treatment', col_outcome='Outcome', col_propensity='Propensity', col_proba_if_treated='Proba_if_Treated', col_proba_if_untreated='Proba_if_Untreated', col_cate='CATE', col_recommendation='Recommendation', col_weight='Weight', min_propensity=0.01, max_propensity=0.99, verbose=2, uplift_model_params={'cv': 3, 'estimator': 'xgboost.XGBClassifier', 'n_jobs': -1, 'param_grid': {'base_score': [0.5], 'booster': ['gbtree'], 'colsample_bylevel': [1], 'colsample_bytree': [1], 'gamma': [0], 'learning_rate': [0.1], 'max_delta_step': [0], 'max_depth': [3], 'min_child_weight': [1], 'missing': [None], 'n_estimators': [100], 'n_jobs': [-1], 'nthread': [None], 'objective': ['binary:logistic'], 'random_state': [0], 'reg_alpha': [0], 'reg_lambda': [1], 'scale_pos_weight': [1], 'subsample': [1], 'verbose': [0]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, enable_ipw=True, enable_weighting=False, propensity_model_params={'cv': 3, 'estimator': 'sklearn.linear_model.LogisticRegression', 'n_jobs': -1, 'param_grid': {'C': [0.1, 1, 10], 'class_weight': [None], 'dual': [False], 'fit_intercept': [True], 'intercept_scaling': [1], 'max_iter': [100], 'multi_class': ['ovr'], 'n_jobs': [1], 'penalty': ['l1', 'l2'], 'random_state': [0], 'solver': ['liblinear'], 'tol': [0.0001], 'warm_start': [False]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, index_name='index', partition_name='partition', runner='SequentialRunner', conditionally_skip=False, df_print=<function display>, dataset_catalog={'df_03': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'estimated_effect_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'propensity_model': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>, 'treated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'untreated__sim_eval_df': <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object>, 'uplift_models_dict': <kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet object>}, logging_config={'disable_existing_loggers': False, 'formatters': {'json_formatter': {'class': 'pythonjsonlogger.jsonlogger.JsonFormatter', 'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s'}, 'simple': {'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'level': 'INFO', 'stream': 'ext://sys.stdout'}, 'error_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './errors.log', 'formatter': 'simple', 'level': 'ERROR', 'maxBytes': 10485760}, 'info_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './info.log', 'formatter': 'simple', 'level': 'INFO', 'maxBytes': 10485760}}, 'loggers': {'anyconfig': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'causallift': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.io': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'kedro.pipeline': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.runner': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}}, 'root': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO'}, 'version': 1})[source]
estimate_cate_by_2_models()[source]

Estimate CATE (Conditional Average Treatment Effect) using 2 XGBoost classifier models.

Return type:

Tuple[DataFrame, DataFrame]

estimate_recommendation_impact(cate_estimated=None, treatment_fraction_train=None, treatment_fraction_test=None, verbose=None)[source]

Estimate the impact of recommendation based on uplift modeling.

Parameters:
  • cate_estimated (Optional[Type[Series]]) – Pandas series containing the CATE. If None (default), use the ones calculated by estimate_cate_by_2_models method.

  • treatment_fraction_train (Optional[float]) – The fraction of treatment in train dataset. If None (default), use the ones calculated by estimate_cate_by_2_models method.

  • treatment_fraction_test (Optional[float]) – The fraction of treatment in test dataset. If None (default), use the ones calculated by estimate_cate_by_2_models method.

  • verbose (Optional[int]) – How much info to show. If None (default), use the value set in the constructor.

Return type:

Type[DataFrame]

causallift.generate_data module

The original code is at https://github.com/wayfair/pylift/blob/master/pylift/generate_data.py licensed under the BSD 2-Clause “Simplified” License Copyright 2018, Wayfair, Inc.

This code is an enhanced (backward-compatible) version that can simulate observational dataset including “sleeping dogs.”

“Sleeping dogs” (people who will “buy” if not treated but will not “buy” if treated) can be simulated by negative values in tau parameter. Observational data which includes confounding can be simulated by non-zero values in propensity_coef parameter. A/B Test (RCT) with a 50:50 split can be simulated by all-zeros values in propensity_coef parameter (default). The first element in each list parameter specifies the intercept.

causallift.generate_data.generate_data(N=1000, n_features=3, beta=[1, -2, 3, -0.8], error_std=0.5, tau=3, discrete_outcome=False)[source]

Generates random data with a ground truth data generating process. Draws random values for features from [0, 1), errors from a 0-centered distribution with std error_std, and creates an outcome y.

Parameters:
  • N – (Optional[int]) - Number of observations.

  • n_features – (Optional[int]) - Number of features.

  • beta – (Optional[List[float]]) - Array of beta coefficients to multiply by X to get y.

  • error_std – (Optional[float]) - Standard deviation (scale) of distribution from which errors are drawn.

  • tau – (Union[List[float], float]) - Array of coefficients to multiply by X to get y if treated. More/larger negative values will simulate more “sleeping dogs” If float scalar is input, effect of features is not considered.

  • tau_std – (Optional[float]) - When not None, draws tau from a normal distribution centered around tau with standard deviation tau_std rather than just using a constant value of tau.

  • discrete_outcome – (Optional[bool]) - If True, outcomes are 0 or 1; otherwise continuous.

  • seed – (Optional[int]) - Random seed fed to np.random.seed to allow for deterministic behavior.

  • feature_effect – (Optional[float]) - Effect of beta on outcome if treated.

  • propensity_coef – (Optional[List[float]]) - Array of coefficients to multiply by X to get propensity log-odds to be treated.

  • index_name – (Optional[str]) - Index name in the output DataFrame. If None (default), index name will not be set.

Returns:

pd.DataFrame

A DataFrame containing the generated data.

Return type:

df

causallift.pipeline module

Pipeline construction.

causallift.pipeline.create_pipeline(**kwargs)[source]

Create the project’s pipeline.

Parameters:

kwargs – Ignore any additional arguments added in the future.

Returns:

The resulting pipeline.

Return type:

Pipeline

causallift.run module