HomeChallengesMeasuring Statistical bias with Amazon Sagemaker Clarify
learnbook poster
Measuring Statistical bias with Amazon Sagemaker ClarifyLast updated at Wed Oct 27 2021Skills
data-wrangling
statistical-analysis
Tools
sagemaker
aws

Learning Objectives

  • Statistical Bias Metrics
  • Measuring Statistical Bias with Sagemaker Clarify

Scenario

Well.. you had quite a day! You've reached home, ordered a burger, and want to Netflix. Now Netflix started throwing some recommendations at you and that is because of Netflix recommendation engine takes several factors and recommends a movie on the screen that you might like. This here what we observe is called Selection Bias. This selection bias is a cause that generates statistical bias on any data generated on viewing/reviews on watching the recommended movies.


Metrics for Statistical Bias

There are several such causes (Activity Bias, Societal Bias...) for statistical bias. Now to Measure such Statistical bias we use certain metrics. Now I do not want to start explaining all of these here. Please read them here


Evaluating Statistical Bias

Our objective for this learnbook is to use Amazon Sagemaker Clarify to measure the statistical bias by evaluating the metrics we saw earlier. We can also make use of Amazon Sagemaker Data wrangler to evaluate this. But you might be wondering what's the difference between them. Please look at the below illustration for an understanding.

statistical Bias.PNG


Data

We are going to make use of this data set for the analysis Credits: Thanks to Akash Patel for this data


Sagemaker Clarify

  • Create an S3 bucket and upload the data to it
  • Create a Sagemaker Studio Environment. [For reference] . Creating Sagemaker studio might take some time.
  • Once Studio is active Open a Python3 Notebook.
  • We will make use of aws wrangler and aws clarify libraries to evaluate pre-training statistical bias
import awswrangler as wr
from sagemaker import clarify
from sagemaker import get_execution_role
from sagemaker import Session
  • We need to work with data that is available in s3 bucket. I am using awswrangler to get the data from s3 and convert it to dataframe. We can also use boto3 to accomplish this task
df = wr.s3.read_csv('s3://soma-tmc-hack/marketing_campaign.csv')
  • As shown in the illustration sagemaker clarify uses distributed processing if available which makes it computationally efficient when working on large datasets. So we need to create a clarify instance for processing.
session = Session()
bucket = session.default_bucket()
prefix = "sagemaker/DEMO-sagemaker-clarify"
role = get_execution_role()
clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.t3.medium", sagemaker_session=session
)
  • We need to define two configurations before we evaluate pre-training statistical bias.

    1. bias_data_cofig to determine input location of the data and output location to save the report, target variable, and type of data
    2. bias_config to determine which facets are we checking bias on, their values, and thresholds
  • Once we have the configurations defined we can evaluate pre-training statistical bias using all the metrics or a specific list of them. Here we are evaluating all the metrics.

bias_report_output_path = "s3://{}/{}/clarify-bias".format(bucket, prefix)
bias_data_config = clarify.DataConfig(
    s3_data_input_path='s3://bucket_name/marketing_campaign.csv',
    s3_output_path=bias_report_output_path,
    label="Response",
    headers=df.columns.to_list(),
    dataset_type="text/csv",
)
bias_config = clarify.BiasConfig(
    label_values_or_threshold=[1], facet_name="Education", group_name="Marital_Status"
)
clarify_processor.run_pre_training_bias(
    data_config=bias_data_config,
    data_bias_config=bias_config,
    methods="all",
)

This generates and saves the report to the specified output path.

  • Clarify not only evaluates pre-training bias but also several other things such as feature importance, explainability, model behavior, etc..

As you finish your experiments make sure you delete all compute instances as well as Sagemaker studio to save cloud costs


Quiz Time!

  • Using the same procedure, check for statistical bias on the Marital_Status feature
  • Reflect a bit on your reading and take this quiz here to validate your learning
  • Wish to learn from a bigger community and work on an open project? Join us on Slack!

Created with 💙 by
author avatar
Soma
Cloud Engineer,Mentorskool
mentorskool logo
Mentee Today, Mentor Tomorrow
No 206, A Block, Sonesta Silver Oak,Varthur, Bangalore 560066,Karnataka
Copyright - 2022 © Mentorskool - All rights reserved.