HomeChallengesMeasuring Statistical bias with Amazon Sagemaker Clarify
Measuring Statistical bias with Amazon Sagemaker ClarifyLast updated at Wed Oct 27 2021Skills
data-wrangling
statistical-analysis
Tools
sagemaker
aws

### Learning Objectives

• Statistical Bias Metrics
• Measuring Statistical Bias with Sagemaker Clarify

### Scenario

Well.. you had quite a day! You've reached home, ordered a burger, and want to Netflix. Now Netflix started throwing some recommendations at you and that is because of Netflix recommendation engine takes several factors and recommends a movie on the screen that you might like. This here what we observe is called Selection Bias. This selection bias is a cause that generates statistical bias on any data generated on viewing/reviews on watching the recommended movies.

### Metrics for Statistical Bias

There are several such causes (Activity Bias, Societal Bias...) for statistical bias. Now to Measure such Statistical bias we use certain metrics. Now I do not want to start explaining all of these here. Please read them here

### Evaluating Statistical Bias

Our objective for this learnbook is to use Amazon Sagemaker Clarify to measure the statistical bias by evaluating the metrics we saw earlier. We can also make use of Amazon Sagemaker Data wrangler to evaluate this. But you might be wondering what's the difference between them. Please look at the below illustration for an understanding.

### Data

We are going to make use of this data set for the analysis Credits: Thanks to Akash Patel for this data

### Sagemaker Clarify

• Create an S3 bucket and upload the data to it
• Create a Sagemaker Studio Environment. [For reference] . Creating Sagemaker studio might take some time.
• Once Studio is active Open a Python3 Notebook.
• We will make use of aws wrangler and aws clarify libraries to evaluate pre-training statistical bias
``````import awswrangler as wr
from sagemaker import clarify
from sagemaker import get_execution_role
from sagemaker import Session
``````
• We need to work with data that is available in s3 bucket. I am using awswrangler to get the data from s3 and convert it to dataframe. We can also use boto3 to accomplish this task
``````df = wr.s3.read_csv('s3://soma-tmc-hack/marketing_campaign.csv')
``````
• As shown in the illustration sagemaker clarify uses distributed processing if available which makes it computationally efficient when working on large datasets. So we need to create a clarify instance for processing.
``````session = Session()
bucket = session.default_bucket()
prefix = "sagemaker/DEMO-sagemaker-clarify"
role = get_execution_role()
``````
``````clarify_processor = clarify.SageMakerClarifyProcessor(
role=role, instance_count=1, instance_type="ml.t3.medium", sagemaker_session=session
)
``````
• We need to define two configurations before we evaluate pre-training statistical bias.

1. bias_data_cofig to determine input location of the data and output location to save the report, target variable, and type of data
2. bias_config to determine which facets are we checking bias on, their values, and thresholds
• Once we have the configurations defined we can evaluate pre-training statistical bias using all the metrics or a specific list of them. Here we are evaluating all the metrics.

``````bias_report_output_path = "s3://{}/{}/clarify-bias".format(bucket, prefix)
bias_data_config = clarify.DataConfig(
s3_data_input_path='s3://bucket_name/marketing_campaign.csv',
s3_output_path=bias_report_output_path,
label="Response",
dataset_type="text/csv",
)
``````
``````bias_config = clarify.BiasConfig(
label_values_or_threshold=[1], facet_name="Education", group_name="Marital_Status"
)
``````
``````clarify_processor.run_pre_training_bias(
data_config=bias_data_config,
data_bias_config=bias_config,
methods="all",
)
``````

This generates and saves the report to the specified output path.

• Clarify not only evaluates pre-training bias but also several other things such as feature importance, explainability, model behavior, etc..

As you finish your experiments make sure you delete all compute instances as well as Sagemaker studio to save cloud costs

#### Quiz Time!

• Using the same procedure, check for statistical bias on the Marital_Status feature