- Statistical Bias Metrics
- Measuring Statistical Bias with Sagemaker Data Wrangler
Well.. you had quite a day! You've reached home, ordered a burger, and want to Netflix. Now Netflix started throwing some recommendations at you and that is because of Netflix recommendation engine takes several factors and recommends a movie on the screen that you might like. This here what we observe is called Selection Bias. This selection bias is a cause that generates statistical bias on any data generated on viewing/reviews on watching the recommended movies.
Metrics for Statistical Bias
There are several such causes (Activity Bias, Societal Bias...) for statistical bias. Now to Measure such Statistical bias we use certain metrics. Now I do not want to start explaining all of these here. Please read them here
Evaluating Statistical Bias
Our objective for this learnbook is to use Amazon Sagemaker Data wrangler to measure the statistical bias by evaluating the metrics we saw earlier. We can also make use of Amazon Sagemaker clarify to evaluate this. But you might be wondering what's the difference between them. Please look at the below illustration for an understanding.
We are going to make use of this data set for the analysis Credits: Thanks to Akash Patel for this data
Sagemaker Data Wrangler
- Create an Amazon S3 bucket and upload the data to it
- Create a Sagemaker Studio Environment. [For reference] . Creating Sagemaker studio might take some time.
- Once Sagemaker Studio is active open it and create a new data flow.
- Import the data that is uploaded to s3
- Create a new Analysis in the data wrangler [For reference]
- Create a Statistical bias report from the new analysis.
- Once you execute this will generate report at specified s3 location
Now you finished generating the report but what does the report exactly mean?
- CI score for YOLO is 1. which means predictions for samples with YOLO class are highly biased in the marital status attribute. Although the sample size is also very less. So inferences had to be made on a significant sample size
- The Lp Norm difference for the "Single" class is 0.13. The closer it is to zero lesser the bias
As you finish your experiments make sure you delete all compute instances as well as Sagemaker studio to save cloud costs