23

Practice Set 23

Questions 221–230 (10 questions)

220

A company wants to segment a large group of customers into subgroups based on shared characteristics. The company’s data scientist is planning to use the Amazon SageMaker built-in k-means clustering algorithm for this task. The data scientist needs to determine the optimal number of subgroups (k) to use.Which data visualization approach will MOST accurately determine the optimal value of k? [{"voted_answers": "D", "vote_count": 16, "is_most_voted": true}]

221

A data scientist at a financial services company used Amazon SageMaker to train and deploy a model that predicts loan defaults. The model analyzes new loan applications and predicts the risk of loan default. To train the model, the data scientist manually extracted loan data from a database. The data scientist performed the model training and deployment steps in a Jupyter notebook that is hosted on SageMaker Studio notebooks. The model's prediction accuracy is decreasing over time.Which combination of steps is the MOST operationally efficient way for the data scientist to maintain the model's accuracy? (Choose two.) [{"voted_answers": "AB", "vote_count": 12, "is_most_voted": true}]

222

A retail company wants to create a system that can predict sales based on the price of an item. A machine learning (ML) engineer built an initial linear model that resulted in the following residual plot:Which actions should the ML engineer take to improve the accuracy of the predictions in the next phase of model building? (Choose three.) [{"voted_answers": "BEF", "vote_count": 10, "is_most_voted": true}, {"voted_answers": "CDE", "vote_count": 2, "is_most_voted": false}]

223

A data scientist at a food production company wants to use an Amazon SageMaker built-in model to classify different vegetables. The current dataset has many features. The company wants to save on memory costs when the data scientist trains and deploys the model. The company also wants to be able to find similar data points for each test data point.Which algorithm will meet these requirements? [{"voted_answers": "A", "vote_count": 19, "is_most_voted": true}, {"voted_answers": "C", "vote_count": 11, "is_most_voted": false}]

224

A data scientist is training a large PyTorch model by using Amazon SageMaker. It takes 10 hours on average to train the model on GPU instances. The data scientist suspects that training is not converging and that resource utilization is not optimal.What should the data scientist do to identify and address training issues with the LEAST development effort? [{"voted_answers": "C", "vote_count": 11, "is_most_voted": true}]

225

A bank wants to launch a low-rate credit promotion campaign. The bank must identify which customers to target with the promotion and wants to make sure that each customer's full credit history is considered when an approval or denial decision is made.The bank's data science team used the XGBoost algorithm to train a classification model based on account transaction features. The data science team deployed the model by using the Amazon SageMaker model hosting service. The accuracy of the model is sufficient, but the data science team wants to be able to explain why the model denies the promotion to some customers.What should the data science team do to meet this requirement in the MOST operationally efficient manner? [{"voted_answers": "C", "vote_count": 13, "is_most_voted": true}, {"voted_answers": "B", "vote_count": 4, "is_most_voted": false}]

226

A company has hired a data scientist to create a loan risk model. The dataset contains loan amounts and variables such as loan type, region, and other demographic variables. The data scientist wants to use Amazon SageMaker to test bias regarding the loan amount distribution with respect to some of these categorical variables.Which pretraining bias metrics should the data scientist use to check the bias distribution? (Choose three.) [{"voted_answers": "DEF", "vote_count": 9, "is_most_voted": true}, {"voted_answers": "ABC", "vote_count": 6, "is_most_voted": false}, {"voted_answers": "BCF", "vote_count": 3, "is_most_voted": false}, {"voted_answers": "BDF", "vote_count": 1, "is_most_voted": false}]

227

A retail company wants to use Amazon Forecast to predict daily stock levels of inventory. The cost of running out of items in stock is much higher for the company than the cost of having excess inventory. The company has millions of data samples for multiple years for thousands of items. The company’s purchasing department needs to predict demand for 30-day cycles for each item to ensure that restocking occurs.A machine learning (ML) specialist wants to use item-related features such as "category," "brand," and "safety stock count." The ML specialist also wants to use a binary time series feature that has "promotion applied?" as its name. Future promotion information is available only for the next 5 days.The ML specialist must choose an algorithm and an evaluation metric for a solution to produce prediction results that will maximize company profit.Which solution will meet these requirements? [{"voted_answers": "C", "vote_count": 14, "is_most_voted": true}, {"voted_answers": "A", "vote_count": 2, "is_most_voted": false}]

228

An online retail company wants to develop a natural language processing (NLP) model to improve customer service. A machine learning (ML) specialist is setting up distributed training of a Bidirectional Encoder Representations from Transformers (BERT) model on Amazon SageMaker. SageMaker will use eight compute instances for the distributed training.The ML specialist wants to ensure the security of the data during the distributed training. The data is stored in an Amazon S3 bucket.Which combination of steps should the ML specialist take to protect the data during the distributed training? (Choose three.) [{"voted_answers": "ACD", "vote_count": 21, "is_most_voted": true}, {"voted_answers": "ACF", "vote_count": 10, "is_most_voted": false}]

229

An analytics company has an Amazon SageMaker hosted endpoint for an image classification model. The model is a custom-built convolutional neural network (CNN) and uses the PyTorch deep learning framework. The company wants to increase throughput and decrease latency for customers that use the model.Which solution will meet these requirements MOST cost-effectively? [{"voted_answers": "A", "vote_count": 8, "is_most_voted": true}]