13

Practice Set 13

Questions 121–130 (10 questions)

121

A data scientist uses an Amazon SageMaker notebook instance to conduct data exploration and analysis. This requires certain Python packages that are not natively available on Amazon SageMaker to be installed on the notebook instance.How can a machine learning specialist ensure that required packages are automatically available on the notebook instance for the data scientist to use? [{"voted_answers": "D", "vote_count": 14, "is_most_voted": true}, {"voted_answers": "B", "vote_count": 1, "is_most_voted": false}]

122

A data scientist needs to identify fraudulent user accounts for a company's ecommerce platform. The company wants the ability to determine if a newly created account is associated with a previously known fraudulent user. The data scientist is using AWS Glue to cleanse the company's application logs during ingestion.Which strategy will allow the data scientist to identify fraudulent accounts? [{"voted_answers": "B", "vote_count": 3, "is_most_voted": true}]

123

A Data Scientist is developing a machine learning model to classify whether a financial transaction is fraudulent. The labeled data available for training consists of100,000 non-fraudulent observations and 1,000 fraudulent observations.The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist needs to reduce the number of false negatives.Which combination of steps should the Data Scientist take to reduce the number of false negative predictions by the model? (Choose two.) [{"voted_answers": "BD", "vote_count": 17, "is_most_voted": true}, {"voted_answers": "BE", "vote_count": 4, "is_most_voted": false}]

124

A data scientist has developed a machine learning translation model for English to Japanese by using Amazon SageMaker's built-in seq2seq algorithm with500,000 aligned sentence pairs. While testing with sample sentences, the data scientist finds that the translation quality is reasonable for an example as short as five words. However, the quality becomes unacceptable if the sentence is 100 words long.Which action will resolve the problem? [{"voted_answers": "C", "vote_count": 11, "is_most_voted": true}]

125

A financial company is trying to detect credit card fraud. The company observed that, on average, 2% of credit card transactions were fraudulent. A data scientist trained a classifier on a year's worth of credit card transactions data. The model needs to identify the fraudulent transactions (positives) from the regular ones(negatives). The company's goal is to accurately capture as many positives as possible.Which metrics should the data scientist use to optimize the model? (Choose two.) [{"voted_answers": "DE", "vote_count": 10, "is_most_voted": true}, {"voted_answers": "BD", "vote_count": 2, "is_most_voted": false}]

126

A machine learning specialist is developing a proof of concept for government users whose primary concern is security. The specialist is using AmazonSageMaker to train a convolutional neural network (CNN) model for a photo classifier application. The specialist wants to protect the data so that it cannot be accessed and transferred to a remote host by malicious code accidentally installed on the training container.Which action will provide the MOST secure protection? [{"voted_answers": "D", "vote_count": 8, "is_most_voted": true}]

127

A medical imaging company wants to train a computer vision model to detect areas of concern on patients' CT scans. The company has a large collection of unlabeled CT scans that are linked to each patient and stored in an Amazon S3 bucket. The scans must be accessible to authorized users only. A machine learning engineer needs to build a labeling pipeline.Which set of steps should the engineer take to build the labeling pipeline with the LEAST effort? [{"voted_answers": "C", "vote_count": 5, "is_most_voted": true}]

128

A company is using Amazon Textract to extract textual data from thousands of scanned text-heavy legal documents daily. The company uses this information to process loan applications automatically. Some of the documents fail business validation and are returned to human reviewers, who investigate the errors. This activity increases the time to process the loan applications.What should the company do to reduce the processing time of loan applications? [{"voted_answers": "C", "vote_count": 3, "is_most_voted": true}]

129

A company ingests machine learning (ML) data from web advertising clicks into an Amazon S3 data lake. Click data is added to an Amazon Kinesis data stream by using the Kinesis Producer Library (KPL). The data is loaded into the S3 data lake from the data stream by using an Amazon Kinesis Data Firehose delivery stream. As the data volume increases, an ML specialist notices that the rate of data ingested into Amazon S3 is relatively constant. There also is an increasing backlog of data for Kinesis Data Streams and Kinesis Data Firehose to ingest.Which next step is MOST likely to improve the data ingestion rate into Amazon S3? [{"voted_answers": "C", "vote_count": 20, "is_most_voted": true}, {"voted_answers": "A", "vote_count": 12, "is_most_voted": false}]

130

A data scientist must build a custom recommendation model in Amazon SageMaker for an online retail company. Due to the nature of the company's products, customers buy only 4-5 products every 5-10 years. So, the company relies on a steady stream of new customers. When a new customer signs up, the company collects data on the customer's preferences. Below is a sample of the data available to the data scientist.How should the data scientist split the dataset into a training and test set for this use case? [{"voted_answers": "D", "vote_count": 18, "is_most_voted": true}, {"voted_answers": "B", "vote_count": 16, "is_most_voted": false}]