Practice Set 28
Questions 271–280 (10 questions)
A machine learning (ML) specialist is training a multilayer perceptron (MLP) on a dataset with multiple classes. The target class of interest is unique compared to the other classes in the dataset, but it does not achieve an acceptable recall metric. The ML specialist varies the number and size of the MLP's hidden layers, but the results do not improve significantly.Which solution will improve recall in the LEAST amount of time? [{"voted_answers": "A", "vote_count": 11, "is_most_voted": true}, {"voted_answers": "C", "vote_count": 1, "is_most_voted": false}]
A machine learning (ML) specialist uploads 5 TB of data to an Amazon SageMaker Studio environment. The ML specialist performs initial data cleansing. Before the ML specialist begins to train a model, the ML specialist needs to create and view an analysis report that details potential bias in the uploaded data.Which combination of actions will meet these requirements with the LEAST operational overhead? (Choose two.) [{"voted_answers": "AD", "vote_count": 5, "is_most_voted": true}]
A network security vendor needs to ingest telemetry data from thousands of endpoints that run all over the world. The data is transmitted every 30 seconds in the form of records that contain 50 fields. Each record is up to 1 KB in size. The security vendor uses Amazon Kinesis Data Streams to ingest the data. The vendor requires hourly summaries of the records that Kinesis Data Streams ingests. The vendor will use Amazon Athena to query the records and to generate the summaries. The Athena queries will target 7 to 12 of the available data fields.Which solution will meet these requirements with the LEAST amount of customization to transform and store the ingested data? [{"voted_answers": "C", "vote_count": 8, "is_most_voted": true}, {"voted_answers": "D", "vote_count": 2, "is_most_voted": false}]
A medical device company is building a machine learning (ML) model to predict the likelihood of device recall based on customer data that the company collects from a plain text survey. One of the survey questions asks which medications the customer is taking. The data for this field contains the names of medications that customers enter manually. Customers misspell some of the medication names. The column that contains the medication name data gives a categorical feature with high cardinality but redundancy.What is the MOST effective way to encode this categorical feature into a numeric feature? [{"voted_answers": "C", "vote_count": 3, "is_most_voted": true}, {"voted_answers": "B", "vote_count": 2, "is_most_voted": false}]
A machine learning (ML) engineer has created a feature repository in Amazon SageMaker Feature Store for the company. The company has AWS accounts for development, integration, and production. The company hosts a feature store in the development account. The company uses Amazon S3 buckets to store feature values offline. The company wants to share features and to allow the integration account and the production account to reuse the features that are in the feature repository.Which combination of steps will meet these requirements? (Choose two.) [{"voted_answers": "AB", "vote_count": 7, "is_most_voted": true}, {"voted_answers": "AC", "vote_count": 5, "is_most_voted": false}]
A company is building a new supervised classification model in an AWS environment. The company's data science team notices that the dataset has a large quantity of variables. All the variables are numeric.The model accuracy for training and validation is low. The model's processing time is affected by high latency. The data science team needs to increase the accuracy of the model and decrease the processing time.What should the data science team do to meet these requirements? [{"voted_answers": "B", "vote_count": 3, "is_most_voted": true}]
An exercise analytics company wants to predict running speeds for its customers by using a dataset that contains multiple health-related features for each customer. Some of the features originate from sensors that provide extremely noisy values.The company is training a regression model by using the built-in Amazon SageMaker linear learner algorithm to predict the running speeds. While the company is training the model, a data scientist observes that the training loss decreases to almost zero, but validation loss increases.Which technique should the data scientist use to optimally fit the model? [{"voted_answers": "A", "vote_count": 9, "is_most_voted": true}, {"voted_answers": "D", "vote_count": 9, "is_most_voted": false}]
A company's machine learning (ML) specialist is building a computer vision model to classify 10 different traffic signs. The company has stored 100 images of each class in Amazon S3, and the company has another 10,000 unlabeled images. All the images come from dash cameras and are a size of 224 pixels × 224 pixels. After several training runs, the model is overfitting on the training data.Which actions should the ML specialist take to address this problem? (Choose two.) [{"voted_answers": "AC", "vote_count": 8, "is_most_voted": true}]
A data science team is working with a tabular dataset that the team stores in Amazon S3. The team wants to experiment with different feature transformations such as categorical feature encoding. Then the team wants to visualize the resulting distribution of the dataset. After the team finds an appropriate set of feature transformations, the team wants to automate the workflow for feature transformations.Which solution will meet these requirements with the MOST operational efficiency? [{"voted_answers": "A", "vote_count": 7, "is_most_voted": true}]
A company plans to build a custom natural language processing (NLP) model to classify and prioritize user feedback. The company hosts the data and all machine learning (ML) infrastructure in the AWS Cloud. The ML team works from the company's office, which has an IPsec VPN connection to one VPC in the AWS Cloud.The company has set both the enableDnsHostnames attribute and the enableDnsSupport attribute of the VPC to true. The company's DNS resolvers point to the VPC DNS. The company does not allow the ML team to access Amazon SageMaker notebooks through connections that use the public internet. The connection must stay within a private network and within the AWS internal network.Which solution will meet these requirements with the LEAST development effort? [{"voted_answers": "A", "vote_count": 5, "is_most_voted": true}]