19

Practice Set 19

Questions 181–190 (10 questions)

180

A real-estate company is launching a new product that predicts the prices of new houses. The historical data for the properties and prices is stored in .csv format in an Amazon S3 bucket. The data has a header, some categorical fields, and some missing values. The company's data scientists have used Python with a common open-source library to fill the missing values with zeros. The data scientists have dropped all of the categorical fields and have trained a model by using the open-source linear regression algorithm with the default parameters.The accuracy of the predictions with the current model is below 50%. The company wants to improve the model performance and launch the new product as soon as possible.Which solution will meet these requirements with the LEAST operational overhead? [{"voted_answers": "D", "vote_count": 18, "is_most_voted": true}]

181

A data scientist is reviewing customer comments about a company's products. The data scientist needs to present an initial exploratory analysis by using charts and a word cloud. The data scientist must use feature engineering techniques to prepare this analysis before starting a natural language processing (NLP) model.Which combination of feature engineering techniques should the data scientist use to meet these requirements? (Choose two.) [{"voted_answers": "CD", "vote_count": 30, "is_most_voted": true}, {"voted_answers": "DE", "vote_count": 2, "is_most_voted": false}, {"voted_answers": "C", "vote_count": 1, "is_most_voted": false}]

182

A data scientist is evaluating a GluonTS on Amazon SageMaker DeepAR model. The evaluation metrics on the test set indicate that the coverage score is 0.489 and 0.889 at the 0.5 and 0.9 quantiles, respectively.What can the data scientist reasonably conclude about the distributional forecast related to the test set? [{"voted_answers": "D", "vote_count": 12, "is_most_voted": true}, {"voted_answers": "C", "vote_count": 5, "is_most_voted": false}]

183

An energy company has wind turbines, weather stations, and solar panels that generate telemetry data. The company wants to perform predictive maintenance on these devices. The devices are in various locations and have unstable internet connectivity.A team of data scientists is using the telemetry data to perform machine learning (ML) to conduct anomaly detection and predict maintenance before the devices start to deteriorate. The team needs a scalable, secure, high-velocity data ingestion mechanism. The team has decided to use Amazon S3 as the data storage location.Which approach meets these requirements? [{"voted_answers": "C", "vote_count": 17, "is_most_voted": true}]

184

A retail company collects customer comments about its products from social media, the company website, and customer call logs. A team of data scientists and engineers wants to find common topics and determine which products the customers are referring to in their comments. The team is using natural language processing (NLP) to build a model to help with this classification.Each product can be classified into multiple categories that the company defines. These categories are related but are not mutually exclusive. For example, if there is mention of "Sample Yogurt" in the document of customer comments, then "Sample Yogurt" should be classified as "yogurt," "snack," and "dairy product."The team is using Amazon Comprehend to train the model and must complete the project as soon as possible.Which functionality of Amazon Comprehend should the team use to meet these requirements? [{"voted_answers": "B", "vote_count": 23, "is_most_voted": true}, {"voted_answers": "C", "vote_count": 1, "is_most_voted": false}]

185

A data engineer is using AWS Glue to create optimized, secure datasets in Amazon S3. The data science team wants the ability to access the ETL scripts directly from Amazon SageMaker notebooks within a VPC. After this setup is complete, the data science team wants the ability to run the AWS Glue job and invoke theSageMaker training job.Which combination of steps should the data engineer take to meet these requirements? (Choose three.) [{"voted_answers": "BCF", "vote_count": 21, "is_most_voted": true}, {"voted_answers": "BDF", "vote_count": 6, "is_most_voted": false}, {"voted_answers": "ADF", "vote_count": 3, "is_most_voted": false}]

186

A data engineer needs to provide a team of data scientists with the appropriate dataset to run machine learning training jobs. The data will be stored in Amazon S3. The data engineer is obtaining the data from an Amazon Redshift database and is using join queries to extract a single tabular dataset. A portion of the schema is as follows:TransactionTimestamp (Timestamp)CardName (Varchar)CardNo (Varchar)The data engineer must provide the data so that any row with a CardNo value of NULL is removed. Also, the TransactionTimestamp column must be separated into a TransactionDate column and a TransactionTime column. Finally, the CardName column must be renamed to NameOnCard.The data will be extracted on a monthly basis and will be loaded into an S3 bucket. The solution must minimize the effort that is needed to set up infrastructure for the ingestion and transformation. The solution also must be automated and must minimize the load on the Amazon Redshift cluster.Which solution meets these requirements? [{"voted_answers": "C", "vote_count": 22, "is_most_voted": true}]

187

A machine learning (ML) specialist wants to bring a custom training algorithm to Amazon SageMaker. The ML specialist implements the algorithm in a Docker container that is supported by SageMaker.How should the ML specialist package the Docker container so that SageMaker can launch the training correctly? [{"voted_answers": "B", "vote_count": 12, "is_most_voted": true}, {"voted_answers": "D", "vote_count": 2, "is_most_voted": false}]

188

An ecommerce company wants to use machine learning (ML) to monitor fraudulent transactions on its website. The company is using Amazon SageMaker to research, train, deploy, and monitor the ML models.The historical transactions data is in a .csv file that is stored in Amazon S3. The data contains features such as the user's IP address, navigation time, average time on each page, and the number of clicks for each session. There is no label in the data to indicate if a transaction is anomalous.Which models should the company use in combination to detect anomalous transactions? (Choose two.) [{"voted_answers": "AD", "vote_count": 21, "is_most_voted": true}, {"voted_answers": "AC", "vote_count": 3, "is_most_voted": false}, {"voted_answers": "CD", "vote_count": 1, "is_most_voted": false}]

189

A healthcare company is using an Amazon SageMaker notebook instance to develop machine learning (ML) models. The company's data scientists will need to be able to access datasets stored in Amazon S3 to train the models. Due to regulatory requirements, access to the data from instances and services used for training must not be transmitted over the internet.Which combination of steps should an ML specialist take to provide this access? (Choose two.) [{"voted_answers": "AC", "vote_count": 14, "is_most_voted": true}, {"voted_answers": "CD", "vote_count": 5, "is_most_voted": false}]