32

Practice Set 32

Questions 311–320 (10 questions)

310

A data engineer wants to perform exploratory data analysis (EDA) on a petabyte of data. The data engineer does not want to manage compute resources and wants to pay only for queries that are run. The data engineer must write the analysis by using Python from a Jupyter notebook.Which solution will meet these requirements? [{"voted_answers": "A", "vote_count": 15, "is_most_voted": true}, {"voted_answers": "B", "vote_count": 10, "is_most_voted": false}]

311

A data scientist receives a new dataset in .csv format and stores the dataset in Amazon S3. The data scientist will use the dataset to train a machine learning (ML) model.The data scientist first needs to identify any potential data quality issues in the dataset. The data scientist must identify values that are missing or values that are not valid. The data scientist must also identify the number of outliers in the dataset.Which solution will meet these requirements with the LEAST operational effort? [{"voted_answers": "D", "vote_count": 5, "is_most_voted": true}]

312

An ecommerce company has developed a XGBoost model in Amazon SageMaker to predict whether a customer will return a purchased item. The dataset is imbalanced. Only 5% of customers return items.A data scientist must find the hyperparameters to capture as many instances of returned items as possible. The company has a small budget for compute.How should the data scientist meet these requirements MOST cost-effectively? [{"voted_answers": "B", "vote_count": 5, "is_most_voted": true}, {"voted_answers": "D", "vote_count": 1, "is_most_voted": false}, {"voted_answers": "C", "vote_count": 1, "is_most_voted": false}]

313

A data scientist is trying to improve the accuracy of a neural network classification model. The data scientist wants to run a large hyperparameter tuning job in Amazon SageMaker. However, previous smaller tuning jobs on the same model often ran for several weeks. The ML specialist wants to reduce the computation time required to run the tuning job.Which actions will MOST reduce the computation time for the hyperparameter tuning job? (Choose two.) [{"voted_answers": "AC", "vote_count": 8, "is_most_voted": true}, {"voted_answers": "AE", "vote_count": 4, "is_most_voted": false}]

314

A machine learning (ML) specialist needs to solve a binary classification problem for a marketing dataset. The ML specialist must maximize the Area Under the ROC Curve (AUC) of the algorithm by training an XGBoost algorithm. The ML specialist must find values for the eta, alpha, min_child_weight, and max_depth hyperparameters that will generate the most accurate model.Which approach will meet these requirements with the LEAST operational overhead? [{"voted_answers": "C", "vote_count": 5, "is_most_voted": true}]

315

A machine learning (ML) developer for an online retailer recently uploaded a sales dataset into Amazon SageMaker Studio. The ML developer wants to obtain importance scores for each feature of the dataset. The ML developer will use the importance scores to feature engineer the dataset.Which solution will meet this requirement with the LEAST development effort? [{"voted_answers": "A", "vote_count": 3, "is_most_voted": true}]

316

A company is setting up a mechanism for data scientists and engineers from different departments to access an Amazon SageMaker Studio domain. Each department has a unique SageMaker Studio domain.The company wants to build a central proxy application that data scientists and engineers can log in to by using their corporate credentials. The proxy application will authenticate users by using the company's existing Identity provider (IdP). The application will then route users to the appropriate SageMaker Studio domain.The company plans to maintain a table in Amazon DynamoDB that contains SageMaker domains for each department.How should the company meet these requirements? [{"voted_answers": "A", "vote_count": 4, "is_most_voted": true}]

317

An insurance company is creating an application to automate car insurance claims. A machine learning (ML) specialist used an Amazon SageMaker Object Detection - TensorFlow built-in algorithm to train a model to detect scratches and dents in images of cars. After the model was trained, the ML specialist noticed that the model performed better on the training dataset than on the testing dataset.Which approach should the ML specialist use to improve the performance of the model on the testing data? [{"voted_answers": "D", "vote_count": 3, "is_most_voted": true}]

318

A developer at a retail company is creating a daily demand forecasting model. The company stores the historical hourly demand data in an Amazon S3 bucket. However, the historical data does not include demand data for some hours.The developer wants to verify that an autoregressive integrated moving average (ARIMA) approach will be a suitable model for the use case.How should the developer verify the suitability of an ARIMA approach? [{"voted_answers": "C", "vote_count": 4, "is_most_voted": true}, {"voted_answers": "A", "vote_count": 3, "is_most_voted": false}]

319

A company decides to use Amazon SageMaker to develop machine learning (ML) models. The company will host SageMaker notebook instances in a VPC. The company stores training data in an Amazon S3 bucket. Company security policy states that SageMaker notebook instances must not have internet connectivity.Which solution will meet the company’s security requirements? [{"voted_answers": "B", "vote_count": 4, "is_most_voted": true}]