教程 - 训练机器学习模型 - Amazon Web Services

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

深情的针织衫 · 关于ChatGPT的思考 - 李理的博客· 1 周前 ·

礼貌的机器人 · python深度学习——聚类分析|高维|ba ...· 5 天前 ·

坐怀不乱的铅笔 · GitHub - ...· 2 天前 ·

非常酷的打火机 · 什么是自动化 ML？自动化 ML - ...· 2 天前 ·

强悍的鸡蛋面 · 教程 - 训练机器学习模型 - ...· 2 天前 ·

坚强的木耳 · 广告公司,样本画册LOGO设计,常州品牌策划 ...· 7 月前 ·

风流倜傥的帽子 · 2020年终回顾：12张图看新冠肺炎疫情的影响· 8 月前 ·

痴情的高山 · Excel如何输入矩形字符？ ...· 1 年前 ·

买醉的草稿本 · 替换人生漫画 - 百度· 1 年前 ·

酷酷的羊肉串 · CFSC 基督教家庭服務中心 - 過往活動· 1 年前 ·

使用 Amazon SageMaker，您可以使用控制台可视化地部署模型，也可以使用 SageMaker Studio 或 SageMaker 笔记本以编程方式部署模型。在本教程中，您将使用 SageMaker Studio 笔记本以编程的方式部署模型，该笔记本需要一个 SageMaker Studio 域。

^{如果您在美国东部（弗吉尼亚州北部）区域已经有一个 SageMaker Studio 域，请遵照

SageMaker Studio 设置指南

将所需的 AWS IAM 策略附加到您的 SageMaker Studio 账户，然后跳过第 1 步，并直接继续第 2 步操作。}

如果您没有现有的 SageMaker Studio 域，第 1 步以运行 AWS CloudFormation 模板，从而创建 SageMaker Studio 域并添加本教程剩余部分所需的权限。

1.1 选择 AWS CloudFormation 堆栈链接。此链接将打开 AWS CloudFormation 控制台并创建您的 SageMaker Studio 域和名为 studio-user 的用户。它还将添加所需权限到您的 SageMaker Studio 账户。在 CloudFormation 控制台中，确认 美国东部（弗吉尼亚州北部） 是右上角显示的区域。 堆栈名称 应为 CFN-SM-IM-Lambda-Catalog ，且不应更改。选择 I acknowledge that AWS CloudFormation might create IAM resources with custom names （我确认，AWS CloudFormation 可能使用自定义名称创建 IAM 资源），然后选择 Create stack （创建堆栈）。此堆栈需要花费 10 分钟左右才能创建所有资源。

此堆栈假设您已经在账户中设置了一个默认公有 VPC。如果您没有公有 VPC，请参阅具有单个公有子网的 VPC 以了解如何创建公有 VPC。

2.8 要安装开源 XGBoost 和 Pandas 库的特定版本，请复制并粘贴以下代码片段到笔记本的单元格中，然后按 Shift+Enter 以运行当前单元格。忽略任何警告以重新启动内核或任何依赖项冲突错误。

%pip install -q  xgboost==1.3.1 pandas==1.0.5

Code snippet copied import joblib from sagemaker.xgboost.estimator import XGBoost from sagemaker.tuner import ( IntegerParameter, ContinuousParameter, HyperparameterTuner from sagemaker.inputs import TrainingInput from sagemaker.image_uris import retrieve from sagemaker.serializers import CSVSerializer from sagemaker.deserializers import CSVDeserializer # Setting SageMaker variables sess = sagemaker.Session() write_bucket = sess.default_bucket() write_prefix = "fraud-detect-demo" region = sess.boto_region_name s3_client = boto3.client("s3", region_name=region) sagemaker_role = sagemaker.get_execution_role() sagemaker_client = boto3.client("sagemaker") read_bucket = "sagemaker-sample-files" read_prefix = "datasets/tabular/synthetic_automobile_claims" # Setting S3 location for read and write operations train_data_key = f"{read_prefix}/train.csv" test_data_key = f"{read_prefix}/test.csv" validation_data_key = f"{read_prefix}/validation.csv" model_key = f"{write_prefix}/model" output_key = f"{write_prefix}/output" train_data_uri = f"s3://{read_bucket}/{train_data_key}" test_data_uri = f"s3://{read_bucket}/{test_data_key}" validation_data_uri = f"s3://{read_bucket}/{validation_data_key}" model_uri = f"s3://{write_bucket}/{model_key}" output_uri = f"s3://{write_bucket}/{output_key}" estimator_output_uri = f"s3://{write_bucket}/{write_prefix}/training_jobs" bias_report_output_uri = f"s3://{write_bucket}/{write_prefix}/clarify-output/bias" explainability_report_output_uri = f"s3://{write_bucket}/{write_prefix}/clarify-output/explainability" Code snippet copied xgb_model_name = "fraud-detect-xgb-model" endpoint_name_prefix = "xgb-fraud-model-dev" train_instance_count = 1 train_instance_type = "ml.m4.xlarge" predictor_instance_count = 1 predictor_instance_type = "ml.m4.xlarge" clarify_instance_count = 1 clarify_instance_type = "ml.m4.xlarge"

3.1 脚本模式的第一级是能够在一个自包含的、定制的 Python 脚本中定义您自己的训练过程，并在定义 SageMaker 估算器 时使用该脚本作为入口点。复制并粘贴以下代码块以编写封装模型训练逻辑的 Python 脚本。

%%writefile xgboost_train.py
import argparse
import os
import joblib
import json
import pandas as pd
import xgboost as xgb
from sklearn.metrics import roc_auc_score
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    # Hyperparameters and algorithm parameters are described here
    parser.add_argument("--num_round", type=int, default=100)
    parser.add_argument("--max_depth", type=int, default=3)
    parser.add_argument("--eta", type=float, default=0.2)
    parser.add_argument("--subsample", type=float, default=0.9)
    parser.add_argument("--colsample_bytree", type=float, default=0.8)
    parser.add_argument("--objective", type=str, default="binary:logistic")
    parser.add_argument("--eval_metric", type=str, default="auc")
    parser.add_argument("--nfold", type=int, default=3)
    parser.add_argument("--early_stopping_rounds", type=int, default=3)
    # SageMaker specific arguments. Defaults are set in the environment variables
    # Location of input training data
    parser.add_argument("--train_data_dir", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    # Location of input validation data
    parser.add_argument("--validation_data_dir", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))
    # Location where trained model will be stored. Default set by SageMaker, /opt/ml/model
    parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    # Location where model artifacts will be stored. Default set by SageMaker, /opt/ml/output/data
    parser.add_argument("--output_data_dir", type=str, default=os.environ.get("SM_OUTPUT_DATA_DIR"))
    args = parser.parse_args()
    data_train = pd.read_csv(f"{args.train_data_dir}/train.csv")
    train = data_train.drop("fraud", axis=1)
    label_train = pd.DataFrame(data_train["fraud"])
    dtrain = xgb.DMatrix(train, label=label_train)
    data_validation = pd.read_csv(f"{args.validation_data_dir}/validation.csv")
    validation = data_validation.drop("fraud", axis=1)
    label_validation = pd.DataFrame(data_validation["fraud"])
    dvalidation = xgb.DMatrix(validation, label=label_validation)
    params = {"max_depth": args.max_depth,
              "eta": args.eta,
              "objective": args.objective,
              "subsample" : args.subsample,
              "colsample_bytree":args.colsample_bytree
    num_boost_round = args.num_round
    nfold = args.nfold
    early_stopping_rounds = args.early_stopping_rounds
    cv_results = xgb.cv(
        params=params,
        dtrain=dtrain,
        num_boost_round=num_boost_round,
        nfold=nfold,
        early_stopping_rounds=early_stopping_rounds,
        metrics=["auc"],
        seed=42,
    model = xgb.train(params=params, dtrain=dtrain, num_boost_round=len(cv_results))
    train_pred = model.predict(dtrain)
    validation_pred = model.predict(dvalidation)
    train_auc = roc_auc_score(label_train, train_pred)
    validation_auc = roc_auc_score(label_validation, validation_pred)
    print(f"[0]#011train-auc:{train_auc:.2f}")
    print(f"[0]#011validation-auc:{validation_auc:.2f}")
    metrics_data = {"hyperparameters" : params,
                    "binary_classification_metrics": {"validation:auc": {"value": validation_auc},
                                                      "train:auc": {"value": train_auc}
    # Save the evaluation metrics to the location specified by output_data_dir
    metrics_location = args.output_data_dir + "/metrics.json"
    # Save the model to the location specified by model_dir
    model_location = args.model_dir + "/xgboost-model"
    with open(metrics_location, "w") as f:
        json.dump(metrics_data, f)
    with open(model_location, "wb") as f:
        joblib.dump(model, f)
                注意脚本导入您之前安装的开源 XGBoost 库的方式。 
                SageMaker 运行入口点脚本并提供所有输入参数，如模型配置详细信息和输入输出路径作为命令行参数。该脚本使用“argparse”Python 库获取提供的参数。 
                您的训练脚本在 Docker 容器中运行，SageMaker 会自动将训练和验证数据集从 Amazon S3 下载到容器内的本地路径。这些位置可以通过环境变量来访问。关于 SageMaker 环境变量的详尽列表，请参阅环境变量。
 
  
              3.2 当您准备好训练脚本后，您可以实例化 SageMaker 估算器。您可以使用 AWS 托管的 XGBoost 估算器，因为它管理可以运行自定义脚本的 XGBoost 容器。要实例化 XGBoost 估算器，请复制并粘贴以下代码。
 
  
              # SageMaker estimator
# Set static hyperparameters that will not be tuned
static_hyperparams = {  
                        "eval_metric" : "auc",
                        "objective": "binary:logistic",
                        "num_round": "5"
xgb_estimator = XGBoost(
                        entry_point="xgboost_train.py",
                        output_path=estimator_output_uri,
                        code_location=estimator_output_uri,
                        hyperparameters=static_hyperparams,
                        role=sagemaker_role,
                        instance_count=train_instance_count,
                        instance_type=train_instance_type,
                        framework_version="1.3-1",
                        base_job_name=training_job_name_prefix
                Code snippet copied 
              3.4 在本教程中，您将调优四个 XGBoost 超参数： 
               eta：在更新中使用步长收缩来防止过拟合。每次提升步骤后，您可以直接获得新特征的权重。eta 参数实际上缩小了特征权重，使提升过程更加保守。 
               subsample：训练实例的二次抽样比率。将它设置为 0.5 意味着 XGBoost 将在种植树木前对训练数据的一半进行采样。在每次提升迭代中使用不同的子集有助于防止过拟合。 
               colsample_bytree：用于生成每棵树的提升过程的部分特征。使用特征子集来创建每棵树在建模过程中引入了更多的随机性，从而提高了普遍性。 
               max_depth：树的最大深度。增加这个值会使模型更加复杂，并可能过度拟合。 
              复制并粘贴以下代码块以设置要从中搜索的上述超参数的范围。
  
              # Setting ranges of hyperparameters to be tuned
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "subsample": ContinuousParameter(0.7, 0.95),
    "colsample_bytree": ContinuousParameter(0.7, 0.95),
    "max_depth": IntegerParameter(1, 5)
                Code snippet copied 
              3.5 复制并粘贴以下代码块以设置超参数调优器。SageMaker 运行贝叶斯优化例程作为搜索过程的默认设置。在本教程中，您可以使用随机搜索方法来减少运行时间。参数根据验证数据集上模型的 AUC 性能进行调优。
  
              objective_metric_name = "validation:auc"
# Setting up tuner object
tuner_config_dict = {
                     "estimator" : xgb_estimator,
                     "max_jobs" : 5,
                     "max_parallel_jobs" : 2,
                     "objective_metric_name" : objective_metric_name,
                     "hyperparameter_ranges" : hyperparameter_ranges,
                     "base_tuning_job_name" : tuning_job_name_prefix,
                     "strategy" : "Random"
tuner = HyperparameterTuner(**tuner_config_dict) 
                Code snippet copied 
              3.6 您可以在调优器对象上调用 fit() 方法以启用超参数调优任务。为了拟合调优器，您可以指定不同的输入通道。此教程提供训练和验证通道。复制并粘贴以下代码块以启动超参数调优任务。完成此操作大约需要 13 分钟。
  
              # Setting the input channels for tuning job
s3_input_train = TrainingInput(s3_data="s3://{}/{}".format(read_bucket, train_data_key), content_type="csv", s3_data_type="S3Prefix")
s3_input_validation = (TrainingInput(s3_data="s3://{}/{}".format(read_bucket, validation_data_key), 
                                    content_type="csv", s3_data_type="S3Prefix")
tuner.fit(inputs={"train": s3_input_train, "validation": s3_input_validation}, include_cls_metadata=False)
tuner.wait() 
                Code snippet copied 
              # Summary of tuning results ordered in descending order of performance
df_tuner = sagemaker.HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.job_name).dataframe()
df_tuner = df_tuner[df_tuner["FinalObjectiveValue"]>-float('inf')].sort_values("FinalObjectiveValue", ascending=False)
df_tuner 
                Code snippet copied 
            当您具有训练模型后，在部署之前，了解模型或数据中是否存在任何固有偏差很重要。模型预测可能是偏差的来源（例如，如果它们做出的预测对一个群体产生的负面结果比另一个群体更频繁）。SageMaker Clarify 帮助解释训练模型如何使用特征归因方法进行预测。本教程的重点是训练后偏差指标和模型可解释性的 SHAP 值。具体来说，包含以下常见任务： 
             数据和模型偏差检测 
             使用功能重要性值的模型可解释性 
             功能影响和对单个数据样本的局部解释
 
  
              4.1 在 SageMaker Clarify 能够执行模型偏差检测之前，它需要一个 SageMaker 模型，SageMaker Clarify 可将其部署到一个临时端点作为分析的一部分。然后，当 SageMaker Clarify 分析完成后，端点将被删除。复制并粘贴以下代码块以根据从调优任务中确定的最佳训练任务创建 SageMaker 模型。
  
              tuner_job_info = sagemaker_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)
model_matches = sagemaker_client.list_models(NameContains=xgb_model_name)["Models"]
if not model_matches:
    _ = sess.create_model_from_job(
            name=xgb_model_name,
            training_job_name=tuner_job_info['BestTrainingJob']["TrainingJobName"],
            role=sagemaker_role,
            image_uri=tuner_job_info['TrainingJobDefinition']["AlgorithmSpecification"]["TrainingImage"]
else:
    print(f"Model {xgb_model_name} already exists.") 
                Code snippet copied 
              4.2 要运行偏差检测，SageMaker Clarify 期望设置多个组件配置。您可以在 Amazon SageMaker Clarify 中找到更多详细信息。对于此教程，除了标准配置之外，还设置了 SageMaker Clarify，以通过检查目标是否偏向于基于客户性别的值来检测数据在统计学上是否偏向于女性。复制并粘贴以下代码以设置 SageMaker Clarify 配置。
 
  
              train_df = pd.read_csv(train_data_uri)
train_df_cols = train_df.columns.to_list()
clarify_processor = sagemaker.clarify.SageMakerClarifyProcessor(
    role=sagemaker_role,
    instance_count=clarify_instance_count,
    instance_type=clarify_instance_type,
    sagemaker_session=sess,
# Data config
bias_data_config = sagemaker.clarify.DataConfig(
    s3_data_input_path=train_data_uri,
    s3_output_path=bias_report_output_uri,
    label="fraud",
    headers=train_df_cols,
    dataset_type="text/csv",
# Model config
model_config = sagemaker.clarify.ModelConfig(
    model_name=xgb_model_name,
    instance_type=train_instance_type,
    instance_count=1,
    accept_type="text/csv",
# Model predictions config to get binary labels from probabilities
predictions_config = sagemaker.clarify.ModelPredictedLabelConfig(probability_threshold=0.5)
# Bias config
bias_config = sagemaker.clarify.BiasConfig(
    label_values_or_threshold=[0],
    facet_name="customer_gender_female",
    facet_values_or_threshold=[1],
                Code snippet copied 
              4.3 在 SageMaker Clarify 内，训练前指标显示数据中预先存在偏差，而训练后指标显示模型预测存在偏差。使用 SageMaker 开发工具包，您可以指定您想要检查哪些组的偏差，以及要考虑哪些偏差指标。出于本教程的目的，您分别使用类不平衡（CI）和预测标签中正比例的差异（DPPL）作为训练前和训练后偏差统计的示例。您可以在衡量训练前偏差和训练后数据和模型偏差中查找其他偏差指标的详细信息。复制并粘贴以下代码块以运行 SageMaker Clarify 并生成偏差报告。所选的偏差指标作为参数传递到 run_bias 方法中。完成此代码大约需要 12 分钟。
  
              clarify_processor.run_bias(
    data_config=bias_data_config,
    bias_config=bias_config,
    model_config=model_config,
    model_predicted_label_config=predictions_config,
    pre_training_methods=["CI"],
    post_training_methods=["DPPL"]
clarify_bias_job_name = clarify_processor.latest_job.name 
                Code snippet copied 
              4.4 SageMaker Clarify 输出会保存到您的默认 S3 存储桶。复制并粘贴以下代码，以将 PDF 格式的 SageMaker Clarify 报告从 Amazon S3 下载到 SageMaker Studio 中的本地目录。
  
              # Copy bias report and view locally
!aws s3 cp s3://{write_bucket}/{write_prefix}/clarify-output/bias/report.pdf ./clarify_bias_output.pdf 
                Code snippet copied 
              4.6 除了数据偏差之外，SageMaker Clarify 还可以分析训练的模型，并根据功能重要性创建模型可解释性报告。SageMaker Clarify 使用 SHAP 值来解释每个输入功能对最终预测的贡献。复制并粘贴以下代码块以配置并运行模型可解释性分析。完成此代码块大约需要 14 分钟。
  
              explainability_data_config = sagemaker.clarify.DataConfig(
    s3_data_input_path=train_data_uri,
    s3_output_path=explainability_report_output_uri,
    label="fraud",
    headers=train_df_cols,
    dataset_type="text/csv",
# Use mean of train dataset as baseline data point
shap_baseline = [list(train_df.drop(["fraud"], axis=1).mean())]
shap_config = sagemaker.clarify.SHAPConfig(
    baseline=shap_baseline,
    num_samples=500,
    agg_method="mean_abs",
    save_local_shap_values=True,
clarify_processor.run_explainability(
    data_config=explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config
                Code snippet copied 
              4.7 复制并粘贴以下代码，以将 PDF 格式的 SageMaker Clarify 科解释性报告从 Amazon S3 下载到 SageMaker Studio 中的本地目录。
  
              # Copy explainability report and view
!aws s3 cp s3://{write_bucket}/{write_prefix}/clarify-output/explainability/report.pdf ./clarify_explainability_output.pdf 
                Code snippet copied 
              4.12 SageMaker Clarify 生成的可解释性报告还提供一个称为 out.csv 的文件，其中包含各个样本的本地 SHAP 值。复制并粘贴下面的代码块，以使用该文件来可视化任何单个示例的解释（每个功能对模型预测的影响）。
  
              import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
local_explanations_out = pd.read_csv(explainability_report_output_uri + "/explanations_shap/out.csv")
feature_names = [str.replace(c, "_label0", "") for c in 
local_explanations_out.columns.to_series()]
local_explanations_out.columns = feature_names
selected_example = 100
print("Example number:", selected_example)
local_explanations_out.iloc[selected_example].plot(
    kind="bar", title="Local explanation for the example number " + str(selected_example), rot=60, figsize=(20, 8)
                Code snippet copied 
              best_train_job_name = tuner.best_training_job()
model_path = estimator_output_uri + '/' + best_train_job_name + '/output/model.tar.gz'
training_image = retrieve(framework="xgboost", region=region, version="1.3-1")
create_model_config = {"model_data":model_path,
                       "role":sagemaker_role,
                       "image_uri":training_image,
                       "name":endpoint_name_prefix,
                       "predictor_cls":sagemaker.predictor.Predictor
# Create a SageMaker model
model = sagemaker.model.Model(**create_model_config)
# Deploy the best model and get access to a SageMaker Predictor
predictor = model.deploy(initial_instance_count=predictor_instance_count, 
                         instance_type=predictor_instance_type,
                         serializer=CSVSerializer(),
                         deserializer=CSVDeserializer())
print(f"\nModel deployed at endpoint : {model.endpoint_name}") 
                Code snippet copied 
              5.5 现在，模型已部署到端点，您可以通过直接调用 REST API（本教程中未描述）、通过 AWS 开发工具包、SageMaker Studio 中的图形界面或使用 SageMaker Python 开发工具包来进行调用。在本教程中，您将使用通过部署步骤提供的 SageMaker Predictor 来获得对一个或多个测试样本的实时模型预测。复制并粘贴下面的代码块，以调用端点并发送单个测试数据示例。
  
              # Sample test data
test_df = pd.read_csv(test_data_uri)
payload = test_df.drop(["fraud"], axis=1).iloc[0].to_list()
print(f"Model predicted score : {float(predictor.predict(payload)[0][0]):.3f}, True label : {test_df['fraud'].iloc[0]}") 
                Code snippet copied 
# Delete inference endpoint config
sess.delete_endpoint_config(endpoint_config_name=predictor._get_endpoint_config_name())
# Delete inference endpoint
sess.delete_endpoint(endpoint_name=model.endpoint_name) 
                Code snippet copied 
              6.2 要删除 S3 桶，请执行以下操作：  
               打开 Amazon S3 控制台。在导航栏上，选择 Buckets（桶）、sagemaker-<your-Region>-<your-account-id>，然后选择 fraud-detect-demo 旁的复选框。然后选择 Delete（删除）。 
               在 Delete objects（删除对象）对话框中，确认您是否已选中要删除的正确对象，并在 Permanently delete objects（永久删除对象）确认框中输入 permanently delete。 
               当此操作完成且桶为空时，您可以通过再次遵循相同程序来删除 sagemaker-<your-Region>-<your-account-id> 桶。 
                6.3 本教程中用于运行笔记本图像的数据科学内核将不断累积费用，直到您停止内核或执行以下步骤删除应用程序。有关更多信息，请参阅《Amazon SageMaker 开发人员指南》中的 
               关闭资源。 
              要删除 SageMaker Studio 应用程序，请执行以下操作：在 SageMaker 控制台上，选择 Domains（域），然后选择 StudioDomain。从 User profiles（用户配置文件）列表中，选择 studio-user，然后通过选择 Delete app（删除应用程序）来删除 Apps（应用程序）下列出的所有应用程序。要删除 JupyterServer，请选择 Action（操作），然后选择 Delete（删除）。等待片刻直到 Status （状态）更改为 Deleted（已删除）。
 
  
       AWS 对 Internet Explorer 的支持将于 07/31/2022 结束。受支持的浏览器包括 Chrome、Firefox、Edge 和 Safari。 
      了解详情 »