使用 Tensorflow 按品种对鸢尾花归类

AVI

浏览: 147377 次
来自: 北京

最近访客更多访客>>

kristy_yy

alxw4616

huixia0010

u012363178

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

监督式机器学习分类和回归

一名植物学家正在寻找一种能够对所发现的每株鸢尾花进行自动分类的方法。一个复杂的机器学习程序可以根据照片对花卉进行分类。机器学习提供多种花卉分类方法，我们将仅根据鸢尾花花萼和花瓣的长度和宽度对其进行分类。鸢尾花约有 300 种，但我们的程序将仅对下列三种进行分类：山鸢尾、维吉尼亚鸢尾、变色鸢尾。

1936年英国的统计学家和生物学家费希尔创建了一个包含 120 株鸢尾花的数据集（包括花萼和花瓣的测量值），该数据集已成为机器学习分类问题的标准入门内容之一（另一个手写数字分类的 MNIST 数据库）。从包含 120 个样本的鸢尾花数据集中抽取的 5 个样本如下：

特征（样本特点）标签（预测的内容）花萼长度花萼宽度花瓣长度花瓣宽度品种

6.4	2.8	5.6	2.2	2
5.0	2.3	3.3	1.0	1
4.9	2.5	4.5	1.7	2
4.9	3.1	1.5	0.1	0
5.7	3.8	1.7	0.3	0

标签：0 代表 setosa（山鸢尾），1 代表 versicolor（变色鸢尾），2 代表 virginica（维吉尼亚鸢尾）

模型（特征与标签间的关系）与训练（机器不断学习并优化模型）

鸢尾花问题，模型定义了花萼和花瓣测量值与鸢尾花品种之间的关系。那该如何创建模型呢？将足够多的代表性样本馈送到正确的机器学习模型类型中，该程序将确定花萼、花瓣与品种之间的关系。监督式机器学习（分类和回归）中模型通过包含标签的样本加以训练。在非监督式机器学习（聚类和关联）中，样本不包含标签，模型通常会在特征中发现一些规律。

安装了 Tensorflow 之后，使用 pip（Win下使用Admin权限）安装 Python 的数据分析包 Pandas（有大量的库和数据模型，解决数据分析）：

pip install pandas

然后 GIT 获取 Github 中TensorFlow 模型代码库的示例程序：

git clone https://github.com/tensorflow/models

运行示例程序：

python 运行 models/samples/core/get_started/premade_estimator.py 程序。

>cd E:
E:\python-studio\models\samples\core\get_started
E:\python-studio\models\samples\core\get_started>python premade_estimator.py

Downloading data from http://download.tensorflow.org/data/iris_training.csv
8192/2194 [================================================================================================================] - 0s 0us/step
Downloading data from http://download.tensorflow.org/data/iris_test.csv
8192/573 [============================================================================================================================================================================================================================================================================================================================================================================================================================================] - 0s 0us/step
INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\AppData\Local\Temp\tmpvjumpbjq
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\AppData\\Local\\Temp\\tmpvjumpbjq', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000022863CF6B70>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-04-26 10:36:47.624219: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\AppData\Local\Temp\tmpvjumpbjq\model.ckpt.
INFO:tensorflow:loss = 116.04464, step = 1
INFO:tensorflow:global_step/sec: 868.393
INFO:tensorflow:loss = 41.216713, step = 101 (0.116 sec)
INFO:tensorflow:global_step/sec: 1386.75
INFO:tensorflow:loss = 26.085459, step = 201 (0.072 sec)
INFO:tensorflow:global_step/sec: 1426.39
INFO:tensorflow:loss = 19.919716, step = 301 (0.070 sec)
INFO:tensorflow:global_step/sec: 1396.44
INFO:tensorflow:loss = 15.965887, step = 401 (0.072 sec)
INFO:tensorflow:global_step/sec: 1416.26
INFO:tensorflow:loss = 12.773253, step = 501 (0.071 sec)
INFO:tensorflow:global_step/sec: 1313.56
INFO:tensorflow:loss = 11.37424, step = 601 (0.077 sec)
INFO:tensorflow:global_step/sec: 1298.82
INFO:tensorflow:loss = 10.965981, step = 701 (0.077 sec)
INFO:tensorflow:global_step/sec: 1322.47
INFO:tensorflow:loss = 8.201114, step = 801 (0.076 sec)
INFO:tensorflow:global_step/sec: 1288.33
INFO:tensorflow:loss = 9.175628, step = 901 (0.078 sec)
INFO:tensorflow:Saving checkpoints for 1000 into C:\Users\AppData\Local\Temp\tmpvjumpbjq\model.ckpt.
INFO:tensorflow:Loss for final step: 8.504807.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-04-26-02:36:49
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\AppData\Local\Temp\tmpvjumpbjq\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-04-26-02:36:49
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.96666664, average_loss = 0.084656574, global_step = 1000, loss = 2.5396972

Test set accuracy: 0.967

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\AppData\Local\Temp\tmpvjumpbjq\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

Prediction is "Setosa" (98.1%), expected "Setosa"

Prediction is "Versicolor" (95.0%), expected "Versicolor"

Prediction is "Virginica" (96.5%), expected "Virginica"

TensorFlow 包含多个 API 的编程堆栈（主要了解两个高阶 API：Estimator 和 Dataset）。

Estimator：代表一个完整的模型。Estimator API 提供一些方法来训练模型、判断模型的准确率并生成预测。
数据集：构建数据输入管道。Dataset API 提供一些方法来加载和操作数据，并将数据馈送到您的模型中。Dataset API 与 Estimator API 合作无间。

premade_estimator.py 的解析说明：

1、导入和解析数据集。

下载鸢尾花的训练集（用于训练模型的样本）和测试集（用于评估训练后模型效果的样本）csv数据。

TensorFlow写道

训练集和测试集起初是同一个数据集。然后，有人对样本进行拆分，大部分样本进入训练集，剩余部分进入测试集。向训练集添加样本通常会构建一个更好的模型；但是，向测试集添加更多样本则使我们能够更好地评估模型的效果。无论如何拆分，测试集中的样本都必须与训练集中的样本分隔开来。否则，您无法准确地确定模型的效果。

premade_estimators.py 依赖于 load_data 函数来读取和解析训练集及测试集。

TRAIN_URL = "http://download.tensorflow.org/data/iris_training.csv"
TEST_URL = "http://download.tensorflow.org/data/iris_test.csv"

CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth',
                    'PetalLength', 'PetalWidth', 'Species']

...

def load_data(label_name='Species'):
    """Parses the csv file in TRAIN_URL and TEST_URL."""
//Keras 是一个开放源代码机器学习库；tf.keras 是 Keras 的一种 TensorFlow 实现。get_file 函数会将远程 CSV 文件复制到本地文件系统
    # Create a local copy of the training set.
    train_path = tf.keras.utils.get_file(fname=TRAIN_URL.split('/')[-1], origin=TRAIN_URL)
    # train_path now holds the pathname: ~/.keras/datasets/iris_training.csv

    # Parse the local CSV file.
    train = pd.read_csv(filepath_or_buffer=train_path,
                        names=CSV_COLUMN_NAMES,  # list of column names
                        header=0  # ignore the first row of the CSV file.
                       )
    # train now holds a pandas DataFrame, which is data structure
    # analogous to a table.

    # 1. Assign the DataFrame's labels (the right-most column) to train_label.
    # 2. Delete (pop) the labels from the DataFrame.
    # 3. Assign the remainder of the DataFrame to train_features
    train_features, train_label = train, train.pop(label_name)

    # Apply the preceding logic to the test set.
    test_path = tf.keras.utils.get_file(TEST_URL.split('/')[-1], TEST_URL)
    test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)
    test_features, test_label = test, test.pop(label_name)

    # Return four DataFrames.
    return (train_features, train_label), (test_features, test_label)

load_data 会返回两个 (feature,label) 对，分别对应训练集和测试集：

# Call load_data() to parse the CSV file.
    (train_feature, train_label),(test_feature, test_label)= load_data()

load_data 返回的特征都打包在 DataFrames 中。Pandas DataFrame 是一个包含已命名列标头和已编号行的表格。Pandas 是一个开放源代码 Python 库，供多个 TensorFlow 函数使用。test_feature DataFrame 如下所示：

       SepalLength  SepalWidth  PetalLength  PetalWidth0           5.9         3.0          4.2         1.51           6.9         3.1          5.4         2.12           5.1         3.3          1.7         0.5...27          6.7         3.1          4.7         1.528          6.7         3.3          5.7         2.529          6.4         2.9          4.3         1.3

使用 Pandas 加载数据，并利用此内存中的数据构建输入管道。

def train_input_fn(features, labels, batch_size):
    """An input function for training"""
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # Shuffle, repeat, and batch the examples.
    return dataset.shuffle(1000).repeat().batch(batch_size)

2、创建特征列（数据结构）以描述数据。

在鸢尾花问题中，我们希望模型将每个特征中的数据解读为 float 值，调用 @{tf.feature_column.numeric_column)。feature_column 对象列表中的每个对象都描述了模型的一个输入。用于创建特征列的代码如下所示：

# Create feature columns for all features.
my_feature_columns = []
for key in train_x.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key=key))

3、选择模型类型。

选择将要进行训练的模型类型很多。我们选择使用全连接神经网络来解决鸢尾花问题，这意味着一个层中的神经元将从上一层中的每个神经元获取输入。神经网络可以发现特征与标签之间的复杂关系。它是一个高度结构化的图，其中包含一个或多个隐藏层。每个隐藏层都包含一个或多个神经元。包含三个隐藏层的全连接神经网络：

要指定模型类型，请实例化一个 Estimator 类。Estimator 会处理初始化、日志记录、保存和恢复等细节部分，并具有很多其他功能，以便您可以专注于模型。

TensorFlow 提供了两类 Estimator：预创建的 Estimator 和 自定义 Estimator。建议通过创建自己的自定义 Estimator 来优化模型。

TensorFlow 提供了几个预创建的分类器 Estimator，其中包括：

tf.estimator.DNNClassifier：适用于执行多类别分类的深度模型。
tf.estimator.DNNLinearCombinedClassifier：适用于宽度和深度模型。
tf.estimator.LinearClassifier：适用于基于线性模型的分类器。

为了实现神经网络，premade_estimators.py 程序会使用一个预创建的 Estimator（ tf.estimator.DNNClassifier）。它会构建一个对样本进行分类的神经网络。实例化 DNNClassifier：

/** hidden_units 定义神经网络内每个隐藏层中的神经元数量。为此参数分配一个列表。列表长度表示隐藏层的数量（2 个）。
 ** 列表中的每个值表示某个特定隐藏层中的神经元数量（第一、二个隐藏层各 10 个）。要更改隐藏层或神经元的数量，只需
 ** 为 hidden_units 参数分配另一个列表即可。
 ** n_classes 指定了神经网络可以预测的潜在值的数量。由于鸢尾花问题将鸢尾花品种分为 3 类，因此我们将 n_classes 设置为 3。
 ** tf.Estimator.DNNClassifier 的构造函数采用名为 optimizer 的可选参数，优化器会控制模型的训练方式。
**/
classifier = tf.estimator.DNNClassifier(
        feature_columns=my_feature_columns,
        hidden_units=[10, 10],
        n_classes=3)

特征、隐藏层和预测（并未显示隐藏层中的所有节点）：

隐藏层和神经元的理想数量取决于问题和数据集。与机器学习的多个方面一样，选择理想的神经网络形状需要一定的知识水平和实验基础。一般来说，增加隐藏层和神经元的数量通常会产生更强大的模型，而这需要更多数据才能有效地进行训练。

4、训练模型。

实例化 tf.Estimator.DNNClassifier 会创建一个用于学习模型的框架。调用 Estimator 对象的 train 方法训练神经网络。Dataset API （一种高阶 TensorFlow API，用于读取数据并将其转换为 train 方法所需的格式）的 train_input_fn 函数将提供训练数据。Dataset API 包含下列类：

各个类如下所示：

Dataset - 包含创建和转换数据集的方法的基类。您还可以通过该类从内存中的数据或 Python 生成器初始化数据集。
TextLineDataset - 从文本文件中读取行。
TFRecordDataset - 从 TFRecord 文件中读取记录。
FixedLengthRecordDataset - 从二进制文件中读取具有固定大小的记录。
Iterator - 提供一次访问一个数据集元素的方法。

使用 Dataset API 可以轻松地从大量并行文件中读取记录，并将它们合并为单个数据流。

/* steps 指示 train 在完成指定的迭代次数后停止训练。增加 steps 会延长模型训练的时间。训练模型的时间越长，并不能保证模型就越好。
** args.train_steps 的默认值是 1000。训练的步数是一个可以调整的超参数。选择正确的步数通常需要一定的经验和实验基础。
** input_fn 参数会确定提供训练数据的函数。调用 train 方法表示 train_input_fn 函数将提供训练数据。
** train_feature 是 Python 字典，其中：每个键都是特征名称。每个值都是包含训练集中每个样本的值的数组。
** train_label 是包含训练集中每个样本标签值的数组。args.batch_size 是一个定义批次大小的整数。
**/
classifier.train(
        input_fn=lambda:train_input_fn(train_feature, train_label, args.batch_size),
        steps=args.train_steps)

//将输入特征和标签转换为 tf.data.Dataset 对象，该对象是 Dataset API 的基类
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))/** tf.dataset 类提供很多用于准备训练样本的实用函数。
 ** tf.data.Dataset.shuffle 对样本进行随机化处理，随机排列的训练样本训练效果最好。
 ** buffer_size 设置为大于样本数 (120) 的值可确保数据得到充分的随机化处理。 
 ** tf.data.Dataset.repeat 可确保 train 方法拥有无限量的训练集样本（现已得到随机化处理）。train 方法通常会多次处理样本。
 ** train 方法一次处理一批样本。tf.data.Dataset.batch 方法通过组合多个样本来创建一个批次。
 ** 该程序将默认批次大小设置为 100，意味着 batch 方法将组合多个包含 100 个样本的组。理想的批次大小取决于具体问题。
 ** 一般较小的批次大小通常会使 train 方法（有时）以牺牲准确率为代价来加快训练模型。
 ** 
 ** 
**/
dataset = dataset.shuffle(buffer_size=1000).repeat(count=None).batch(batch_size)

//将一批样本传回调用方（train 方法）。
return dataset.make_one_shot_iterator().get_next()

5、评估模型的效果。

评估指的是确定模型进行预测的效果。将一些花萼和花瓣测量值传递给模型，并要求模型预测它们所代表的鸢尾花品种。然后将模型的预测与实际标签进行比较。例如，如果模型对一半输入样本的品种预测正确，则准确率为 0.5。下面展示了一个准确率为 80% 的模型：

测试集特征标签预测

5.9	3.0	4.3	1.5	1	1
6.9	3.1	5.4	2.1	2	2
5.1	3.3	1.7	0.5	0	0
6.0	3.4	4.5	1.6	1	2
5.5	2.5	4.0	1.3	1	1

为评估模型的效果，每个 Estimator 都提供了 evaluate 方法。premade_estimator.py 程序会调用 evaluate，如下所示：

/** 调用 classifier.evaluate 与调用 classifier.train 类似。classifier.evaluate 必须从测试集（而非训练集）中获取样本。
 ** 为了公正地评估模型的效果，用于评估模型的样本一定不能与用于训练模型的样本相同。
 ** eval_input_fn 函数负责提供来自测试集的一批样本。
 ** 当由 classifier.evaluate 调用时，eval_input_fn 会执行以下操作：
 ** 1、将测试集中的特征和标签转换为 tf.dataset 对象。
 ** 2、创建一批测试集样本。（无需随机化处理或重复使用测试集样本。）
 ** 3、将该批次的测试集样本返回 classifier.evaluate。
**/
# Evaluate the model.
eval_result = classifier.evaluate(
    input_fn=lambda:eval_input_fn(test_x, test_y, args.batch_size))

print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))

输出（或类似输出）：Testset accuracy:0.967

// eval_input_fn 方法
def eval_input_fn(features, labels=None, batch_size=None):
    """An input function for evaluation or prediction"""
    if labels isNone:
        # No labels, use only features.
        inputs = features
    else:
        inputs =(features, labels)

    # Convert inputs to a tf.dataset object.
    dataset = tf.data.Dataset.from_tensor_slices(inputs)

    # Batch the examples
    assert batch_size isnotNone,"batch_size must not be None"
    dataset = dataset.batch(batch_size)

    # Return the read end of the pipeline.
    return dataset.make_one_shot_iterator().get_next()

0.967 的准确率表示我们经过训练的模型正确分类了测试集中 30 个鸢尾花品种中的 29 个品种。

6、让经过训练的模型进行预测。

我们现已训练了一个模型并“证明”它是有效的，但在对鸢尾花品种进行分类方面还不够。现在我们使用经过训练的模型对无标签样本（即包含特征但不包含标签的样本）进行一些预测。在现实生活中，无标签样本可能来自很多不同的来源，其中包括应用、CSV 文件和数据 Feed。手动提供下列三个无标签样本：

predict_x = {
        'SepalLength': [5.1, 5.9, 6.9],
        'SepalWidth': [3.3, 3.0, 3.1],
        'PetalLength': [1.7, 4.2, 5.4],
        'PetalWidth': [0.5, 1.5, 2.1],
    }

/** 每个 Estimator 均提供一个 predict 方法，与 evaluate 方法一样，predict 方法也收集来自 eval_input_fn 方法的样本。
 ** predict 方法返回一个 Python 可迭代对象，为每个样本生成一个预测结果字典。此字典包含几个键。
 ** a)probabilities 键存储的是一个由三个浮点值组成的列表，每个浮点值表示输入样本是特定鸢尾花品种的概率。如：
 **   'probabilities': array([  1.19127117e-08,   3.97069454e-02,   9.60292995e-01])
 **   列表表明：该鸢尾花为山鸢尾的可能性微乎其微。该鸢尾花为变色鸢尾的几率为 3.97％。该鸢尾花为维吉尼亚鸢尾的几率为 96.0％。
 ** b)class_ids 键存储的是一个 1 元素数组，用于标识可能性最大的品种。如：
 **   'class_ids': array([2])
 **   数字 2 对应维吉尼亚鸢尾。
 ** 
 ** 
 **  ** eval_input_fn 将执行以下操作：
 ** 1、转换我们刚刚手动创建的 3 元素集合中的特征。
 ** 2、根据该手动集合创建一个包含 3 个样本的批次。
 ** 3、将该批样本返回 classifier.predict。
 ** 
**/
predictions = classifier.predict(
    input_fn=lambda:eval_input_fn(predict_x, batch_size=args.batch_size))
//遍历返回的 predictions，以报告每个预测
for pred_dict, expec in zip(predictions, expected):
    template=('\nPrediction is "{}" ({:.1f}%), expected "{}"')

    class_id = pred_dict['class_ids'][0]
    probability = pred_dict['probabilities'][class_id]
    print(template.format(SPECIES[class_id],100* probability, expec))

输出：
...
Predictionis"Setosa"(99.6%), expected "Setosa"

Predictionis"Versicolor"(99.8%), expected "Versicolor"

Predictionis"Virginica"(97.9%), expected "Virginica"

premade_estimators.py 依赖于高阶 API，详细了解梯度下降法、批处理和神经网络。