In [2]:
# K-近傍法
"肺がんデータをk近傍法で分類しよう"
Out[2]:
'肺がんデータをk近傍法で分類しよう'
In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
In [11]:
# データの読み込み
data = load_breast_cancer()

# データのキーを確認する
data.keys()
"""
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
"""
Out[11]:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
In [17]:
# 概要を確認する
print(data.DESCR)
Breast Cancer Wisconsin (Diagnostic) Database
=============================================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

References
----------
   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.

In [32]:
# ターゲットデータの意味を確認する
print(data.target)
# 0 or 1

print(data.target_names)
# ['malignant' 'benign']

print(data.target.shape) # (569,)

"""
0:悪性 malignant, 1:良性 benign
569個のデータがある
"""
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 0 0 0 0 0 1]
['malignant' 'benign']
(569,)
Out[32]:
'\n0:悪性 malignant, 1:良性 benign\n569個のデータがある\n'
In [35]:
# 特徴量データを確認する
print(data.feature_names)
print(data.feature_names.shape) # (30,)
print(data.data)
print(data.data.shape) # (569, 30)
"""
・特徴量は30個ある
・データ数は各569個ある
"""
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
(30,)
[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
(569, 30)
Out[35]:
'\n・特徴量は30個ある\n・データ数は各569個ある\n'
In [67]:
# 訓練データと検証データに分割する
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, stratify=data.target)
In [68]:
# train_test_splitの引数を調べる
help(train_test_split)
"""
 stratify : array-like or None (default is None)
        If not None, data is split in a stratified fashion, using this as
        the class labels.
アレイを指定したとき、データは層状に分割されます。

→目的変数 0と1のクラスターから 同じ割合で訓練データを抽出するということ。
こうしないと、ランダムに抽出されたときに、0か1のどちらかに偏ってしまう場合がある。

"""
Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, **options)
    Split arrays or matrices into random train and test subsets
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float, int, None, optional
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. By default, the value is set to 0.25.
        The default will change in version 0.21. It will remain 0.25 only
        if ``train_size`` is unspecified, otherwise it will complement
        the specified ``train_size``.
    
    train_size : float, int, or None, default None
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples. If None,
        the value is automatically set to the complement of the test size.
    
    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.
    
    shuffle : boolean, optional (default=True)
        Whether or not to shuffle the data before splitting. If shuffle=False
        then stratify must be None.
    
    stratify : array-like or None (default is None)
        If not None, data is split in a stratified fashion, using this as
        the class labels.
    
    Returns
    -------
    splitting : list, length=2 * len(arrays)
        List containing train-test split of inputs.
    
        .. versionadded:: 0.16
            If the input is sparse, the output will be a
            ``scipy.sparse.csr_matrix``. Else, output type is the same as the
            input type.
    
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X
    array([[0, 1],
           [2, 3],
           [4, 5],
           [6, 7],
           [8, 9]])
    >>> list(y)
    [0, 1, 2, 3, 4]
    
    >>> X_train, X_test, y_train, y_test = train_test_split(
    ...     X, y, test_size=0.33, random_state=42)
    ...
    >>> X_train
    array([[4, 5],
           [0, 1],
           [6, 7]])
    >>> y_train
    [2, 0, 3]
    >>> X_test
    array([[2, 3],
           [8, 9]])
    >>> y_test
    [1, 4]
    
    >>> train_test_split(y, shuffle=False)
    [[0, 1, 2], [3, 4]]

Out[68]:
'\n stratify : array-like or None (default is None)\n        If not None, data is split in a stratified fashion, using this as\n        the class labels.\n        \nアレイを指定したとき、データは層状に分割されます\n'
In [69]:
# KNNのインスタンスを作成する。
model = KNeighborsClassifier()

# 訓練する
model.fit(X_train, y_train)
Out[69]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
In [74]:
# KNNの引数を確認する
help(KNeighborsClassifier)

"""
 |  n_neighbors : int, optional (default = 5)
 |      Number of neighbors to use by default for :meth:`kneighbors` queries.
 
 近傍点の数を指定する。デフォルトは5
 
 """
Help on class KNeighborsClassifier in module sklearn.neighbors.classification:

class KNeighborsClassifier(sklearn.neighbors.base.NeighborsBase, sklearn.neighbors.base.KNeighborsMixin, sklearn.neighbors.base.SupervisedIntegerMixin, sklearn.base.ClassifierMixin)
 |  Classifier implementing the k-nearest neighbors vote.
 |  
 |  Read more in the :ref:`User Guide <classification>`.
 |  
 |  Parameters
 |  ----------
 |  n_neighbors : int, optional (default = 5)
 |      Number of neighbors to use by default for :meth:`kneighbors` queries.
 |  
 |  weights : str or callable, optional (default = 'uniform')
 |      weight function used in prediction.  Possible values:
 |  
 |      - 'uniform' : uniform weights.  All points in each neighborhood
 |        are weighted equally.
 |      - 'distance' : weight points by the inverse of their distance.
 |        in this case, closer neighbors of a query point will have a
 |        greater influence than neighbors which are further away.
 |      - [callable] : a user-defined function which accepts an
 |        array of distances, and returns an array of the same shape
 |        containing the weights.
 |  
 |  algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
 |      Algorithm used to compute the nearest neighbors:
 |  
 |      - 'ball_tree' will use :class:`BallTree`
 |      - 'kd_tree' will use :class:`KDTree`
 |      - 'brute' will use a brute-force search.
 |      - 'auto' will attempt to decide the most appropriate algorithm
 |        based on the values passed to :meth:`fit` method.
 |  
 |      Note: fitting on sparse input will override the setting of
 |      this parameter, using brute force.
 |  
 |  leaf_size : int, optional (default = 30)
 |      Leaf size passed to BallTree or KDTree.  This can affect the
 |      speed of the construction and query, as well as the memory
 |      required to store the tree.  The optimal value depends on the
 |      nature of the problem.
 |  
 |  p : integer, optional (default = 2)
 |      Power parameter for the Minkowski metric. When p = 1, this is
 |      equivalent to using manhattan_distance (l1), and euclidean_distance
 |      (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
 |  
 |  metric : string or callable, default 'minkowski'
 |      the distance metric to use for the tree.  The default metric is
 |      minkowski, and with p=2 is equivalent to the standard Euclidean
 |      metric. See the documentation of the DistanceMetric class for a
 |      list of available metrics.
 |  
 |  metric_params : dict, optional (default = None)
 |      Additional keyword arguments for the metric function.
 |  
 |  n_jobs : int, optional (default = 1)
 |      The number of parallel jobs to run for neighbors search.
 |      If ``-1``, then the number of jobs is set to the number of CPU cores.
 |      Doesn't affect :meth:`fit` method.
 |  
 |  Examples
 |  --------
 |  >>> X = [[0], [1], [2], [3]]
 |  >>> y = [0, 0, 1, 1]
 |  >>> from sklearn.neighbors import KNeighborsClassifier
 |  >>> neigh = KNeighborsClassifier(n_neighbors=3)
 |  >>> neigh.fit(X, y) # doctest: +ELLIPSIS
 |  KNeighborsClassifier(...)
 |  >>> print(neigh.predict([[1.1]]))
 |  [0]
 |  >>> print(neigh.predict_proba([[0.9]]))
 |  [[ 0.66666667  0.33333333]]
 |  
 |  See also
 |  --------
 |  RadiusNeighborsClassifier
 |  KNeighborsRegressor
 |  RadiusNeighborsRegressor
 |  NearestNeighbors
 |  
 |  Notes
 |  -----
 |  See :ref:`Nearest Neighbors <neighbors>` in the online documentation
 |  for a discussion of the choice of ``algorithm`` and ``leaf_size``.
 |  
 |  .. warning::
 |  
 |     Regarding the Nearest Neighbors algorithms, if it is found that two
 |     neighbors, neighbor `k+1` and `k`, have identical distances
 |     but different labels, the results will depend on the ordering of the
 |     training data.
 |  
 |  https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
 |  
 |  Method resolution order:
 |      KNeighborsClassifier
 |      sklearn.neighbors.base.NeighborsBase
 |      abc.NewBase
 |      sklearn.base.BaseEstimator
 |      sklearn.neighbors.base.KNeighborsMixin
 |      sklearn.neighbors.base.SupervisedIntegerMixin
 |      sklearn.base.ClassifierMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  predict(self, X)
 |      Predict the class labels for the provided data
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_query, n_features),                 or (n_query, n_indexed) if metric == 'precomputed'
 |          Test samples.
 |      
 |      Returns
 |      -------
 |      y : array of shape [n_samples] or [n_samples, n_outputs]
 |          Class labels for each data sample.
 |  
 |  predict_proba(self, X)
 |      Return probability estimates for the test data X.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_query, n_features),                 or (n_query, n_indexed) if metric == 'precomputed'
 |          Test samples.
 |      
 |      Returns
 |      -------
 |      p : array of shape = [n_samples, n_classes], or a list of n_outputs
 |          of such arrays if n_outputs > 1.
 |          The class probabilities of the input samples. Classes are ordered
 |          by lexicographic order.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The latter have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.neighbors.base.KNeighborsMixin:
 |  
 |  kneighbors(self, X=None, n_neighbors=None, return_distance=True)
 |      Finds the K-neighbors of a point.
 |      
 |      Returns indices of and distances to the neighbors of each point.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_query, n_features),                 or (n_query, n_indexed) if metric == 'precomputed'
 |          The query point or points.
 |          If not provided, neighbors of each indexed point are returned.
 |          In this case, the query point is not considered its own neighbor.
 |      
 |      n_neighbors : int
 |          Number of neighbors to get (default is the value
 |          passed to the constructor).
 |      
 |      return_distance : boolean, optional. Defaults to True.
 |          If False, distances will not be returned
 |      
 |      Returns
 |      -------
 |      dist : array
 |          Array representing the lengths to points, only present if
 |          return_distance=True
 |      
 |      ind : array
 |          Indices of the nearest points in the population matrix.
 |      
 |      Examples
 |      --------
 |      In the following example, we construct a NeighborsClassifier
 |      class from an array representing our data set and ask who's
 |      the closest point to [1,1,1]
 |      
 |      >>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
 |      >>> from sklearn.neighbors import NearestNeighbors
 |      >>> neigh = NearestNeighbors(n_neighbors=1)
 |      >>> neigh.fit(samples) # doctest: +ELLIPSIS
 |      NearestNeighbors(algorithm='auto', leaf_size=30, ...)
 |      >>> print(neigh.kneighbors([[1., 1., 1.]])) # doctest: +ELLIPSIS
 |      (array([[ 0.5]]), array([[2]]...))
 |      
 |      As you can see, it returns [[0.5]], and [[2]], which means that the
 |      element is at distance 0.5 and is the third element of samples
 |      (indexes start at 0). You can also query for multiple points:
 |      
 |      >>> X = [[0., 1., 0.], [1., 0., 1.]]
 |      >>> neigh.kneighbors(X, return_distance=False) # doctest: +ELLIPSIS
 |      array([[1],
 |             [2]]...)
 |  
 |  kneighbors_graph(self, X=None, n_neighbors=None, mode='connectivity')
 |      Computes the (weighted) graph of k-Neighbors for points in X
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_query, n_features),                 or (n_query, n_indexed) if metric == 'precomputed'
 |          The query point or points.
 |          If not provided, neighbors of each indexed point are returned.
 |          In this case, the query point is not considered its own neighbor.
 |      
 |      n_neighbors : int
 |          Number of neighbors for each sample.
 |          (default is value passed to the constructor).
 |      
 |      mode : {'connectivity', 'distance'}, optional
 |          Type of returned matrix: 'connectivity' will return the
 |          connectivity matrix with ones and zeros, in 'distance' the
 |          edges are Euclidean distance between points.
 |      
 |      Returns
 |      -------
 |      A : sparse matrix in CSR format, shape = [n_samples, n_samples_fit]
 |          n_samples_fit is the number of samples in the fitted data
 |          A[i, j] is assigned the weight of edge that connects i to j.
 |      
 |      Examples
 |      --------
 |      >>> X = [[0], [3], [1]]
 |      >>> from sklearn.neighbors import NearestNeighbors
 |      >>> neigh = NearestNeighbors(n_neighbors=2)
 |      >>> neigh.fit(X) # doctest: +ELLIPSIS
 |      NearestNeighbors(algorithm='auto', leaf_size=30, ...)
 |      >>> A = neigh.kneighbors_graph(X)
 |      >>> A.toarray()
 |      array([[ 1.,  0.,  1.],
 |             [ 0.,  1.,  1.],
 |             [ 1.,  0.,  1.]])
 |      
 |      See also
 |      --------
 |      NearestNeighbors.radius_neighbors_graph
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.neighbors.base.SupervisedIntegerMixin:
 |  
 |  fit(self, X, y)
 |      Fit the model using X as training data and y as target values
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix, BallTree, KDTree}
 |          Training data. If array or matrix, shape [n_samples, n_features],
 |          or [n_samples, n_samples] if metric='precomputed'.
 |      
 |      y : {array-like, sparse matrix}
 |          Target values of shape = [n_samples] or [n_samples, n_outputs]
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.

In [70]:
# スコアの確認
model.score(X_test, y_test)
Out[70]:
0.916083916083916
In [71]:
# 予測結果と正解データを比較してみる
print('予測結果 :\n', model.predict(X_test))
print('正解データ :\n', y_test)
予測結果 :
 [0 0 0 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 1 0 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0
 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 1 1
 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1
 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 0]
正解データ :
 [1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0
 0 0 1 0 0 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 0 1 1 1 1
 1 0 1 1 1 0 1 1 0 0 0 1 1 1 1 1 1 0 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1
 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 0]
In [72]:
# 一致しているか 確認
model.predict(X_test) == y_test
Out[72]:
array([False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True])
In [73]:
# スコアを今一度確認する
(model.predict(X_test) == y_test).sum() / len(model.predict(X_test) == y_test)
Out[73]:
0.916083916083916
In [86]:
# では、次に近傍点の数を変化させたときにベストスコアを叩き出す条件を探っていく
score_train = []
score_test = []

for n_neighbors in range(1,51):
    model = KNeighborsClassifier(n_neighbors=n_neighbors)
    model.fit(X_train, y_train)
    
    score_train.append(model.score(X_train, y_train))
    score_test.append(model.score(X_test, y_test))
In [96]:
# スコアを可視化する
plt.plot(range(1,51), score_train, label='Training')
plt.plot(range(1,51), score_test, label='Test')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.grid(True)
plt.legend(loc='best')
Out[96]:
<matplotlib.legend.Legend at 0x1ad65c3d588>
In [97]:
# 結論:近傍点の数は15から30程度が良さそう
In [98]:
# おまけで、KNNの回帰を試してみよる
from sklearn.neighbors import KNeighborsRegressor

# データの読み込み
data = load_breast_cancer()

# 訓練データと検証データに分割する
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, stratify=data.target)

# では、次に近傍点の数を変化させたときにベストスコアを叩き出す条件を探っていく
score_train = []
score_test = []

for n_neighbors in range(1,51):
    model = KNeighborsRegressor(n_neighbors=n_neighbors)
    model.fit(X_train, y_train)
    
    score_train.append(model.score(X_train, y_train))
    score_test.append(model.score(X_test, y_test))
    
    
# スコアを可視化する
plt.plot(range(1,51), score_train, label='Training')
plt.plot(range(1,51), score_test, label='Test')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.grid(True)
plt.legend(loc='best')
Out[98]:
<matplotlib.legend.Legend at 0x1ad66374a58>
In [102]:
# これが何を示しているのか?
# なぜ分類よりもなめらかなグラフなのか?

# 予測結果データを見てみる
print(model.predict(X_test))

"""
悪性と良性という2極になっていない。
n_neighbors = 50 の結果なので
1/50(= 0.02)ステップで予測結果が返ってきている。

つまり 50個中 1個が悪性なら 0.02という割合の値として出力されている。
"""
[0.64 0.   0.02 1.   0.02 0.78 0.   0.   0.04 0.98 0.04 0.02 1.   0.86
 1.   1.   1.   0.38 0.82 0.04 0.6  0.   0.64 1.   1.   0.1  1.   1.
 1.   0.   1.   0.96 0.58 1.   0.98 1.   0.52 0.   0.   0.98 0.   0.04
 0.02 0.04 1.   1.   1.   0.16 0.6  1.   0.26 1.   0.96 0.98 0.98 0.74
 1.   0.92 0.66 0.86 0.6  0.94 0.96 0.   0.54 0.   0.   1.   0.98 0.
 1.   1.   0.98 1.   1.   0.86 0.66 0.72 0.9  0.7  1.   1.   0.04 0.86
 0.   0.98 0.7  0.98 0.92 1.   1.   0.96 0.1  0.5  0.44 0.04 1.   0.74
 1.   0.   0.   0.48 0.36 0.7  0.96 0.74 0.   0.02 1.   1.   1.   0.
 0.98 1.   1.   0.38 1.   0.98 1.   0.98 0.66 0.6  1.   1.   0.   0.48
 0.98 0.38 1.   1.   1.   0.   1.   0.8  1.   0.74 0.   0.9  0.02 0.98
 0.1  1.   0.  ]
Out[102]:
'\n\n'
In [106]:
# では、Scoreの値は何なのか?
score_test[-1] # 最後、近傍点数=50のとき
Out[106]:
0.6841288888888889
In [108]:
# 恐らく、正解データとの差分をどうにか加工した値ではないかと考えられる。

# 正解データ
y_test
Out[108]:
array([1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0])
In [113]:
# 差分の絶対値を出してみる
delta = y_test - model.predict(X_test)
delta_abs = np.abs(delta)
print(delta_abs)
[0.36 0.   0.02 0.   0.02 0.22 0.   0.   0.04 0.02 0.04 0.02 1.   0.14
 0.   0.   0.   0.38 0.82 0.04 0.4  0.   0.64 0.   0.   0.1  0.   0.
 0.   0.   0.   0.04 0.42 0.   0.02 0.   0.52 0.   0.   0.02 0.   0.04
 0.02 0.04 0.   0.   0.   0.84 0.6  0.   0.26 0.   0.04 0.02 0.02 0.26
 0.   0.08 0.34 0.14 0.6  0.06 0.04 0.   0.54 0.   0.   0.   0.02 0.
 0.   0.   0.02 0.   0.   0.86 0.34 0.28 0.1  0.7  0.   0.   0.04 0.86
 0.   0.02 0.3  0.02 0.08 0.   0.   0.96 0.1  0.5  0.44 0.04 0.   0.26
 0.   0.   0.   0.52 0.36 0.3  0.04 0.26 0.   0.02 0.   0.   0.   0.
 0.02 0.   0.   0.62 0.   0.02 0.   0.02 0.34 0.4  0.   0.   0.   0.52
 0.02 0.38 0.   0.   0.   0.   0.   0.2  0.   0.26 0.   0.1  0.02 0.02
 0.1  0.   0.  ]
In [115]:
# 平均を出してみる
delta_abs.sum() / len(delta_abs) # 0.13776223776223775

# 
Out[115]:
0.13776223776223775
In [117]:
# 0.13違うということは
1-0.13 # こうじゃない
Out[117]:
0.87
In [125]:
# 2乗誤差ならどうか
delta_double = (y_test - model.predict(X_test)) ** 2
np.sqrt(delta_double.sum())
Out[125]:
3.2459821318054103
In [ ]:
# これも違う・・・

"""
以上、KNNの回帰スコアは意味がわからないので、
後日理論を確認しよう。
"""