01_python_内存溢出

第一章学习内容整理 01 常用包说明

python可以解决很多问题,相应解决方案使用的包也很多,不太好记忆.
为了便于记忆,用大白话简单描述一下各个包的功能.

# 科学计算最基础的包
# 本质就是个多维数组, 里面包含了一些 *** 作数组的功能函数, 机器学习一切的起源
import numpy as np

# 以numpy为基础,封装的各种高级数学函数
import scipy

# 将各种数据已表单形式存储&表达, 可以直接使用numpy的数据
# 虽然同numpy都是数据集合, 但numpy侧重的是多为数组, 而且数据类型必须都一致
# pandas则侧重数据的集合,简单理解就是想象成excel表格
import pandas as pd

# 绘图工具 可以直接使用numpy的数据格式进行绘图
import matplotlib

# 交互式python
# 通常我们在cmd执行python后,进入python环境,编码是没有语法提醒的.
# Ipython下则可以提供语法提醒
import IPython

# ---------------
# 以下可以简单记忆, xxxxlearn表示各种机器学习用的包
# 机器学习的基础库, 里面有免费的学习数据, 实例中用的就是这个包里的数据
import sklearn

# 机器学习绘图库
import mglearn

02 从一个例子开始说起

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import mglearn
import matplotlib.pyplot as pt

iris_dataset = load_iris()
X_train, X_test, y_train, y_test = \
    train_test_split(iris_dataset['data'],
                     iris_dataset['target'], random_state=0)
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
                                 hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)
pt.show()

如果你跟我一样,看的一头雾水,说明咱们是一个段位的,接下来我们就尝试搞懂这段程序.
先把程序功能简单说明一下,也好带着问题看程序.

功能:

- 使用机器学习包中提供的数据样本,创建学习用数据和测试用数据
- 使用学习数据创建数据表格模板
- 使用测试数据测验证数据表格是否能正常识别数据
- 通过绘图来显示测试结果

接下来我们尝试分段来研究这段程序:

02_001 引入包的说明

# sklearn为机器学习基础包, 其下的datasets中提供了一些免费使用的数据样本, 
# load_iris是样本中关于鸢尾花建模的数据样本
from sklearn.datasets import load_iris

# train_test_split可以将sklearn的数据拆分
# 默认情况下会将一份数据拆分成 75%的建模用数据 和 25%的测试用数据
from sklearn.model_selection import train_test_split

# panda中提供数据的绘图功能
# 直接用matplotlib不行吗? 一般来说数据在成图之前都会做一些设置,中间多一曾panda缓冲
# 能解决绝大部分问题
import pandas as pd

# 虽然panda提供了绘图功能,但是主职并不是绘图,以此也不能完成复杂的配图方案.
# mglearn是专门应对绘图的包,里面有很多视图配色方案,
# 各种现成的数据视图,本着不重复造轮子的宗旨,建议绘图时选择mglearn
# 另外需要注意的是: mglearn只是做成绘图用的数据
import mglearn

# 绘制图片的主要函数
import matplotlib.pyplot as pt

02_002 使用机器学习包中提供的数据样本,创建学习用数据和测试用数据

加载数据样本:

# 加载鸢尾花数据包, 加载后的数据格式类似于dict
iris_dataset = load_iris()

print(type(iris_dataset))
>>> <class 'sklearn.utils.Bunch'>

iris_dataset.keys()
>>> dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

# 说明一下这个数据包中各个key的作用:
# =='data'==
#  用于存放测量数据 可以看出这个例子的数据是二维数组
>>> iris_dataset.data
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       ....
       ])

# 二维数组的形状      
# 一维表示测试数据的个数
# 二维表示每个测试数据的特征具体值
>>> iris_dataset.data.shape
(150, 4)

# =='feature_names'==
# 用于存放特征信息. 特征个数跟data中的二维数据个数应当是一样的.
>>> iris_dataset.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

# =='target_names'==
# 单个所有特征信息汇总后最终会得到一个结论, target_names就是存放最终的结论,
# 本例中所有数据共分为三类.
>>> iris_dataset.target_names
array(['setosa', 'versicolor', 'virginica'], dtype=')

# =='target'==
# 存放每个数据 特征汇总后所指向的结论, 为了节省内存, 
# 没有直接写名字,而是使用target_names中的位置信息来描述结论
>>> iris_dataset.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

# =='DESCR'==
# 存放的是关于这个数据集合的说明性文档,也就是readme, 记住是description的缩写就好理解了

# =='filename'==
# 数据读取来源, 数据的实体文件
# 你在本机一定会搜到这个文件, 文件里面记录的都是纯测量数据
>>> iris_dataset.filename
'iris.csv'


# =='data_module'==
# 数据出处,理解成签名信息就好
>>> iris_dataset.data_module
'sklearn.datasets.data'

拆分数据样本, 做成建模数据和测试用数据:

>>> X_train, X_test, y_train, y_test = \
    train_test_split(iris_dataset['data'],
                     iris_dataset['target'], random_state=0)
                     
# train_test_split的函数定义, 可以看出,'data'和'target都是以arry形式传入函数,
# train_test_split都会照单全收,然后返回对应结果.
# random_state=0 这个需要说明一下,这个叫随机种子,0表示数据不偏移,这会保证我们在多次运行程序
# 的情况下,每次做成的建模数据和测试数据始终是一致的.
#def train_test_split(
#    *arrays,
#    test_size=None,
#    train_size=None,
#    random_state=None,
#    shuffle=True,
#    stratify=None,
):

# train_test_split函数的作用, 将第一个传入的arrays参数的每一项拆分成俩个
# 其比例是 75%,25%
# 上面的例子中 iris_dataset['data']被拆分成 75%的X_train 和25%的X_test
# iris_dataset['target']同理  75%的y_train和25%的y_test 
# 这个比例是可以调整的 指定test_size就可以, 设置范围是 0~1
# 比如 test_size=0.5 表示 拆分成等比例两份
# 但是官方建议用默认比例,可能是经过大量测试发现默认的是黄金比例吧

# 拆分前数据大小
>>> iris_dataset.data.shape
(150, 4)
# 差分后数据大小
>>> X_train.shape
(112, 4)
>>> X_test.shape
(38, 4)
>>> y_train.shape
(112,)
>>> y_test.shape
(38,)
# 确认拆分比例
>>> 112 / 150
0.7466666666666667
>>> 38 / 150
0.25333333333333335

02_003 使用学习数据创建数据表格模板

>>> iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)

# DataFrame支持的格式很多,没法展开了说,就事论事说说咱这个例子
# 先贴一下函数参数说明:

# Parameters
#   ----------
#    data: ndarray(structured or homogeneous), Iterable, dict, or DataFrame
#       Dict can contain Series, arrays, constants, dataclass or list-like objects. If
#        data is a dict, column order follows insertion-order. If a dict contains Series
#         which have an index defined, it is aligned by its index.

#         .. versionchanged: : 0.25.0
#          If data is a list of dicts, column order follows insertion-order.

#     index: Index or array-like
#       Index to use for resulting frame. Will default to RangeIndex if
#        no indexing information part of input data and no index provided.
#     columns: Index or array-like
#       Column labels to use for resulting frame when data does not have them,
#        defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
#         will perform column selection instead.
#     dtype: dtype, default None
#       Data type to force. Only a single dtype is allowed. If None, infer.
#     copy: bool or None, default None
#       Copy data from inputs.
#         For dict data, the default of None behaves like ``copy = True``.  For DataFrame
#         or 2d ndarray input, the default of None behaves like ``copy = False``.

#         .. versionchanged:: 1.3.0

#     See Also
#     --------
#     DataFrame.from_records: Constructor from tuples, also record arrays.
#     DataFrame.from_dict: From dicts of Series, arrays, or dicts.
#     read_csv: Read a comma-separated values(csv) file into DataFrame.
#     read_table: Read general delimited file into DataFrame.
#     read_clipboard: Read text from clipboard into DataFrame.


# ------------
>>> iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# 上面例子只用到了一个参数 columns. 
# columns用于描述 x_train序列数据中每一个单个数据所拥有的列的含义
>>> X_train
array([[4.6, 3.1, 1.5, 0.2],
       [5.9, 3. , 5.1, 1.8],
       [5.1, 2.5, 3. , 1.1],
       ......
>>> iris_dataset.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
>>> X_train.shape
(112, 4)

# X_train中有112个训练数据, 每个数据都有4个特征, 按位置分别是
# 0:'sepal length (cm)', 1:'sepal width (cm)', 2:'petal length (cm)', 3:'petal width (cm)'
# 有人提议想象成excel表格形式,我觉的很贴切

做成的数据表格类似下面这种格式:
第一列是额外做成的,标记数据位置, 其它的内容均来自于函数的接口设定

02_004 使用测试数据测验证数据表格是否能正常识别数据

鸢尾花数据样本特征中含有4个种类, 但是平面图新很难表现3个以上的特征,
比如柱状图,只有横纵坐标,也就是说用它来表示的话仅能表示数据样本中的2个特征,
这明显不是我们想要的结果,书中给出的解决方案是使用散点图来表示.
大概的意思就是把每个鸢尾花的特征都用点的方式来表示, 所有数据中,相同特征的点用一种颜色来表示
这样的确可以解决多余两个特征的数据样本显示的问题.

想制作散点图,就需要使用[pd.plotting.scatter_matrix]函数,这个函数使用的数据元类型是DataFrame,
所以才会出现上面步骤中 [pd.DataFrame(X_train, columns=iris_dataset.feature_names)]的代码.
不管是那种编程语言,每一步的设置都是为后续代码作准备的,有时死扣住一段代码
研究,不如先通览一遍来的要快.

如果想看scatter_matrix的函数帮助可以找下:

.venv\Lib\site-packages\pandas\plotting_misc.py

书中的列子是这样写的

grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
                                 hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

我在执行过程中,并没有出现书中的图片,单步调试的时候出现的内容:

grr
array([[<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='sepal length (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='sepal width (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='petal length (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal width (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='petal width (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='petal width (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='petal width (cm)'>]],
      dtype=object)

通过函数帮助发现这个就是返回个数组吗, 我说怎么没有图片呢,哎!!!

Returns
-------
numpy.ndarray
    A matrix of scatter plots.

散点图函数的接口

def scatter_matrix(
frame,
alpha=0.5,
figsize=None,
ax=None,
grid=False,
diagonal=“hist”,
marker=“.”,
density_kwds=None,
hist_kwds=None,
range_padding=0.05,
**kwargs, ):

既然用到这了,本着了解一下态度,简单记录一下:

pandas scatter_matrix 帮助文档

grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
                                 hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)
# frame 
#  数据表格
# alpha
#  透明度
# figsize
#  绘制最终图片是的大小
# marker
#	 散点途中点的样式
# c, hist_kwds, s, cmap
#  属于kwargs范畴的自定义参数, 看到这里为止还不知道是干什么用的,但是通过帮助文档来看是向下传递,暂时先不管它
#    **kwds
#       Options to pass to matplotlib scatter plotting method.

02_005 通过绘图来显示测试结果

数据已经准备好了,接下来就是显示,

pt.show()

这段代码就可以完成绘图显示.与通常函数使用相比,show函数在调用是并没有传递任何参数,那它是怎样完成绘图的,又是怎样找到绘图用的数据源呢?

02_005_001 调查开始:

因为图片显示代码很简单,没有设置任何东西,说明图片的相关设置应当是前面步骤完成的.
首先回顾一下上一步代码,我i们需要从这里开始调查

import pandas as pd

grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
                                 hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

查找scatter_matrix函数
找到是找到了,但是没有看到’scatter_matrix’, 那说明这个函数有特殊的处理,
遇到这种情况一般去’init.py’下看看.
.venv\Lib\site-packages\pandas\plotting

跟预想的一样,在这里定义的, 所以’scatter_matrix’是写在’_misc '中

from pandas.plotting._misc import (
    andrews_curves,
    autocorrelation_plot,
    bootstrap_plot,
    deregister as deregister_matplotlib_converters,
    lag_plot,
    parallel_coordinates,
    plot_params,
    radviz,
    register as register_matplotlib_converters,
    scatter_matrix,
    table,
)

_misc.scatter_matrix 处理内容如下:
引用一个函数, 然后将所有收到的参数如数塞进这个函数然后返回.

	from pandas.plotting._core import _get_plot_backend

	def scatter_matrix(
	    frame,
	    alpha=0.5,
	    figsize=None,
	    ax=None,
	    grid=False,
	    diagonal="hist",
	    marker=".",
	    density_kwds=None,
	    hist_kwds=None,
	    range_padding=0.05,
	    **kwargs,
	):
    plot_backend = _get_plot_backend("matplotlib")
    return plot_backend.scatter_matrix(
        frame=frame,
        alpha=alpha,
        figsize=figsize,
        ax=ax,
        grid=grid,
        diagonal=diagonal,
        marker=marker,
        density_kwds=density_kwds,
        hist_kwds=hist_kwds,
        range_padding=range_padding,
        **kwargs,
    )

确认_get_plot_backend返回个啥

_backends: dict[str, types.ModuleType] = {}

def _get_plot_backend(backend: str | None = None):
    backend = backend or get_option("plotting.backend")

	# _backends默认是空字典,所以这个if没走进去
    if backend in _backends:
        return _backends[backend]

	# 由下面的函数定义可以看出, _load_backend(backend) 返回的是[pandas.plotting._matplotlib]
    module = _load_backend(backend)
    _backends[backend] = module
    return module

def _load_backend(backend: str) -> types.ModuleType:
    from importlib.metadata import entry_points

    if backend == "matplotlib":
        # Because matplotlib is an optional dependency and first-party backend,
        # we need to attempt an import here to raise an ImportError if needed.
        try:
            module = importlib.import_module("pandas.plotting._matplotlib")
        except ImportError:
            raise ImportError(
                "matplotlib is required for plotting when the "
                'default backend "matplotlib" is selected.'
            ) from None
        return module
)

到这里先总结一下:
分析到现在[_get_plot_backend(“matplotlib”)]的实际调用位置是
[pandas.plotting._matplotlib] 也就是[pandas.plotting._matplotlib.misc]中的[scatter_matrix]方法

3 分析scatter_matrix中到底干了些啥
这部分需要简略说明,它不是我们学习的重点,我们只需要粗略了解功能就行,展开了分析就相当于学习这个包的用法,学习成本较高,
方法体中有这样一段:

 ax.scatter(
  df[b][common], df[a][common], marker=marker, alpha=alpha, **kwds
          )

这段应该相当于调用 matplotlib.pyplot.scatter, 有精力的小伙伴可以自己去研究

02_005_002 调查结论:

import pandas as pd

grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
                                hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

相当于将参数转了一手传给了matplotlib.pyplot.scatter函数, 而matplotlib的主要作用就是绘图

02_005_003 绘图参数调查:

接下来看看scatter的函数接口, 来了解例子中的代码都干了些啥

scatter_matrix管方文档

# c 
#		color的意思, 这里用的是纯数字数组, 每个数字是一种颜色
#   y_train是想面数据拆分后 75%的训练数据[X_train]的结果部分, 因为我们传入的iris_dataframe是
#   用X_train做成的,所以必须用y_train来表示每个训练数据的结果
# cmap
#		color map配色方案, 就像是各种主题都有一种配色方案,选哪一种就在这里指定
#    	>>> import mglearn
#		>>> mglearn.cm3
#		
# marker
#		散点图中用什么图标来表示结果
# s	
#		图标的大小 Size
# hist_kwds={'bins': 20}
#		bins表示柱状图, 20表示柱状图的个数
# 	生成的图片中,对角线上的图都是柱状图,其他的为散点图
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
                                 hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

最终显示的图片效果:
我的理解:
对象线部分横纵坐标都是自己,完全匹配,所以是柱状图

这个图我们怎么来确定是否正确?
我的理解:
函数接口中我们曾经设置过: [c=y_train], y_train中只有三种结果,也就是说,如果散点图部分只存在3中颜色,就说明这个就OK.

03 总结

散点图函数官方帮助文档
绘图函数官方帮助文档
pandas中的散点图scatter_matrix中 **kwgs部分会传递给matplotlib.pyplot.scatter.

以上.

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/922493.html

01

发表评论

评论列表（0条）