上一篇文章数据探索性分析(EDA)常用方法大合集中,我们介绍了数据探索性分析中数据概览及常用的处理方法,本篇我们将重点介绍分析数据分布及相关性的可视化方法,本篇均为实用方法,建议收藏。
#导入常用库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# 使用 jupyter notebook 自身显示图像
%matplotlib inline
-
使用直方图来查看data中数值型数据的分布情况
data.hist(figsize = (20,20))
-
另一种方式,查看数据的分布情况
import missingno as msno
p=msno.bar(data)
-
查看某一特征中数据取值的分布情况,下例中Outcome为列名
data.Outcome.value_counts().plot(kind="bar")
-
绘制两个变量间的散点图
# figure size
plt.figure(figsize=(15,8))
# Simple scatterplot
ax = sns.scatterplot(x='Calories', y='LightActiveDistance', data=activity1)
ax.set_title('Scatterplot of calories and intense_activities')
-
绘制折线图,查看变量变化趋势
## plot the raw values
col_select = ['Calories','VeryActiveMinutes','FairlyActiveMinutes','LightlyActiveMinutes','SedentaryMinutes']
show_dt = data[col_select]
# figure size
plt.figure(figsize=(15,8))
# timeseries plot using lineplot
ax = sns.lineplot(data=show_dt)
ax.set_title('Un-normalized value of calories and different activities based on activity minutes')
-
查看两两数据之间的相关性,对脚线上由于是数据自身,所以以直方图呈现,其他的以两两数据之间的散点图呈现。
import seaborn as sns
sns.pairplot(data)
-
按照某一特定列进行分类后查看数据的分布情况
Outcome取值为0,1,可以按照该字段对数据进行分类后呈现数据的分布情况,黄色点表示Outcome为1的数据,蓝色点表示Outcome为0的数据。
import seaborn as sns
#hue :针对某一字段进行分类
sns.pairplot(data, hue = 'Outcome')
-
利用热力图查看数据之间的相关系数
import seaborn as sns
plt.figure(figsize=(12,10))
p=sns.heatmap(diabetes_data.corr(), annot=True,cmap ='RdYlGn')
-
绘制数据的箱线图,查看数据分布情况、离散情况及是否存在异常值等
plt.rcParams['figure.figsize'] = (15, 8)
ax = sns.boxplot(x = data_clubs['Club'], y = data_clubs['Overall'], palette = 'inferno')
ax.set_xlabel(xlabel = 'Some Popular Clubs', fontsize = 9)
ax.set_ylabel(ylabel = 'Overall Score', fontsize = 9)
ax.set_title(label = 'Distribution of Overall Score in Different popular Clubs', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()
-
封装好的通用方法,直接查看全部特征的分布情况(同时支持数值型、非数值型)
def plot_distribution(dataset, cols=5, width=20, height=30, hspace=0.2, wspace=0.5):
plt.style.use('fivethirtyeight') #Use matplotlib style settings from a style specification.
fig = plt.figure(figsize=(width,height)) #Create a new figure
fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace) # customizing the subplots
rows = math.ceil(float(dataset.shape[1]) / cols)
#print(rows)
# iterating over the columns and then showing the data distribution in various columns
for i, column in enumerate(dataset.columns):
# print('befor',i,cols)
ax = fig.add_subplot(rows, cols, i + 1)
# print('after',i,cols)
# print(column)
ax.set_title(column)
if dataset.dtypes[column] == np.object: # plot counts of different values if the column has a datatype object
g = sns.countplot(y=column, data=dataset)
substrings = [s.get_text()[:18] for s in g.get_yticklabels()]
#print(substrings)
g.set(yticklabels=substrings) #Set the y-tick labels with list of strings labels.
plt.xticks(rotation=25) #Get or set the current tick locations and labels of the x-axis.
#plt.show()
else:
g = sns.distplot(dataset[column]) # Plotting if the datatype of the column is not object
plt.xticks(rotation=25) #Get or set the current tick locations and labels of the x-axis.
#plt.show()
#调用
plot_distribution(Sample, cols=2, width=20, height=35, hspace=0.8, wspace=0.8)
AI自研社是一个专注人工智能、机器学习技术的公众平台,目前已发表多篇连载文章,对机器学习领域知识由浅入深进行详细的讲解,其中包含了大量实例及代码参考,对学习交流有很大帮助,欢迎大家关注。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)