Posted 2022-11-05Updated 2022-11-05Techs / Python6 minutes read (About 922 words)

Introduction

Pandas 是 Python 语言的一个扩展程序库，用于数据分析。

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

本文章将会对一些数据data.csv进行处理与绘图，形如：

Idx	Title of Book	Description	Authors	Rating	Price	Availability	Book Category
0	It’s Only the Himalayas	Wherever you go, whatever you do, just . . . don’t do anything stupid.	S. Bedford	2	45.17	19	Travel

Analysis

Read File

利用pd.read_csv完成读取，指定index_col字段，以指定数据随后所使用的索引。

import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
dfcand=pd.read_csv("data.csv", sep=',',index_col=0)

Group by

对dataframe对象使用.groupby()可以对数据进行合理的归并分组，是一个pandas.core.groupby.generic.DataFrameGroupBy 对象

.size()

对.size()字段可以得出各个分组的名称和对应大小（数量），是一个pandas.core.series.Series对象

dfcand.groupby('Book Category').size()

#Book Category
#Academic                1
#Add a comment          67
#Adult Fiction           1
#Art                     8
#Autobiography           9
#Biography               5
#...
#Young Adult            54
#dtype: int64

Series对象

.sort_values(ascending=False)

对Series对象，.sout_values()可以规定其排序方式

piedata = dfcand.groupby('Book Category').size().sort_values(ascending=False)

#Book Category
#Default               152
#Nonfiction            110
#Sequential Art         75
#Add a comment          67
#Fiction                65
#Young Adult            54
#...
#Academic                1
#dtype: int64

一般的，对于Series对象，可以利用比较符号进行筛选，例如，我们要获得以上大于7的值

abovepiedata = piedata[piedata>7]

#Book Category
#Default               152
#Nonfiction            110
#...
#Autobiography           9
#dtype: int64

pd.concat()

可以使用此方法将多个Series对象合成为dataframe对象

当axis = 1时，如果其索引一样，会将合并的Series作为新的列，最终合并为dataframe

1	pd3d = pd.concat([ratingseries,availabilityseries,sizeseries],axis= 1,ignore_index= False)

对于合并后可能出现的未命名列，可以使用.iloc 获取，例如

1	pd3d.iloc[:,2]

Benford

By Benford’s law it is often the case that 1 occurs more frequently than 2, 2 more frequently than 3, and so on. This observation is a simplified version of Benford’s law. More precisely, the law gives a prediction of the frequency of leading digits using base-10 logarithms that predicts specific frequencies which decrease as the digits increase from 1 to 9.

frequency = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0,'9':0}
def benford(num):
    firstdigit = str(num)[0]
    frequency[firstdigit] = frequency[firstdigit] + 1

Draw diagrams

General

plt.figure(figure = (10,10) #规定画布的大小
plt.rcParams.update({'font.family': 'Times New Roman'}) #规定显示字体
plt.rcParams.update({'font.weight': 'normal'}) #规定字体粗细
plt.rcParams.update({'font.size': 15}) #规定字体字号

#Draw Diagram

plt.title("This is a figure") #规定图片标题
plt.show() #显示图片

Pie

对Series对象，可利用.values获取其值

利用plt.pie()绘制饼图

第一个参数提供数据

labels = 提供对应数据的标签

startangle = 规定第一个刻度的角度

labeldistance = 规定标签到饼图的距离

1	plt.pie(abovepiedata.values,labels=abovepiedata.index,startangle= 32,labeldistance= 1.12)

3D Scatter

ax = plt.axes(projection = '3d') #规定为3D散点图
ax.scatter3D(pd3d['Rating'], pd3d.iloc[:,2], pd3d['Availability']) #为散点图提供数据
plt.xticks((range(5)) #规定x轴的刻度标度

plt.xlabel('Avg Rating')#规定x轴标签
plt.ylabel('Size of Category',rotation = 39)#规定y轴标签
ax.set_zlabel('Avg stock')#规定z轴标签

Bar & Plot

plt.bar() 中，x = 提供了数据的条目（有几列数据）

plt.bar(x=range(len(benser)), height=benser, label='Data', tick_label = benser.index)

plt.xlabel("First Digit")
plt.ylabel("Data")

#与上图共用同一个x轴
ax2 = plt.twinx()

#规定y轴的取值区间
ax2.set_ylim([0.04, 0.33]);
plt.plot(range(len(benser)), y, marker='.', color ='goldenrod', linewidth='1', label="Benford's Law")

#给出折线上的百分比图例
plt.legend(loc="upper right")
for a, b in zip(range(len(benser)), y):
    plt.text(a, b, str('{:.2f}'.format(b*100)) + '%', ha='center', va='bottom', fontsize=8)