使用Pandas和matplotlib库进行简单的数据分析与可视化

Introduction

Pandas 是 Python 语言的一个扩展程序库,用于数据分析。

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

本文章将会对一些数据data.csv进行处理与绘图,形如:

Idx Title of Book Description Authors Rating Price Availability Book Category
0 It’s Only the Himalayas Wherever you go, whatever you do, just . . . don’t do anything stupid. S. Bedford 2 45.17 19 Travel

Analysis

Read File

利用pd.read_csv完成读取,指定index_col字段,以指定数据随后所使用的索引。

1
2
3
4
5
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
dfcand=pd.read_csv("data.csv", sep=',',index_col=0)

Group by

dataframe对象使用.groupby()可以对数据进行合理的归并分组,是一个pandas.core.groupby.generic.DataFrameGroupBy 对象

.size()

.size()字段可以得出各个分组的名称和对应大小(数量),是一个pandas.core.series.Series对象

1
2
3
4
5
6
7
8
9
10
11
12
dfcand.groupby('Book Category').size()

#Book Category
#Academic 1
#Add a comment 67
#Adult Fiction 1
#Art 8
#Autobiography 9
#Biography 5
#...
#Young Adult 54
#dtype: int64

Series对象

.sort_values(ascending=False)

Series对象,.sout_values()可以规定其排序方式

1
2
3
4
5
6
7
8
9
10
11
12
piedata = dfcand.groupby('Book Category').size().sort_values(ascending=False)

#Book Category
#Default 152
#Nonfiction 110
#Sequential Art 75
#Add a comment 67
#Fiction 65
#Young Adult 54
#...
#Academic 1
#dtype: int64

一般的,对于Series对象,可以利用比较符号进行筛选,例如,我们要获得以上大于7的值

1
2
3
4
5
6
7
8
abovepiedata = piedata[piedata>7]

#Book Category
#Default 152
#Nonfiction 110
#...
#Autobiography 9
#dtype: int64

pd.concat()

可以使用此方法将多个Series对象合成为dataframe对象

当axis = 1时,如果其索引一样,会将合并的Series作为新的列,最终合并为dataframe

1
pd3d = pd.concat([ratingseries,availabilityseries,sizeseries],axis= 1,ignore_index= False)

对于合并后可能出现的未命名列,可以使用.iloc 获取,例如

1
pd3d.iloc[:,2]

Benford

By Benford’s law it is often the case that 1 occurs more frequently than 2, 2 more frequently than 3, and so on. This observation is a simplified version of Benford’s law. More precisely, the law gives a prediction of the frequency of leading digits using base-10 logarithms that predicts specific frequencies which decrease as the digits increase from 1 to 9.

1
2
3
4
frequency = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0,'9':0}
def benford(num):
firstdigit = str(num)[0]
frequency[firstdigit] = frequency[firstdigit] + 1

Draw diagrams

General

1
2
3
4
5
6
7
8
9
plt.figure(figure = (10,10) #规定画布的大小
plt.rcParams.update({'font.family': 'Times New Roman'}) #规定显示字体
plt.rcParams.update({'font.weight': 'normal'}) #规定字体粗细
plt.rcParams.update({'font.size': 15}) #规定字体字号

#Draw Diagram

plt.title("This is a figure") #规定图片标题
plt.show() #显示图片

Pie

Series对象,可利用.values获取其值

利用plt.pie()绘制饼图

第一个参数提供数据

labels = 提供对应数据的标签

startangle = 规定第一个刻度的角度27a6d1f0aa648134a54357c580480631.png

labeldistance = 规定标签到饼图的距离

1
plt.pie(abovepiedata.values,labels=abovepiedata.index,startangle= 32,labeldistance= 1.12)

3D Scatter

1
2
3
4
5
6
7
ax = plt.axes(projection = '3d') #规定为3D散点图
ax.scatter3D(pd3d['Rating'], pd3d.iloc[:,2], pd3d['Availability']) #为散点图提供数据
plt.xticks((range(5)) #规定x轴的刻度标度

plt.xlabel('Avg Rating')#规定x轴标签
plt.ylabel('Size of Category',rotation = 39)#规定y轴标签
ax.set_zlabel('Avg stock')#规定z轴标签

60be8e80839f1377b500f8880e84589a.png

Bar & Plot

plt.bar() 中,x = 提供了数据的条目(有几列数据)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
plt.bar(x=range(len(benser)), height=benser, label='Data', tick_label = benser.index)

plt.xlabel("First Digit")
plt.ylabel("Data")

#与上图共用同一个x轴
ax2 = plt.twinx()

#规定y轴的取值区间
ax2.set_ylim([0.04, 0.33]);
plt.plot(range(len(benser)), y, marker='.', color ='goldenrod', linewidth='1', label="Benford's Law")

#给出折线上的百分比图例
plt.legend(loc="upper right")
for a, b in zip(range(len(benser)), y):
plt.text(a, b, str('{:.2f}'.format(b*100)) + '%', ha='center', va='bottom', fontsize=8)

31c32aecc6b84bc4f1c38402c0cfb495.png

How to get the author of the book by title

Introduction

思路很简单,找到一个 API 即可。

此 API 接受书名(Title)作为参数,返回的数据中须包含作者(Author)字段。

经过一些探索,Google Books APIs 可以作为一个合格的解决办法。而Goodreads.com的搜索框可以是一个辅助选项。

Google Books APIs

Refer to this webpage to find more information: Google Books APIs Getting Started

Google Books has a vision to digitize the world’s books. You can use the Google Books API to search content, organize an authenticated user’s personal library and modify it as well.

Books concepts

为了能够正确处理随后返回的json数据,应当理解以下四则基本概念:

  • Volume: A volume represents the data that Google Books hosts about a book or magazine. It is the primary resource in the Books API. All other resources in this API either contain or annotate a volume.

  • Bookshelf: A bookshelf is a collection of volumes. Google Books provides a set of predefined bookshelves for each user, some of which are completely managed by the user, some of which are automatically filled in based on user’s activity, and some of which are mixed. Users can create, modify or delete other bookshelves, which are always filled with volumes manually. Bookshelves can be made private or public by the user.

    Note: Creating and deleting bookshelves as well as modifying privacy settings on bookshelves can currently only be done through the Google Books site.

  • Review: A review of a volume is a combination of a star rating and/or text. A user can submit one review per volume. Reviews are also available from outside sources and are attributed appropriately.

  • Reading Position: A reading position indicates the last read position in a volume for a user. A user can only have one reading position per volume. If the user has not opened that volume before, then the reading position does not exist. The reading position can store detailed position information down to the resolution of a word. This information is always private to the user.

Working with volumes

You can perform a volumes search by sending an HTTP GET request to the following URI:

https://www.googleapis.com/books/v1/volumes?q=search+terms

例如:搜索书目《It’s Only the Himalayas》,如果配置正确,会得到一个json,内含所有的搜索结果,一般的,认为第一个结果就是我们搜索得到的书目。

Goodreads.com

Refer to this webpage to find more information: Goodreads.com

Discover and share books you love on Goodreads, the world’s largest site for readers and book recommendations!

实例请求:搜索《Test》

对于搜索请求https://www.goodreads.com/search?q=`keyword`&qid=

只需要将关键词填入q=后

Scrape

Goodreads.com 可能有反爬机制,可以使用伪装浏览器的办法缓解一些情况:

1
2
3
4
5
6
7
8
send_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
"Connection": "keep-alive",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8"
}

result = requests.get(main_url, send_headers, proxies=proxies)

一些情况下Goodreads也无法给出正确的作者,需要使用try catch关键字捕获异常情况。

Python Code

注意事项

对于在Python中使用这两个办法,需要注意一些内容

  1. 代理可能导致错误request eof occurred in violation of protocol (_ssl.c:997)

    多见于使用的代理工具代理模式为全局代理,并且未在Python脚本中正确配置代理,尝试通过以下办法解决:

    1
    2
    3
    4
    5
    6
    7
    8
    import requests
    proxies = {
    'http': 'http://your_server:your_port',
    'https': 'http://your_server:your_port',
    }

    #仅在需要代理的请求下,填写参数proxies=proxies
    result = requests.get(Full_API_Link,proxies = proxies)
  2. Google Books APIs 所返回的json中,结果title字段下的标题并不与本地title字段相匹配。通过使用Python自带的 difflib 库实现匹配功能。

  3. 由于持续的请求,可能会导致请求失败,故使用try: except:关键字捕获异常,保证程序正常运行。

简单实现

假设:

  1. 本电脑使用的代理工具是CFW,未开启全局代理,默认端口

  2. 可以正常请求到对应书目,遍历json找到最为匹配的标题。

  3. Google Books APIs 无法正确获得作者,进而使用Goodreads尝试获取之。

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import requests
proxies = {
'http': 'http://localhost:7890',
'https': 'http://localhost:7890',
}
import json
import difflib、
from bs4 import BeautifulSoup

def UrlToSoupAdvanced(Url:str):
main_url = Url
print(main_url)
send_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
"Connection": "keep-alive",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8"
}

try:
result = requests.get(main_url, send_headers, proxies=proxies)
except :
time.sleep(5)
return "[Unknown][Goodreads]Fail to request the link"

return BeautifulSoup(result.text, 'html.parser')

def TwoStrMatch(string1:str,string2:str):
result = difflib.SequenceMatcher(None, string1, string2).quick_ratio()
print(string1 + "====" + string2 + "====" + str(result))
return result

def GetAuthorByTitleUsingGoogleBooksAPI(Title:str):
Google_Book_APIs_Head = "https://www.googleapis.com/books/v1/volumes?q="
API_KEY = "AIzaSyCdlpdS8EWgKIN6EW95fwjiLqDkkiIA8Pg"
Search_Data = Title

#https://www.googleapis.com/books/v1/volumes?q=A%20Summer%20In%20Europe+intitle:keyes&key=AIzaSyCdlpdS8EWgKIN6EW95fwjiLqDkkiIA8Pg
#https://www.googleapis.com/books/v1/volumes?q=a summer in europe&printType=books&intitle:a summer in europe

Full_API_Link = Google_Book_APIs_Head + Search_Data
#Full_API_Link = Google_Book_APIs_Head + "It's_Only_the_Himalayas"

print("The request link is " + Full_API_Link)
try:
result = requests.get(Full_API_Link,proxies = proxies)
except :
time.sleep(5)
return GetAuthorByTitleUsingGoodreadsSearch(Title)



return_json_dict = json.loads(s = result.text)
if 'items' in return_json_dict:
return_items = return_json_dict['items']
else:
return GetAuthorByTitleUsingGoodreadsSearch(Title)

#对于返回的搜索结果
for book in return_items:
current_volumn = book['volumeInfo']
current_title = current_volumn['title']

if 'authors' in current_volumn:
current_authors = current_volumn['authors']

#如果标题直接匹配,正常返回
if (TwoStrMatch(current_title, Title) > 0.85):
return str(','.join(current_authors))

if('subtitle' in current_volumn):
#如果加上副标题是匹配的,则正常返回
if (TwoStrMatch(current_title + current_volumn['subtitle'], Title) > 0.85):
return str(','.join(current_authors))

return GetAuthorByTitleUsingGoodreadsSearch(Title)

def GetAuthorByTitleUsingGoodreadsSearch(Title:str):

Full_Search_Link = "https://www.goodreads.com/search?q=" + title +"&qid="


try:
soup = UrlToSoupAdvanced(Full_Search_Link)
except:
return "[Unknown][Goodreads]Fail to request the link"

try:
author = soup.find('span', itemprop="author").div.a.span.string
except:
return "[Unknown][Goodreads]No correct Author"

if(author == None):
return "[Unknown][Goodreads]No correct Author"
return author

Install Anaconda & Jupyter Notebook

0. Intro

Anaconda offers the easiest way to perform Python/R data science and machine learning on a single machine.

Jupyter Notebook is the original web application for creating and sharing computational documents.

More information related to this topic could be found from:Installing Python · cs109/content Wiki · GitHub

1. Install Anaconda

官网下载链接:Anaconda | Anaconda Distribution

For Windows user:

  1. 点击绿色按钮即可下载,若下载速度过慢,可以考虑镜像清华大学开源软件镜像站 | Tsinghua Open Source Mirror,或右键使用迅雷下载。
    c61a4a5e9ff2e264b7a7a4a4e0289ad9.png

  2. 下载完毕后打开安装包,一路next默认同意即可。安装进度条走完需要约五到六分钟。

  3. 安装成功后,打开开始菜单(win+s)→搜索Anaconda,打开。
    e9559ebb95e08a8d87c1d6b8e5b52ee7.png

  4. 一般首次启动时间较长,有时会提示更新,安装更新即可。

  5. 启动 Jupyter Notebookd31359c5234f7109b9819b59c01f0f46.png

  6. 正常情况下浏览器打开,可以看到所在目录

    77e23a5cd60e6fd6715d0442d68ed9dc.png

  7. .ipynb文件添加到该目录下,点击即可在浏览器中打开。

Install Libraries

a379d50d6d045afbb61b641f41d972d7.png

Change the working directory

如果未生成过 Jupyter 的配置文件jupyter_notebook_config.py
打开 Anaconda Prompt
026c47fcb4a58e006629a22f4fa4fb3c.png
输入

jupyter notebook --generate-config

配置文件生成在了对应路径下,即C:\Users\xiaonan\.jupyter

打开此配置文件,找到c.NotebookApp.notebook_dir字段,等号右边填入需要变更为的目标路径
c.NotebookApp.notebook_dir = "D:\XJTLU\INT303\Labs",并去除行前的注释#

重新打开Jupyter Notebook 即可。