How to get the author of the book by title

Introduction

思路很简单,找到一个 API 即可。

此 API 接受书名(Title)作为参数,返回的数据中须包含作者(Author)字段。

经过一些探索,Google Books APIs 可以作为一个合格的解决办法。而Goodreads.com的搜索框可以是一个辅助选项。

Google Books APIs

Refer to this webpage to find more information: Google Books APIs Getting Started

Google Books has a vision to digitize the world’s books. You can use the Google Books API to search content, organize an authenticated user’s personal library and modify it as well.

Books concepts

为了能够正确处理随后返回的json数据,应当理解以下四则基本概念:

  • Volume: A volume represents the data that Google Books hosts about a book or magazine. It is the primary resource in the Books API. All other resources in this API either contain or annotate a volume.

  • Bookshelf: A bookshelf is a collection of volumes. Google Books provides a set of predefined bookshelves for each user, some of which are completely managed by the user, some of which are automatically filled in based on user’s activity, and some of which are mixed. Users can create, modify or delete other bookshelves, which are always filled with volumes manually. Bookshelves can be made private or public by the user.

    Note: Creating and deleting bookshelves as well as modifying privacy settings on bookshelves can currently only be done through the Google Books site.

  • Review: A review of a volume is a combination of a star rating and/or text. A user can submit one review per volume. Reviews are also available from outside sources and are attributed appropriately.

  • Reading Position: A reading position indicates the last read position in a volume for a user. A user can only have one reading position per volume. If the user has not opened that volume before, then the reading position does not exist. The reading position can store detailed position information down to the resolution of a word. This information is always private to the user.

Working with volumes

You can perform a volumes search by sending an HTTP GET request to the following URI:

https://www.googleapis.com/books/v1/volumes?q=search+terms

例如:搜索书目《It’s Only the Himalayas》,如果配置正确,会得到一个json,内含所有的搜索结果,一般的,认为第一个结果就是我们搜索得到的书目。

Goodreads.com

Refer to this webpage to find more information: Goodreads.com

Discover and share books you love on Goodreads, the world’s largest site for readers and book recommendations!

实例请求:搜索《Test》

对于搜索请求https://www.goodreads.com/search?q=`keyword`&qid=

只需要将关键词填入q=后

Scrape

Goodreads.com 可能有反爬机制,可以使用伪装浏览器的办法缓解一些情况:

1
2
3
4
5
6
7
8
send_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
"Connection": "keep-alive",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8"
}

result = requests.get(main_url, send_headers, proxies=proxies)

一些情况下Goodreads也无法给出正确的作者,需要使用try catch关键字捕获异常情况。

Python Code

注意事项

对于在Python中使用这两个办法,需要注意一些内容

  1. 代理可能导致错误request eof occurred in violation of protocol (_ssl.c:997)

    多见于使用的代理工具代理模式为全局代理,并且未在Python脚本中正确配置代理,尝试通过以下办法解决:

    1
    2
    3
    4
    5
    6
    7
    8
    import requests
    proxies = {
    'http': 'http://your_server:your_port',
    'https': 'http://your_server:your_port',
    }

    #仅在需要代理的请求下,填写参数proxies=proxies
    result = requests.get(Full_API_Link,proxies = proxies)
  2. Google Books APIs 所返回的json中,结果title字段下的标题并不与本地title字段相匹配。通过使用Python自带的 difflib 库实现匹配功能。

  3. 由于持续的请求,可能会导致请求失败,故使用try: except:关键字捕获异常,保证程序正常运行。

简单实现

假设:

  1. 本电脑使用的代理工具是CFW,未开启全局代理,默认端口

  2. 可以正常请求到对应书目,遍历json找到最为匹配的标题。

  3. Google Books APIs 无法正确获得作者,进而使用Goodreads尝试获取之。

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import requests
proxies = {
'http': 'http://localhost:7890',
'https': 'http://localhost:7890',
}
import json
import difflib、
from bs4 import BeautifulSoup

def UrlToSoupAdvanced(Url:str):
main_url = Url
print(main_url)
send_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
"Connection": "keep-alive",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8"
}

try:
result = requests.get(main_url, send_headers, proxies=proxies)
except :
time.sleep(5)
return "[Unknown][Goodreads]Fail to request the link"

return BeautifulSoup(result.text, 'html.parser')

def TwoStrMatch(string1:str,string2:str):
result = difflib.SequenceMatcher(None, string1, string2).quick_ratio()
print(string1 + "====" + string2 + "====" + str(result))
return result

def GetAuthorByTitleUsingGoogleBooksAPI(Title:str):
Google_Book_APIs_Head = "https://www.googleapis.com/books/v1/volumes?q="
API_KEY = "AIzaSyCdlpdS8EWgKIN6EW95fwjiLqDkkiIA8Pg"
Search_Data = Title

#https://www.googleapis.com/books/v1/volumes?q=A%20Summer%20In%20Europe+intitle:keyes&key=AIzaSyCdlpdS8EWgKIN6EW95fwjiLqDkkiIA8Pg
#https://www.googleapis.com/books/v1/volumes?q=a summer in europe&printType=books&intitle:a summer in europe

Full_API_Link = Google_Book_APIs_Head + Search_Data
#Full_API_Link = Google_Book_APIs_Head + "It's_Only_the_Himalayas"

print("The request link is " + Full_API_Link)
try:
result = requests.get(Full_API_Link,proxies = proxies)
except :
time.sleep(5)
return GetAuthorByTitleUsingGoodreadsSearch(Title)



return_json_dict = json.loads(s = result.text)
if 'items' in return_json_dict:
return_items = return_json_dict['items']
else:
return GetAuthorByTitleUsingGoodreadsSearch(Title)

#对于返回的搜索结果
for book in return_items:
current_volumn = book['volumeInfo']
current_title = current_volumn['title']

if 'authors' in current_volumn:
current_authors = current_volumn['authors']

#如果标题直接匹配,正常返回
if (TwoStrMatch(current_title, Title) > 0.85):
return str(','.join(current_authors))

if('subtitle' in current_volumn):
#如果加上副标题是匹配的,则正常返回
if (TwoStrMatch(current_title + current_volumn['subtitle'], Title) > 0.85):
return str(','.join(current_authors))

return GetAuthorByTitleUsingGoodreadsSearch(Title)

def GetAuthorByTitleUsingGoodreadsSearch(Title:str):

Full_Search_Link = "https://www.goodreads.com/search?q=" + title +"&qid="


try:
soup = UrlToSoupAdvanced(Full_Search_Link)
except:
return "[Unknown][Goodreads]Fail to request the link"

try:
author = soup.find('span', itemprop="author").div.a.span.string
except:
return "[Unknown][Goodreads]No correct Author"

if(author == None):
return "[Unknown][Goodreads]No correct Author"
return author
Author

TsingLoo

Posted on

2022-10-17

Updated on

2022-10-28

Licensed under

Comments