To question an NCBI database effectively, you’ll wish to study certain E-utilities, define your search fields, and select your search parameters — which control the way in which results are returned to your browser or in our case, we’ll use Python to question the databases.
4 most useful E-utilities
There are nine E-utilities available from NCBI, they usually are all implemented as server-side fast CGI programs. This implies you’ll access them by creating URLs which end in .cgi and specify query parameters after a question-mark, with parameters separated by ampersands. All of them, apart from EFetch, provides you with either XML or JSON outputs.
ESearch
generates a listing of ID numbers that meet your search query
The next E-Utilities could be used with a number of ID numbers:
ESummary
journal, writer list, grants, dates, references, publication typeEFetch
**XML ONLY** all of whatESummary
provides in addition to an abstract, list of grants utilized in the research, institutions of authors, and MeSH keywordsELink
provides a listing of links to related citations using computed similarity rating in addition to providing a link to the published item [your gateway to the full-text of the article]
The NCBI hosts 38 databases across their servers, related to a wide range of data that goes beyond literature citations. To get an entire list of current databases, you should utilize EInfo without search terms:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi
Each database will vary in how it may be accessed and the data it returns. For our purposes, we’ll concentrate on the pubmed and pmc databases because these are where scientific literature are searched and retrieved.
The 2 most significant things to study searching NCBI are search fields and outputs. The search fields are numerous and can rely upon the database. The outputs are more straightforward and learning how one can use the outputs is important, especially for doing large searches.
Search fields
You won’t give you the option to actually harness the potential of E-utilities without knowing in regards to the available search fields. You could find a full list of those search fields on the NLM website together with an outline of every, but for the most accurate list of search terms specific to a database, you’ll wish to parse your individual XML list using this link:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed
with the db flag set to the database (we’ll use pubmed for this text, but literature can also be available through pmc).
One especially useful search field is the Medline Subject Headings (MeSH).[3] Indexers, who’re experts in the sector, maintain the PubMed database and use MeSH terms to reflect the material of journal articles as they’re published. Each indexed publication is often described by 10 to 12 rigorously chosen MeSH terms by the indexers. If no search terms are specified, then queries might be executed against every search term available within the database queried.[4]
Query parameters
Each of the E-utilities accepts multiple query parameters through the URL line which you should utilize to manage the kind and amount of output returned from a question. That is where you’ll be able to set the variety of search results retrieved or the dates searched. Listed below are a listing of the more necessary parameters:
Database parameter:
db
must be set to the database you might be involved in searching — pubmed or pmc for scientific literature
Date parameters: You’ll be able to get more control over the date by utilizing search fields, [pdat] for instance for the publication date, but date parameters provide a more convenient option to constrain results.
reldate
the times to be searched relative to the present date, set reldate=1 for probably the most recent daymindate
andmaxdate
specify date in accordance with the format YYYY/MM/DD, YYYY, or YYYY/MM (a question must contain eachmindate
andmaxdate
parameters)datetype
sets the kind of date while you query by date — options are ‘mdat’ (modification date), ‘pdat’ (publication date) and ‘edat’ (Entrez date)
Retrieval parameters:
rettype
the kind of information to return (for literature searches, use the default setting)retmode
format of the output (XML is the default, though all E-utilities except fetch do support JSON)retmax
is the utmost variety of records to return — the default is 20 and the utmost value is 10,000 (ten thousand)retstart
given a listing of hits for a question,retstart
specifies the index (useful for when your search exceeds the ten thousand maximum)cmd
this is simply relevant toELink
and is used to specify whether to return IDs of comparable articles or URLs to full-texts
Once we all know in regards to the E-Utilities, have chosen our search fields, and decided upon query parameters, we’re able to execute queries and store the outcomes — even for multiple pages.
When you don’t specifically need to make use of Python to make use of the E-utilities, it does make it much easier to parse, store, and analyze the outcomes of your queries. Here’s how one can start in your data science project.
Let’s say you should search MeSH terms for the term “myoglobin” between 2022 and 2023. You’ll set your retmax to 50 for now, but remember the max is 10,000 and you’ll be able to query at a rate of three/s.
import urllib.request
search_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/' +
f'?db=pubmed' +
f'&term=myoglobin[mesh]' +
f'&mindate=2022' +
f'&maxdate=2023' +
f'&retmode=json' +
f'&retmax=50'link_list = urllib.request.urlopen(search_url).read().decode('utf-8')
link_list
The outcomes are returned as a listing of IDs, which could be utilized in a subsequent search throughout the database you queried. Note that “count” shows there are 154 results for this question, which you could possibly use should you desired to get a complete count of publications for a certain set of search terms. In the event you desired to return IDs for all of the publication, you’d set the retmax parameter to the count, or 154. Basically, I set this to a really high number so I can retrieve the entire results and store them.
Boolean searching is simple with PubMed and it only requires adding +OR+
, +NOT+
, or +AND+
to the URL between search terms. Here’s an example below. For instance:
http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=pubmed&term=CEO[cois]+OR+CTO[cois]+OR+CSO[cois]&mindate=2022&maxdate=2023&retmax=10000
These search strings can constructed using Python. In the next steps, we’ll parse the outcomes using Python’s json package to get the IDs for every of the publications returned. The IDs can then be used to create a string — this string of IDs could be utilized by the opposite E-Utilities to return information in regards to the publications.
Use ESummary to return details about publications
The aim of ESummary is to return data that you simply might expect to see in a paper’s citation (date of publication, page numbers, authors, etc). Once you’ve got a lead to the shape of a listing of IDs from ESearch (within the step above), you’ll be able to join this list into an extended URL.
The limit for a URL is 2048 characters, and every publication’s ID is 8 characters long, so to be secure, you must split your list of links up into batches of 250 if you’ve got a listing larger than 250 IDs. See my notebook at the underside of the article for an example.
The outcomes from an ESummary are returned in JSON format and may include a link to the paper’s full-text:
import json
result = json.loads( link_list )
id_list = ','.join( result['esearchresult']['idlist'] )summary_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esummary.fcgi?db=pubmed&id={id_list}&retmode=json'
summary_list = urllib.request.urlopen(summary_url).read().decode('utf-8')
We are able to again use json to parse summary_list. When using the json package, you’ll be able to browse the fields of every individual article by utilizing summary[‘result’][id as string], as in the instance below:
summary = json.loads( summary_list )
summary['result']['37047528']
We are able to create a dataframe to capture the ID for every article together with the name of the journal, the publication date, title of the article, a URL for retrieving the complete text, in addition to the primary and last writer.
uid = [ x for x in summary['result'] if x != 'uids' ]
journals = [ summary['result'][x]['fulljournalname'] for x in summary['result'] if x != 'uids' ]
titles = [ summary['result'][x]['title'] for x in summary['result'] if x != 'uids' ]
first_authors = [ summary['result'][x]['sortfirstauthor'] for x in summary['result'] if x != 'uids' ]
last_authors = [ summary['result'][x]['lastauthor'] for x in summary['result'] if x != 'uids' ]
links = [ summary['result'][x]['elocationid'] for x in summary['result'] if x != 'uids' ]
pubdates = [ summary['result'][x]['pubdate'] for x in summary['result'] if x != 'uids' ]links = [ re.sub('doi:s','http://dx.doi.org/',x) for x in links ]
results_df = pd.DataFrame( {'ID':uid,'Journal':journals,'PublicationDate':pubdates,'Title':titles,'URL':links,'FirstAuthor':first_authors,'LastAuthor':last_authors} )
Below is a listing of all different fields that ESummary returns so you’ll be able to make your individual database:
'uid','pubdate','epubdate','source','authors','lastauthor','title',
'sorttitle','volume','issue','pages','lang','nlmuniqueid','issn',
'essn','pubtype','recordstatus','pubstatus','articleids','history',
'references','attributes','pmcrefcount','fulljournalname','elocationid',
'doctype','srccontriblist','booktitle','medium','edition',
'publisherlocation','publishername','srcdate','reportnumber',
'availablefromurl','locationlabel','doccontriblist','docdate',
'bookname','chapter','sortpubdate','sortfirstauthor','vernaculartitle'
Use EFetch while you want abstracts, keywords, and other details (XML output only)
We are able to use EFetch to return similar fields as ESummary, with the caveat that the result’s returned in XML only. There are several interesting additional fields in EFetch which include: the abstract, author-selected keywords, the Medline Subheadings (MeSH terms), grants that sponsored the research, conflict of interest statements, a listing of chemicals utilized in the research, and an entire list of all of the references cited by the paper. Here’s how you’ll use BeautifulSoup to acquire a few of this stuff:
from bs4 import BeautifulSoup
import lxml
import pandas as pdabstract_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={id_list}'
abstract_ = urllib.request.urlopen(abstract_url).read().decode('utf-8')
abstract_bs = BeautifulSoup(abstract_,features="xml")
articles_iterable = abstract_bs.find_all('PubmedArticle')
# Abstracts
abstract_texts = [ x.find('AbstractText').text for x in articles_iterable ]
# Conflict of Interest statements
coi_texts = [ x.find('CoiStatement').text if x.find('CoiStatement') is not None else '' for x in articles_iterable ]
# MeSH terms
meshheadings_all = list()
for article in articles_iterable:
result = article.find('MeshHeadingList').find_all('MeshHeading')
meshheadings_all.append( [ x.text for x in result ] )
# ReferenceList
references_all = list()
for article in articles_:
if article.find('ReferenceList') is just not None:
result = article.find('ReferenceList').find_all('Citation')
references_all.append( [ x.text for x in result ] )
else:
references_all.append( [] )
results_table = pd.DataFrame( {'COI':coi_texts, 'Abstract':abstract_texts, 'MeSH_Terms':meshheadings_all, 'References':references_all} )
Now we will use this table to look abstracts, conflict of interest statements, or make visuals that connect different fields of research using MeSH headings and reference lists. There are after all many other tags that you could possibly explore, returned by EFetch, here’s how you’ll be able to see all of them using BeautifulSoup:
efetch_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={id_list}'
efetch_result = urllib.request.urlopen( efetch_url ).read().decode('utf-8')
efetch_bs = BeautifulSoup(efetch_result,features="xml")tags = efetch_bs.find_all()
for tag in tags:
print(tag)
Using ELink to retrieve similar publications, and full-text links
It’s possible you’ll want to seek out articles just like those returned by your search query. These articles are grouped in accordance with a similarity rating using a probabilistic topic-based model.[5] To retrieve the similarity scores for a given ID, it’s essential to pass cmd=neighbor_score in your call to ELink. Here’s an example for one article:
import urllib.request
import jsonid_ = '37055458'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_}&retmode=json&cmd=neighbor_score'
elinks = urllib.request.urlopen(elink_url).read().decode('utf-8')
elinks_json = json.loads( elinks )
ids_=[];score_=[];
all_links = elinks_json['linksets'][0]['linksetdbs'][0]['links']
for link in all_links:
[ (ids_.append( link['id'] ),score_.append( link['score'] )) for id,s in link.items() ]
pd.DataFrame( {'id':ids_,'rating':score_} ).drop_duplicates(['id','score'])
The opposite function of ELink is to offer full-text links to an article based on its ID, which could be returned should you pass cmd=prlinks to ELink as an alternative.
In the event you want to access only those full-text links which can be free to the general public, you’ll want to use links that contain “pmc” (PubMed Central). Accessing articles behind a paywall may require subscription through a University—before downloading a big corpus of full-text articles through a paywall, you must seek the advice of along with your organization’s librarians.
Here’s a code snippet of how you could possibly retrieve the links for a publication:
id_ = '37055458'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_}&retmode=json&cmd=prlinks'
elinks = urllib.request.urlopen(elink_url).read().decode('utf-8')elinks_json = json.loads( elinks )
[ x['url']['value'] for x in elinks_json['linksets'][0]['idurllist'][0]['objurls'] ]
You may as well retrieve links for multiple publications in a single call to ELink, as I show below:
id_list = '37055458,574140'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_list}&retmode=json&cmd=prlinks'
elinks = urllib.request.urlopen(elink_url).read().decode('utf-8')elinks_json = json.loads( elinks )
elinks_json
urls_ = elinks_json['linksets'][0]['idurllist']
for url_ in urls_:
[ print( url_['id'], x['url']['value'] ) for x in url_['objurls'] ]
Occasionally, a scientific publication might be authored by someone who’s a CEO, CSO, or CTO of an organization. With PubMed, we’ve the flexibility to investigate the newest life science industry trends. Conflict of interest statements, which were introduced as a search term in PubMed during 2017,[6] give a lens into which author-provided keywords are appearing in publications where an industry executive is disclosed as an writer. In other words, the keywords chosen by the authors to explain their finding. To perform this function, simply include CEO[cois]+OR+CSO[cois]+OR+CTO[cois] as search term in your URL, retrieve the entire results returned, and extract the keyword from the resulting XML output for every publication. Each publication comprises between 4–8 keywords. Once the corpus is generated, you’ll be able to quantify keyword frequency per yr throughout the corpus as the variety of publications in a yr specifying a keyword, divided by the variety of publications for that yr.
For instance, if 10 publications list the keyword “cancer” and there are 1000 publications that yr, the keyword frequency could be 0.001. Using the seaborn clustermap module with the keyword frequencies you’ll be able to generate a visualization where darker bands indicate a bigger value of keyword frequency/yr (I even have dropped COVID-19 and SARS-COV-2 from the visualization as they were each represented at frequencies far greater 0.05, predictably).
From this visualization, several insights in regards to the corpus of publications with C-suite authors listed becomes clear. First, one of the crucial distinct clusters (at the underside) comprises keywords which were strongly represented within the corpus for the past five years: cancer, machine learning, biomarkers, artificial intelligence — simply to name a couple of. Clearly, industry is heavily energetic and publishing in these areas. A second cluster, near the center of the figure, shows keywords that disappeared from the corpus after 2018, including physical activity, public health, children, mass spectrometry, and mhealth (or mobile health). It’s to not say that these areas aren’t being developed in industry, just that the publication activity has slowed. Taking a look at the underside right of the figure, you’ll be able to extract terms which have appeared more recently within the corpus, including liquid biopsy and precision medicine — that are indeed two very “hot” areas of medication in the mean time. By examining the publications further, you could possibly extract the names of the businesses and other information of interest. Below is the code I wrote to generate this visual:
import pandas as pd
import time
from bs4 import BeautifulSoup
import seaborn as sns
from matplotlib import pyplot as plt
import itertools
from collections import Counter
from numpy import array_split
from urllib.request import urlopenclass Searcher:
# Any instance of searcher will seek for the terms and return the variety of results on a per yr basis #
def __init__(self, start_, end_, term_, **kwargs):
self.raw_ = input
self.name_ = 'searcher'
self.description_ = 'searcher'
self.duration_ = end_ - start_
self.start_ = start_
self.end_ = end_
self.term_ = term_
self.search_results = list()
self.count_by_year = list()
self.options = list()
# Parse keyword arguments
if 'count' in kwargs and kwargs['count'] == 1:
self.options = 'rettype=count'
if 'retmax' in kwargs:
self.options = f'retmax={kwargs["retmax"]}'
if 'run' in kwargs and kwargs['run'] == 1:
self.do_search()
self.parse_results()
def do_search(self):
datestr_ = [self.start_ + x for x in range(self.duration_)]
options = "".join(self.options)
for yr in datestr_:
this_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/' +
f'?db=pubmed&term={self.term_}' +
f'&mindate={yr}&maxdate={yr + 1}&{options}'
print(this_url)
self.search_results.append(
urlopen(this_url).read().decode('utf-8'))
time.sleep(.33)
def parse_results(self):
for lead to self.search_results:
xml_ = BeautifulSoup(result, features="xml")
self.count_by_year.append(xml_.find('Count').text)
self.ids = [id.text for id in xml_.find_all('Id')]
def __repr__(self):
return repr(f'Search PubMed from {self.start_} to {self.end_} with search terms {self.term_}')
def __str__(self):
return self.description
# Create a listing which can contain searchers, that retrieve results for every of the search queries
searchers = list()
searchers.append(Searcher(2022, 2023, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2021, 2022, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2020, 2021, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2019, 2020, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2018, 2019, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
# Create a dictionary to store keywords for all articles from a selected yr
keywords_dict = dict()
# Each searcher obtained results for a selected start and end yr
# Iterate over searchers
for this_search in searchers:
# Split the outcomes from one search into batches for URL formatting
chunk_size = 200
batches = array_split(this_search.ids, len(this_search.ids) // chunk_size + 1)
# Create a dict key for this searcher object based on the years of coverage
this_dict_key = f'{this_search.start_}to{this_search.end_}'
# Each value within the dictionary might be a listing that gets appended with keywords for every article
keywords_all = list()
for this_batch in batches:
ids_ = ','.join(this_batch)
# Pull down the web site containing XML for all of the ends in a batch
abstract_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={ids_}'
abstract_ = urlopen(abstract_url).read().decode('utf-8')
abstract_bs = BeautifulSoup(abstract_, features="xml")
articles_iterable = abstract_bs.find_all('PubmedArticle')
# Iterate over the entire articles from the web site
for article in articles_iterable:
result = article.find_all('Keyword')
if result is just not None:
keywords_all.append([x.text for x in result])
else:
keywords_all.append([])
# Take a break between batches!
time.sleep(1)
# Once all of the keywords are assembled for a searcher, add them to the dictionary
keywords_dict[this_dict_key] = keywords_all
# Print the important thing once it has been dumped to the pickle
print(this_dict_key)
# Limit to words that appeared approx five times or more in any given yr
mapping_ = {'2018to2019':2018,'2019to2020':2019,'2020to2021':2020,'2021to2022':2021,'2022to2023':2022}
global_word_list = list()
for key_,value_ in keywords_dict.items():
Ntitles = len( value_ )
flattened_list = list( itertools.chain(*value_) )
flattened_list = [ x.lower() for x in flattened_list ]
counter_ = Counter( flattened_list )
words_this_year = [ ( item , frequency/Ntitles , mapping_[key_] ) for item, frequency in counter_.items() if frequency/Ntitles >= .005 ]
global_word_list.extend(words_this_year)
# Plot results as clustermap
global_word_df = pd.DataFrame(global_word_list)
global_word_df.columns = ['word', 'frequency', 'year']
pivot_df = global_word_df.loc[:, ['word', 'year', 'frequency']].pivot(index="word", columns="yr",
values="frequency").fillna(0)
pivot_df.drop('covid-19', axis=0, inplace=True)
pivot_df.drop('sars-cov-2', axis=0, inplace=True)
sns.set(font_scale=0.7)
plt.figure(figsize=(22, 2))
res = sns.clustermap(pivot_df, col_cluster=False, yticklabels=True, cbar=True)
After reading this text, you need to be able to go from crafting highly tailored search queries of the scientific literature all of the option to generating data visualizations for closer scrutiny. While there are other more complex ways to access and store articles using additional features of the assorted E-utilities, I even have tried to present probably the most straightforward set of operations that ought to apply to most use cases for a knowledge scientist involved in scientific publishing trends. By familiarizing yourself with the E-utilities as I even have presented here, you’ll go far toward understanding the trends and connections inside scientific literature. As mentioned, there are various items beyond publications that could be unlocked through mastering the E-utilities and the way they operate throughout the larger universe of NCBI databases.