Mapping research hot-spots

From spatial-analyst.net
Jump to: navigation, search
 
  • Icon zip.png geostatsci.zip : bibliometric data for geostatistics; four tables with bibliographic records: gs (4724) - Google Scholar; sc (1577) - SCOPUS; wos (3153) - Web of Science, authors - manually entered list of authors (sorted by WoS by relevance); mask map of the world in resolution of 20 arcminutes; (886 KB)


A

ny researcher or research organization can be successfully evaluated nowadays using web services such as Web of Science, SCOPUS, Google Scholar or similar (Meho and Yang, 2007). Objective measures such as Citation Rate (number of citations an author or a library item receives in average per year) can be used to depict the most influential authors/publications and research institutes/organizations in the world. If the library items are linked to geographical location, such data can also be used to generate scientific productivity and excellence maps. The commercial scientific indexing companies could enhance their service if they would consider assigning the geographical location to library items to allow spatial exploration and analysis of bibliometric indices.

In this paper, we used statistical and spatial data analysis methods to analyze bibliometric indices in a research field of geostatistics. We obtained publications and their citation statistics from the Web of Science, Scopus and Google Scholar, and focused on the citation rates (CR). Then, we have attached geographic coordinates to each article by using the contact author's address and the Google's API service. This allowed us to produce global density maps of citations, which can be used to detect areas of scientific excellence for a given field.

Here you can access the input data sets used in the analysis and view main outputs. For more information, please read the original article:


Contents

Data retrieval

Geostatistics can be defined as a branch of statistics that is specialized in the analysis and interpretation of any spatially (and temporally) referenced data, but with a special focus on features that are inherently continuous/fields (Gotway Crawford and Young, 2008). In the bibliographic terms, the field of geostatistics can be best defined by listing a number of keywords that are unique for the field and can be associated only with a limited number of authors. After we have determined those keywords, we can run queries on various databases to obtain all references belonging to that group. In the case of WoS, the query was:

 topic=(kriging OR variogram OR "spatial statistic" OR "spatial interpolation"
OR "spatial predict" OR "spatial sampling" OR geostatistic*)

and in the case of SCOPUS:

TITLE-ABS-KEY(kriging OR variogram OR "spatial statistic" OR "spatial
interpolation" OR "spatial predict" OR "spatial sampling" OR geostatistic*)

Once we retrieve the results of query, we can sort them by relevance (number of times specified words appear in the text) and then export the first e.g. 2000 from the list. This way we are sure that we will be really processing representative articles. In the case of Google Scholar, we are not able to sort the results based on the relevance so we searched citations with ANY of the words: kriging, interpolation, and sampling, and with all of the words: spatial, statistic* and variogram. This can be efficiently run using the Publish or Perish software provided by Anne-Wil Harzing (Harzing and van der Wal, 2007).

These queries gave us 6,393 publications from WoS, 10,491 from SCOPUS and 5,389 publications from GS (compare with the results of Zhou et al. 2007). The WoS and SCOPUS publications were first sorted by relevance and then the first 4,000 entries were exported, filtered and reorganized to allow for further statistical analysis and processing. The GS database, which is noisy, requires filtering before it can be used. We often found duplicate or triplicate publications in the systems, but there are also many publications with misspelling (special symbols) of authors’ names. However, most of these can be easily filtered out, either by visually examining the results or by running operations in R. We also decided to reduce the number of publications in the results of WoS and GS. In the case of WoS, publications can be sorted by relevance (the number of times keywords appear in the text); in the case of GS, we omitted publications that are over four years old and have still not been cited. With SCOPUS items can also be sorted by relevance; however, it limits the number of items that can be exported to 2000.

After all the preprocessing steps, we have prepared four tables with the following structure:

  • geostat_wos (n=4000): (Web of Science) publication title, year, authors, journal/publisher, number of citations.
  • geostat_sc (n=2000): (SCOPUS) publication title, year, authors, journal/publisher, number of citations.
  • geostat_gs (n=4724): (Google Scholar) publication title, year, authors, journal/publisher, number of citations.
  • authors (n=200): author, institution/company, city, latitude, longitude, total publications, SCOPUS h-index for authors, WoS h-index for their geostatistics publications.

Geocoding addresses

In the following step, we need to attach geographic coordinates to the extracted articles by using the address of the contact author (we will focus on the results from WoS only). Here we use the Google's geographic services, which allows us to get geographic coordinates given a street + city + country address (see also coverage detail of Google maps). First, register your own Google API key. Now, to geocode an address, you can run in R:

> readLines(url("http://maps.google.com/maps/geo?q=1600+Amphitheatre+Parkway,+Mountain+View,+CA
&output=csv&key=abcdefg"), n=1, warn=FALSE)

which will give four numbers: 1. HTTP status code, 2. accuracy, 3. latitude, and 4. longitude. In the case from above:

[1] 200.00000 8.00000 37.42197 -122.08414

the status code is 200 (meaning "No errors occurred; the address was successfully parsed and its geocode has been returned"; see also the status code table), the geocoding accuracy is 8 (meaning highly accurate; see also the accuracy constants), longitude is 37.42197 and the latitude is -122.08414.

Note that the address of a location needs to be provided in the following format:

"StreetNumber+Street,+City,+Country"

We can now loop this operation for a vector of addresses (contact authors):

> library(spatstat)
> library(rgdal)
> library(maps)
> googlekey <- "abcd"  # please obtain the correct Google API key!
>
> wos <- read.delim("geostat_wos.txt")
# create new columns and set the significant digits to 5;
> wos$lat <- round(rep(0, length(wos$address1)), 5)
> wos$lon <- round(rep(0, length(wos$address1)), 5)
> for (i in 1:length(wos$address1)) {
   googleaddress <- paste(unlist(strsplit(as.character(wos$address1[i]), " ")), collapse="+")
   googleurl <- url(paste("http://maps.google.com/maps/geo?q=",googleaddress,
+      "csv&key=", googlekey, sep=""))
   googlell <- as.vector(as.numeric(unlist(strsplit(readLines(googleurl,
+       n=1, warn=FALSE), ","))))
   wos$lat[i] <- googlell[3]
   wos$lon[i] <- googlell[4]
   close(googleurl)
} 

Obtaining longitude/latitudes from the Google API service can be problematic for slower internet connections and a long list of addresses. In fact, Google limits the number of geocode requests to 15,000 in a 24 hour period (read more). If the url connection breaks, then it might be a good idea to run the same loop one more time and set add a while command to look for the missing coordinates:

> for (i in 1:length(wos$address1)) {
  while(wos$lat[[i]]==0|wos$lon[[i]]==0) {
   googleaddress <- paste(unlist(strsplit(as.character(wos$address1[i]), " ")), collapse="+")
   googleurl <- url(paste("http://maps.google.com/maps/geo?q=",googleaddress,
+      "csv&key=", googlekey, sep=""))
   googlell <- as.vector(as.numeric(unlist(strsplit(readLines(googleurl,
+       n=1, warn=FALSE), ","))))
   wos$lat[i] <- googlell[3]
   wos$lon[i] <- googlell[4]
   close(googleurl)
}} 

After we have obtained coordinates for each article, we can convert the table into a point map using:

> wosmap <- subset(wos, !is.na(wos$lat))
# insert a small location error to reduce duplicate points;
> wosmap$rlat <- round(wosmap$lat + rnorm(1, mean=0, sd=0.001), 4)
> wosmap$rlon <- round(wosmap$lon + rnorm(1, mean=0, sd=0.001), 4)
> coordinates(wosmap) <-~rlon+rlat
> proj4string(wosmap) <- CRS("+proj=longlat +datum=WGS84")
> worldmap <- map2SpatialLines(map("world", fill=TRUE, col="transparent", plot=FALSE),
proj4string=CRS("+proj=longlat +ellps=WGS84"))
> bubble(wosmap[!is.na(wosmap$CR),"CR"], sp.layout=list("sp.lines", worldmap,
col="grey"), maxsize=2) 

which produces the following plot:

Figure: A bubble plot showing the citation rates in the field of geostatistics. Based on results in January 2008.

Spatial Analysis

Once we had attached the geographic location to a selection of articles, we can use the isotropic Gaussian kernel smoother (weighted by the CR) to map scientific excellence around the world. This can be run in e.g. the spatstat package (Baddeley, 2008). First, we will import a 20 arcminutes mask map of the world with all land areas:

> worldmaps20 <- readGDAL("mask20.asc")
> names(worldmaps20) <- "mask"
> wowin <- as(worldmaps20, "owin")

Next, we can convert the point map to a point pattern (spatstat data format) and run an isotropic Gaussian kernel with a bandwidth of 0.5 arcdegrees:

> wosCR.ppp <- ppp(wosmap@coords[,1], wosmap@coords[,2], marks=wosmap$CR, window=wowin)
> densCR <- density.ppp(wosCR.ppp, 0.5, weights=wosmap$CR, edge=TRUE)
> plot(densCR)
# export to a GIS format:
> dens.CR=as(densCR, "SpatialGridDataFrame")
> writeGDAL(dens.CR["v"], "../ilwis/densCR.mpr", "ILWIS")


Figure: World maps of bibliometric parameters for geostatistics estimated using a sample of 4000 articles: (a) density of published research articles generated using the isotropic Gaussian kernel with a standard deviation of 0.5 arcdegrees; (b) the same but weighted using the CRs for each article. Based on results in January 2008.

The figure above shows locations of both high productivity and high CR. This revealed clusters of scientific excellence around European locations such as Barcelona, London, Louvain, Norwich, Paris, Utrecht, Wageningen and Zürich; US locations such as Stanford, Ann Arbor, Tucson, Corvallis, Seattle, Boulder, Montreal, Baltimore, Durham, Santa Barbara and Los Angeles; and also around Canberra, Melbourne, Sydney, Santiago, Taipei, and Beijing.

The figure shown below presents the results of mapping density (CR) using a mask map for Europe (5 km resolution):

Figure: Density map (citation rates) in the field of geostatistics for Europe. Based on results in January 2008.

References


<Rating> Rate this article: 1 Poor and in error 2 Far from ready 3 Needs improvements 4 Minor corrections 5 Very useful!! </Rating>

Personal tools