YOKOFAKUN: article

Showing posts with label article. Show all posts

21 May 2016

Playing with the @ORCID_Org / @ncbi_pubmed graph. My notebook.

"ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized. "

I've recently discovered that pubmed now integrates ORCID identfiers.

and so it begins ! :-D @ORCID_Org Orcid Identififiers in @ncbi_pubmed https://t.co/hEBPQOoYjH
YESSSS !!!!!!!!!!!!! pic.twitter.com/B0fWNU8V2A
— Pierre Lindenbaum (@yokofakun) May 19, 2016

And there are several minor problems, I found some articles where the ORCID id is malformed or where different people use the same ORCID-ID:

The dream is over: two authors sharing the same @ORCID_Org Orcid ID in pubmed https://t.co/NdSW87fV3Y pic.twitter.com/p7EnTEl8Mc
— Pierre Lindenbaum (@yokofakun) May 20, 2016

for now, I've found 45 papers in pubmed having a problem with their @ORCID_Org ID : https://t.co/4H20PLtvJe
— Pierre Lindenbaum (@yokofakun) May 20, 2016

"- I suggest we all use the same @ORCID_Org in the lab
- sounds legit" pic.twitter.com/m5yvi60DRL
— Pierre Lindenbaum (@yokofakun) May 20, 2016

You can download the papers containing some orcid Identifiers using the entrez query http://www.ncbi.nlm.nih.gov/pubmed/?term=orcid[AUID].
I've used one of my tools pubmeddump to download the articles asXML and I wrote PubmedOrcidGraph to extract the author's orcid.

<?xml version="1.0" encoding="UTF-8"?>
<PubmedArticleSet>
  <!--Generated with PubmedOrcidGraph https://github.com/lindenb/jvarkit/wiki/PubmedOrcidGraph - Pierre Lindenbaum.-->
  <PubmedArticle pmid="27197243" doi="10.1101/gr.199760.115">
    <year>2016</year>
    <journal>Genome Res.</journal>
    <title>Improved definition of the mouse transcriptome via targeted RNA sequencing.</title>
    <Author orcid="0000-0002-4078-7413">
      <foreName>Giovanni</foreName>
      <lastName>Bussotti</lastName>
      <initials>G</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
    <Author orcid="0000-0002-4449-1863">
      <foreName>Tommaso</foreName>
      <lastName>Leonardi</lastName>
      <initials>T</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
    <Author orcid="0000-0002-6090-3100">
      <foreName>Anton J</foreName>
      <lastName>Enright</lastName>
      <initials>AJ</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
  </PubmedArticle>
  <PubmedArticle pmid="27197225" doi="10.1101/gr.204479.116">
    <year>2016</year>
    <journal>Genome Res.</journal>
(...)

Now, I want to insert those data into a sqlite3 database. I use the XSLT stylesheet below to convert the XML into some SQL statement.

<?xml version="1.0"?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0"
    xmlns:xalan="http://xml.apache.org/xalan"
    xmlns:str="xalan://com.github.lindenb.xslt.strings.Strings"
    exclude-result-prefixes="xalan str"
 >
<xsl:output method="text"/>
<xsl:variable name="q">'</xsl:variable>

<xsl:template match="/">
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
<xsl:apply-templates select="PubmedArticleSet/PubmedArticle"/>
commit;
</xsl:template>

<xsl:template match="PubmedArticle">
<xsl:for-each select="Author">
<xsl:variable name="o1" select="@orcid"/>insert or ignore into author(orcid,name,affiliation) values ('<xsl:value-of select="$o1"/>','<xsl:value-of select="translate(concat(lastName,' ',foreName),$q,' ')"/>','<xsl:value-of select="translate(affiliation,$q,' ')"/>');
<xsl:for-each select="following-sibling::Author">insert or ignore into collab(orcid1,orcid2) values(<xsl:variable name="o2" select="@orcid"/>
<xsl:choose>
 <xsl:when test="str:strcmp( $o1 , $o2) < 0">'<xsl:value-of select='$o1'/>','<xsl:value-of select='$o2'/>'</xsl:when>
 <xsl:otherwise>'<xsl:value-of select='$o2'/>','<xsl:value-of select='$o1'/>'</xsl:otherwise>
</xsl:choose>);
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

This stylesheet contains an extension 'strmcp' for the xslt processor xalan to compare two XML strings
This extension is just used to always be sure that the field "orcid1" in the table "collab" is always lower than "orcid2" to avoid duplicates pairs.

./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml

create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4078-7413','Bussotti Giovanni','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-4449-1863');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4449-1863','Leonardi Tommaso','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4449-1863','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-6090-3100','Enright Anton J','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
(...)

and those sql statetements are loaded into sqlite3:

./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml |\
 sqlite3 orcid.sqlite

The next step is to produce a gexf+xml file to play with the orcid graph in gephi.
I use the following bash script to convert the sqlite3 database to gexf+xml.

DB=orcid.sqlite

cat << EOF
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" version="1.2">
<meta>
<creator>Pierre Lindenbaum</creator>
<description>Orcid Graph</description>
</meta>
<graph defaultedgetype="undirected" mode="static">

<attributes class="node">
<attribute type="string" title="affiliation" id="0"/>
</attributes>
<nodes>
EOF

sqlite3 -separator ' ' -noheader  ${DB} 'select orcid,name,affiliation from author' |\
 sed  -e 's/&/&/g' -e "s/</\</g" -e "s/>/\>/g" -e "s/'/\'/g"  -e 's/"/\"/g' |\
 awk -F ' ' '{printf("<node id=\"%s\" label=\"%s\"><attvalue for=\"0\" value=\"%s\"/></node>\n",$1,$2,$3);}'

echo "</nodes><edges>"
sqlite3 -separator ' ' -noheader  ${DB} 'select orcid1,orcid2 from collab' |\
 awk -F ' ' '{printf("<edge source=\"%s\" target=\"%s\"/>\n",$1,$2);}'
echo "</edges></graph></gexf>"

If you want to play, I've uploaded the gephi+pubmed graph as gexf/gephi here: https://t.co/0nRRts7gXm (4Mb) pic.twitter.com/8RGuI7X3ZE
— Pierre Lindenbaum (@yokofakun) May 21, 2016

The output is saved and then loaded into gephi.

playing with the ORCID/pubmed graph in gephi pic.twitter.com/1Ao5OC7ywI
— Pierre Lindenbaum (@yokofakun) May 21, 2016

where is my lab @institut_thorax in the pubmed/orcid graph of co-authorships ? pic.twitter.com/3Krqk5K1o8
— Pierre Lindenbaum (@yokofakun) May 21, 2016

That's it,

Pierre

16 September 2010

Using MongoDB with Apache Tomcat, searching for Pubmed articles

The current post continues the previous one titled "
MongoDB and NCBI pubmed: Inserting, searching and updating. My notebook.".

Here I've used the Java API for MongoDB/BSON in Apache Tomcat, the java servlet container to search and display some pubmed papers stored in mongo via a web interface. At the end the result looks like this:

The servlet

The servlet uses the JAVA API for mongo.

An instance of Mongo is created and it connects to the mongodb server
We obtain an object DB from the server for the database 'pubmed'
we get an object DBCollection from the database
If the user provided a valid pmid, the new query is created:
BasicDBObject query=new BasicDBObject("_id",Long.parseLong(pmid))
and it is used to search the collection
BObject article=collection.findOne(query);
If the article was found, it was injected in the request and forwarded to the JSP named 'article.jsp'

File: src/WEB-INF/src/fr/inserm/umr915/biomongo/BioMongoServlet.java

package fr.inserm.umr915.biomongo;

import java.io.IOException;
import javax.servlet.ServletConfig;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import com.mongodb.BasicDBObject;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBObject;
import com.mongodb.Mongo;
import com.mongodb.ServerAddress;

@SuppressWarnings("serial")
public class BioMongoServlet extends HttpServlet
{
/** mongodb host */
private String mongoHost=ServerAddress.defaultHost();
/** mongodb port */
private int mongoPort=ServerAddress.defaultPort();

@Override
public void init(ServletConfig config) throws ServletException {
super.init(config);
String param=config.getInitParameter("mongo.port");
if(param!=null) this.mongoPort=Integer.parseInt(param);
param=config.getInitParameter("mongo.host");
if(param!=null) this.mongoHost=param;
}

@Override
protected void service(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException
{
Mongo mongo=null;
try {
//connect to mongo
mongo = new Mongo(this.mongoHost,this.mongoPort);
//get the database pubmed
DB database=mongo.getDB("pubmed");
if(database==null) throw new ServletException("canot get database");
//get the collection 'articles'
DBCollection col = database.getCollection("articles");
if(col==null) throw new ServletException("canot get collection");
//get the query parameter 'pmid'
String pmid=req.getParameter("pmid");
String jspPage="/WEB-INF/jsp/index.jsp";
//if pmid exist and looks like a number
if(pmid!=null && pmid.matches("[0-9]+"))
{
//find
DBObject article=col.findOne(new BasicDBObject("_id",Long.parseLong(pmid)));
if(article==null)
{
req.setAttribute("message", "Cannot find pmid:"+pmid);
}
else
{
jspPage="/WEB-INF/jsp/article.jsp";
req.setAttribute("article",article);
}
}
req.getRequestDispatcher(jspPage).forward(req, res);
}
catch(ServletException e)
{
throw e;
}
catch(IOException e)
{
throw e;
}
catch(Exception e)
{
throw new ServletException(e);
}
finally
{
//cleanup mongo
if(mongo!=null)
{
mongo.close();
}
mongo=null;
}
}
}

article.jsp

article.jsp displays the article. As this object is an instance of BasicDBObject (that is to say a Map and/or a List) it was very easy to use it with the JSP technology and the Java Standard Tag Library (JSTL).

File: ./src/WEB-INF/jsp/article.jsp

<%@page contentType="text/html" %>
<%@ taglib prefix="c" uri="http://java.sun.com/jstl/core" %>
<%@ taglib prefix="mongo" tagdir="/WEB-INF/tags" %>
<% out.clearBuffer(); %><html><head>
<title>PMID:<c:out value="${param['pmid']}" escapeXml="true"/></title>
</head>
<body>
<h1>PMID:<c:out value="${param['pmid']}" escapeXml="true"/></h1>
<h2><c:out value="${article.title}" escapeXml="true"/></h2>
<dl>
<dt>Date</dt><dd><mongo:dbobject object="${article.created}"/></dd>
<dt>Authors</dt><dd><ul><c:forEach var="author" items="${article.authors}">
<li>
<mongo:dbobject object="${author}"/>
</li>
</c:forEach></ul></dd>
<dt>Journal</dt><dd><mongo:dbobject object="${article.journal}"/></dd>
<dt>Mesh</dt><dd><ul><c:forEach var="term" items="${article.mesh}">
<li>
<c:out value="${term}" escapeXml="true"/>
</li>
</c:forEach></ul></dd>
</dl>

</body>
</html>

dbobject.tag

The previous JSP page calls a custom tag file <mongo:dbobject> which make a 'pretty display' of a BSON object by serializing it with com.mongodb.util.JSON.serialize(Object o)

File: ./src/WEB-INF/tags/dbobject.tag

<%@ tag language="java" pageEncoding="UTF-8"%>
<%@ taglib prefix="c" uri="http://java.sun.com/jstl/core" %>
<%@ taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions" %>
<%@tag import="com.mongodb.util.JSON"%>
<%@tag import="com.mongodb.DBObject"%>
<%@attribute name="object" required="true" rtexprvalue="true" type="java.lang.Object"%>
<c:set var="bson"><%= JSON.serialize(this.object) %></c:set>

<div style="background-color:lightgray;"><c:out escapeXml="true" value="${bson}"/></div>

Other Files

index.jsp

a simple form for searching an article.

File: ./src/WEB-INF/jsp/index.jsp

<%@page contentType="text/html" %>
<%@ taglib prefix="c" uri="http://java.sun.com/jstl/core" %>
<% out.clearBuffer(); %><html><head>
<title>Search PMID</title>
</head>
<body>
<form method="GET" action="${pageContext.request.contextPath}/biomongo" >
<div style=" position:absolute;width:400px;height:100px;left:50%;top:50%;margin-left:-100px;margin-top:-50px;">
<c:if test="${not empty message}">
<div style="text-align:center; color:red;"><c:out value="${message}" escapeXml="true"/></div>
</c:if>
<input name="pmid"/><input type="submit"/>
</div>
</form></body>
</html>

web.xml

the deployement descriptor.

File: ./src/WEB-INF/web.xml

<?xml version="1.0" encoding="UTF-8"?>
<web-app xmlns="http://java.sun.com/xml/ns/javaee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd"
version="2.5">

<description>Biomongo Servlet</description>
<display-name>Biomongo</display-name>

<servlet>
<display-name>Biomongo Servlet</display-name>
<servlet-name>biomongo</servlet-name>
<servlet-class>fr.inserm.umr915.biomongo.BioMongoServlet</servlet-class>

<load-on-startup>1</load-on-startup>
</servlet>

<servlet-mapping>
<servlet-name>biomongo</servlet-name>
<url-pattern>/biomongo</url-pattern>
</servlet-mapping>

<jsp-config>
<taglib>
<taglib-uri>http://java.sun.com/jstl/fmt</taglib-uri>
<taglib-location>/WEB-INF/tld/fmt.tld</taglib-location>
</taglib>

<taglib>
<taglib-uri>http://java.sun.com/jstl/core</taglib-uri>
<taglib-location>/WEB-INF/tld/c.tld</taglib-location>
</taglib>

<taglib>
<taglib-uri>http://java.sun.com/jstl/sql</taglib-uri>
<taglib-location>/WEB-INF/tld/sql.tld</taglib-location>
</taglib>

<taglib>
<taglib-uri>http://java.sun.com/jstl/x</taglib-uri>
<taglib-location>/WEB-INF/tld/x.tld</taglib-location>
</taglib>

<taglib>
<taglib-uri>http://java.sun.com/jstl/functions</taglib-uri>
<taglib-location>/WEB-INF/tld/fn.tld</taglib-location>
</taglib>

</jsp-config>

</web-app>

An ANT file for building the project

File: build.xml

<?xml version="1.0" encoding="UTF-8"?>
<project default="biomongo">
<property file="build.properties"/>
<property name="root.dir" value="."/>

<property file="${rootdir}/build.properties"/>

<path id="mongo.path">
<pathelement location="${mongodb.lib}"/>
</path>

<path id="servlet.path">
<pathelement location="${tomcat.servlet.api}"/>
<pathelement location="${tomcat.jsp.api}"/>
</path>

<target name="biomongo">
<property name="base.dir" value="${root.dir}/biomongo"/>
<mkdir dir="${base.dir}/src/WEB-INF/lib"/>
<mkdir dir="${base.dir}/src/WEB-INF/classes"/>

<copy todir="${base.dir}/src/WEB-INF/lib" includeEmptyDirs="false">
<fileset file="${mongodb.lib}"/>
<fileset dir="${taglib.dir}/lib" includes="*.jar"/>
</copy>

<copy todir="${base.dir}/src/WEB-INF/tld" includeEmptyDirs="false">
<fileset dir="${taglib.dir}/tld" includes="*.tld"/>
</copy>

<javac srcdir="${base.dir}/src/WEB-INF/src"
destdir="${base.dir}/src/WEB-INF/classes"
debug="true"
source="1.6"
target="1.6">
<classpath>
<path refid="mongo.path"/>
<path refid="servlet.path"/>
</classpath>
<sourcepath>
<pathelement location="${base.dir}/src/WEB-INF/src"/>
</sourcepath>
<include name="**/BioMongoServlet.java"/>
</javac>

<jar destfile="${tomcat.dir}/webapps/biomongo.war"
basedir="${base.dir}/src">
</jar>

</target>

</project>

Result

That's it.

Pierre

01 December 2007

Women in Science 2

Maud came from Genethon where she took part of the creation of the first genetic map of the human genome. After several years at the National Center Of Genotyping, she became our engineer responsible for the production of the genotypes with the Illumina plateform at Integragen (this technology is the same as the one used by 23andMe).

Pierre

Women in Science 1

AS I said after Scifoo 2007, I should take some time to draw. So here it is...

A portrait of my former colleague Chistine, a biostatistician, who was in charge of the analysis of the genotypes produced at Integragen. She now works in an Inserm unit with Dr Florence Demenais on genetic epidemiology and multifactorial diseases.

Pierre

01 November 2007

Pubmed2Wikipedia

I've created a java tool called pubmed2wikipedia: I wrote it to quickly create a new entry for wikipedia.
First, the user select a set of articles about a given subject from pubmed, the software then download, prepare and format the data for a new wikipedia page. For example it creates the 'references' part and suggest the Categories: from the Mesh terms. I've also included a dictionary which recognize some regex patterns to help create a wikipedia internal link.
I first tried to use my own tool to create an entry about NSP3, a viral protein I studied during my PhD but with hundred of articles I felt I was not any more an expert about this protein :-) so I created a small article about another protein: RoXaN.

I hosted this tool on http://code.google.com. It is available at: http://lindenb.googlecode.com/files/pubmed2wikipedia.jar

Pierre

26 October 2007

“Getting Started In…”

In PLOS, today: This month, PLoS Computational Biology and the ISCB begin a series of short, practical articles for students and active researchers who want to learn more about new areas of computational biology and are unsure where or how to start. The aim of each article in the “Getting Started in…” series is to introduce the essentials: define the area and what it is about, highlight the debates and issues of relevance, and provide directions to the most relevant books, articles, or Web sites to find out more...

The first expert to inform, motivate, and inspire readers to consider a new direction is Dr. Xiaole Shirley Liu, who introduces tiling microarrays.

“Getting Started In…”: A Series Not to Miss: PLoS Computational Biology 3 (10), e224 (2007)
Getting Started in Tiling Microarray Analysis : PLoS Computational Biology 3 (10), e183 (2007)

10 June 2007

Mapping NCBI/PUBMED

In my previous post I showed how I used the tag <Affiliation> from the XML/pubmed records to extract the mails and the names from the authors of a paper. I've slightly changed the source code of this program to find the country of origin of each paper. To retrieve the country I used:
1) the suffix of the mail (if any)
2) the name of the country (if any)
3) the name of the city (a few famous one such as Standord, for the US or UK)

My program takes as input a pubmed query and the ouput is the number of papers per year and per country. I put a few results on ManyEyes. As an example with the query "Rotavirus" with 1000 records, I was able to retrieve 887 countries.

Publications in "Bioinformatics", "BMC Bioinformatics", "Plos Comp. Biol."

Publications about "Rotavirus"

publications about malaria, anopheles, plasmodium etc...

16 May 2007

Health Care, Life Sciences and the Semantic Web: Publication

From the W3C:

The Semantic Web Health Care and Life Sciences Interest Group (HCLSIG) has reached a significant milestone with their publication of the article "Advancing Translational Research with the Semantic Web." This joint work of the Interest Group was published in BMC Bioinformatics, a peer-reviewed open access journal that plays a central role in the bioinformatics community. The authors illustrate the value of Semantic Web technologies to neuroscience researchers and biomedicine and report on several projects by members of the Interest Group.

YOKOFAKUN