Team:St Andrews/Human-practices

From 2012.igem.org

(Difference between revisions)
(Discussed validity of our data. (To be continued.))
(Discussed validity of data set 2.)
Line 200: Line 200:
</div>
</div>
<h2>Is our data good?</h2>
<h2>Is our data good?</h2>
-
<h3>Yes. Mostly.</h3>
+
<h3>Yes. Mostly...</h3>
-
<p>Yes, it is rather meta-analytical to ask such. Given our doubts over accuracy of Google Scholar's data, we thought we should play it safe first and check that we're getting reasonable results. Turns out data sets with IDs 5 and 2 had severe flaws, so we rejected them from further discussion.</p>
+
<p>It is rather meta-analytical to ask such! But given our doubts over accuracy of Google Scholar's data, we considered it a priority to use query results only with some caution. Data sets with IDs 5 and 2 had severe flaws! They were rejected from further discussion, as discussed soon below.</p>
-
<p>Random samplings of each data set were quickly passed under a human eye. For <em>most</em> queries, this empirical observation showed acceptably low levels of "background static" i.e. results that Google Scholar had matched to the query that were not relevant. They would form only a drop of dirt in the ocean of relevant data.</p>
+
<p>Our method to check our data was empirical: Random samplings of each data set were quickly passed under a human eye. For <em>most</em> queries, this observation of random subsets showed acceptably low levels of "background static" (i.e. results that Google Scholar had automatically matched to the query, which were not actually relevant). These would form only a drop of error in the ocean of relevant data.</p>
-
<h3>...but only mostly.</h3>
+
<h3>...but <em>only</em> mostly.</h3>
-
<p>For query 5, the level of static was unacceptably large. The reason was quickly identified: Because the two quoted terms in Query ID 5 (<code>iGEM OR "International Genetically Engineered Machine"</code>) were separated by a disjunction (<code>OR</code>), the query would easily match anything at all that contained the acronym "IGEM"! This meant acronyms in economics such as "Inter-temporal General Equilibrium Model (IGEM)" or the British "Institution of Gas Engineers & Managers (IGEM)" and various medical terms and names of chemicals snuck in. The entire data set with ID 5 was hence dismissed from further analysis.</p>
+
<p><strong>Query 5</strong> (<code>iGEM OR "International Genetically Engineered Machine"</code>) was found to have an unacceptably large level of static. The reason was quickly identified: Because the two quoted terms in Query ID 5 (<code>"iGEM"</code>, <code>"International Genetically Engineered Machine"</code>) were separated by a disjunction (<code>OR</code>), the query would easily match anything that contained even just the acronym "IGEM"! This meant acronyms in economics such as "Inter-temporal General Equilibrium Model (IGEM)" or the British "Institution of Gas Engineers & Managers (IGEM)" and various medical terms and chemical names snuck in. The entire data set with ID 5 was dismissed from further analysis.</p>
-
<p>Query 1 (<code>(“synthetic biology” OR "genetic engineering") AND (“iGEM” OR “International Genetically Engineered Machine")</code>) could be thought of a "fixed version" of the problematic Query 5. It searches for the same terms, but includes a conjunction (<code>AND</code>) with either <em>synthetic biology</em> or <em>genetic engineering</em>, which tunes static down to acceptable levels.</p>
+
<p>The lesson we took from this is not to search for short acronyms by themselves. Query 1 (<code>(“synthetic biology” OR "genetic engineering") AND (“iGEM” OR “International Genetically Engineered Machine")</code>) could be thought of as the "Version Two" of the problematic Query 5. It searches for the same terms, but includes a conjunction (<code>AND</code>) with either <em>synthetic biology</em> or <em>genetic engineering</em>, which bends results toward <em>our</em> iGEM and hence tunes static down to an acceptable level.</p>
 +
 
 +
<p><strong>Query 2</strong> (<code>synthetic biology</code>) had a different problem: It was too big. The query was an attempt to capture stats for the entire field of synthetic biology, so we could statistically determine the relative influence of the iGEM competition. The composer of the query had forgotten the 1000-result cap imposed by Google Scholar. It is impossible to retrieve results beyond this 1k "event horizon". Google does not publish information regarding how the order of results is determined. Hence these first 1000 results (out of what are likely to be 10s or 100s of thousands of papers) are all biased by some unknown force. Were more cited papers favoured? Were papers published more recently favoured? No conclusions can be drawn from a biased and small subset of the full data. We also discounted data set 2 from any further analysis.</p>

Revision as of 10:57, 3 August 2012

Scientific impact of iGEM

"Most influential synthetic biology competition" vs. "Just some kids playing"?

We investigated the scientific attention garnered by iGEM and the Registry of Standard Parts. A data-driven approach was chosen: We extracted data from search results to various queries (such as ("iGEM" OR "International Genetically Engineered Machine") AND ("synthetic biology" OR "genetic engineering")) from various publication search engines. We searched Web of Knowledge, Scopus, PubMed and Google Scholar. Google Scholar was chosen due to the alternatives' various shortcomings.

We conclude with hypotheses to explain the results and discuss their implications for the iGEM competition.

TODO Put findings summary here too.

Why we used Google Scholar

We must admit iGEM is somewhat of a niche topic. WoK, Scopus and PubMed are strictly curated and limited in scope, so they missed many obviously relevant publications. We also found their search options unsuitable: Many of them did not support full text search (they looked at titles, keywords and abstracts only) or boolean operators. These were requirements, for the following reasons: iGEM cannot be expected to always be the main subject of a paper, hence full text search. There are many relevant terms floating about iGEM, hence boolean operators like "OR" to allow treating papers that contain "International Genetically Engineered Machine" or the acronym "iGEM" equally. Google Scholar, however, fulfilled these requirements.

What sort of scope range are we talking about here? Here is an example: PubMed gave so few results (16 for iGEM genetic*) that we quickly discarded it. Manually merging the Web of Knowledge and Scopus results for the query iGEM AND genetic* (discarding obviously irrelevant results) gave 43 results. Then we queried Google Scholar. It gave us 770 for (“synthetic biology” OR "genetic engineering") AND (“iGEM” OR “International Genetically Engineered Machine"). Imagine our expressions.

Of course, Google Scholar too is but a bronze bullet: It brings its own drawbacks. It is engineered to pick up things that only seem like scholarly articles. Like Google's search results in general, the results are not curated by a human. This has been criticised in the literature (Péter J., 2006.). Also we found the occasional hilarious total miss. Google Scholar is also known to somewhat overestimate citation counts (Iselid, L., 2006.). However, we concluded from empirical manual examination of a random sample that the majority of the results are plausible and (most importantly) far greater in scope than searches in curated databases. We only want statistics to identify trends. For this, large and coarce pieces suffice. We discourage using our method for obtaining precise values!

Browse the data

...are online in a nifty and very usable Google Docs folder.

An introduction is included in case you get lost or want more information.

On extraction tools

We made extensive use of Harzing Publish or Perish (Harzing, A.W., 2007.) to scrape Google Scholar results. The tool has many limitations. However, it is in our experience the best out there for managing the mess that scientific publication data scraping tends to become!

We did try other things: We quickly found manual methods too slow. Various Firefox browser plugins simply failed outright, were extremely awkward to use or produced clearly erroneous results. The Mac OS program Papers was easy to use and found huge numbers of papers (as it could access many sources), but had unacceptably high rates of error, problems with duplicates and could not export the results into a form we could easily process. Hence Publish or Perish.

Query summary

Here's a quick breakdown of what we queried for on Google Scholar and what sort of data was returned. (The ID matches the name of the data set in our data tables).

Dataset ID Plain English query Query Nº Papers Nº Citations h-index g-index Query date
5 Papers mentioning iGEM iGEM OR "International Genetically Engineered Machine" 1000 9095 36 64 17/7/2012
6 Papers mentioning iGEM and Registry ("iGEM" OR "International Genetically Engineered Machine") AND("Registry of Standard Biological Parts" OR "partsregistry.org" OR "parts.mit.edu") 330 2208 23 42 17/7/2012
1 Papers mentioning iGEM in context of synbio (“synthetic biology” OR "genetic engineering") AND (“iGEM” OR “International Genetically Engineered Machine") 770 3253 26 45 17/7/2012
2 All synthetic biology synthetic biology 1000 68482 127 214 17/7/2012
3 Papers mentioning Registry of Parts "Registry of Standard Biological Parts" OR "partsregistry.org" OR "parts.mit.edu" 751 6442 39 69 17/7/2012
4 Papers citing a particular Part partsregistry.org/Part: 54 263 5 16 17/7/2012

Note Searches were capped at a maximum of 1000 results. Hence getting 1000 results for a query implies that more exist! Those first 1000 are only the ones the search engine judged most relevant.

There are many ways you could quantify success of a paper. Here are a few we investigated:

Plain citation count

It's good scientific practice to cite (to mention relevant publications in one's own paper). High citation count can hence generally be taken as an indicator of a high-quality or high-impact paper. This is the most traditional method of ranking the influence of papers.

The main disadvantage of the "citation count"-method is its non-universality. Even papers in distinct scientific fields have differing standards and counts of citations. It is also significant that old papers have an edge over newer papers, as they've had more time to be cited.

h-index

The h-index is an integer unique to a set of papers. It is used to measure the output and influence of a set of scientists. A greater h-index implies more productive and more influential authors. It was invented by physicist J.E. Hirsch (2005) and has since been automatically calculated by many citation databases. Here is its definition: "A set of papers has h-index h if h papers out of that set have been cited at least h times." An image ("Ael 2" and "Vulpecula", 2012.) clarifies:

g-index

The g-index is another citation index meant to quantify the influence of papers. It was proposed by Leo Egghe (2006) as a variation to the h-index. It puts more emphasis on the most cited papers and Egghe argues that it ranks highly cited authors more fairly. He gives the following definition: "A set of papers has a g-index g if g is the highest rank such that the top g papers have, together, at least citations." Here's a clarifying image by our Polish friend ("Ael 2", 2012.) again:

Algorithmic methods

It's worth noting that there are many other ways of quantifying productivity and impact of a set of papers or scientists. For example, Y.B. Zhou et al (2012) propose a more complete method for "distinguishing prestige from popularity". In their algorithm, the weight of a citation to the influence of a paper is also dependent on the (already calculated) influences of the citing papers and their authors. This requires running a recursive algorithm on sufficiently complete bipartite network of papers and their authors.

We've looked mainly at citation count, h-index and g-index. The algorithmic methods were beyond our reach due to unavailability of data: We would have had to find the names of everyone in every iGEM team and all papers they've written, filtering out large amounts of false matches. This was impracticable.

TODO Explain choices even better?

Is our data good?

Yes. Mostly...

It is rather meta-analytical to ask such! But given our doubts over accuracy of Google Scholar's data, we considered it a priority to use query results only with some caution. Data sets with IDs 5 and 2 had severe flaws! They were rejected from further discussion, as discussed soon below.

Our method to check our data was empirical: Random samplings of each data set were quickly passed under a human eye. For most queries, this observation of random subsets showed acceptably low levels of "background static" (i.e. results that Google Scholar had automatically matched to the query, which were not actually relevant). These would form only a drop of error in the ocean of relevant data.

...but only mostly.

Query 5 (iGEM OR "International Genetically Engineered Machine") was found to have an unacceptably large level of static. The reason was quickly identified: Because the two quoted terms in Query ID 5 ("iGEM", "International Genetically Engineered Machine") were separated by a disjunction (OR), the query would easily match anything that contained even just the acronym "IGEM"! This meant acronyms in economics such as "Inter-temporal General Equilibrium Model (IGEM)" or the British "Institution of Gas Engineers & Managers (IGEM)" and various medical terms and chemical names snuck in. The entire data set with ID 5 was dismissed from further analysis.

The lesson we took from this is not to search for short acronyms by themselves. Query 1 ((“synthetic biology” OR "genetic engineering") AND (“iGEM” OR “International Genetically Engineered Machine")) could be thought of as the "Version Two" of the problematic Query 5. It searches for the same terms, but includes a conjunction (AND) with either synthetic biology or genetic engineering, which bends results toward our iGEM and hence tunes static down to an acceptable level.

Query 2 (synthetic biology) had a different problem: It was too big. The query was an attempt to capture stats for the entire field of synthetic biology, so we could statistically determine the relative influence of the iGEM competition. The composer of the query had forgotten the 1000-result cap imposed by Google Scholar. It is impossible to retrieve results beyond this 1k "event horizon". Google does not publish information regarding how the order of results is determined. Hence these first 1000 results (out of what are likely to be 10s or 100s of thousands of papers) are all biased by some unknown force. Were more cited papers favoured? Were papers published more recently favoured? No conclusions can be drawn from a biased and small subset of the full data. We also discounted data set 2 from any further analysis.

Is iGEM getting attention?

What's "attention"? We will operate under the reasonable assumption that getting attention correlates very strongly with being mentioned. Hence the attention that some term is getting is quantifiable simply by searching for that term and summing up result counts.

Here is a chart summarising how many papers mention various terms floating about iGEM over time:

Chart 1: Summary of iGEM-related attention over time

Number of teams Papers mentioning iGEM Papers mentioning parts registry Parts submitted (10s) Papers mentioning specific Registry Parts

Note As the label too states, the Parts submitted are in 10s i.e. units of ten! (The other values are unscaled.) We may do this again further on in the document without repeating this notice, so pay attention!

The Answer

It seems things are looking good for iGEM! Since the first iGEM competition in 2003, ever more teams have taken part each year and more parts have been submitted. Papers mentioning iGEM and the Registry of Standard Parts have consistently risen. These metrics appear with the expected proportions.

Papers mentioning a specific Registry BioBrick have only begun to appear in recent years, but their numbers show growth. We theorise that this may be due to the Registry's contents only now beginning to reach the critical mass necessary for adoption by researchers. Do note that the sample size is low in comparison to the other statistics and hence is of course more susceptible to error.

Where this data comes from

Data about participating teams and number of submitted BioBricks comes from iGEM Foundation (2012). The other data sets come from the results Google Scholar queries with IDs 1, 3 and 4 (see query summary table above).

"Ael 2" and "Vulpecula", 2012. h-index (Hirsch). Wikipedia. [image online] Available at: <http://en.wikipedia.org/wiki/File:H-index-en.svg> [Accessed Jul 27, 2012].

"Ael 2", 2012. Illustrated example for the g-index proposed by Egghe. Wikipedia [image online] Available at: <http://en.wikipedia.org/wiki/File:Gindex1.jpg> [Accessed Jul 27, 2012].

Egghe, L., 2006. Theory and practise of the g-index. Scientometrics [online], Volume 69 (Issue 1), p.131-152. Available at: <www.springerlink.com/content/4119257t25h0852w/?MUD=MP> [Accessed Jun 7, 2012].

Harzing, A.W., 2007. Publish or Perish. [computer program] Available from <http://www.harzing.com/pop.htm>

Hirsch, J.E., 2005. An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences of the United States of America, Volume 102 (Issue 46). [online] Available at: http://<www.ncbi.nlm.nih.gov/pmc/articles/PMC1283832/?tool=pmcentrez> [Accessed 5th Jul, 2012]

iGEM Foundation, 2012. Previous iGEM Competitions. [web page] Available at: <https://igem.org/Previous_iGEM_Competitions> [Accessed Jul 30, 2012]

Iselid, L., 2006. Research on citation search in Web of Science, Scopus and Google Scholar. One Entry to Research [blog] Available at: <http://oneentry.wordpress.com/2006/08/11/research-on-citation-search-in-web-of-science-scopus-and-google-scholar/> [Accessed Jun 20, 2012].

Péter J., 2006. Dubious hit counts and cuckoo's eggs. Online Information Review [online] Volume 30 (Issue 2) p.188-193. Available at: <http://www.emeraldinsight.com/journals.htm?articleid=1550726&show=abstract> [Accessed Jun 20, 2012].

Zhou Y., Liyan L. and Menghui L., 2012. Quantifying the influence of scientists and their publications: distinguishing between prestige and popularity. New Journal of Physics, [online] Volume 14 (March 2012) Available at: <http://iopscience.iop.org/1367-2630/14/3/033033/> [Accessed Jun 7, 2012].

Back to top

University of St Andrews, 2012.

Contact us: igem2012@st-andrews.ac.uk, Twitter, Facebook

This iGEM team has been funded by the MSD Scottish Life Sciences Fund. The opinions expressed by this iGEM team are those of the team members and do not necessarily represent those of Merck Sharp & Dohme Limited, nor its Affiliates.