Uncovering historical patterns in scientific publications

Using probabilistic topic modeling, researchers have developed a system, called Bookworm-arXiv, that can parse through thousands of scientific manuscripts located on arXiv, thereby providing immense data manipulation capabilities to science historians. The same team helped to develop Google’s n-gram viewer, which provides similar in-text search capabilities for Google collection of books.

the Cultural Observatory, will soon inaugurate a browser that searches for such language changes in a large online repository of scientific papers known as arXiv (pronounced like “archive”)

arXiv under load due to Perelman's Fields Medal

arXiv under load due to Perelman's Fields Medal (Photo credit: ktheory)

Users will be able to type in one or two words at the site, called Bookworm-arXiv, and immediately see a graph showing the ups and downs of the phrase’s use in the archive


The system will enable researchers to understand the history of scientific concepts and the diffusion of knowledge through the scientific community.

Users can then click on the graph and drill down to read the original papers in which the terms appear, tracing ideas back toward their roots, or to spots where scientific ideas spread from one field to another.


via Words by the Millions, Sorted by Software – NYTimes.com.

Correlation versus Causation (as per xkcd)

While Businessweek had an interesting infographic on the difference between correlation and causation, xkcd provides a different take on the logical fallacy of cum hoc ergo propter hoc (correlation proves causation).

Correlation versus Causation (cum hoc ergo propter hoc)

Image by xkcd

In other words, correlation enables prediction whereas causation enables explanation. Correlation denotes the strength of a relationship between two variables, while causation requires correlation of cause and effect, temporal precedence and rejection of alternative hypotheses.

via Correlation versus Causation (cum hoc ergo propter hoc) « A little bit of this, a little bit of that.


Edit: Don’t forget the mouse-over text.

Implementing heteroskedasticity-consistent standard errors in SPSS (and SAS)

Homoskedasticity (also spelled as Homoscedasticity), or constant variance of regression error terms, is a key assumption of ordinary least squares (OLS) regression. When this assumption is violated, i.e. the errors are heteroskedastic (or heteroscedastic), the regression estimator is unbiased and consistent. However, it is less efficient and this leads to Type I error inflation or reduced statistical power for coefficient hypothesis tests.

Thus correcting for heteroskedasticity is necessary while conducting OLS. There are methods for this, which include transforming the data, use of weighted least squares (WLS) regression and generalized least squares (GLS) estimation. Another alternative is to use  heteroskedasticity-consistent standard error (HCSE) estimators of OLS parameter estimates (White, 1980).

Comparison of residuals between first order He...

Comparison of residuals between first order Heteroskedastic and Homoskedastic disturbances (Photo credit: Wikipedia)

HCSE are of four types. Standard errors from HC0 (the most common implementation) are best used for large sample sizes as these estimators are downward biased for small sample sizes. HC1, HC2, and HC3 estimators are better used for smaller samples.

Many researchers conduct their statistical analysis in STATA, which has in-built procedures for estimating standard errors using all of the HC methods. However, others use SPSS due to its pair-wise deletion capability (versus list-wise deletion in STATA) and suffer from its  lack of heteroskedasticity correction capabilities. This wonderful paper by Hayes and Cai, provides a macro (in the Appendix) that can implement HCSE estimators in SPSS. They also provide a similar macro for SAS.

Note that the macro has no error-handling procedures, hence pre-screening of the data is required. Also, missing data is handled by list-wise deletion (which might defeat the purpose of using SPSS for some users).

Another link to the paper is here.


Hayes, A.F. and Cai, L. (2007) , “Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation”, Behavior Research Methods, 39 – 4, 709-722, DOI: 10.3758/BF03192961.

White, H. (1980), “A Heteroscedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroscedasticity,” Econometrica, 48, 817-838.

Is press coverage good for science?

The past few months the press has reported many stories about grand scientific discoveries. For example, the discovery of the lair of the Kraken, a pre-historic monster that ate dinosaurs for fun (see Lair of Ancient ‘Kraken’ Sea Monster Possibly Discovered – Yahoo! News and The Giant, Prehistoric Squid That Ate Common Sense | Wired Science | Wired.com). There was also extensive coverage of a study that said that the speed of light can be broken and hence Einstein’s theory of special relativity was flawed (read more at  Speed of light may have been broken – Q&A – Telegraph).

Pen and wash drawing by malacologist Pierre Dé...

Image via Wikipedia

In hindsight, it seems that the reporting of these scientific discoveries was a little pre-mature. In the zest to get the next big story, non-reviewed working papers get cited, data gets mis-represented, findings get mis-quoted, and scientific ethics ignored. These problems are more prevalent in some parts of the world.

Everyone has an example of the scientific ignorance of the press, but researchers in Britain probably have more than most. With stories ranging from ludicrous (wind turbine attacked by aliens) to downright irresponsible (promoting the link between childhood vaccinations and autism), the fourth estate in the United Kingdom has hardly covered itself in glory when it comes to science and scientific issues. Other countries have similar grievances, of course — particularly the United States, where right-wing talk radio and cable television regularly air anti-science views on everything from global warming to creationism. Stem-cell scientists in Germany and transgenic-crop researchers in France have also been assailed by journalism out of step with the scientific evidence that it claims to examine.

English: receiving from Judge his certificate ...

In Britain, an inquiry into the  standards and ethics of the press has asked the scientific community to provide support for the thesis that press coverage that does not apply the scientific method is harmful to science. While this is an interesting debate in itself, it begs us to ask the larger question – is press coverage good for science? It also raises questions on if scientists should be responsible for and trained in scientific journalism or if journalists should be trained in the scientific method. While several points can be made in support or opposition to this discussion, many proponents from both sides may agree that good, responsible press coverage is critical for good science. If not, then why do many researchers cite ‘media coverage’ on their resumes?

Read more at The press under pressure : Nature : Nature Publishing Group.

Correlation versus Causation (cum hoc ergo propter hoc)

A picture is worth a thousand words. This wonderful image by Businessweek helps to convey the difference between correlation and causation in a much easier to understand manner as compared with standard research texts.

Image via BusinessWeek


In other words, correlation enables prediction whereas causation enables explanation. Correlation denotes the strength of a relationship between two variables, while causation requires correlation of cause and effect, temporal precedence and rejection of alternative hypotheses.


Thus that correlation proves causation, or, cum hoc ergo propter hoc, is a logical fallacy.


See the original at Correlation or Causation? – Businessweek.

The ironic effects of packaging

English: Photo by R L Sheehan of commercially ...

Image via Wikipedia

A paper published in advance in the Journal of Marketing Research finds that signalling effectiveness is a double-edged sword. Termed as the ironic effects of packaging, the experimental study found that while attractive packaging increases initial product sales, it also leads to lesser post-purchase use of the product.

“Consumers become so convinced of the power of a boldly packaged product that they judge they can use less of it,” lead researcher Meng Zhu says. “Conversely, they tend to use more of a product when the packaging lacks strong cues of effectiveness.”


Meng Zhu, Darron M Billeter, and J. Jeffrey Inman (2011). The Double-Edged Sword of Signaling Effectiveness: When Salient Cues Curb Postpurchase Consumption. Journal of Marketing Research. Ahead of Print.

Read more at Marketing Press | The Double-Edged Sword of Signaling Effectiveness: When Salient Cues Curb Postpurchase Consumption and Futurity.org – Package irony: Buy quickly, use slowly.

Agent-based model predicts crowd movement

A team of researchers from the University of Maryland has developed an agent-based model to predict crowd behavior and movement. The model allows users to test the effects of 30 different types of individual behavior on the behavior of the group. The visual model, which allows users to have an immersive experience, can help in disaster relief planning following riots.

The video begins with a simulated riot where characters are static and where the emphasis is on social interaction, decision-making and the connection between geography and social agency.  But the crowd models that follow focus on how a group of people moves around.

See the video here.

Read more at New Scientist TV: Virtual rioters predict how crowds move and here.

Cognizant’s successful acquisition strategy

Cognizant’s CFO says that their acquisition strategy is to acquire specific technology and service skills, competencies and capabilities through focused, small-sized acquisitions. The maximum size of these acquisitions is $200 million – 3.3% of its approximately $6 billion revenues. Cognizant’s acquisition of Zaffera is the most recent in a series of such strategic acquisitions aimed at gaining specific knowledge.

Our acquisition strategy is towards very targeted tuck-in small deals which are targeted towards geographic expansion, building industry expertise and lastly, to acquire technology and services capability. For example, our PIPC acquisition was a very targeted acquisition. PIPC has expertise in large program management capability. It also had geographic presence in the UK and Australia. It has helped us in getting large development projects. Our sweet spot is $20-80 million deals and at the upper end it is $200 million. It is far easier to integrate.

Image via Business Today

via Cognizant’s CFO on acquisition strategy – Business Today – Business News.

Factor Analysis in Stata

Conducting Exploratory Factor Analysis in Stata is relatively straight forward. Run the factor command, followed by the rotate command. There are also other post-estimation commands. For examples of running EFA in Stata, go here or here. Running a Confirmatory Factor Analysis in Stata is a little more complicated. A cfa module, which is maintained and updated by Stanislav Kolenikov, can be downloaded by running the following command in Stata:


net from http://web.missouri.edu/~kolenikovs/stata/


The command syntax is:

The accompanying paper which contains the description of the commands and illustrative examples is here -> cfa-sj-kolenikov.



Stata Annotated Output: Factor Analysis. UCLA: Academic Technology Services, Statistical Consulting Group.

from http://www.ats.ucla.edu/stat/stata/output/fa_output.htm (accessed October 14, 2011).


John B. Willett, Conducting Exploratory Factor Analyses, in Selected Multivariate Data-Analytic Methods.

from http://isites.harvard.edu/fs/docs/icb.topic863573.files/Section_III/Section%20III_3/III_3a_1.do (accessed October 14, 2011).


Stanislav Kolenikov, Confi rmatory factor analysis using cfa.

from http://web.missouri.edu/~kolenikovs/stata/cfa-sj-kolenikov.pdf (accessed October 14, 2011).

India leads in Engineering R&D

India has a 22% share in the global engineering research and development(ER&D) outsourcing market, with current revenues of $10 billion expected to grow to $40 billion by the end of the decade. Hopefully, this export-oriented R&D will have spillover effects for local industry and help in improved manufacturing innovation for Indian firms, especially in fast growing sectors.

Cycle of Research and Development, from "...

Image via Wikipedia

With over 400 service providers employing nearly two lakh people and revenue of $9-10 billion, ER&D currently contributes 15 per cent of the $60 billion strong Indian IT-BPO export industry. During FY 2011, the cost savings by India-based ER&D Centres was over $20 billion.

Elaborating on sectors expected to have a bright future in India, Pandit said, “The Indian market is booming and as a country, we are no.1 when it comes to ER&D outsourcing. Sectors such as consumer electronics, automotive, energy, telecom and medical electronics have a great future.”

Read more in ‘Future bright for consumer electronics, automotive sectors’ – Indian Express and Business Line : Industry & Economy / Info-tech : Nasscom pegs engineering design market at $40 b by 2020.