Category Archives: Data Science

New model reveals forgotten influencers and ’sleeping beauties’ of science


The influence of ‘forgotten’ scientific papers has been demonstrated in a new study led by a researcher from Goldsmiths, University of London.

A team from Goldsmiths, the University of Chicago, Google, the University of Maryland, and Columbia University, developed a model that tracks ‘discursive influence’, or recurring words and phrases in historical texts that measure how scholars actually talk about a field, instead of just their attributions. To determine a particular scientific paper’s influence, the researchers can statistically remove it from history and see how scientific discourse would have unfolded without it.

aaron-gerow380Aaron Gerow, Lecturer in Computing at Goldsmiths, who led t
he study said: “Citations are one kind of impact, and discursive influence is a different kind. Neither one is the complete story, but they work together to give a better picture of what’s influencing science.”

The researchers report in the journal PNAS how they trained the model on massive text collections from computational linguistics, physics, and across science and scholarship (JSTOR) and then traced distinct patterns of influence. They found that scientists who persistently published in a single field were more likely to be ‘canonised’ in a way that compelled others to cite them disproportionate to their papers’ discursive contributions. On the other hand, discoveries that crossed disciplinary boundaries were more likely to have outsized discursive impact but fewer citations, likely because the ‘owner’ of the idea and her allies remain socially and institutionally distant from the citing author.

The model also sheds light on so-called ‘sleeping beauties’: papers that went relatively unacknowledged for years or even decades before experiencing a late burst of citations. For example, a 1947 paper on graphene remained obscure and forgotten until the 1990s with a resurgence of research interest in the material and an eventual Nobel Prize.

Study co-author James Evans, director of Knowledge Lab and professor of sociology at the University of Chicago, said: “Papers have a news cycle, when lots of people chat about them and cite them, and then they’re no longer new news. Our model shows that some papers have much more influence than citations will typically demonstrate, such as these ‘sleeping beauties,’ which didn’t have much influence early but come to be appreciated and important later.”

The study used a computational method known as ‘topic modeling’ that was invented by co-author David Blei of Columbia University. The authors said the same model can also be used to trace influence in other areas, such as literature and music. Text from poems or song lyrics, and even extra-textual characteristics such as stanza structure or chord progressions, could feed into the model to find under-credited influencers and map the spread of new concepts and innovations.

A report of the research, ‘Measuring discursive influence across scholarship’ by Aaron Gerow, Yuening Hu, Jordan Boyd-Graber, David M. Blei and James A. Evans, is published in Proceedings of the National Academy of Sciences.

This article is based on an original story written by Rob Mitchum for University of Chicago News, which was then adapted by Peter Wilton for Goldsmiths News.

US elections: Goldsmiths data science research links voting habits with sickness & death


A new dissertation by MSc Data Science student Caroline Butler highlights the relationship between health and politics in the USA.

MSc Data Science student Caroline Butler has been investigating whether there is a relationship between mortality among middle-aged white Americans, social and economic well-being, and the 2016 presidential primary election outcomes at county-level.

Her research suggests that middle-aged white Americans living in counties with higher death rates are more cautious voters. That is, they are more likely to vote for a safe bet over a wildcard such as Trump.

After analysing data from the United States Center for Disease Control’s WONDER tool, the United States Census Bureau’s County QuickFacts, and the Kaggle forum, 2016 US Election, Caroline discovered a pattern connecting death rates to voting.

Contrary to expectations, a one unit increase in the all-cause mortality rate increased log odds of Hillary Clinton winning in that county’s Democratic presidential election primary by 1.5693 compared to Bernie Sanders. However, this result could have been skewed by Bernie Sanders’ younger fan base.

To Caroline’s surprise, a one unit increase in the all-cause mortality rate decreased log odds of Donald Trump winning his primary in a county by 1.4371.

The project was inspired by recent evidence that drug and alcohol poisoning, suicide and chronic liver diseases have caused the mortality rate among middle-aged white people in the United States to increase. At the same time, anti-establishment candidates, such as Donald Trump and Bernie Sanders, have achieved unexpected success.

In a follow-up investigation to her project, Caroline ran her data on mortality, socio-economic status of a county, and which state the counties were in through the CHAID machine learning algorithm, and found that with 85-89% accuracy, you could predict who would win the primary for each political party.

Her results suggest that for both white people and all races combined, the social and economic well-being of a county is as much related to the outcomes of the 2016 primary election as the mortality rates of middle aged Americans is.

“Understanding whether mortality data for middle-aged white Americans is associated with political viewpoints is important not only from a political perspective, but also for purposes of developing appropriate public health directives,” Caroline explains.

“I was surprised to find that in areas with higher mortality rates, people were more likely to vote for Clinton over Sanders in the primaries – but I’d suggest this could be because Sanders had a high number of young, so generally more healthy, voters.

“A similar study should definitely be done for the United States Presidential Election so we can compare the voting patterns from the Democratic Party to the votes from the Republican Party.”

Adapted from a Goldsmiths news article by Sarah Cox

Jobs at Goldsmiths Computing

Goldsmiths Department of Computing is recruiting three post-doctoral research assistants in areas including computer science, mathematics and statistics.

Type of Contract: Three years fixed-term, full time
Salary: £31,462 to £34,110 (incl London Weighting)
Closing date for applications: 30 April 2015
Interview date: w/c 11 May 2015

The Role
The recently-founded Goldsmiths Centre for Intelligent Data Analytics is seeking post docs to join a new industrial research project working with our partners at a large city financial institution to develop a fully-functional state-of-the art spend analytics system.

We are particularly interested in applicants with specialisms in machine learning and/or statistics. It is also essential that applicants have strong programming abilities.

We are looking to extend our team, which already has strong expertise in many aspects of mathematics, computer science and artificial intelligence. We have already had considerable success in working together as a team developing new research ideas and deliverables to customers.

BIG DATA and algorithmic abstractions

imgf000021_0001 (1)

‘The era of ubiquitous computing and big data is now firmly established, with more and more aspects of our everyday lives being mediated, augmented, produced and regulated by digital devices and networked systems powered by software. Software is fundamentally composed of algorithms — sets of defined steps structured to process data to produce an output. And yet, to date, there has been little critical reflection on algorithms, nor empirical research into their nature and work’ – Rob Kitchin

On December 11th 2014 Rob Kitchin will present his paper ‘Thinking critically about and researching algorithms’ in the RHB Cinema at Goldsmiths from 11:00am – 1:00pm.

His paper will begin with an introduction to what constitutes an ‘algorithm’, how they function, and outline the numerous tasks that they now perform in our society. He will address the short fallings of our understandings of algorithms, both in their formulaic structure and their operations in the world and how they are affected by interactions with other algorithms and users.

Critiquing the way in which scientists and technologists would usually present algorithms as ‘purely formal beings of reason’ Rob will discuss how they can transform into ‘abstract entities’ in which their work is often ‘out of control’.

‘…they are: often ‘black boxed’; heterogeneous, often contingent on hundreds of other algorithms, and are embedded in complex socio-technical assemblages; ontogenetic and performative…’

Often the work of many different hands and processes and dispersed across vast networks algorithms become difficult to decode and find their point of origin. They could be considered ‘emergent and constantly unfolding’.

How to govern their nature and work, although difficult, should be considered urgent, with a greater certainty about how ‘algorithms exercise their power over us’.

The lecture will address these concerns and suggest how we may approach researching algorithms through several different access points including: examining source code, reverse engineering and unpacking the wider socio-technical assemblages and examining how algorithms do work in the world.

Modelling a Community’s Health and Mobility Patterns with Mobile Phone Data


This Thursday at 3pm (16th October 2014), Kate Farrahi, Lecturer in Computing at Goldsmiths University will be giving a talk on ‘mobility patterns and interactions sensed by mobile phones’ at Cambridge University.

This data provides a new source for many applications both in research and industry. In this talk, she will discuss two mobile sensed data-driven applications, one based on mobility patterns and the other based on interaction patterns.

Human interactions sensed ubiquitously by cellphones can benefit many domains, particularly for monitoring the spread of disease. A community of 72’s flu patterns have been collected simultaneous to their interactions sensed by mobile phone Bluetooth logs. The focus of this work is to determine the accuracy of incorporating interaction data into dynamic epidemiology models for infection prediction.

Kate (Katayoun) Farrahi is a lecturer at the University of London, Goldsmiths. Her research focuses on large-scale human behaviour modelling and mining, with special interest in data science, computational social sciences, mobile phone sensor data, and machine learning. Farrahi received her Ph.D. in Computer Science from the Swiss Federal Institute of Technology (EPFL) Lausanne, and the Idiap Research Institute, Switzerland. She has spent time as an intern at MIT and is a recipient of the Google Anita Borg scholarship, and the Idiap research award.

This talk is part of the Computer Laboratory Systems Research Group Seminar series.