MA/MSc Digital Journalism & MA/MSc Creating Social Media

Digital Research Methods:
Statistical Data Mining
 (SDM)

Module lecturer: Dr Daniel Stamate, Department of Computing

Module Description

Weekly session material:

23/01/12

Lecture 1 Introduction to Statistics: histograms, normal distribution, skewness, kurtosis, mean, mode, range, quartiles, interquartile range.

Lecture 2 Samples, statistical models, deviation from mean, variance, standard deviation, confidence intervals. Applications and exercises.

30/01/12

Finish Lecture 2

Lecture 3 Presentation of the IBM SPSS Statistics environment + demo with the software.

06/02/12

Lecture 4 Visual Data Exploration + demo with IBM SPSS Statistics.

Lab/Seminar 1 Descriptive statistics analysis. Visual data exploration with SPSS. Exercise on statistical estimation with 95% confidence intervals.

13/02/12

Finish tasks from last week's Lab/Seminar 1.

20/02/12

Lecture 5 Correlation and multiple linear regression analysis + demo with IBM SPSS Statistics.

Lecture 6 Data Mining: classification with decision trees and rules. Demo on building Decision Trees with IBM SPSS Statistics.

Lab 2 Looking for correlation in data. Building and evaluating linear regression models using the 2003 world development indicators dataset.

Video Tutorials – Applications with IBM SPSS Statistics software:

Tutorial 1 Tutorial 2 Tutorial 3 Tutorial 4

Optional complementary session (21/03/12; material not to be assessed)
Introducing additional leading quantitative analysis software with demonstrations on:

  1. Data analysis on the World development indicators dataset: charts, variable correlations, building linear regressions based on women literacy rate, medical expenditure, doctor rate and infant mortality rate as dependent variable.

  2. Data mining with building decision trees on US census data to understand the profile of people earning less or more than a threshold of USD 50K.

  3. Text and web mining applied in sentiment analysis with classifying film reviews as positive or negative (other possible similar applications: automatically classifying the reader online feedback on articles in given categories). Clustering UK Computer Science departments' webpages (other possible similar applications: grouping online news or media output in categories based on content similarity).

Homework (not compulsory):

Homework 1

Homework 2: read How Obama's data-crunching prowess may get him re-elected

Software to use in the labs:

IBM SPSS Statistics version 19; you are entitled to use a College licenced copy (obtainable from here) also on your home computer for academic purposes.

Reading list:

1. Data Mining: A Tutorial Based Primer, by R. Roiger et al., Addison Wesley, 2002

2. Discovering Statistics using SPSS, 3rd edition, by A. Field, Sage, 2009

3. Statistical Data Mining module website, by D. Stamate, 2012 (online teaching resources)

Further reading – see module description

© Daniel Stamate 2012