MA/MSc Digital Sociology

Digital Social Research Methods:
Computational Statistics and Data Mining
 (CSDM)

Module lecturer: Dr Daniel Stamate, Department of Computing

Module Description

Weekly session material:

10/02/12

Lecture 1 Introduction to Statistics: histograms, normal distribution, skewness, kurtosis, mean, mode, range, quartiles, interquartile range.

Lecture 2 Samples, statistical models, deviation from mean, variance, standard deviation, confidence intervals + applications.

Introductory demo with IBM SPSS Statistics.

16/02/12

Finishing Lecture 2

Lecture 3 Exploring data with graphs.

Demo with IBM SPSS Statistics

17/02/12

Lab/Seminar 1 Applications on real datasets with software, and exercises on estimating proportions with 95% and 99% confidence intervals.

08/03/12

Lecture 4 Measuring association in data: covariance, correlation coefficient. Statistical models: linear regression.

Lecture 5 Introduction to Data Mining. Decision trees: rules, classification, and evaluating the model accuracy. Clustering.

Demo on SPSS clustering algorithms with census data.

09/03/12

Lab 2 Correlation analysis and linear regression.

Demo on SPSS decision trees algorithms with census data for training and new dataset for scoring/classification (see Tutorial 4).

Video Tutorials - Applications with IBM SPSS Statistics software:

Tutorial 1 Tutorial 2 Tutorial 3 Tutorial 4

Optional complementary session (21/03/12; material not to be assessed)
Introducing additional leading quantitative analysis software with demonstrations on:

  1. Data analysis on a world development indicators dataset: charts, variable correlations, building linear regressions based on women literacy rate, medical expenditure, doctor rate and infant mortality rate as dependent variable.

  2. Data mining with building decision trees on US census data to understand the profile of people earning less or more than a threshold of USD 50K.

  3. Text and web mining applied in sentiment analysis with classifying film reviews as positive or negative (other possible similar applications: automatically classifying the reader online feedback on articles in given categories). Clustering UK Computer Science departments' webpages (other possible similar applications: grouping online news or media output in categories based on content similarity).

Homework (not compulsory): available here.

Software to use in the labs:

IBM SPSS Statistics version 19; you are entitled to use a College licenced copy (obtainable from here) also on your home computer for academic purposes.

Reading list:

1. Discovering Statistics using SPSS, 3rd edition, by A. Field, Sage, 2009

2. Data Mining: A Tutorial Based Primer, by R. Roiger et al., Addison Wesley, 2002

3. Computational Statistical and Data Mining module website, by D. Stamate, 2012 (online teaching resources)

Further reading – see module description

© Daniel Stamate 2012