CIS338: Data Mining 

Lecturer: Dr Daniel Stamate

Announcement Board

Course Description

Assignment term 2

Re-sit/Summer Assignment

Lectures (in bold)/ labs (in red) /optional homeworks:

Week 1 (11 Jan - )

Course presentation and Data Mining applications

Data Mining: Introduction pdf
Read Chapter 1 of Roiger's DM book (completed with Han's DM book) - see course presentation

Optional Homework: Search for the entry Data Mining in Wikipedia, and read about its notable uses.

Week 2 (18 Jan - )

Data Mining Strategies pdf
Read Chapter 2
 

Lab: Diagnosing patients using Data Mining

Optional Homework: Explore KDnuggets, one of the most popular websites with various information and resources related to Data Mining (see in particular DM software and suites). 

Week 3 (25 Jan - )

Data Mining with Weka: a tutorial 

Lab: Practice Data Mining with Weka - classification with Decision Trees and the Nearest Neighbour. Application in medical diagnoses and research

Optional Homework: Explore profiles of jobs in Data Mining on KDnuggets website.

Week 4 (1 Feb - )

Data Mining Techniques pdf [+ demo with Weka on clustering and association analysis using datasets CCP_associations.csv and  kmeans.arff]
Read Chapter 3 (Section 3.4: Genetic Learning, is optional)

Lab: Data Mining application in Customer Analytics

Optional Homework: Read about Text Mining.

Week 5 (8 Feb - )

Data Mining Techniques (continued)

Lab: Continue work from previous week

Optional Homework:  Find a particular dataset that may present an interest for you to mine in the UCI KDD dataset repository. (The datasets are classified with respect to the practical problems to solve they have been created for)

Week 6 (15 Feb - )

Reading week

Week 7 (22 Feb - )

Data Mining Techniques (continued)

Lab: Association analysis

Optional Homework: Read about IBM SPSS Modeler (formerly Clementine, see additional resources entry below)

Week 8 (1 Mar - )

Knowledge Discovery in Databases  pdf
Read Chapter 5

Lab: Work on the assignment

Optional Homework: Read about SAS Enterprise Miner (see additional resources entry below)

Week 9 (8 Mar - )

Data Mining with Neural Networks pdf [+ demo] 
Read Chapter 8 (Section 8.5 is optional)

Lab: Clustering numeric datasets

Optional Homework: Read about RapidMiner (see additional resources entry below)

Week 10 (15 Mar - )

Statistical Techniques pdf [bring laptops with you for practice in class on Linear Regression - Office buildings dataset (output: value) and Portuguese wine dataset (output: quality); estimation/prediction using Regression Trees - Deer hunter dataset (output: Yes), Logistic regression - Credit card promotions dataset (output: LIPromotion) and Breast cancer dataset (output: Class), Naive Bayesian classification - Credit card promotions dataset (output attribute: Sex), clustering using the EM technique - Iris plants dataset, Conceptual clustering - Iris plants dataset]
Read Chapter 10

Lab: Apply the Knowledge Discovery in Data process with Weka's KnowledgeFlow Environment: Predictive Analytics for the Customer Retention problem

Optional Homework: Read about Music Data Mining. See in this paper how Data Mining, in particular Weka with the C4.5 decision tree building algorithm (J48), could be used in the Automatic Music Classification problem.

Week 11 (22 Mar - )

Data Warehousing  pdf
Read Chapter 6 (Sections 6.1 and 6.4 are optional)

Lab:  Finish work from the previous week

Optional Homework: Explore this website with useful/practical information about Data Warehousing

Past exams Recent exam papers are available here; See student intranet for previous papers.
Revision week Production rules and classifier evaluation  

Note: If reference to a book (chapter, section, exercise, etc) is made but the title is not provided explicitly, one should assume it is Roiger's book. See the essential titles below (Reading list). 

Lab software to be used 
                 
- Java coding: javacis338.zip (library for handling datasets) and Java online tutorial 
                  - Data Mining/Machine Learning software:
Weka : lab working software (free download & documentation website);
                    you are advised to install
the Weka book version 3.4.14 on your laptops for working at home and/or running demos 
                    in the lectures
(this ensures full compatibility with the lab material). A software presentation can be found here.

Optional Java coding tasks: (you may try one of these in particular if you finished the lab work in a session)
T1 Code in Java the 1R algorithm  with dataset.
T2 Implement a Bayesian classifier.
T3 Implement the C4.5 algorithm.

Reading list
    1. [Lecture] Richard Roiger and Michael Geatz "Data Mining, a tutorial-based primer", Addison Wesley, 2003
    2. [Lecture
] Jiawei Han and Micheline Kamber "Data Mining: Concepts and Techniques", 
        Morgan Kaufmann, 2006 (2000 edition can also be used)
    3. [Lab] Ian Witten and Eibe Frank "Data Mining: Practical Machine Learning Tools and Techniques" , 
        Morgan Kaufmann, 2005
    + additional titles in Course Description

Additional resources:
       Datasets for mining
                the ARFF format used by Weka
                various datasets archives
               
UCI KDD dataset repository
               
other datasets sources
       Connect Weka to databases 
               download instructions and files 
      
Other Data Mining software:
              RapidMiner : free download & documentation website
              IBM SPSS Modeler (formerly SPSS Clementine) plus demo
              SAS Enterprise Miner plus documentation
              Various links  
        KDnuggets (
Data Mining, Knowledge Discovery, Genomic Mining, Web Mining)
        KDNet (Information on data mining and knowledge discovery)

Site maintained by Daniel Stamate. Updated frequently.