Project

This project is to be completed individually.

Data Acquisition

Core Data Science Querry

Code Development Task:

Develop a Python program that solve your query. The program (could include multiple files) should include the following components:

Task 1: Data Engineering
  • Carry out descriptive analysis and data reduction tasks over the dataset. You should be responsible to identify relevant analysis for your selected dataset and query. For example, if it is text data, analysis like word count, sentence count, sentiment analysis .. are relevant. For numerical data, statistical analysis are relevant.
  • The raw data is definitely larger that what you need. You will need to trim down and create smaller dataset and/or combine multiple data sources.
  • Write a narration describing how you implemented this task. The narration should not be as simple as I did this, I did that …. Rather, it should include extensive justification, especially regarding data manipulation activities. Why did you select the specific implementation that you did? How did the dataset (size, attributes, quirks, etc) influence your technical choices. It should also report the outcome of your data engineering activities (e.g. decsrie the resulting intermediate dataset). This narration will make up the Data Engineering sections of your Technical Report.
Task 2: Data Analytic
  • Implement the necessary Spark-based code to resole your query. You should use of one or more of the techniques that we have covered in class this semester (HITS, frequent itemsets, similarity/LSH, clustering, recommendation system).
  • Write a narration describing how you implemented this task. The narration should not be as simple as I did this, I did that …. Rather, it should include extensive justification, especially regarding data manipulation activities. Why did you select the specific implementation that you did? How did the dataset (size, attributes, quirks, etc) influence your technical choices. It should also describe the final outcome of your query and the non-technical insights from the added value achieved through this query. This narration will make up the Data Analytic sections of your Technical Report. Describe

Carry out a search on Google Scholar and identify prior work that either analyze the same datasets or solving the same query type. Write about your results using proper IEEE citation standards. This will form the Related Work section of your Technical Report. Feel free to adapt a previous known approach in your query.

Technical report:

The general requirements for the technical report are as follows:

Submissions requirements

Failure to adhere to details in this project description will result in points taken off.