Final Project
These are specifications and guidelines for the final Data Science Project and the Project report, the assessment for the Data Science Tools (see Syllabus): How to write it, when and how to deliver it, and how it will be graded.
1 How to write a project report in a nutshell
- You find a team,
- pick a dataset,
- register your project,
- do some interesting question-driven data analysis with it,
- present a draft version at the end of the semester, and
- write up a well-structured and nicely formatted report about the analysis in your team’s repository.
2 Work in Teams and Register your Project
Projects should be done in teams of 2-3 students. This serves two additional learning goals:
- Learning to pursue and coordinate data science workflows collaboratively.
- Learning to work collaboratively on code with
gitand GitHub.
Single teams and teams with more the 3 members are possible upon request when there are reasons for it.
All members receive the same grade for it. However, the module grade can be different because of the bonus for fully complete homework. A dysfunctional team can be split upon request by individual members.
How to form a team and register your project:
- Form a team on your own with fellow students.
- Go through this guidelines together.
- Find a suitable dataset (see below).
- Come up with a few initial questions.
- All team members should write to the instructors.
One member should write to the Jan Lorenz and Armin Müller and provide
- all names of the team members,
- information about the dataset,
- the initial questions, and
- a short working title for the repository name.
All other members should write to the instructors for confirmation.
3 Advice for data science team work (with git)
- Select your language R or python!
- Put the data you need into a folder
data. For large data sets better not to commit and push it, but to document where the data can be retrieved and where it should be put to load it with your code.
- All team members need to clone the repository to their local computers.
- Try both:
- Working together at the same time with realtime communction.
- Working together remotely based on a plan who should do what. Feel free to use GitHub issues in your repository for this!
- Every team member should finally have pushed some work to GitHub!
- Commit early and often! Then we can help you when you have questions or need directions.
- Before you start to work as a team member: Always do
git pullfirst to receive potentially new content! - Try to finish an individual work session with comments on next steps and open questions. Push the files in a state such that it renders.
- Conflicts in git: Git conflicts will probably occur when you work in teams while pulling and pushing. These can be solved. If you do not manage, ask for help.
- Conflicts in the team. Please communicate about tensions in your team and try to solve them. However, you are not obliged to finish your project as a dysfunctional team. To avoid being bound to the same grade in a dysfunctional team, every team member has the right to leave the team and continue independently. This can only be realized upon prior communication with team members and consequently with the instructors!
4 Delivery
- Delivery is via GitHub in a private repository (the same way as Homework Projects).
- The repository for the final project will be created in the CU-F24-MDSSB-01-Concepts-Tools by the instructors upon team and project approval.
- The source file as well as the rendered html-file should be included.
- The assessment will start after the deadline with the latest commit.
- Assessment will be done based on reading the HTML-file. Additionally and if necessary, the Quarto Markdown File and other provided files may be looked at. Ideally, the HTML-Report is sufficient to assess the project! (To test if the HTML file looks good for the instructors do the following: Clone the repository anew to a temporaty directory on your computer, open the HTML file from this newly cloned directory. Check if it looks good. This is what instructors will do.)
5 Deadline
See Schedule
6 Grading
- The reports will be graded jointly by the instructor of Data Science Tools and Data Science Concepts.
- The presentation in class is informal and will not be graded.
- All team members receive the same grade for the project report. In case of a dysfunctional team, members can leave or split the team upon prior communication with the team and the instructors.
Assessment rubric
Project reports can be quite different depending on the data used and the type of question(s) (Descriptive, Exploratory, Inferential, Predictive, Causal, Mechanistic). Therefore, there will not be a fixed rubric. We will communicate back the final percentage grade (compare Constructor University’s Grading Table. Additionally, we aim to give short textual feedback about strong and weak points.
Nevertheless, the assessment goes along some criteria with some approximate weights:
- Introduction/Questions (10%)
- Data Handling/Preparation (10%)
- Modeling (20%)
- Comprehensiveness/Visualizations (20%)
- Results/Conclusion (20%)
- Programming style (10%)
- Document Structure and Layout (10%)
7 How to define a project?
Part of the work is to define a good project for question-driven data analysis.
Essentially there are two approaches to coming to a project setup:
- Find and select data that interests you, do some question-driven data exploration, and focus and narrow the question.
- Start with a question and search data to answer it.
You can build on topics and datasets of the Homework.
8 Guidelines for drafting your project
- Find a manageable dataset: at least 50 cases (rows), good is 10-20 variables (columns) with a mix of categorical and numeric. Deviations are allowed based on your interest and capability.
- Go through the visualizations, statistics, and models we had in lectures and homework and think if similar things would be interesting.
- The goal is not an exhaustive data analysis with every method! Do not calculate every statistic and procedure you have learned for every variable, but rather show that you can ask meaningful questions and answer them with results of data analysis and proficient interpretation and presentation of the results.
- Do NOT blindly do all visualization and all statistics on all variables in the data set! You may do that in your data exploration. But not all exploratory data analysis goes into your report. At some point, you have to focus on what is needed to answer your questions. Your report is not a documentation of all your exploratory steps. Throw out what is not needed anymore, improve the remaining visualizations, and explain well in the text what you want to communicate.
- A single high-quality visualization that shows a point clearly will receive a much higher appreciation than a large number of poor-quality visualizations without an explanation of what they should communicate.
9 Notes on Data Sources
There are various sources of data. Nevertheless, finding good data for a projects is always a challenge and what good data is depends also on what you want to do. Typical issues are confidentiality of business data and data privacy issues for user data. Here are some notes on some sources.
9.1 kaggle - Use with caution!
kaggle is a website for the machine learning and artificial intelligence community to practice and compete on new methods. They provide several data sets ready-made to test and compete on typical prediction tasks. When you browse kaggle and find something interesting, scrutinize it.
Find out the original source of the dataset! Many dataset on kaggle are artificial/synthetic but do show this only when you dig deep in the documentation or not at all. Sometimes there is neither a link to the original source, nor information how it was produced. Using such datasets for domain specific work is of no value unless you go into how such synthetic data can tell us something about the real world and that is way beyond the scope of the course. Such data may be OK for a purely methods-oriented but this is also not the main goal of the course.
10 Making your project public
By default, project repositories are private and only visible to team members and instructors. The repositories can be made public such that students can use it as part of their portfolio, for example for referring to it in applications for internships and student jobs. Please let the instructors know if you want to make the repository public.
Additionally, the HTML file can be made accessible directly as a webpage with a few tweaks from quarto using the service of GitHub Pages. See the documentation.
11 Frequently Asked Questions (may be extended)
Do we need to stick to methods and visualizations treated in lectures?
No, you are invited to use other data analysis and visualization methods (from other courses or packages which you self-learn)! We are happy to give advice if we can.
Are we only allowed to use one dataset?
No, you can also merge data from different sources. This is a more challenging project because the data wrangling work would be a bit more. If your question calls for it we encourage you to use another data source. The additional effort will be recognized.