This project encompasses the utilization of two open datasets from Kaggle. Our primary objectives involve data acquisition, data cleansing, and subsequent storage of the cleaned data in .sqlite format using a structured data pipeline. Throughout this process, we've conducted in-depth analysis and exploratory data analysis (EDA). Prior to commencing, ensure Python is installed as we'll be working within Jupyter Notebooks.
-
Data Pipeline Creation: Initially, our focus was on building a robust data pipeline that facilitated the extraction of Kaggle data, enabling its transformation and storage in the .sqlite format.
-
Test File Creation: To ensure the efficacy of our data pipeline, we developed a test file for verifying the successful creation and storage of data within the .sqlite directory.
-
Final Report: Our final report encompasses several crucial steps:
-
Data Cleaning: We meticulously examined the datasets for any occurrences of missing or empty values (NaN), promptly addressing and removing them to ensure data integrity. Additionally, we identified and rectified any data type inconsistencies within our dataset.
-
Exploratory Data Analysis (EDA): After completing the data cleansing process, we delved into an extensive exploration of the dataset. This involved statistical analysis and descriptive insights to better comprehend the underlying patterns and trends.
-
Data Visualization: To gain comprehensive insights and facilitate a better understanding of the dataset, we employed various data visualization techniques. Visual representations such as plots, graphs, and charts were utilized to highlight significant trends, correlations, and anomalies within the data.
-
We emphasize that each phase of this project was undertaken individually, ensuring dedicated attention to detail and comprehensive exploration of the dataset. Our work focused on maintaining a high standard of accuracy, adhering to best practices in data analysis, and striving for a cohesive and structured project execution.
During the semester you will need to complete exercises, sometimes using Python, sometimes using Jayvee. You must place your submission in the exercises folder in your repository and name them according to their number from one to five: exercise<number from 1-5>.<jv or py>.
In regular intervalls, exercises will be given as homework to complete during the semester. We will divide you into two groups, one completing an exercise in Jayvee, the other in Python, switching each exercise. Details and deadlines will be discussed in the lecture, also see the course schedule. At the end of the semester, you will therefore have the following files in your repository:
./exercises/exercise1.jvor./exercises/exercise1.py./exercises/exercise2.jvor./exercises/exercise2.py./exercises/exercise3.jvor./exercises/exercise3.py./exercises/exercise4.jvor./exercises/exercise4.py./exercises/exercise5.jvor./exercises/exercise5.py
We provide automated exercise feedback using a GitHub action (that is defined in .github/workflows/exercise-feedback.yml).
To view your exercise feedback, navigate to Actions -> Exercise Feedback in your repository.
The exercise feedback is executed whenever you make a change in files in the exercise folder and push your local changes to the repository on GitHub. To see the feedback, open the latest GitHub Action run, open the exercise-feedback job and Exercise Feedback step. You should see command line output that contains output like this:
Found exercises/exercise1.jv, executing model...
Found output file airports.sqlite, grading...
Grading Exercise 1
Overall points 17 of 17
---
By category:
Shape: 4 of 4
Types: 13 of 13