Diving into the data

2 min readNov 15, 2020

As a part of a group project, I analyzed how well students learned during the quarantine period due to Covid-19 using machine learning models. I looked at what students did in order to learn effectively and what students could do to improve their learning experience. In this blog post, I will go in-depth about the data that was used to get our results.

The dataset used in this project was from a survey given to Vietnamese students. It included 40 features that detailed information on the students’ environments, resources, engagement time, progress, etc. After the dataset was cleaned of outliers, duplicates, and missing data, it included over 400 observations, (the number of students surveyed) which helped with our machine learning model’s accuracy.

In our exploratory data analysis, the initial 40 features were reduced into the 11 features that showed the most correlation with the dependant variable, assured learning progress. We also created a binary assured progress variable for clearer data. This means that the responses were divided into either agree or disagree instead of having a variety from strongly disagree and strongly agree.

Visualization is a key step in creating an effective learning model. Using ggplot, a helpful tool that creates graphs in R, we were able to create several graphs that helped us explore our data.

**Distribution of the Dependent Variable**

In my next post, I will talk about the machine learning models that were used in our project that explores how students learn best during Covid-19.

Ryan Tietjen is a Student Ambassador in the Inspirit AI Student Ambassadors Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/.

Diving into the data

Written by Ryan Tietjen

No responses yet