Student Performance Analysis

Section 1: Description

As a senior at the University of Maryland, College Park that is currently studying for mid-terms, the data I am interested in analyzing pertains to attributes of students and their performance on exams. Unfortunately, I was unable to find a data set associated specifically with my university, however, I did in fact find a dataset on Kaggle that analyzed the performance of High school students.

The data set is comprised of 1000 rows and 8 columns, consisting of gender, race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, and writing score. Additionally, the columns possess 2 types of data; objects and integers. I can initially assume that the features for future testing would include gender, race/ethnicity, parental level of education, lunch, and test preparation. All the features are categorical values. The target variables are the scores for math, reading, and writing. All the targets are quantitative values.

Section 2: Initial Plan for EDA

Section 3: Feature Engineering

The feature engineering of this data was creating and applying a function that classified the test scores into letter grades. No manipulation was required for null values because there was none. In a future analysis of this data, in which we can create a prediction model using linear regression or any other applicable ML algorithms, the categorical variables of the data will need to be encoded.

Checking for null value

There are no null values in the data.

Section 4: Key Findings and Insight

Find the number of unique values per feature

The number of unique values for the features of gender, lunch, and test preparation are 2. Race/ethnicity is 5 and Parental level of education is 6.

Unique values per categorical column

Data Visualization

Gender

According to the bar chart, there is more Females(518) then Males(482) in this data.

Race/ethnicity

According to the bar chart, the two largest groups for race/ethnicity within the data are groups C(319) and D(262). Group A(89) is the smallest.

Parental Level of Education

According to the bar chart, the two largest groups for parental level of education is some college(226) and assoicate's degree(222). The levels of high school(196) and some high school(179) are significant as well. However, the levels of bachelor's degree(118) and master's degree(59) are much smaller.

Lunch

For Lunch, most students have standard lunch(645), but there still is a significant population of students with free/reduced lunch(355).

Test Preparation Course

For test preparation courses, most students have not taken a course(642).

Determining the mean, median, quantiles, and range for each test scores

Math Scores

Reading Scores

Writing Scores

Letter Grades vs Gender

In the case of Math grades hued by gender, the distribution between genders is similar. Both Genders are negatively skewed. Additionally, the females have noticeable more Fs than males; however, we must also consider that this is due to the females having a greater population in the data set.

In the case of reading grades hued by gender, the distribution between genders differs significantly. The distribution of the males is negatively skewed. Additionally, males have significantly more Fs. Females appear to have somewhat of a normal distribution. Thus, Females have a higher mean reading grade.

Similar to the reading grades, in the case of writing grades hued by gender, the distribution between genders differs significantly. The distribution of the males is negatively skewed. Additionally, males have significantly more Fs. Females appear to have somewhat of a normal distribution. Thus, Females have a higher mean writing grade.

Letter Grades and Race/ethnicity

In the case of math grades hued by race/ethnicity, the distributions for Group's A, B, C, and D are negatively skewed. Group E has a somewhat normal distribution. Thus, the mean Math Grade for group E is higher than in other groups.

Similar to math grades, in the case of reading hued by race/ethnicity, the distributions for Group's A, B, C, and D are negatively skewed. Although still skewed, Group's C and D appear more normal in comparison to their math grades. Group E has a somewhat normal distribution. Thus, the mean reading grade for group E is higher than in other groups.

In the case of writing grades hued by race/ethnicity, Group's A, B, and are negatively skewed. Group D has a slight negative skew but appears nearly normal. Group E has a somewhat normal distribution. Thus, the mean writing grade for group E is higher than in other groups.

Letter Grades and Parental level of Education

In the case of math grades hued by the parent level of education, all levels of education appear to be negatively skewed.

In the case of reading grades hued by the parent level of education, levels of high school and some high school are negatively skewed. For associate's and some college, the distribution is negatively skewed but appears more normal than high school and some high school. For levels of master's degree and bachelor's degree, the distributions appear normal. Thus, the mean reading grade for master's and bachelor's is higher than in other groups.

Similar to the reading grades, in the case of writing grades hued by the parent level of education, levels of high school and some high school are negatively skewed. For associate's and some college, the distribution is negatively skewed but appears more normal than high school and some high school. For levels of master's degree and bachelor's degree, the distributions appear normal. Thus, the mean writing grade for master's and bachelor's is higher than in other groups.

Lunch vs Test Scores

In the case of Math grades hued by lunch type, both standard and free/reduced are negatively skewed. The skew of free/reduced is more extreme.

In the case of reading grades hued by lunch type, free/reduced is negatively skewed. Standard lunch appears somewhat normal. Thus, the mean reading grade of standard is higher than free/reduced.

Similar to reading grades, in the case of writing grades hued by lunch type, free/reduced is negatively skewed. Standard lunch appears somewhat normal. Thus, the mean writing grade of standard is higher than free/reduced.

Test Preparation Course vs Test Scores

In the case of Math grades hued by test preparation course, both groups of none and complement appear to be negatively skewed. The skew of the none group is more extreme.

In the case of reading grades hued by test preparation course, the none group is negatively skewed. The completed group appears normal. Thus, the mean grade of reading for students who completed a preparation course is higher than those who don't.

Similar to reading grades, in the case of writing grades hued by test preparation course, the none group is negatively skewed. The completed group appears normal. Thus, the mean of writing grades for students who completed a preparation course is higher than those who don't.

Section 5: Formulating Hypotheses

Hypothesis 1:

Hypothesis 2:

Hypothesis 3:

Section 6: Conducting Test for Significance

Hypothesis 1:

alpha>p-value; Testing the difference of performance between males and females in math at 5% significance, we reject the null hypothesis; males and females perform significantly different on math tests.

Hypothesis 2:

alpha>p-value; Testing the difference of performance between students who completed a preparation class and students who did not, testing at 5% significance, we reject the null hypothesis. Students who completed a course and student who did not differ significantly in performance on math tests.

Hypothesis 3:

alpha>p-value; Testing the difference of performance between students with standard lunch and free/reduced lunch, testing at 5% significance, we reject the null hypothesis. Students with standard and free/reduced lunch differ significantly in performance on math tests.

Section 7: Future Analysis

The future analysis of this data would include the creation of a prediction model using linear regression(predict test scores) or logistic regression(predict the probability of students scoring in a certain interval of test scores). The data would require encoding of the categorical features.

Section 8: Summary

In conclusion, throughout this analysis of data of student performance on tests and their student attributes, we have uncovered some key findings within the data. Firstly, in the comparison of letter grade frequency hued by gender, females showed a higher mean grade letter than males in reading and writing. In math, they roughly had the same distribution. Secondly, in the comparison of the frequency of letter grades hued by race/ethnicity, Group E, for all 3 three tests, displayed higher means in Letter Grades than all other groups. However, we must consider the significant difference in population size within the race/ethnicity column. Thirdly, in the comparison of letter grade and parental level of education, students whose parents have a master's or bachelor's have higher averages in reading and writing. The distribution of math was roughly the same as the others. Fourthly, in the comparison of a letter grade and type of lunch, students with standard lunch have higher letter grade means for reading and writing tests. Fifthly, in the comparison of a letter grade and whether a student completed a preparation course or not, students that completed the course have higher letter grade means in writing and reading. Lastly, for my inferential testing, since I noticed students across all the features consistently had the same negatively skewed distribution, I proposed my testing to see if there was a significant difference in math test performance for the features of gender, test preparation course, and lunch type. I concluded for all three tests that there is a significant difference in performance. For future analysis, I would request data on student GPA, tardiness, hours studied, and extracurriculars.