Examining the Validity of ETS’ Educational Proficiency Profile and AAC&U’s VALUE rubrics to assess GenEd Skills of UD undergraduates

![]()
A Comparison of Assessing General Education Goals with Freshmen and Senior Students by implementing the Abbreviated Educational Proficiency Profile compared to the Adapted American Association of Colleges and Universities Valid Assessment of Learning in Undergraduate Education (VALUE) Rubrics.
FALL 2011
by:
Kathleen Langan Pusecker M.S.
Director of the Office of Educational Assessment
Manuel Roberto Torres PhD.
Research Analyst, Office of Educational Assessment
Iain Crawford PhD., Associate Professor, English Interim Chairperson
Delphis Levia PhD., Associate Professor, Geography
Donald Lehman EdD., Associate Professor, Medical Technology
Gordana Copic MS., Graduate Student

Assessment of student learning has the ability to influence approaches for teaching and learning in higher education. Having valid and reliable assessments of expected learning outcomes should articulate academic standards set forth by the faculty and a department.
As accountability has moved into the discussion of assessment not just for learning improvement, institutions of higher education are increasingly required to report their measures of student learning for accreditation/accountability purposes. The University of Delaware, like any institution of higher education receiving federal monies, is required to measure student learning not only for the purpose of learning improvement but also for accountability and accreditation agencies such as the Middle States Commission on Higher Education as well as to the Voluntary System of Accountability (VSA).
The VSA was initiated in 2007 by public 4-year universities to supply comparable information on the undergraduate student experience. Participants are required to use one of three standardized tests to assess core General Education (GenEd) skills of critical thinking, reading, writing, and mathematics (quantitative reasoning).1 The three acceptable standardized tests advocated by the VSA are Educational Testing Services (ETS) Educational Proficiency Profile (EPP), the Collegiate Learning Assessment (CLA), a product of the Council for Aid to Education (CAE), or the Collegiate Assessment of Academic Proficiency (CAAP), produced by ACT. UD Office of Educational Assessment (OEA) in consultation with the Associate Provost of Institutional Effectiveness selected ETS’ EPP as the standardized test to evaluate UD student’s GenEd competencies. Advocates of standardized tests suggest that the best way to assess student learning is through universal, unbiased measures of student and school performance. However, critics of standardized testing, argue that these tests fail to accurately assess students’ knowledge and school performance. Furthermore, standardized tests may institute a bias against underserved populations and are not reliably measure school quality (Beaupre', Nathan, & Kaplan, 2002).
The purpose of this study was to examine the process and results of assessing student learning with the VSA advocated EPP standardized test and the results of an assessment of students’ work by using the Association of American Colleges and Universities (AACU) tools Valid Assessment of Learning in Undergraduate Education (VALUE) rubrics. The VALUE rubrics were developed in coordination with AACU members at institutions required to document the quality of student learning. The purpose of the VALUE project was to define core learning goals and create assessment tools that could be modified by institutions so that they could be better able to obtain useful information about the quality of student learning on essential learning outcomes of undergraduate education that are most challenging to evaluate via standardized tests. Questions of costs and validity and are addressed.
Costs
The cost to implement the ETS EPP to examine Gen Ed competencies of UD undergraduates were far reaching.2 ETS charged $7,000 just for the scantron tests and test booklets and had the duty of processing tests and reporting the scores. The scores that ETS reported did not include a comparison between the freshmen and seniors scores, so OEA conducted a statistical analysis on the output that ETS provided. The tables below provide a list of costs incurred for the EPP and, below it, costs involved in utilizing the VALUE Rubric to assess Gen Ed competencies of undergrads at UD, some with dollar values, and some without.
Table 1: EPP Costs

Table 2: VALUE Rubric Costs

Recruiting students to take the EPP not only cost OEA staff time, but the results obtained with this standardized test possess little validity. Motivating students to perform their very best on a standardized test presents unique challenges. Students who volunteer to complete standardized tests introduce a personal bias into the assessment process. Despite these challenges, OEA implemented the standardized test by recruiting freshman students in a variety of First Year Seminar courses, and senior students in a variety of major related Capstone experience courses. In addition, to recruit additional seniors, open advertisements were placed around campus in large campus classroom buildings requesting senior students to participate in the assessment. Candy was the only incentive offered to students after the test. The data used to conduct the Gen Ed study with the VALUE rubrics was generated from implementing the EPP test on campus. Students who completed the EPP test, and were asked to upload work they felt suggested they had attained the skills of critical thinking, writing, and quantitative reasoning, so they could be entered into a competition to win an iPad2. By asking students to supply their best work that was previously graded in courses allowed OEA to evaluate authentic evidence of students’ acquisition of GenEd skills.
Validity
In addition to assessing the GenEd goals, OEA also wanted to examine if the EPP was valid so that results could be used to inform curricular and resource decisions to improve UD students’ GenEd competency levels. A widely accepted definition of validity is the extent to which a test measures what it claims to measure. If a test is valid, results should be able to be applied and interpreted. Because of the challenges of the implementation of the standardized EPP and because the EPP provides little information regarding the quality of students’ competencies on the core GenEd goals, the OEA involved three Faculty Assessment Scholars to evaluate authentic student work samples that the students deemed as good representations that they had acquired specific GenEd goals of writing, quantitative reasoning and critical thinking.
Implementation
The EPP which assesses four core skills areas – critical thinking, reading, writing and mathematics, was given to 196 freshmen and 121 seniors at the University of Delaware (UD) in fall 2010 by the OEA. The completed EPPs were sent to ETS to be scored and results were provided to the OEA. For each student who took the test, a scaled score was provided for the four core skill areas assessed as well as an aggregated individual score. To compare the significance of these results across these two groups, OEA conducted independent sample t-tests on these scaled scores. An independent sample t-test has the ability to determine if differences in the means of two samples comprised of different people are significant.
ETS also supplied Criterion-referenced scores (proficiency classifications) which measure the level of proficiency obtained on a certain skill set. The ETS Proficiency Profile provides nine criterion-referenced scores: Mathematics (Level 1, Level 2, Level 3), Writing (Level 1, Level 2, Level 3), and Reading (Level 1, Level 2, Level3/Critical Thinking)
The scores indicate the percentages at the Proficient, the Marginal, and the Non-proficient score levels. The Marginal level is not helpful in that it only indicates that ETS could not determine if the student scored at the proficient or at the non-proficient level.
UD Senior students when compared to the average proficiency levels of the 16,878 students in the ETS 2010 sample from Research I Doctoral institutions scored higher on all Proficiency Levels. It is important to note that ETS only provided the mean scores from the long version of the test that they refer to as the “Proficiency Profile Total Test” and UD implemented the abbreviated version of the EPP which ETS did not provide mean scores for, so the comparisons between UD students and Research I students are actually made on the results of a slightly different test, but ETS does say that the abbreviated test provides equivalent results on the core skills as the longer test version.3
UD Students Score compared to the National Average of Doctoral Research Universities
Figure I: Seniors EPP Scores UD v. Research I

UD Freshmen students who took the abbreviated EPP when compared to the average proficiency levels of students in the ETS 2010 sample (of freshmen) who took the long version of the EPP from Research I doctoral institutions also scored higher on the Reading, Writing, and Critical Thinking Level 1 Proficiency Levels, Writing and Math Level 2 Proficiency levels, and Writing Level 3 Proficiency Level.
Figure II: Freshmen EPP Scores UD v. Research I

T- tests were performed using SPSS statistical software on the total scaled score, critical thinking scaled score, reading scaled score, math scaled score and writing scaled score for these two groups. The total scaled scores range in value from 400 to 500, while individual scaled scores (e.g. critical thinking, reading, math and writing) range in value from 100 to 130. The variable year was created (freshmen coded as 1.00 and seniors coded as 4.00) to link scaled scores to students by their year in school so OEA could compare scores on the EPP by group. It must be noted that UD seniors who took the EPP had higher scaled scores than the national average for other Research I classification seniors who took the EPP (Figure 1).
The independent sample t-test was conducted to examine if the differences in means were significant, or possibly resultant from random error. The means for each of the scaled scores was higher for seniors than freshmen (Table 5). When examining results of the t-test, we first refer to “Levene’s Test for Equality of Variance” (an inferential statistic) which is a test of the homogeneity of variance assumption (variance across multiple samples is similar), verifying if the two groups have equal variance on the dependent variable. If Levene’s test is significant (the value for significance is .05 or smaller) then the two variances are significantly different. If not significant (significance above .05) then we have equal variance. Essentially, Levene’s test examines if the samples differ from the population from which they were drawn. Levene’s test output is split on two lines so there are two possible t-tests to evaluate If the significance level for Levene’s test is .05 or below, then the “Equal Variance Not Assumed Test” (the one on the bottom)is used to determine if the difference in means between freshmen and seniors are statistically significant. If the value is above .05, then the t-test on the top line is examined to see if there are differences in means for the scaled scores between freshmen and seniors. Of the five scores examined with Levene’s test, only the reading and writing scaled scores were not significant (Table 4); therefore, one should examine the top row “equal variance assumed” Levene’s test revealed that the total scaled score, critical thinking scaled score and the mathematic scaled score were significant, thus we look at the “equal variances not assumed” row on the t-test. Examining the significance level (2-tailed) for each scaled score along their respective row, each was significant at .05 level (each had a .000 p value in Table 4). The difference in means between each scaled score for freshmen and seniors at UD who took the EPP was significant. Seniors had higher scores in the four core areas assessed by the EPP and in their overall scaled score.
VALUE Rubrics Assessment
In summer 2011, three University of Delaware faculty members were hired by the OEA to evaluate GenEd competencies of critical thinking and quantitative reasoning. These “Faculty Assessment Scholars” (FASs) participated in a comparative study (freshmen and seniors) and utilized two AAC&U VALUE 4 rubrics adapted by the OEA, one for assessing critical thinking (CT) skills (AAC&U Inquiry and Analysis Rubric) and one for assessing quantitative reasoning (QR) skills. The FASs examined artifacts voluntarily submitted by the same freshmen and senior undergraduate students who had completed the EPP at the University of Delaware during Fall 2010.
A smaller group of EPP freshmen and seniors provided artifacts to demonstrate their GenEd competencies and the OEA selected artifacts to be assessed based upon whether the artifact containing enough information to be evaluated via the VALUE rubric. Students who completed the EPP and then uploaded work samples demonstrating the areas of interest for OEA (e.g. critical thinking and quantitative reasoning) were then entered into a drawing to win a new iPad2. 5
A total of 10 freshmen CT artifacts, 10 senior CT artifacts, and 10 QR senior artifacts were evaluated by the FASs. It is important to note that only 3 QR artifacts were submitted by the freshmen and none of those actually contained information that could be evaluated using the QR rubric. This may be a result of the freshmen not understanding what is QR or not having had enough opportunity by the end of their fall semester to select a QR artifact. The rubrics were replicated in Qualtrics Survey System so the FASs could conduct score student’s work and then that captured data could be examined by the OEA. There were 79 surveys started and 79 surveys completed for the Critical Thinking AAC&U VALUE rubric adapted by OEA, and 31 started and 31 completed for the Quantitative Reasoning rubric adapted by OEA in Qualtrics.
The FASs utilized the QR rubric adapted by OEA to assess ten artifacts from undergraduate seniors; the tables below illustrate the ratings given according to each criterion in the rubric. The data in the charts below reflect the output obtained in Qualtrics from developing a report for each rubric.
Table 6 illustrates the ratings given according to each criterion on this rubric.6 The 10 QR artifacts were rated on a scale of 1 to 4 for the six criteria in this rubric. The table below illustrates the percentage of ratings received by seniors for possible rating on a scale from 1 to 4. The mean rating for seniors on each QR criterion is also provided.
The UD assessment scholars utilized the Inquiry and Analysis VALUE rubric adapted by OEA to assess the GenEd competency of Critical Thinking (CT). FASs reviewed 20 artifacts from freshmen and seniors as part of a comparative analysis of General Education competencies amongst a sample of university undergraduates. The 20 artifacts were rated on a scale of 1 to 4 for the six criteria in this rubric. The table below (Table 7) illustrates the percentage of ratings received by freshmen and seniors for possible rating on a scale from 1 to 4 on each criterion in the Inquiry and Analysis AAC&U VALUE rubric adapted by OEA. The mean for ratings of freshmen and seniors artifacts against criteria on this rubric is provided as well as the gains achieved amongst seniors.
Inter-rater Reliability Testing
The inter-rater reliability amongst the three FASs was examined utilizing the intraclass correlation coefficient. 7 This test is a “general measurement of agreement or consensus…the Coefficients represents agreements between two or more raters or evaluation methods on the same set of subjects” (stattools.net, n.d.). The intraclass correlation coefficient allows for the examination of agreement of ratings on the same population, or the “homogeneity of…elements within clusters” (Kish 1965, p. 170), those elements being ratings of student artifacts.
Methods
When employing intraclass correlation coefficients to examine inter-rater reliability, there are a variety of models one can use, and the choice among these is determined by the method of ratings given and whether raters and those being rated are drawn from a random sample. Since all raters rated every student and these three were not drawn from a random sample of possible raters, they are a fixed effect, “while the target ratings are a random effect” (Yaffee 1998). Thus the most appropriate model to estimate inter-rater reliability is a two way mixed model. 8 With the selection of the two way mixed model, the measure of “absolute agreement” was utilized. When running a two way mixed model to obtain an intraclass correlation coefficient, one has the ability to define agreement in terms of “consistency” or “absolute agreement”. If variability amongst raters is considered irrelevant for analyses, then the “consistency” measure is selected. If variability or systematic differences between raters is considered relevant for analysis, then “absolute agreement” is selected. In choosing “absolute agreement” in analyses all differences in measurement are “considered, no matter what the reason for the difference…consistency rating excludes situations such as one rater being consistently higher than the others” (Spirtos et al. 2011). Thus, for this current analyses, “absolute agreement” was selected which essentially tests whether each rater gave each student the same exact score.
The intraclass correlation coefficient, as a measure of inter-rater reliability has a range of -1 indicating extreme heterogeneity and 1 indicating complete homogeneity (Kish 1965, p. 171). Ratings closer to 1.0 indicate less variation amongst the three FASs in evaluating student artifacts. With the mixed method model there are two forms of reliability, the “single measures” and “average measures”. In this study there is more than one rater; therefore the average of the raters will be reported and used to examine inter-rater reliability.
Each criterion on the three separate rubrics (e.g. freshmen critical thinking, senior critical thinking and senior quantitative reasoning) assessed by the FASs was tested for inter-rater reliability. For each criterion, ratings by all three FASs for all students were examined in a two way mixed model to obtain intraclass correlation coefficients. Since there were six criteria and three rubrics, there were 18 two way models produced. Out of the 18 models, six were found to be statistically significant at the .01 level, while one was found significant at the .05 level (see Table 8). Looking at the “average measures” for the seven criteria rated and found to be significant, four had a moderate correlation (.40 - .69) and three had a strong correlation (.70 - .89).
Discussion
Implementing the abbreviated version of the EPP was time consuming, expensive, and yielded results the UD faculty Assessment Scholars as well as the OEA has little confidence in supporting. Because of the bias that occurred from students self-selecting to take the standardized test, it is not recommended that the results be generalized to the student population at UD. Additional problems occur because the EPP is a low stakes test and the abbreviated version provides a less intensive examination of skill levels than the more expensive and much longer (40 minutes for the abbreviated versus 120 minutes for the standard EPP). For example, there are only 9 questions testing one’s competency in Mathematics on the abbreviated version. To obtain volunteers to take the test, the OEA recruited First Year Experience faculty from across campus to allow OEA to offer students the opportunity to volunteer during class time. The faculty members recruited were asked to participate because each volunteer provided a more thorough representation of all the disciplines across campus. This was not a randomized or even a truly purposeful sample of freshmen. With the senior sample, it was even more challenging to obtain participants. Signs were put up around campus for senior volunteers to take the EPP at open sessions (not during any of their classes), faculty teaching Capstone courses were asked for access to their students to do an in-class solicitation for participation, and the chance to win an iPad 2 worth $499.00 was offered as an incentive for volunteering, yet these efforts were still not enough to obtain the recommended 200 senior participants. This reliance on senior volunteers and lack of purposeful sampling introduced bias to senior responses. The seniors and freshmen samples were obtained in different methods. Consequently, the OEA cannot infer what levels of General Education competencies are possessed by the freshmen and senior UD undergraduate population with the results of the EPP. The methodological issues in administering the EPP to UD undergraduates impact the validity of its results.
Implementing an assessment of students’ artifacts can be accomplished with minimal interruption to class time and with a more thorough representation of student work could be obtained by implementing a different strategy such as ePortfolios with the First Year Experience or assessment requirements implemented by the Provost’s office. For example, because the First Year Experience course is required for all UD students, faculty in charge of these courses could require that all freshmen students upload artifacts of their work so that a baseline measure could be obtained. If the faculty assigned this task for freshmen, artifact collection could occur with minimal interruption to in-class learning. The same process could be implemented in senior capstone experience courses. Almost all UD majors currently require a capstone experience; therefore, the artifact collection would be from the widest breadth of majors and consequently would allow the OEA comparative measures between freshmen and seniors Gen Ed competency levels.
OEA realized that hiring FASs to evaluate student work using a faculty endorsed rubric matched to the operationalized definition of the UD Gen Ed goal was a more efficient and effective way to evaluate student work, and the results can inform decision making and curricular reforms. FASs had excellent inter-rater reliability with minimal training and could accomplish their evaluations at a convenient location for them. Through the use of a rubric tool, the ability to isolate the criteria per goal provided more useful and actionable information than simple levels of proficiency on large nebulous skills such as critical thinking. When UD FASs evaluated critical thinking they observed students’ abilities to make assumptions and draw conclusions was their weakest skill. With that quality result, the OEA can inform programs that students could be provided with further opportunities to practice this skill to impact students’ performance.
References
Beaupre', D., Nathan, S. B., & Kaplan, A. (2002).Testing our schools a guide for parents. http://www.pbs.org/wgbh/pages/frontline/shows/schools/etc/guide.html (accessed September 15, 2011)
Kish, Leslie. 1965. Survey Sampling. John Wiley & Sons, In.: New York
Spirtos, Michelle, Paul O’Mahony and Jeni Malone. 2011. “Interrater reliability of the Melbourne Assessment of Unilateral Upper Limb Function for Children with Hemiplegic Cerebal Palsy” American Journal of Occupational Therapy, Vol. 65, No. 4, 378-383 http://ajot.aotapress.net/content/65/4/378.full (accessed August 15, 2011)
Stattools.net n.d. “Intraclass correlation for parametric data: Introduction and Explanation” http://www.stattools.net/ICC_Exp.php (accessed August 15, 2011)
Yaffee, Robert A. 1998. “Enhancement of Reliability Analysis: Application of Intraclass Correlations with SPSS/Windows v. 8” http://www.nyu.edu/its/statistics/Docs/intracls.html (accessed August 15, 2011)
[4] Association of American Colleges and Universities Valid Assessment of Learning in Undergraduate Education (VALUE) -VALUE meta-rubrics reflect faculty expectations for essential learning across the nation regardless of type of institution. UD participated in the development and review of the fifteen rubrics.http://www.aacu.org/value/rubrics/index_p.cfm?CFID=32492691&CFTOKEN=71150862
Table 3: T-tests for Equality of Means

Highlighted area indicates which significance test was utilized for each score.
Table 4: Levene’s Test for equal variance

Table 5: Group Statistics
Table 6: Quantitative Reasoning assessment of Seniors

Table 7: Critical Thinking Analysis assessment of Freshmen and Seniors

Table 8: Intra-correlation Coefficient Reliability Test


A Comparison of Assessing General Education Goals with Freshmen and Senior Students by implementing the Abbreviated Educational Proficiency Profile compared to the Adapted American Association of College and Universities Valid Assessment of Learning in Undergraduate Education (VALUE) Rubrics. by Kathleen Langan Pusecker MS.; Manuel Roberto Torres PhD.; Ian Crawford PhD.; Delphis Levia PhD.; Donald Lehman EdD. and Gordona Copic MS. is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License
