{"id":1639,"date":"2023-02-20T15:32:39","date_gmt":"2023-02-20T19:32:39","guid":{"rendered":"https:\/\/jeffq.com\/blog\/?p=1639"},"modified":"2023-02-20T15:32:40","modified_gmt":"2023-02-20T19:32:40","slug":"language-models-vs-the-sat-reading-test","status":"publish","type":"post","link":"https:\/\/jeffq.com\/blog\/language-models-vs-the-sat-reading-test\/","title":{"rendered":"Language Models vs. The SAT Reading Test"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote\">\n<p>tl;dr FLAN-T5 (11B) scored identically to GPT-3.5 (text-davinci-003) across the ten publicly available SAT Reading Tests. A finetuned 3B model scored within 7 percentage points of GPT-3.5 on held-out tests with 98% less parameters while maintaining generalization<\/p>\n\n\n\n<p>Models: <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/emozilla\/flan-t5-base-sat-reading-comprehension\" target=\"_blank\">base<\/a>, <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/emozilla\/flan-t5-large-sat-reading-comprehension\" target=\"_blank\">large<\/a>, <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/emozilla\/flan-t5-xl-sat-reading-comprehension\" target=\"_blank\">xl<\/a>, <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/emozilla\/flan-t5-xxl-sat-reading-comprehension\" target=\"_blank\">xxl<\/a> Dataset: <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/datasets\/emozilla\/sat-reading\" target=\"_blank\">HuggingFace<\/a> Code: <a href=\"https:\/\/github.com\/jquesnelle\/sat-reading\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a><\/p>\n<\/blockquote>\n\n\n\n<p>After working on <a href=\"https:\/\/jeffq.com\/blog\/literai-ai-generated-open-source-visual-podcasts\/\">literAI<\/a> I&#8217;ve been interested in further exploring language models from a narrative\/literary perspective. One question I had was &#8220;how well do these models actually &#8216;understand&#8217; longer prose?&#8221;<\/p>\n\n\n\n<p>Now, it just so happens that there&#8217;s a test we make teenagers take every year to determine this very fact! That is, the <a href=\"https:\/\/satsuite.collegeboard.org\/sat\">SAT<\/a> (specifically, the Reading part). <\/p>\n\n\n\n<p>The SAT Reading Test, despite its name, is multimodal. There is always one section that includes a combination of charts, tables, and graphs. However, the questions are clearly delineated &#8212; typically only three questions on the test reference the data. For the purposes of evaluation I excluded these questions. First, the results.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/SAT-Reading-Scores-Pre-trained-models-1.png\"><img decoding=\"async\" loading=\"lazy\" width=\"769\" height=\"430\" src=\"https:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/SAT-Reading-Scores-Pre-trained-models-1.png\" alt=\"\" class=\"wp-image-1642\"\/><\/a><figcaption class=\"wp-element-caption\"><a href=\"https:\/\/github.com\/jquesnelle\/reading-comprehension\/blob\/master\/results\/test-results.json\" target=\"_blank\" rel=\"noreferrer noopener\">Data<\/a><\/figcaption><\/figure>\n\n\n\n<p>FLAN-T5 11B scored identical to GPT-3.5, despite being less than 1\/10th the size! It is also can be run on a consumer GPU (&lt;= 24 GB) when loaded in 8-bit inference mode! This offers further data supporting the hypothesis that Google did the open source local compute LM community a great service when it released FLAN-T5.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>One interesting aspect of the SAT Reading Test is that 30% of the questions reference specific lines within the passage under consideration.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\">\n<p>Which choice best supports the conclusion that<br>Mr. Peters wants to attract attention?<\/p>\n\n\n\n<p>A) Lines 80-81 (\u201cApparently\u2026 change\u201d)<br>B) Lines 81-85 (\u201cHe straightened\u2026 hand\u201d)<br>C) Lines 90-91 (\u201cThe young . . . Mr. Peters\u201d)<br>D) Lines 91-93 (\u201cHe was\u2026 forty-five\u201d)<\/p>\n<cite>SAT Practice Test #5 Question #9<\/cite><\/blockquote>\n\n\n\n<blockquote class=\"wp-block-quote\">\n<p>As used in line 93, \u201cbecoming\u201d most nearly means<\/p>\n\n\n\n<p>A) emerging.<br>B) fitting.<br>C) developing.<br>D) happening.<\/p>\n<cite>SAT Practice Test #5 Question #10<\/cite><\/blockquote>\n\n\n\n<p>This means that to properly answer the question the LM need to be able to count lines in the presented passage and reason about them explicitly in the context of the passage itself. The <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/datasets\/emozilla\/sat-reading\" target=\"_blank\">dataset<\/a> I created faithfully represents the line breaks as they appear on the test. What it doesn&#8217;t contain is the extra line count helper column that appears next to the passage. For example, here is a snippet of what a passage on the actual test looks like:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/sat-snippet.png\"><img decoding=\"async\" loading=\"lazy\" width=\"519\" height=\"463\" src=\"https:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/sat-snippet.png\" alt=\"\" class=\"wp-image-1644\"\/><\/a><figcaption class=\"wp-element-caption\">SAT Practice Test #5 Passage #1<\/figcaption><\/figure>\n\n\n\n<p>Note the italicized <em>Line<\/em> and counter, which appears every five lines. Even the regular passages are multimodal! While it&#8217;s certainly just text, communicating it requires more than presenting it merely as a sequence of characters. To see how the models performed on these type of questions I took at look at how the best open source model (FLAN-T5) scored on the two question classes.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><a href=\"https:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/FLAN-T5-Difference-in-Scores-Between-Question-Types.png\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/FLAN-T5-Difference-in-Scores-Between-Question-Types.png\" alt=\"\" class=\"wp-image-1646\" width=\"600\" height=\"371\"\/><\/a><\/figure>\n\n\n\n<p>FLAN-T5 scored between 5-13% worse on the &#8220;line number&#8221; questions that it did on the other questions on the test. Could the model just need a little help counting?<\/p>\n\n\n\n<p>To test this theory I finetuned the each of the FLAN-T5 models on eight of the ten practice tests, leaving the remaining two tests for validation. An especially huge thanks is in line to <a rel=\"noreferrer noopener\" href=\"https:\/\/twitter.com\/_philschmid\" target=\"_blank\">Philipp Schmid<\/a> for his excellent <a rel=\"noreferrer noopener\" href=\"https:\/\/www.philschmid.de\/fine-tune-flan-t5-deepspeed\" target=\"_blank\">blog<\/a> <a rel=\"noreferrer noopener\" href=\"https:\/\/www.philschmid.de\/fine-tune-flan-t5\" target=\"_blank\">posts<\/a> on finetuning FLAN-T5. <\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/FLAN-T5-Finetuned-vs-Pre-trained-on-Held-Out-Tests.png\"><img decoding=\"async\" loading=\"lazy\" width=\"792\" height=\"490\" src=\"https:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/FLAN-T5-Finetuned-vs-Pre-trained-on-Held-Out-Tests.png\" alt=\"\" class=\"wp-image-1649\"\/><\/a><\/figure>\n\n\n\n<p>The models themselves are available here: <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/emozilla\/flan-t5-base-sat-reading-comprehension\" target=\"_blank\">base<\/a>, <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/emozilla\/flan-t5-large-sat-reading-comprehension\" target=\"_blank\">large<\/a>, <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/emozilla\/flan-t5-xl-sat-reading-comprehension\" target=\"_blank\">xl<\/a>, <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/emozilla\/flan-t5-xxl-sat-reading-comprehension\" target=\"_blank\">xxl<\/a>. Three of the four finetuned models outscored the original models, with the XL model showing the largest gain. Of particular interest is the XL model, which is within seven percentage points of GPT-3.5 while having 98% (!!!) less parameters (3B vs. 175B). <\/p>\n\n\n\n<p>One problem with aggressive finetuning on small datasets is overfitting or loss of generalization. Do the finetuned models still perform as well as the original models on unseen tasks? To test this I ran the finetuned on a subset of the <a rel=\"noreferrer noopener\" href=\"https:\/\/super.gluebenchmark.com\/\" target=\"_blank\">SuperGLUE<\/a> metrics.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes has-small-font-size\"><table><thead><tr><th> <\/th><th>XXL PT<\/th><th>XL FT<\/th><th>XL PT<\/th><th>XL FT<\/th><th>Large PT<\/th><th>Large FT<\/th><th>Base PT<\/th><th>Base FT<\/th><\/tr><\/thead><tbody><tr><td>cb gpt<\/td><td><strong>0.87<\/strong><\/td><td>0.83<\/td><td>0.83<\/td><td>0.83<\/td><td><strong>0.76<\/strong><\/td><td>0.71<\/td><td>0.82<\/td><td><strong>0.82<\/strong><\/td><\/tr><tr><td>copa c1\/c2<\/td><td><strong>0.95<\/strong><\/td><td>0.91<\/td><td><strong>0.95<\/strong><\/td><td>0.90<\/td><td><strong>0.83<\/strong><\/td><td>0.82<\/td><td><strong>0.57<\/strong><\/td><td>0.55<\/td><\/tr><tr><td>rte gpt<\/td><td>0.89<\/td><td><strong>0.90<\/strong><\/td><td>0.85<\/td><td><strong>0.87<\/strong><\/td><td><strong>0.87<\/strong><\/td><td>0.84<\/td><td>0.79<\/td><td><strong>0.80<\/strong><\/td><\/tr><tr><td>wic gpt<\/td><td>0.68<\/td><td>0.68<\/td><td>0.71<\/td><td><strong>0.72<\/strong><\/td><td><strong>0.62<\/strong><\/td><td>0.61<\/td><td>0.48<\/td><td>0.48<\/td><\/tr><tr><td>wsc gpt<\/td><td>0.76<\/td><td><strong>0.77<\/strong><\/td><td>0.73<\/td><td><strong>0.75<\/strong><\/td><td><strong>0.66<\/strong><\/td><td>0.61<\/td><td>0.45<\/td><td><strong>0.46<\/strong><\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><a href=\"https:\/\/github.com\/jquesnelle\/reading-comprehension\/tree\/master\/results\/finetuned-evaluations\" target=\"_blank\" rel=\"noreferrer noopener\">Data<\/a><\/figcaption><\/figure>\n\n\n\n<p>The above table represents only a few of the hundreds of metrics ran &#8212; see the <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/jquesnelle\/reading-comprehension\/tree\/master\/results\/finetuned-evaluations\" target=\"_blank\">data<\/a> for full results. They are, however, representative; the finetuned (FT) models maintain the same generalization capabilities as the pre-trained (PT) versions! It may be that the finetuned models are (by this limited measure) &#8220;better&#8221; than the originals since they score higher on the SAT Reading Test while maintaining zero-shot unseen task performance.<\/p>\n\n\n\n<p>In conclusion, FLAN-T5 continues to show itself as a powerful model, both in its raw reasoning capabilities relative to closed source models, but also in its ability to quickly learn new skills through finetuning &#8212; not to mention its accessibility on consumer-grade hardware. ty google<\/p>\n","protected":false},"excerpt":{"rendered":"<p>FLAN-T5 has parity with GPT-3.5 (text-davinci-003) on the SAT Reading Test, and finetuning leads to even better scores. The 3B (XL) finetuned model scores within 1.5 percentage points of GPT-3.5 on held-out tests with 98.2% less parameters while maintaining generalization<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[44],"_links":{"self":[{"href":"https:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/posts\/1639"}],"collection":[{"href":"https:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/comments?post=1639"}],"version-history":[{"count":5,"href":"https:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/posts\/1639\/revisions"}],"predecessor-version":[{"id":1651,"href":"https:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/posts\/1639\/revisions\/1651"}],"wp:attachment":[{"href":"https:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/media?parent=1639"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/categories?post=1639"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/tags?post=1639"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}