Ryan Shaw Applied Natural Language Processing Assignment #2 September 29, 2004 I wanted to be able to correctly chunk noun phrases such as "rising labor costs" in sentences like "Rising labor costs are forcing manufacturers to make some difficult decisions." I wrote a parser (based on the one you gave us) that would make a first pass at identifying noun phrases, and then wrote a second parser that, among other things, would look for present participles ("rising") preceding noun phrases ("labor costs"). The problem with this was that my rule also chunked present participles followed by their direct objects, like "forcing manufacturers." I thought I could avoid this by looking for a preceding VBP (like "are") and not chunking in that case. Looking through the Python regex documentation, I saw that I could do this with a "negative lookbehind assertion," which would match a VBG followed by a NP only if it were *not* preceded by a match for VBP. Unfortunately, the syntax for negative lookbehind assertions (? tag: this was just a lame ad hoc attempt to get better results. I noticed that some of my VP chunking rules were breaking because of <-NONE-> tags seemingly randomly sprinkled throughout the treebank corpus. I couldn't find any info about these tags, what they meant or why they were in there. So I decided to allow optional <-NONE-> tags in verb phrases. I also noticed that the construction "$X a share" (for stock prices) always had an extra <-NONE-> tag between the dollar value and the determiner "a," so I took advantage of this fact to write a rule for correctly chunking those constructions as NPs, since none of my other rules handled those correctly. These are my chunking parsers. They are meant to be run in the order presented. -- Parser 1 ------------------------------------------------------- This parser operates on unchunked word tokens and produces NP chunks. rules = [ ChunkRule(r'<\$><-NONE->

', 'Share prices (ex: $18 a share)'), ChunkRule(r'

?(|||)*+', 'Chunk nouns and their modifiers'), ChunkRule(r'

??(|)++', 'Chunk gerunds and their modifiers'), ChunkRule(r'<\$>+', 'Chunk currency values'), ChunkRule(r'+', 'Chunk numbers (like years)'), ChunkRule(r'', 'Chunk personal pronouns')] -- Parser 2 ------------------------------------------------------- This parser operates on NP chunks and produces higher-level NP chunks. May be run more than once. rules = [ HackedChunkRule(r'(?)', 'Chunk present participles modifying NPs'), ChunkRule(r'((<,>||<,>))*', 'Chunk NPs joined by conjunctions'), ChunkRule(r'', 'Chunk possessives'), ChunkRule(r'', 'Chunk possessive pronouns')] -- Parser 3 ------------------------------------------------------- This parser operates on NP chunks and produces PP chunks. rules = [ ChunkRule(r'(|)+', 'Chunk prepositions followed by NPs')] -- Parser 4 ------------------------------------------------------- This parser operates on NP and PP chunks and produces higher-level NP chunks like "the first man on the moon." rules = [ ChunkRule(r'', 'Chunk NP-PP combos into NPs')] -- Parser 5 ------------------------------------------------------- This parser operates on NP and PP chunks and produces VP chunks. rules = [ ChunkRule(r'', 'Chunk verb-adjective combos (ex: rising higher)'), ChunkRule(r'(|)??()+<-NONE->???(|)*?', 'Chunk VPs with all their optional attachments'), ChunkRule(r'', 'Chunk lone verbs')] -- Parser 6 ------------------------------------------------------- This parser operates on VP chunks and produces higher-level VP chunks like "rose quickly and levelled off." rules = [ ChunkRule(r'', 'Chunk VPs joined by conjunctions')] Highlights of chunking performance: -- Sentence 1 ----------------------------------------------------- Shorter maturities are considered a sign of rising rates because portfolio managers can capture higher rates sooner. -- Original chunking rules ---------------------------------------- (S: (NP: ) (VP: (NP: )) (VP: (NP: ) (PP: (NP: ))) (NP: ) <./.>) -- Improved chunking rules ---------------------------------------- (S: (NP: ) (VP: (NP: (NP: ) (PP: (NP: (NP: )))) (PP: (NP: ))) (VP: (NP: ) ) <./.>) -- Comments ------------------------------------------------------- The original chunker incorrectly tagged "rising rates because portfolio managers" as a VP. My chunker correctly identified "rising rates" as a NP. -- Sentence 2 ----------------------------------------------------- Japan's domestic sales of cars, trucks and buses in October rose 18% from a year earlier to 500,004 units, a record for the month, the Japan Automobile Dealers' Association said . -- Original chunking rules ---------------------------------------- (S: (NP: ) <'s/POS> (NP: ) (PP: (NP: )) <,/,> (NP: ) (NP: ) (PP: (NP: )) (VP: (NP: <18/CD> <%/NN>) (PP: (NP: ))) (NP: <500,004/CD> ) <,/,> (NP: ) (PP: (NP: )) <,/,> (NP: ) <'/POS> (NP: ) <./.>) -- Improved chunking rules ---------------------------------------- (S: (NP: (NP: (NP: ) <'s/POS> (NP: )) (PP: (NP: (NP: ) <,/,> (NP: ) (NP: )))) (PP: (NP: )) (VP: (NP: (NP: <18/CD> <%/NN>) (PP: (NP: )))) (PP: (NP: <500,004/CD> )) <,/,> (NP: (NP: ) (PP: (NP: ))) <,/,> (NP: (NP: ) <'/POS> (NP: )) (VP: ) <./.>) -- Comments ------------------------------------------------------- My chunker correctly identified "Japan's domestic sales of cars, trucks and buses" as a single NP, whereas the original chunker did not. -- Sentence 3 ----------------------------------------------------- In the 1990s, spurred by rising labor costs and the strong yen, these companies will increasingly turn themselves into multinationals with plants around the world. -- Original chunking rules ---------------------------------------- (S: (PP: (NP: <1990s/NNS>)) <,/,> (VP: (NP: )) (NP: ) <,/,> (NP: ) (PP: (NP: )) (PP: (NP: )) (PP: (NP: )) <./.>) -- Improved chunking rules ---------------------------------------- (S: (PP: (NP: <1990s/NNS>)) <,/,> (VP: (PP: (NP: (NP: (NP: )) (NP: )))) <,/,> (NP: ) (VP: (NP: (NP: ) (PP: (NP: ))) (PP: (NP: )) (PP: (NP: ))) <./.>) -- Comments ------------------------------------------------------- The original chunker did not correctly identify either of the two VPs in this sentence, and falsely identified a NP as a VP. My chunker handled all three phrases (pretty much) correctly. -- Sentence 4 ----------------------------------------------------- The index of the 100 largest Nasdaq financial stocks rose modestly as well, gaining 1.28 to 449.04. -- Original chunking rules ---------------------------------------- (S: (NP: ) <100/CD> (NP: ) (NP: ) <,/,> <1.28/CD> <449.04/CD> <./.>) -- Improved chunking rules ---------------------------------------- (S: (NP: (NP: ) (PP: (NP: <100/CD> ))) (VP: ) <,/,> (NP: (NP: (NP: <1.28/CD>)) (PP: (NP: <449.04/CD>))) <./.>) -- Comments ------------------------------------------------------- My chunker correctly tagged "The index of the 100 largest Nasdaq financial stocks" as a single NP, where the original chunker only tagged "The index." The original chunker also missed the VP "rose modestly;" mine did not. -- Verb context analysis ------------------------------------------ I selected the verb "to rise," various forms of which appeared 30 times in 29 sentences in the subset of the Penn Treebank that I used. Given that most of the articles in this corpus came from the Wall Street Journal, it is perhaps unsurprising that the contexts in which it appeared were fairly similar. Subjects of "rise" were always some quantitative value like "sales," "prices," "stock," "ratings," etc. In about 2/3 of the sentences this value was a NNS (plural common noun). As "rise" is an intransitive verb, it did not take any direct objects, but there were some common contructions following it. In the majority of sentences it was followed by a dollar amount or a percentage, a PP beginning with "to" and sometimes a PP beginning with "from," as the following template shows: rise [AMOUNT] [to AMOUNT] [from AMOUNT] It was also common to see the VP include temporal specifiers such as "in September" or "last year," and to have adverbs such as "quickly" or "sharply" modifying the verb. -- Similar verbs -------------------------------------------------- Though I did not have time to do an analysis to find verbs similar to "rise," I am fairly confident that an analysis using the same corpus of synonyms such as "increase" or antonyms such as "fall" would find similar patterns to the ones described above.