Bots searching for linguistic gems on Twitter.
PangramTweets
By Ben Zimmer
The Twitter API, beyond its great utility for corpus linguistics (see “On the front lines of Twitter linguistics,” “The he’s and she’s of Twitter“), has made possible a lot of fun automated text-mining projects. One fertile area is algorithmic found poetry: there have been Twitter bots designed to find accidental haikus, and even more impressively, a bot named @Pentametron that finds rhyming tweets in iambic pentameter and fashions sonnets out of them.
And then there is found wordplay, which is its own kind of found poetry. I’m a big fan of @Anagramatron, which discovers paired tweets that form serendipitous anagrams of each other. (Example: “Last time I do anything” ⇔ “That’s it. I’m dying alone.”) Now, courtesy of Jesse Sheidlower, comes @PangramTweets, in which each tweet contains every letter of the alphabet at least once.
Jesse explains the project on his site:
PangramTweetsis a bot (a computer program that runs on its own) that searches Twitter for, and then retweets, pangrams—texts that contain every letter of the alphabet. A famous pangram, sometimes used as a typing test, is “The quick brown fox jumps over the lazy dog.” […]
You may find the results interesting, or dull. I make no judgment on this. The bot is entirely automated; I do not curate the results.
I strip out user names and URLs from the results, but hashtags are included. I also do some very basic filtering to try to ensure that the results are in English, and not in another language or complete gibberish (random letters), though earlier versions of the bot did retweet nonsense or foreign-language pangrams.
The bot originally did not filter out known pangrams of the “quick brown fox” variety, but by popular demand Jesse put a filter in place for that as well. The results are not as rich as Anagramatron, but that’s to be expected given the constraints: Jesse sayshe gets “one real pangram in every few million tweets scanned.” Here’s a sampling of what has turned up so far.
I’ve just (with the help of google) realized I wrote about the wrong experiment in my 12 mark psychology question
oops
— s (@bricktop___) May 13, 2014
It’s official: Arthur Sulzberger names Dean Baquet executive editor of The New York Times, replacing Jill Abramson.
— Vindu Goel (@vindugoel) May 14, 2014
Looking for a new job is exhausting. Every one I want requires a bazillion years of experience I don’t have. FML.
— Ryan Stephens (@Integrity1stziB) May 16, 2014
Thanks JMM for boosting my boxing prediction confidence again. The Mayweather card did a number on a lot of boxing fans. #MarquezAlvarado
— E.J.O. (@ElioOrtiz11) May 18, 2014
SHUT THE FUCK UP ABOUT THE “FRIENDZONE”. MAYBE YOU SHOULD JUST VALUE A WOMAN’S FRIENDSHIP AND QUIT EXPECTING THEM TO FUCK YOU. JESUS FUCK.
— ・。。・゜☆゜・。。・ (@chrstnmchd) May 19, 2014
Juan Manuel Marquez boxes Alvarado on weekday to line up fifth fight alongside Pacquiao @SportsMomentz http://t.co/e5CyDwDXFd
— Rinaldo Jonathan (@testeronline12) May 19, 2014
Maybe Joe needs to take some advice from Iceland and arrest the rich people who are stealing from the rest of us tax paying citizens. #qanda
— Toby Owens (@TehMegaWiz) May 19, 2014
It will be interesting to see if the bot turns up a naturally occurring “pangrammatic window” that beats the current record-holder of 42 letters, from Piers Anthony’s Cube Route:
“We are all from Xanth,” Cube said quickly. “Just visiting Phaze…”
Sean Irvine announced the discovery of this pangrammatic window in Word Ways in 2012. It beat out Eric Chaikin’s 47-letter find, which he discovered by Googling for “Joaquin Phoenix”:
“JoBlo’s movie review of The Yards: Mark Wahlberg, Joaquin Phoenix, Charlize Theron…”
Of course, determining if a pangram is “naturally occurring” may be difficult, since it’s always possible to game the system! But with half a billion tweeters tweeting, maybe someday one of them will authentically produce a winner like “Mr. Jock, TV quiz PhD, bags few lynx.”
Update: Jesse is attempting to filter out non-English tweets, but Indonesian tweets keep seeping through. Since I’ve done research on colloquial varieties of Indonesian, I find these tweets fascinating. I was initially surprised that the Indonesian Twittersphere would be generating pangrams, considering that the letters Q, V, X, and Z appear only in loanwords. But Indonesian participants on Twitter are using quite a lot of Anglicisms, along with a plethora of txtspk-style abbreviations of Indonesian words. An example that just popped up:
@PutriAZSYA EXCITED BGT GRGR 1D MW K INDO. LBH EXCITED LG KLO JOIN LITTLEQUIZ @1D_CrazyLovers DAN BCA JG FFNY.PASTI LO MKIN EXCITED.CEK FAV6
— winda (@windaameliasar1) May 20, 2014
The loanwords here are EXCITED, JOIN, and LITTLEQUIZ, and 1D refers to the band One Direction. Here’s a key to the abbreviation-heavy Indonesian items:
BGT = banget ‘very’
GRGR = gara-gara ‘just because’
MW = mau ‘will’
K = ke ‘(come) to’
INDO = Indonesia
LBH = lebih ‘more’
LG = lagi ‘(even) more’
KLO = kalau ‘if’
DAN = dan ‘and’
BCA = baca ‘read’
JG = juga ‘also’
PASTI = pasti ‘definitely’
LO = (e)lo ‘you’
MKIN = makin ‘more and more’
CEK = cek ‘check’
So that would work out to: “@PutriAZSYA Very excited just because One Direction is coming to Indonesia. You’ll be even more excited if you join LittleQuiz @1D_CrazyLovers, and also read FFNY. You’ll definitely get more and more excited. Check Fav6.”
May 19, 2014 at 5:07PM
via Language Log http://ift.tt/1h1wAaW