It's like the Southpark episode where the robot was really Cartman in a cardboard box.
There was something on Bluesky or maybe even Mastodon where someone said, paraphrasing, that this was deadly secret but now it's been over ten yerars so they can share that Quickbooks had a magic-seeming feature where you put in a receipt and it categorized the expense automatically, but sometimes it took longer & you had to check back later. The longer waits were when the staff in the Philippines actually doing the work were off.
Pretty sure you're mistaken. That's definitely the plotline of the Donny and Marie Osmond Little Golden Book twenty years earlier.
I wonder if the guys in India had to wait for someone to fall asleep before they could use the bathroom.
SpinVox!
https://x.com/yorksranter/status/1547221184609239045
https://www.theregister.com/2009/07/29/spinvox_mechanical_turk/
The Mechanical Turk!
https://en.wikipedia.org/wiki/Mechanical_Turk
Not artificial, but otherwise unconnected, in fact opposed to the avalanche of deception and error we live in, the first sketches of the little prince:
https://www.sothebys.com/en/buy/auction/2025/livres-et-manuscrits-pf2503/le-petit-prince-sur-sa-planete-lecharpe-au-vent
I mean, this is also the big reveal in Wizard of Oz. What a tired trope!
This was also famously what was happening with Amazon's "Just Walk Out" "technology."
The joke is AI stands for "actually Indians."
I do vaguely wonder how much of this is pure knowing grift and cash grab, and how much is covering for "well, it doesn't work *yet*". Is there a true believer in there somewhere?
What's interesting about Alex's post is that it's from sixteen years ago. Now, real machine transcription is incredibly good, and either cheap or free, but no one thinks about how you can now get near-perfect subtitles for free on any video you want (for example) and says "aha! A triumph for AI!" To 2, I bet that an LLM could categorise receipts pretty accurately as well now.
I don't know enough about LLMs' coding ability, but would anyone be prepared to bet that Builder.ai won't be real by 2030?
but no one thinks about how you can now get near-perfect subtitles for free on any video you want (for example) and says "aha! A triumph for AI!"
Paging E. Messily.
Step 1: Cheat to make AI look good and drive out human competition
Step 2: Stop cheating and don't cut prices because competition is good
Step 3: Profit and/or a boot stepping on humanity's face
Meanwhile I saw a TV show last night with closed captions of incredibly poor quality - "lower" when it was "lore", many thought-free errors like that. I also see such errors all the time in the short video world where captions are often added by the creator.
It may not quite be Mechanical Turks *all* the way down, but it sure seems pretty close.
real machine transcription is incredibly good, and either cheap or free
It's cheap or free when you don't have a lot of volume to transcribe and aren't too worried about evaluating and correcting the results. It's not cheap or free when you have multiple years worth of audio and the quality matters at a higher level of precision. Relative to having people transcribe from scratch, it's definitely cheaper. It's likely that the vast majority of transcription services that are not 100% machine transcription now run machine transcription before people fix up the transcripts. It would be foolish not to.
I've been using an open source tool and a GPU on my own computer to get good enough captions for a class I'm taking. The class videos are captioned but I couldn't find a way to download the captions from the class website. I can download the videos and prefer to watch them outside the class website, but I want captions. The captions I'm getting are basically fine and the cost of the GPU was about $350.* But if I was a university captioning every video for every class, and I needed to comply with accessibility requirements, I would be looking at a high enough cost that no one would call it cheap.
no one thinks about how you can now get near-perfect subtitles for free on any video you want (for example) and says "aha! A triumph for AI!"
I've run into many people who think this. I do find it annoying when people talk about the advances of "AI" and they're promoting chat-based generative AI, but then all the "success" examples they have are things like handwriting or speech transcription.
*And I was going to buy it anyway, not just for transcription.
I've been testing machine generated transcripts for a dataset project I am working on. There are already existing human transcriptions, and I am generating machine transcriptions (and also adding machine generated structured annotations). The whole lot will go through manual review.
Comparing the actual human transcripts with the AI ones (nothing fancy, just a large Whisper model), I'd say that it's about the same in terms of accuracy. Personal names and other proper nouns are the biggest source of errors in both. The AI model has better general knowledge of some common place and brand names than whoever did the human transcripts. Generally everything else is pretty impressively close to 100% accurate. At some point, I'll have actual metrics, but these are currently anecdotal.
In reality I do think auto captions (*of English*) are a triumph for AI.
There's a never ending cycle of the same technology helping and then hurting accessibility. Currently we're in a moment where hearing people think the technology is so good that they shouldn't have to pay for actual captions or interpreters, which is driving down the net benefits. But it is very useful to have the auto option, in many situations, and it's significantly better now in SO many ways than even 5 years ago. (I still wouldn't call it "near perfect" though)
20- I'm very interested in what kind of content is being transcribed and who the humans were that generated the human-generated ones! From my perspective, it is still very obvious very quickly when anything uses uncorrected auto captions, both from the number of errors and the specific type of errors. The auto version is awesome as a starting point and speeds up the process a lot, but I'd never publish anything without having had a human edit it and auto captions are definitely not high quality enough to be usable for work, meetings, dr appointments, etc., even if you can assume a quiet background and a standard accent.
23: that's my feeling as well. AI can beat cheap humans, but auto generated captions still aren't as good as I'd expect from high production value content where people pay for good captioning.
Saying these are near perfect is like saying Tesla FSD is near perfect.
Out of curiosity I just turned on the Zoom captioning on a big weekly work webinar, and it was better than I'm used to ML being. I didn't catch any errors, but I also didn't watch it that long.
I think Zoom has improved but may still be failing to meet guidelines around line length and timing that make a big difference in how readable the captions end up being. It really stands out when you have 2-3 lines at once that cover practically the full width of the screen.
I spent a lot of time looking at auto-generated transcripts last fall. The best I've seen out of Whisper has been formal lectures (1 speaker, no Q and A) and oral histories (2 speakers, no overlapping speech). I listened to a few lectures while reading the transcripts where I would have made little or no correction - except maybe to the timing of what appears on screen.
Beyond those two categories, I've seen many problems that you wouldn't want to leave uncorrected, especially text inserted over periods of silence or instrumental music, or problems with non-English languages and especially recordings where more than one language is spoken. The training data presumably included formal transcripts that carry an attribution ("transcription provided by [Person/Organization]"), and it's not uncommon to see a Whisper transcription throw an attribution line into the start or end of a transcript when obviously none exists in the spoken words. (See this discussion, for example.)
"OMG THEY'RE NOT NEAR PERFECT"
Sorry, yes, they are. I produce multiple videos a week, they all have to have subtitles, I use YouTube auto captioning to produce them in draft, and the result is near perfect. Maybe you just use crap software. I would say YouTube needs about one word in two hundred correcting, plus punctuation. That is near perfect.
Are you all getting really really cross because you thought "near perfect" meant the same thing as "perfect"? Is that what's happening here?
Sorry, no, they aren't. I use auto captioning as a viewer every time any content or meeting does not have either live captioning or interpretation and the quality is nowhere near good enough to be understandable without concentrated effort in the vast majority of scenarios. Maybe you just use remarkably well produced videos with crystal clear audio and for some reason assume that deaf people don't have a good grasp on which types of captioning software we should be using.
28: yes, ajay, exactly. You caught me!
I also use auto captions as a first step in produced videos and while the quality of captions for an English speaker with a standard accent and no background noise is quite high for some content areas, it's not good enough to go without at least one pass through by a human. And again from the consumer perspective, the vast majority of content (recorded and live) does not have sound quality or articulation/accent such that the results are even close to that standard.
25- Zoom is the best out of the conference options I've used, with enough of a jump on Google that I'll request to switch platforms even if there is an interpreter (I like to cross reference when other people are talking and quality check the interpreter when I'm signing).
Google knows and doesn't know very random things and makes weird guesses and also censors anything it thinks is profanity which is very annoying in informal situations. Also Google is extremely bad at understanding any of my brother's Montana-hick-accented speech, but that's more funny than annoying.
23, 31: yes, exactly. Automated transcription gets almost every word right. It isn't perfect - it still needs a human to look it over. It isn't usable. I didn't say it was usable. But it gets nearly every word right. It's near perfect.
I don't think "near perfect" necessarily means "good enough" or even "usable". Usable, in this case, means perfect.
Are you all getting really really cross
No?
15 was a dick move, by the way. Why do it? Why deliberately invite a pointless argument? It's not like you get money from engagement here.
I don't think "near perfect" necessarily means "good enough" or even "usable"
... well, I think I've identified the source of the disagreement, then.
38. How dare you invite the input of someone with deep experience and expertise, heebie.
You guys aren't paying her for each comment you make?
38: to be fair, I didn't literally backchannel E. and tell her she's just gotta get over here and tear into you. It was more of a candyman invocation I guess.
But also, she's a literal expert! like jms says!
But also, I thought you loved irritated bickering!
35: no, I can only do three at a time. One mouth, two hands.
Do I have to explain the rules of Three Noises yet again?
A comment thread is hardly the place to comment on comments.
Why deliberately invite a pointless argument?
E. Messily explicitly said, "In reality I do think auto captions (*of English*) are a triumph for AI." why not take that as starting from a point of agreement (and then trying to add some additional precision to the ways in which it falls short of "perfect") rather than thinking of it as a pointless argument?
43.last, 49.last: I don't recall seeing either irritated bickering or pointless argument in this clip
I'm not sure either is really a possibility.
I attended a high school graduation a couple of weeks ago that involved a large screen and captions. I was with someone who doesn't hear well, and I LIKE captions. The speaker asked the audience to "join me in thanking our troops" and the closed captions read "join me in spanking our troops." I nudged my neighbor, and we both stifled laughter and our inner 12 year olds. That HAD to be on purpose, right? I assumed the speakers turned in scripts to the venue in advance, but I have no idea.
Obviously, the speaker submitted a speech saying "spanking our troops" but went off script.
I can only do three at a time. One mouth, two hands.
That's not really giving it your all, E.
27: State of the art word error rate is well above 1% on clean audio like audiobooks. When people claim better results on this they all apart if you present a clean audio source they're not specialized for. For harder to understand audio sources state of the art models can have 50% WER. YouTube frequently has a higher error rate than state of the art models because YouTube needs to keep COGS down. Either your memory is playing tricks on you or you're cherry picking unrepresentative results.
I pulled up the last video I watched on YouTube and turned captions on. There are 3 errors in the first 50 words not including punctuation, which is missing. One of the errors is tricky and I had to listen to the phrase twice to catch what the person said so let's call that 2 out of 50 to be generous. The person speaks with a neutral accent with clear audio that doesn't have significant background noise but they're not in a treated room and you can hear some echo if you're someone who pays close attention to audio quality. Most people would probably call the audio perfect.
I recently looked at a transcript of an interview a journalist recorded on their phone in someone's living room. WER was about 30% with a state of the art model. Entire sentences were wrong and when sentences was near perfect with only one incorrect word the meaning of the sentence was frequently substantially different due to the error.
re: 23
The material is a mixture of two types of things. One: roving interviews gathered at a series of events, e.g. interviews of CEOs attending a business conference.* Two: live broadcasts by journalists on the scene at a specific type of newsworthy event. The journalists are doing a mixture of reporting--probably from scripted text--and interviewing, which is unscripted.
In both cases, the audio quality is decent, and the original transcripts were hand-created (they were created some years ago so no machines were used). The typical errors are the kinds of things you'd expect and are familiar with: homophones, and issues with word boundaries, esp. with proper nouns. The human transcripts tend to be cleaned up more--missing bits of phatic speech, vocal fillers, etc.-- and the machine transcripts are closer to verbatim, at the slight cost of readability.
In the relatively small sample set where I've done a detailed manual review, the AI transcripts tend to make fewer mistakes with proper nouns. On the other hand, with near homophones like "Their X is over there ...", where X is a word beginning with /n/ and the AI has interpreted that as "They're in X over there".
I don't have WER numbers for either (although will have after the data is manually reviewed in total), but I'd put the WER at approximately 1% for both the human transcripts and the AI ones. I've checked a couple of approx. 500 word transcripts and in both cases, there's a low single figure number of errors.
* that's not what they are, as I have to be vague about who specifically is being interviewed, but the type of material, environment they are gathered in, etc is that.
Important caveat re: 55, I was using a large-ish model (low billions of parameters) and it wasn't real time transcription. Also, the numbers I am getting are better than the published WERs for that model, for English, so presumably the type of speech, recording quality, I am working with is sufficient to give better than normal performance. In the case of the broadcasts by journalists, these are professional speakers, using high quality recording technology.*
I think the WER for Whisper-Large v3 on standard English language corpora is more like 5-9% depending, so that's basically the same as the Youtube anecdotal numbers (3 errors in the first 50 words) in 54.
* the material I've personally audited is old -- 1950s or earlier -- but was recorded on the state of the art field recording equipment for the time and then carefully preserved and digitised using high quality workflows.
54: Either your memory is playing tricks on you or you're cherry picking unrepresentative results.
Well, thank you for your concern but I am in fact neither demented nor lying.
I just checked the most recent video I produced. The YouTube autogenerated captions were 1,057 words. There were five errors in the transcription. One was a mistranscription of the word "polos" (plural of "polo", short for "polo shirt") as "pose". One was a misspelling of a personal name that has several accepted spellings. The other three were mistranscription of a particular place name - the same one each time, a fairly obscure place that I wouldn't expect to occur very often in training sets. Even if you count each of those as a separate error, that is still well below 1%. My instinctive guess of "less than 1%" turns out to have been entirely accurate for this case.
Most of my job cleaning this up was removing repeated words - because we, we do repeat things when speaking, but it looks odd on screen - and adding punctuation, which as you note YouTube doesn't even attempt to do.
Is this the right eclectic web magazine for an argument?
58: thanks for the giggles
Hawaii passed her drivers test! Hooray!
Yet another domain where things have become orders of magnitude more insane than when I got my license. Not the driving test exactly - although I've heard California's actual test is bonkers - but the paperwork. Like two forms of verification that the student is currently enrolled in school, and just a wild proliferation of very exact, very obscure documents. I'm kind of surprised so many people manage to navigate it. Also you have to book months in advance because everything is so overbooked and understaffed. It's a very stressful process.
Does Texas still require you take the test with an open Shiner Bock can between your legs and not spill any?
Jammies likes to take coffee in a coffee mug into the car, and then berate himself when it sloshes. It's like he imprinted on the License to Drive scene with the dad from Family Matters.
I don't really see art house movies.
60 practical driving tests over here have also become bonkers because of a massive backlog from COVID. A friend of mine is considering waiting and doing hers in Australia because she can't book one anywhere in the UK. (She's going there anyway for a few months; it's not like she'd be flying out there specially.)
Like two forms of verification that the student is currently enrolled in school
This seems odd. If you're school-age but not in school, you aren't allowed to take a driving test?
I have no idea - if you took the GED or were in alternative schooling I'm sure there's a pathway - but this may be a mechanism to shit on kids who still manage to drop out.
And yep, I think it's still the covid backlog that they can't catch up on.
Is it now the Real ID backlog? That's the problem here.
Found it in Texas Transportation Code 521.204. If you're under 18 and not in school (and didn't get your diploma or GED already), proof of being enrolled in a GED prep program can substitute.
Home school also counts as school.
But if you've just dropped out & have nothing else going on - yes, the law seems to prohibit a DL until you turn 18.
Home school is great until you realize the theme for prom is always incest.
As opposed to only about 20% of the time in regular-school prom, depending on the state.
The loophole is that documenting home schooling requires an old piece of toilet paper from your shoe, or any piece of fabric or paper or gum wrapper you may be able to locate.
I think our theme was "These are the times to remember."
re: 64.1
The real reason is that they allow unscrupulous agencies to bulk buy test slots, and then sell on at a massive markup, I believe.
74: that would increase price (to the detriment of the public and the benefit of the agencies) but wouldn't increase the backlog, because the agencies are not hoarding these slots, they are selling them all on. Whatever's happening with people bulk booking and reselling and so on, there is a limited number of driving test slots every day and those are (presumably) all getting filled by people who want driving tests. The length of a queue doesn't change because you allow some people to buy their way to the front of it.
If anything, if the agencies are driving up the price, there'll be some marginal people who will just decide they don't want to learn to drive at all if a test costs that much, and so the backlog should fall.
Shouldn't it?
"I don't know enough about LLMs' coding ability, but would anyone be prepared to bet that Builder.ai won't be real by 2030?"
It's real now lol you can use Codex or Claude Code and get functional applications with very little effort.
Another issue seems to be that, until very recently, there were just fewer driving examiners around than there used to be (the numbers are now increasing very slowly as of last year).
And the ones there are seem to be working less hard. 15% fewer tests were provided in the first four months of this year than in the first four months of 2024. And, unsurprisingly, the backlog went up substantially this year...
Rich people buying slots on speculation and then leaving them empty if they get busy or realize they are not ready yet?
57: That would be cherry picking unrepresentative results.
What do you get out of being so belligerent, as in 28, 38, and 57?
When he's not bickering or being belligerent, he's actually quite charming.
"Rich people buying slots on speculation and then leaving them empty if they get busy or realize they are not ready yet?"
This is happening- not just rich people either, it's so difficult to get a slot and you have to book so far in advance that people book tests before they start lessons and just hope they'll be ready. But you can cancel up to three days before without penalty. Or resell your slot of course. So I don't know if there are that many slots actually left vacant...
79: no, that was picking an example at random, which is the opposite of cherry picking. I assure you it was representative of my experience.
69: my understanding certainly could be wrong but was that this is a deliberate choice to disencentivize dropping out. Here it's not legal until 18 anyway.
When I spent almost a decade as a transcriptionist now almost that long ago, our process involved a pair where the first person creates the transcript and the second edited it. There was a huge range of abilities and outcomes within our office and we were better than the other bigger office, which was miles better than trying to get this done cheaper in the Philippines. But I was laid off when the company decided to mostly do the last and have a few American teams do high-profile stuff if clients would pay more for that. (And then I got a series of concussions and long covid and I don't have the memory to transcribe even if I wanted to, when before it was practically a superpower and I could do it while reading or writing something else.)