VoiceTextBlender introduces a novel approach to augmenting LLMs with speech capabilities through single-stage joint speech-text supervised fine-tuning. The researchers from Carnegie Mellon and NVIDIA have developed a more efficient way to create models that can handle both speech and text without compromising performance in either modality. The team's 3B parameter model demonstrates superior performance compared to previous 7B and 13B SpeechLMs across various speech benchmarks whilst preserving the original text-only capabilities—addressing the critical challenge of catastrophic forgetting that has plagued earlier attempts. Their technical approach employs LoRA adaptation of the LLM backbone, combining text-only SFT data with three distinct types of speech-related data: multilingual ASR/AST, speech-based question answering, and an innovative mixed-modal interleaving dataset created by applying TTS to randomly selected sentences from text SFT data. What's particularly impressive is the model's emergent ability to handle multi-turn, mixed-modal conversations despite being trained only on single-turn speech interactions. The system can process user input in pure speech, pure text, or any combination, showing impressive generalisation to unseen prompts and tasks. The researchers have committed to publicly releasing their data generation scripts, training code, and pre-trained model weights, which should significantly advance research in this rapidly evolving field of speech language models. Paper: https://lnkd.in/dutRcaAA Authors: Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg #SpeechLM #MultimodalAI #SpeechAI
Speech Recognition Innovations
Explore top LinkedIn content from expert professionals.
Summary
Speech-recognition-innovations refer to the latest breakthroughs that enable computers and devices to accurately interpret and understand spoken language, going beyond just converting speech to text. These advancements are making technology more responsive, accessible, and natural for real-time communication and personalized interaction.
- Explore real-time solutions: Try out new speech recognition tools that can process spoken commands instantly, allowing for smoother and faster conversations with technology.
- Consider accessibility tools: Look into non-invasive wearables and AI-powered devices designed to help people with speech impairments communicate more freely and restore their natural voice.
- Embrace intent-based AI: Experiment with systems that focus on understanding the meaning behind what you say, not just the words, for more intuitive and human-like interactions.
-
-
Breakthrough: BCI + AI = Instant mind-to-speech conversion A new device can detect words and turn them into speech within three seconds. 📍 The researchers used deep learning RNN-T models to achieve fluent speech synthesis with a large-vocabulary with neural decoding in 80-ms increments. In the study, Ann was a participant who lost her ability to speak after a stroke 18 years ago. Researchers used paper-thin rectangle containing 253 electrodes on the surface on her brain cortex (speech sensorimotor area) to record the activity of thousands of neurons. Researchers even personalized the synthetic voice! They used AI on recordings from her wedding video. As a result, the synthetic voice sounds like Ann’s own voice from before her injury. ❗ The result: Before: a single sentence took >20 seconds. Now: 47 - 90 words per minute. “Our framework also successfully generalized to other silent-speech interfaces, including single-unit recordings and electromyography. Our findings introduce a speech-neuroprosthetic paradigm to restore naturalistic spoken communication to people with paralysis.” Huge congratulations to the authors of this work! Just WOW.
-
376,000 ALS patients type 10 words per minute. MIT just gave them normal speech speed. No sound. No surgery. Just seven sensors reading jaw signals. Arnav Kapur and his team built AlterEgo with one mission: empower people with ALS and oral cancer, not replace them. Their wearable reads signals your brain sends to silent muscles—92% accuracy, half-second response. The cost breakthrough that matters: ↳ Neuralink surgery: $30,000-$100,000+ ↳ Brain implants: Infection risks, select trials only ↳ Current ALS devices: $1,500-$8,000 robotic voices ↳ AlterEgo target: Same price, your actual voice Think about that. No drilling into skulls like Synchron or UC Davis implants. No $100,000 medical bills. Just electrodes on your jaw detecting the same signals you use to read silently. Traditional Assistive Reality: ↳ Eye-tracking: 10 exhausting words per minute ↳ Brain surgery: $100,000+ with infection risks ↳ Robotic voices destroying identity ↳ Most patients priced out entirely AlterEgo Reality: ↳ Think naturally, speak instantly ↳ Non-invasive wearable design ↳ Your voice preserved digitally ↳ First responders using it for silent comms But here's what stopped me cold: The same device restoring voices to ALS patients is being tested for secure translation, silent note-taking, and emergency response teams. One innovation serving different needs with high impact. Consumer EEG headsets cost $100-$1,000 but can't handle real speech. Medical BCIs require brain surgery. AlterEgo sits between—medical-grade accuracy without medical risks. The Multiplication Effect: 1 voice preserved = independence restored 100 patients reconnected = isolation broken 1,000 using AlterEgo = new communication standard At scale = surgery becomes obsolete From MIT lab to human trials. From $100,000 brain implants to accessible wearables. From "I need surgery to speak" to "I just need to think." Kapur's team chose technology that empowers rather than replaces human ability. Because 376,000 people with ALS and oral cancer deserve their own voice—not a robot's. Follow me, Dr. Martha Boeckenfeld for innovations that restore human dignity without invasion. ♻️ Share if everyone deserves to keep their voice.
-
Voice agents are having their moment in 2025: an open-source breakthrough just redefined real-time multimodal AI by slashing interaction latency to 1.5 seconds, challenging the recently released proprietary real-time APIs from OpenAI and Google. VITA-1.5, the latest iteration of the open-source interactive omni-multimodal LLM, brings three major improvements that push the boundaries of multimodal AI: (1) Speed transformation - reduced end-to-end speech interaction latency from 4 seconds to 1.5 seconds, enabling true real-time conversations (2) Speech processing leap - decreased Word Error Rate from 18.4 to 7.5, rivaling specialized speech models (3) Multimodal excellence - boosted performance across MME, MMBench, and MathVista from 59.8 to 70.8 while maintaining robust vision-language capabilities One novel method from the paper is VITA’s progressive training strategy that allows speech integration without compromising other multimodal capabilities - a persistent challenge in the field. The image understanding performance only drops by 0.5 points while gaining an entirely new modality. As we move towards agentic AI systems that need to process and respond to multiple input streams in real time, VITA-1.5's achievement in reducing latency while maintaining high accuracy across modalities sets a new standard for what's possible in open-source AI. This release signals a shift in the multimodal AI landscape, demonstrating that open-source alternatives can compete with proprietary solutions in the race for real-time, multi-sensory AI interactions. VITA-1.5 https://lnkd.in/gj7pd77P More tools, open-source models, and APIs for building voice agents in my recent AI Tidbits post https://lnkd.in/g9ebbfX3
-
Machines used to hear us. Now they start to understand us. Google’s new model, Speech-to-Retrieval (S2R), skips transcription. It listens for meaning, not words. ✦ Old loop: Voice → Text → Search → Error → Frustration ✦ New loop: Voice → Intention → Retrieval → Result No more “Scream” becoming “screen.” No more brittle text layers between thought and answer. ⧉ This is more than an upgrade. It’s a paradigm shift from speech as input to speech as understanding. Humans speak in nuance. Machines finally start to respond in kind. What this means for us: › Search becomes semantic. › Interfaces become invisible. › Conversation becomes computation. When the system no longer asks what you said but starts inferring what you meant, the interface dissolves and intent becomes the command. ツ The next frontier is not recognition. It’s understanding. What happens when your system listens and truly knows what you mean?
-
[𝗟𝗔-𝗥𝗔𝗚] 🔥Struggling with 𝟰𝟯.𝟵𝟰% ASR error rates in dialects? 𝗟𝗔-𝗥𝗔𝗚 can bring that down to 𝟯𝟬.𝟯𝟵%. Automatic Speech Recognition (ASR) has seen great advancements, but it still struggles with diverse accents and dialects. Traditional ASR systems often fall short in high-variation acoustic environments, impacting accuracy. 𝗪𝗵𝘆 𝗟𝗔-𝗥𝗔𝗚: Enter 𝗟𝗔-𝗥𝗔𝗚 (Retrieval-Augmented Generation)—a revolutionary solution that enhances ASR systems by leveraging 𝘁𝗼𝗸𝗲𝗻-𝗹𝗲𝘃𝗲𝗹 𝘀𝗽𝗲𝗲𝗰𝗵 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹. It utilizes 𝗶𝗻-𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 to drastically reduce errors, especially in dialect-heavy regions. 𝗛𝗼𝘄 𝗶𝘁 𝘄𝗼𝗿𝗸𝘀 (𝘀𝘁𝗲𝗽-𝗯𝘆-𝘀𝘁𝗲𝗽): 1.𝗦𝗽𝗲𝗲𝗰𝗵 𝗧𝗼𝗸𝗲𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Extracts fine-grained speech tokens from audio using a pre-trained ASR model. 2.𝗗𝗮𝘁𝗮𝘀𝘁𝗼𝗿𝗲 𝗖𝗿𝗲𝗮𝘁𝗶𝗼𝗻: Builds a datastore with speech tokens and their correct text matches for retrieval. 3.𝗦𝗽𝗲𝗲𝗰𝗵 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹: Queries the datastore to find similar speech sequences for better transcription. 4.𝗣𝗿𝘂𝗻𝗶𝗻𝗴: Filters out low-error tokens, focusing on the hardest transcription challenges. 5.𝗟𝗟𝗠 𝗣𝗿𝗼𝗺𝗽𝘁𝗶𝗻𝗴: Feeds retrieved speech examples and N-best transcriptions into the LLM to improve accuracy. 6.𝗔𝗱𝗮𝗽𝘁𝗮𝘁𝗶𝗼𝗻: Aligns speech and text spaces for seamless integration, improving performance on accents. 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀: 🔹 LA-RAG cuts error rates from𝟰𝟯.𝟵𝟰%to𝟯𝟬.𝟯𝟵% on Chinese dialects 🔹 Mandarin ASR accuracy improves by 𝟮𝟬% 🔹 Reduces error rates across dialects by 𝘂𝗽 𝘁𝗼 𝟮.𝟭% 🔹 Particularly effective for accents and regional variations. 𝗣.𝗦. Imagine more precise communication across languages and accents—without retraining entire models. 𝗟𝗔-𝗥𝗔𝗚 is setting a new standard for ASR, making speech recognition more adaptable and reliable than ever. #LLMs #SpeechRecognition #DataScience
-
A few days ago, we began our year-end series highlighting the top 5 news stories our readers found the most interesting. 𝐓𝐡𝐢𝐬 𝐢𝐬 𝐧𝐮𝐦𝐛𝐞𝐫 𝟏: 𝐒𝐩𝐞𝐚𝐤𝐢𝐧𝐠 𝐰𝐢𝐭𝐡𝐨𝐮𝐭 𝐯𝐨𝐜𝐚𝐥 𝐜𝐨𝐫𝐝𝐬, 𝐭𝐡𝐚𝐧𝐤𝐬 𝐭𝐨 𝐚 𝐧𝐞𝐰 𝐀𝐈-𝐚𝐬𝐬𝐢𝐬𝐭𝐞𝐝 𝐰𝐞𝐚𝐫𝐚𝐛𝐥𝐞 𝐝𝐞𝐯𝐢𝐜𝐞 Bioengineers invented a thin (weighing only 7 grams), flexible device that adheres to the neck and translates the muscle movements of the larynx with the assistance of machine-learning technology - with nearly 95% accuracy - into audible speech! What an amazing technological support for people who have lost the ability to speak due to vocal cord problems. "The tiny new patch-like device is made up of two components. One, a self-powered sensing component, detects and converts signals generated by muscle movements into high-fidelity, analyzable electrical signals; these electrical signals are then translated into speech signals using a machine-learning algorithm. The other, an actuation component, turns those speech signals into the desired voice expression." Next: clinical trials, as well as enlarging the vocabulary of the device through machine learning.
-
A new generation of customer-service voice bots is here, spurred by advances in artificial intelligence and a flood of cash, Belle L. reports. Insurance marketplace eHealth, Inc. uses AI voice agents to handle its initial screening for potential customers when its human staff can’t keep up with call volume, as well as after hours. The company slowly became more comfortable with using AI voice agents as the underlying technology improved, said Ketan Babaria, chief digital officer at eHealth. “Suddenly, we noticed these agents become very humanlike,” Babaria said. “It’s getting to a point where our customers are not able to differentiate between the two.” The transition is happening faster than many expected. “You have AI voice agents that you can interrupt, that proactively make logical suggestions, and there’s very little or no latency in the conversation. That’s a change that I thought was going to happen a year and a half or two years from now,” said Tom Coshow, an analyst at market research and information-technology consulting firm Gartner. Venture capital investment in voice AI startups increased from $315 million in 2022 to $2.1 billion in 2024, according to data from CB Insights. Some leading AI models for voice applications come from AI labs like OpenAI and Anthropic, startup founders and venture capitalists say, as well as smaller players like Deepgram and Assembly AI, which have improved their speech-to-text or text-to-speech models over the past few years. For instance, OpenAI’s Whisper model is a dedicated speech-to-text model, and its GPT-4o model can interact with people by voice in real-time.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development