In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step. The Baum–Welch algorithm, the primary method for inference in hidden Markov models, is numerically unstable due to its recursive calculation of joint probabilities. As the number of variables grows, these joint probabilities become increasingly small, leading to the forward recursions rapidly approaching values below machine precision. == History == The Baum–Welch algorithm was named after its inventors Leonard E. Baum and Lloyd R. Welch. The algorithm and the Hidden Markov models were first described in a series of articles by Baum and his peers at the IDA Center for Communications Research, Princeton in the late 1960s and early 1970s. One of the first major applications of HMMs was to the field of speech processing. In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular genetic information. They have since become an important tool in the probabilistic modeling of genomic sequences. == Description == A hidden Markov model describes the joint probability of a collection of "hidden" and observed discrete random variables. It relies on the assumption that the i-th hidden variable given the (i − 1)-th hidden variable is independent of previous hidden variables, and the current observation variables depend only on the current hidden state. The Baum–Welch algorithm uses the well known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors. Let X t {\displaystyle X_{t}} be a discrete hidden random variable with N {\displaystyle N} possible values (i.e. We assume there are N {\displaystyle N} states in total). We assume the P ( X t ∣ X t − 1 ) {\displaystyle P(X_{t}\mid X_{t-1})} is independent of time t {\displaystyle t} , which leads to the definition of the time-independent stochastic transition matrix A = { a i j } = P ( X t = j ∣ X t − 1 = i ) . {\displaystyle A=\{a_{ij}\}=P(X_{t}=j\mid X_{t-1}=i).} The initial state distribution (i.e. when t = 1 {\displaystyle t=1} ) is given by π i = P ( X 1 = i ) . {\displaystyle \pi _{i}=P(X_{1}=i).} The observation variables Y t {\displaystyle Y_{t}} can take one of K {\displaystyle K} possible values. We also assume the observation given the "hidden" state is time independent. The probability of a certain observation y i {\displaystyle y_{i}} at time t {\displaystyle t} for state X t = j {\displaystyle X_{t}=j} is given by b j ( y i ) = P ( Y t = y i ∣ X t = j ) . {\displaystyle b_{j}(y_{i})=P(Y_{t}=y_{i}\mid X_{t}=j).} Taking into account all the possible values of Y t {\displaystyle Y_{t}} and X t {\displaystyle X_{t}} , we obtain the N × K {\displaystyle N\times K} matrix B = { b j ( y i ) } {\displaystyle B=\{b_{j}(y_{i})\}} where b j {\displaystyle b_{j}} belongs to all the possible states and y i {\displaystyle y_{i}} belongs to all the observations. An observation sequence is given by Y = ( Y 1 = y 1 , Y 2 = y 2 , … , Y T = y T ) {\displaystyle Y=(Y_{1}=y_{1},Y_{2}=y_{2},\ldots ,Y_{T}=y_{T})} . Thus we can describe a hidden Markov chain by θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} . The Baum–Welch algorithm finds a local maximum for θ ∗ = a r g m a x θ P ( Y ∣ θ ) {\displaystyle \theta ^{}=\operatorname {arg\,max} _{\theta }P(Y\mid \theta )} (i.e. the HMM parameters θ {\displaystyle \theta } that maximize the probability of the observation). === Algorithm === Set θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} with random initial conditions. They can also be set using prior information about the parameters if it is available; this can speed up the algorithm and also steer it toward the desired local maximum. ==== Forward procedure ==== Let α i ( t ) = P ( Y 1 = y 1 , … , Y t = y t , X t = i ∣ θ ) {\displaystyle \alpha _{i}(t)=P(Y_{1}=y_{1},\ldots ,Y_{t}=y_{t},X_{t}=i\mid \theta )} , the probability of seeing the observations y 1 , y 2 , … , y t {\displaystyle y_{1},y_{2},\ldots ,y_{t}} and being in state i {\displaystyle i} at time t {\displaystyle t} . This is found recursively: α i ( 1 ) = π i b i ( y 1 ) , {\displaystyle \alpha _{i}(1)=\pi _{i}b_{i}(y_{1}),} α i ( t + 1 ) = b i ( y t + 1 ) ∑ j = 1 N α j ( t ) a j i . {\displaystyle \alpha _{i}(t+1)=b_{i}(y_{t+1})\sum _{j=1}^{N}\alpha _{j}(t)a_{ji}.} Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences. However, this can be avoided in a slightly modified algorithm by scaling α {\displaystyle \alpha } in the forward and β {\displaystyle \beta } in the backward procedure below. ==== Backward procedure ==== Let β i ( t ) = P ( Y t + 1 = y t + 1 , … , Y T = y T ∣ X t = i , θ ) {\displaystyle \beta _{i}(t)=P(Y_{t+1}=y_{t+1},\ldots ,Y_{T}=y_{T}\mid X_{t}=i,\theta )} that is the probability of the ending partial sequence y t + 1 , … , y T {\displaystyle y_{t+1},\ldots ,y_{T}} given starting state i {\displaystyle i} at time t {\displaystyle t} . We calculate β i ( t ) {\displaystyle \beta _{i}(t)} as, β i ( T ) = 1 , {\displaystyle \beta _{i}(T)=1,} β i ( t ) = ∑ j = 1 N β j ( t + 1 ) a i j b j ( y t + 1 ) . {\displaystyle \beta _{i}(t)=\sum _{j=1}^{N}\beta _{j}(t+1)a_{ij}b_{j}(y_{t+1}).} ==== Update ==== We can now calculate the temporary variables, according to Bayes' theorem: γ i ( t ) = P ( X t = i ∣ Y , θ ) = P ( X t = i , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) β i ( t ) ∑ j = 1 N α j ( t ) β j ( t ) , {\displaystyle \gamma _{i}(t)=P(X_{t}=i\mid Y,\theta )={\frac {P(X_{t}=i,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)\beta _{i}(t)}{\sum _{j=1}^{N}\alpha _{j}(t)\beta _{j}(t)}},} which is the probability of being in state i {\displaystyle i} at time t {\displaystyle t} given the observed sequence Y {\displaystyle Y} and the parameters θ {\displaystyle \theta } ξ i j ( t ) = P ( X t = i , X t + 1 = j ∣ Y , θ ) = P ( X t = i , X t + 1 = j , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) a i j β j ( t + 1 ) b j ( y t + 1 ) ∑ k = 1 N ∑ w = 1 N α k ( t ) a k w β w ( t + 1 ) b w ( y t + 1 ) , {\displaystyle \xi _{ij}(t)=P(X_{t}=i,X_{t+1}=j\mid Y,\theta )={\frac {P(X_{t}=i,X_{t+1}=j,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)a_{ij}\beta _{j}(t+1)b_{j}(y_{t+1})}{\sum _{k=1}^{N}\sum _{w=1}^{N}\alpha _{k}(t)a_{kw}\beta _{w}(t+1)b_{w}(y_{t+1})}},} which is the probability of being in state i {\displaystyle i} and j {\displaystyle j} at times t {\displaystyle t} and t + 1 {\displaystyle t+1} respectively given the observed sequence Y {\displaystyle Y} and parameters θ {\displaystyle \theta } . The denominators of γ i ( t ) {\displaystyle \gamma _{i}(t)} and ξ i j ( t ) {\displaystyle \xi _{ij}(t)} are the same ; they represent the probability of making the observation Y {\displaystyle Y} given the parameters θ {\displaystyle \theta } . The parameters of the hidden Markov model θ {\displaystyle \theta } can now be updated: π i ∗ = γ i ( 1 ) , {\displaystyle \pi _{i}^{}=\gamma _{i}(1),} which is the expected frequency spent in state i {\displaystyle i} at time 1 {\displaystyle 1} . a i j ∗ = ∑ t = 1 T − 1 ξ i j ( t ) ∑ t = 1 T − 1 γ i ( t ) , {\displaystyle a_{ij}^{}={\frac {\sum _{t=1}^{T-1}\xi _{ij}(t)}{\sum _{t=1}^{T-1}\gamma _{i}(t)}},} which is the expected number of transitions from state i to state j compared to the expected total number of transitions starting in state i, including from state i to itself. The number of transitions starting in state i is equivalent to the number of times state i is observed in the sequence from t = 1 to t = T − 1. b i ∗ ( v k ) = ∑ t = 1 T 1 y t = v k γ i ( t ) ∑ t = 1 T γ i ( t ) , {\displaystyle b_{i}^{}(v_{k})={\frac {\sum _{t=1}^{T}1_{y_{t}=v_{k}}\gamma _{i}(t)}{\sum _{t=1}^{T}\gamma _{i}(t)}},} where 1 y t = v k = { 1 if y t = v k , 0 otherwise {\displaystyle 1_{y_{t}=v_{k}}={\begin{cases}1&{\text{if }}y_{t}=v_{k},\\0&{\text{otherwise}}\end{cases}}} is an indicator function, and b i ∗ ( v k ) {\displaystyle b_{i}^{}(v_{k})} is the expected number of times the output observations have been equal to v k {\displaystyle v_{k}} while in state i {\displaystyle i} over the expected total number of times in state i {\displaystyle i} . These steps are now repeated iteratively until a desired level of convergence. Note: It is possible to over-fit a particular data set. That is, P ( Y ∣ θ final ) > P ( Y ∣ θ true ) {\displaystyle P(Y\mid \theta _{\text{final}})>P(Y\mid \theta _{\text{true}})} . The algorithm also does not guarantee a global maximum. ==== Multiple sequences ==== The algorithm described thus far assumes a single observed sequence Y = y 1 , … , y T {\displaystyle Y=y_{1},\ldots ,y_{T}} . However, in many situations, there are several sequences observed: Y 1 ,
Cyber attribution
In the area of computer security, cyber attribution is an attribution of cybercrime, i.e., finding who perpetrated a cyberattack. Uncovering a perpetrator may give insights into various security issues, such as infiltration methods, communication channels, etc., and may help in enacting specific countermeasures. Cyber attribution is a costly endeavor requiring considerable resources and expertise in cyber forensic analysis. For governments and other major players dealing with cybercrime would require not only technical solutions, but legal and political ones as well, and for the latter ones cyber attribution is crucial. Attributing a cyberattack is difficult, and of limited interest to companies that are targeted by cyberattacks. In contrast, secret services often have a compelling interest in finding out whether a state is behind the attack. A further challenge in attribution of cyberattacks is the possibility of a false flag attack, where the actual perpetrator makes it appear that someone else caused the attack. Every stage of the attack may leave artifacts, such as entries in log files, that can be used to help determine the attacker's goals and identity. In the aftermath of an attack, investigators often begin by saving as many artifacts as they can find, and then try to determine the attacker.
Deep learning speech synthesis
Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks are trained using large amounts of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text. == Formulation == Given an input text or some sequence of linguistic units Y {\displaystyle Y} , the target speech X {\displaystyle X} can be derived by X = arg max P ( X | Y , θ ) {\displaystyle X=\arg \max P(X|Y,\theta )} where θ {\displaystyle \theta } is the set of model parameters. Typically, the input text will first be passed to an acoustic feature generator, then the acoustic features are passed to the neural vocoder. For the acoustic feature generator, the loss function is typically L1 loss (Mean Absolute Error, MAE) or L2 loss (Mean Square Error, MSE). These loss functions impose a constraint that the output acoustic feature distributions must be Gaussian or Laplacian. In practice, since the human voice band ranges from approximately 300 to 4000 Hz, the loss function will be designed to have more penalty on this range: l o s s = α loss human + ( 1 − α ) loss other {\displaystyle loss=\alpha {\text{loss}}_{\text{human}}+(1-\alpha ){\text{loss}}_{\text{other}}} where loss human {\displaystyle {\text{loss}}_{\text{human}}} is the loss from human voice band and α {\displaystyle \alpha } is a scalar, typically around 0.5. The acoustic feature is typically a spectrogram or Mel scale. These features capture the time-frequency relation of the speech signal, and thus are sufficient to generate intelligent outputs. The Mel-frequency cepstrum feature used in the speech recognition task is not suitable for speech synthesis, as it reduces too much information. == History == In September 2016, DeepMind released WaveNet, which demonstrated that deep learning-based models are capable of modeling raw waveforms and generating speech from acoustic features like spectrograms or mel-spectrograms. Although WaveNet was initially considered to be computationally expensive and slow to be used in consumer products at the time, a year after its release, DeepMind unveiled a modified version of WaveNet known as "Parallel WaveNet," a production model 1,000 faster than the original. This was followed by Google AI's Tacotron 2 in 2018, which demonstrated that neural networks could produce highly natural speech synthesis but required substantial training data—typically tens of hours of audio—to achieve acceptable quality. Tacotron 2 used an autoencoder architecture with attention mechanisms to convert input text into mel-spectrograms, which were then converted to waveforms using a separate neural vocoder. When trained on smaller datasets, such as 2 hours of speech, the output quality degraded while still being able to maintain intelligible speech, and with just 24 minutes of training data, Tacotron 2 failed to produce intelligible speech. In 2019, Microsoft Research introduced FastSpeech, which addressed speed limitations in autoregressive models like Tacotron 2. FastSpeech utilized a non-autoregressive architecture that enabled parallel sequence generation, significantly reducing inference time while maintaining audio quality. Its feedforward transformer network with length regulation allowed for one-shot prediction of the full mel-spectrogram sequence, avoiding the sequential dependencies that bottlenecked previous approaches. The same year saw the release of HiFi-GAN, a generative adversarial network (GAN)-based vocoder that improved the efficiency of waveform generation while producing high-fidelity speech. In 2020, the release of Glow-TTS introduced a flow-based approach that allowed for fast inference and voice style transfer capabilities. In March 2020, the free text-to-speech website 15.ai was launched. 15.ai gained widespread international attention in early 2021 for its ability to synthesize emotionally expressive speech of fictional characters from popular media with minimal amount of data. The creator of 15.ai (known pseudonymously as 15) stated that 15 seconds of training data is sufficient to perfectly clone a person's voice (hence its name, "15.ai"), a significant reduction from the previously known data requirement of tens of hours. 15.ai is credited as the first platform to popularize AI voice cloning in memes and content creation. 15.ai used a multi-speaker model that enabled simultaneous training of multiple voices and emotions, implemented sentiment analysis using DeepMoji, and supported precise pronunciation control via ARPABET. The 15-second data efficiency benchmark was later corroborated by OpenAI in 2024. == Semi-supervised learning == Currently, self-supervised learning has gained much attention through better use of unlabelled data. Research has shown that, with the aid of self-supervised loss, the need for paired data decreases. == Zero-shot speaker adaptation == Zero-shot speaker adaptation is promising because a single model can generate speech with various speaker styles and characteristic. In June 2018, Google proposed to use pre-trained speaker verification models as speaker encoders to extract speaker embeddings. The speaker encoders then become part of the neural text-to-speech models, so that it can determine the style and characteristics of the output speech. This procedure has shown the community that it is possible to use only a single model to generate speech with multiple styles. == Neural vocoder == In deep learning-based speech synthesis, neural vocoders play an important role in generating high-quality speech from acoustic features. The WaveNet model proposed in 2016 achieves excellent performance on speech quality. Wavenet factorised the joint probability of a waveform x = { x 1 , . . . , x T } {\displaystyle \mathbf {x} =\{x_{1},...,x_{T}\}} as a product of conditional probabilities as follows p θ ( x ) = ∏ t = 1 T p ( x t | x 1 , . . . , x t − 1 ) {\displaystyle p_{\theta }(\mathbf {x} )=\prod _{t=1}^{T}p(x_{t}|x_{1},...,x_{t-1})} where θ {\displaystyle \theta } is the model parameter including many dilated convolution layers. Thus, each audio sample x t {\displaystyle x_{t}} is conditioned on the samples at all previous timesteps. However, the auto-regressive nature of WaveNet makes the inference process dramatically slow. To solve this problem, Parallel WaveNet was proposed. Parallel WaveNet is an inverse autoregressive flow-based model which is trained by knowledge distillation with a pre-trained teacher WaveNet model. Since such inverse autoregressive flow-based models are non-auto-regressive when performing inference, the inference speed is faster than real-time. Meanwhile, Nvidia proposed a flow-based WaveGlow model, which can also generate speech faster than real-time. However, despite the high inference speed, parallel WaveNet has the limitation of needing a pre-trained WaveNet model, so that WaveGlow takes many weeks to converge with limited computing devices. This issue has been solved by Parallel WaveGAN, which learns to produce speech through multi-resolution spectral loss and GAN learning strategies.
Trevor Paglen
Trevor Paglen (born 1974) is an American artist, geographer, and author whose work covers mass surveillance and data collection. In 2016, Paglen won the Deutsche Börse Photography Foundation Prize and he has also won The Cultural Award from the German Society for Photography. In 2017, he was a recipient of a MacArthur Fellowship. On March 17, 2026, Paglen was awarded the 2026 LG Guggenheim Award (a collaboration between LG and Guggenheim New York). == Early life and education == Paglen earned a B.A. degree in religious studies in 1998 from the University of California at Berkeley, a M.F.A. degree in 2002 from the School of the Art Institute of Chicago, and a Ph.D. in Geography in 2008 from the University of California at Berkeley. While at UC Berkeley, Paglen lived in the Berkeley Student Cooperative, residing in Chateau, Fenwick, and Rochdale co-ops. == Work == Sean O'Hagan, writing in The Guardian in 2015, said that Paglen, whose "ongoing grand project [is] the murky world of global state surveillance and the ethics of drone warfare", "is one of the most conceptually adventurous political artists working today, and has collaborated with scientists and human rights activists on his always ambitious multimedia projects." His visual work such as his "Limit Telephotography" and "The Other Night Sky" series have received widespread attention for both his technical innovations and for his conceptual project that involves simultaneously making and negating documentary-style truth-claims. Paglen’s work relies on contemporary technology in two meaningful ways. Firstly, the views he photographs would be impossible to shoot without media tech, that includes the cameras, the microscopes, and even helicopters. But interestingly enough, the shots would not be possible if not for the existence of the subject. The contrasts between secrecy and revelation, evidence and abstraction distinguish Paglen's work. With that the artist presents not so much "evidence" as admonitions to awareness. He was an Eyebeam Commissioned Artist in 2007. In 2008 the Berkeley Art Museum devoted a comprehensive solo exhibition to his work. In the next year, Paglen took part in the Istanbul Biennial, and in 2010 he exhibited at the Vienna Secession. Autonomy Cube was a project by Paglen and Jacob Appelbaum that placed relays for the anonymous communication network Tor in traditional art museums. He contributed to the Oscar-winning documentary film Citizenfour (2014), directed by Laura Poitras. Paglen features in the nerd-culture documentary Traceroute (2016). Orbital Reflector was a reflective, mylar sculpture by Paglen intended to be the first "purely artistic" object in space. The temporary satellite, containing an inflatable mylar balloon with reflective surface, launched into space 3 December 2018. A mid-career survey in 2018–2019, Trevor Paglen: Sites Unseen, was a traveling exhibition shown at the Smithsonian American Art Museum in Washington DC and the Museum of Contemporary Art San Diego. In September 2020, Pace Gallery in London held an exhibition of Paglen's work, exploring "the weird, partial ways computers look back at us". His work is included in the permanent collections of the San Francisco Museum of Modern Art, the Columbus Museum of Art, and the Metropolitan Museum. === Experimental Geography === Paglen is credited with coining the term "Experimental Geography" to describe practices coupling experimental cultural production and art-making with ideas from critical human geography about the production of space, materialism, and praxis. The 2009 book Experimental Geography: Radical Approaches to Landscape, Cartography, and Urbanism is largely inspired by Paglen's work. == Publications == Paglen has published a number of books. Torture Taxi (2006) (co-authored with investigative journalist A. C. Thompson) was the first book to comprehensively describe the CIA's extraordinary rendition program. I Could Tell You But Then You Would Have to be Destroyed by Me (2007), is a look at the world of black projects through unit patches and memorabilia created for top-secret programs. Blank Spots on the Map: The Dark Geography of the Pentagon's Secret World (2009) is a broader look at secrecy in the United States. The Last Pictures (2012) is a collection of 100 images to be placed on permanent media and launched into space on EchoStar XVI, as a repository available for future civilizations (alien or human) to find. === Publications by Paglen === I Could Tell You But Then You Would Have to be Destroyed by Me. Brooklyn, NY: Melville House, 2007. ISBN 1-933633-32-8. Blank Spots on the Map: The Dark Geography of the Pentagon's Secret World. New York: Dutton, 2009. ISBN 9781101011492. Invisible: Covert Operations and Classified Landscapes, Photographs by Trevor Paglen. New York: Aperture, 2010. ISBN 9781597111300. With an essay by Rebecca Solnit. The Last Pictures. Oakland, CA: University of California, 2012. ISBN 9780520275003. Trevor Paglen. London: Phaidon, 2018. ISBN 0714873446. With essays by Laren Cornell, Julia Bryan-Wilson, Omar Kholeif. === Publications co-authored === Torture Taxi. Co-authored with A. C. Thompson. Brooklyn, NY: Melville House Publishing, 2006. ISBN 1-933633-09-3. Icon, 2007. ISBN 9781840468304. === Publications with contributions by Paglen === Experimental Geography: Radical Approaches to Landscape, Cartography, and Urbanism. Brooklyn, NY: Melville House, 2009. ISBN 978-0091636586. Edited by Nato Thompson. With essays by Paglen, Thompson, and Jeffrey Kastner. Trevor Paglen and Jacob Appelbaum – Autonomy Cube. Revolver, 2016. ISBN 978-3957633026. Essays by Luke Skrebowski and Keller Easterling on Autonomy Cube, a piece of sculpture by Paglen and Jacob Appelbaum. In English and German. == Exhibitions == Bellwether Gallery, New York, November–December 2006 The Other Night Sky, Berkeley Art Museum, 2008 A Compendium of Secrets, Cologne Still Revolution: Suspended in Time, Museum of Contemporary Canadian Art, Toronto, May–June 2009. Group exhibition with Paglen, Barbara Astman, Walead Beshty, Mat Collishaw, Stan Douglas, Idris Khan, Martha Rosler, and Mikhael Subotzky A Hidden Landscape, Aksioma, Ljubljana, Slowenia Geographies of Seeing, Lighthouse, Brighton, England, October–November 2012 The Last Pictures, New York, 2012–13 Trevor Paglen, Altman Siegel gallery, San Francisco, CA, March–May 2015 The Octopus, Frankfurter Kunstverein, Frankfurt am Main, 2015 Autonomy Cube, Edith-Russ-Haus, Oldenburg, Germany, October 2015 – January 2016. Sculpture by Paglen and Jacob Appelbaum. Deutsche Börse Photography Foundation Prize 2016, The Photographers' Gallery, London, April–July 2016. Deutsche Börse Photography Prize shortlist with Paglen, Erik Kessels, Laura El-Tantawy, and Tobias Zielony. Radical Landscapes, di Rosa, Napa, February–April 2016 L’Image volée, Americas II, Bahamas Internet Cable System (BICS-1) and Globenet, Fondazione Prada, Milan (group exhibition), 2016 A Study of Invisible Images, Metro Pictures, New York, September–October 2017 == Awards == 2014: Pioneer Award from the Electronic Frontier Foundation. 2015: The Cultural Award from the German Society for Photography (DGPh) 2015: Academy Award as cameraman and director for the documentary film Citzenfour. 2016: Deutsche Börse Photography Foundation Prize 2017: MacArthur Fellowship, John D. and Catherine T. MacArthur Foundation, Chicago, IL 2018: Nam June Paik Art Center Prize == Films about Paglen == Unseen Skies (2021) == Works ==
Whisper (speech recognition system)
Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022. It is capable of transcribing speech in English and multiple other languages, and can translate several non-English languages into English. Whisper is a weakly-supervised deep learning acoustic model, made using an encoder-decoder transformer architecture. OpenAI claims that the combination of different training data and post-training filtering used in its development has led to improved recognition of accents, background noise, and jargon compared to previous approaches. While the model does not outperform larger, more specialized models and still experiences AI hallucination, it has been showed to be useful for general sound recognition and has many applications across different industries. == Background == Speech recognition has had a long history in research; the first approaches made use of statistical methods, such as dynamic time warping, and later hidden Markov models. At around the 2010s, deep neural network approaches became more common for speech recognition models, which were enabled by the availability of large datasets ("big data") and increased computational performance. Early approaches to deep learning in speech recognition included convolutional neural networks, which were limited due to their inability to capture sequential data, which later led to developments of Seq2seq approaches, which include recurrent neural networks, which made use of long short-term memory. Transformers, introduced in 2017 by Google, displaced many prior state-of-the-art approaches across a wide range in machine learning, and started becoming the core neural architecture in fields such as language modeling and computer vision. Weakly-supervised approaches to training acoustic models were recognized in the early 2020s as promising for speech recognition approaches using deep neural networks. According to a NYT report, in 2021 OpenAI believed they exhausted sources of higher-quality data to train their large language models and decided to complement scraped web text with transcriptions of YouTube videos and podcasts, and developed Whisper to solve this task. Whisper Large V2 was released on December 8, 2022, followed by Whisper Large V3 being released in November 2023, during the OpenAI Dev Day. In March 2025, OpenAI released new transcription models based on GPT-4o and GPT-4o mini, both of which have lower error rates than Whisper. == Architecture == The Whisper architecture is based on an encoder-decoder transformer. Input audio is resampled to 16,000 Hertz (Hz) and converted to an 80-channel Log-magnitude Mel spectrogram using 25 ms windows with a 10 ms stride. The spectrogram is then normalized to a [-1, 1] range with near-zero mean. The encoder takes this Mel spectrogram as input and processes it. It first passes through two convolutional layers. Sinusoidal positional embeddings are added. It is then processed by a series of Transformer encoder blocks (with pre-activation residual connections). The encoder's output is layer normalized. The decoder is a standard transformer decoder. It has the same width and Transformer blocks as the encoder. It uses learned positional embeddings and tied input-output token representations (using the same weight matrix for both the input and output embeddings). It uses a byte-pair encoding tokenizer, of the same kind as used in GPT-2. English-only models use the GPT-2 vocabulary, while multilingual models employ a re-trained multilingual vocabulary with the same number of words. Special tokens are used to allow the decoder to perform multiple tasks: Tokens that denote language (one unique token per language). Tokens that specify task (<|transcribe|> or <|translate|>). Tokens that specify if no timestamps are present (<|notimestamps|>). If the token is not present, then the decoder predicts timestamps relative to the segment, and quantized to 20 ms intervals. <|nospeech|> for voice activity detection. <|startoftranscript|>, and <|endoftranscript|> . Any text that appears before <|startoftranscript|> is not generated by the decoder, but given to the decoder as context. Loss is only computed over non-contextual parts of the sequence, i.e. tokens between these two special tokens. == Training data == The training dataset consists of 680,000 hours of labeled audio-transcript pairs sourced from the internet using semi-supervised learning. This includes 117,000 hours in 96 non-English languages and 125,000 hours of X→English translation data, where X stands for any non-English language. Preprocessing involved standardization of transcripts, filtering to remove machine-generated transcripts using heuristics (e.g., punctuation, capitalization), language identification and matching with transcripts, fuzzy deduplication, and deduplication with evaluation datasets to avoid data contamination. Speechless segments were also included to allow voice activity detection training. For the files still remaining after the filtering process, audio files were then broken into 30-second segments paired with the subset of the transcript that occurs within that time. If this predicted spoken language differed from the language of the text transcript associated with the audio, that audio-transcript pair was not used for training the speech recognition models, but instead for training translation. The model was trained using the AdamW optimizer with gradient norm clipping and a linear learning rate decay with warmup, with batch size 256 segments. Training proceeded for 1 million updates (approximately 2-3 epochs). No data augmentation or regularization, except for the Large V2 model, which used SpecAugment, Stochastic Depth, and BPE Dropout. The training used data parallelism with float16, dynamic loss scaling, and activation checkpointing. === Post-training filtering === After training the first model, researchers ran it on different subsets of the training data, each representing a distinct source. Data sources were ranked by a combination of their error rate and size. Manual inspection of the top-ranked sources (high error, large size) helped determine if the source was low quality (e.g., partial transcriptions, inaccurate alignment). After training, it was fine-tuned to suppress the prediction of speaker names and low-quality sources were then removed. == Capacity == While Whisper does not outperform models which specialize in the LibriSpeech dataset, when tested across many datasets, it is more robust and makes 55.2% fewer errors than other models. Whisper has a differing error rate with respect to transcribing different languages, with a higher word error rate in languages not well-represented in the training data. The authors found that multi-task learning improved overall performance compared to models specialized to one task. They conjectured that the best Whisper model trained is still underfitting the dataset, and larger models and longer training can result in better models. Third-party evaluations have found varying levels of AI hallucination. A study of transcripts of public meetings found hallucinations in eight out of every 10 transcripts, while an engineer discovered hallucinations in "about half" of 100 hours of transcriptions and a developer identified them in "nearly every one" of 26,000 transcripts. A study of 13,140 short audio segments (averaging 10 seconds) found 187 hallucinations (1.4%), 38% of which generated text that could be harmful because it inserted false references to things like race, non-existent medications, or violent events that were not in the audio. == Applications == The model has been used as the base for many applications, such as a unified model for speech recognition and more general sound recognition. Whisper has also been integrated into the workflow of biomedical research. In 2025, a study on Alzheimer's disease detection used the model to transcribe spontaneous speech recordings. The transcripts that were generated by the model were combined with LLM vector embeddings and traditional classifiers to help classify the patients' health. Another application is when OVALYTICS incorporated Whisper to transcribe YouTube videos and automate content moderation systems, which improved its detection of offensive content. The model has also been used in academic libraries and cultral heritage institutions to generate transcripts and captions for their digitized audiovisual collections. In a 2025 case study, Emory University Libraries found that Whisper reduced the labor used in transcription by around 30-35%, shifting work from text creation to text correction. However, human review is still necessary to make sure accuracy, formatting, and accessibility are all standard.
C-RAN
C-RAN (Cloud-RAN), also referred to as Centralized-RAN, is an architecture for cellular networks. C-RAN is a centralized, cloud computing-based architecture for radio access networks that supports 2G, 3G, 4G, 5G and future wireless communication standards. Its name comes from the four 'C's in the main characteristics of C-RAN system, "Clean, Centralized processing, Collaborative radio, and a real-time Cloud Radio Access Network". == Background == Traditional cellular, or Radio Access Networks (RAN), consist of many stand-alone base stations (BTS). Each BTS covers a small area, whereas a group BTS provides coverage over a continuous area. Each BTS processes and transmits its own signal to and from the mobile terminal, and forwards the data payload to and from the mobile terminal and out to the core network via the backhaul. Each BTS has its own cooling, back haul transportation, backup battery, monitoring system, and so on. Because of limited spectral resources, network operators 'reuse' the frequency among different base stations, which can cause interference between neighboring cells. There are several limitations in the traditional cellular architecture. First, each BTS is costly to build and operate. Moore's law helps reduce the size and power of an electrical system, but the supporting facilities of the BTS are not improved quite as well. Second, when more BTS are added to a system to improve its capacity, interference among BTS is more severe as BTS are closer to each other and more of them are using the same frequency. Third, because users are mobile, the traffic of each BTS fluctuates (called 'tide effect'), and as a result, the average utilization rate of individual BTS is pretty low. However, these processing resources cannot be shared with other BTS. Therefore, all BTS are designed to handle the maximum traffic, not average traffic, resulting in a waste of processing resources and power at idle times. == Evolution of base station architecture == === All-in-one macro base station === In the 1G and 2G cellular networks, base stations had an all-in-one architecture. Analog, digital, and power functions were housed in a single cabinet as large as a refrigerator. Usually the base station cabinet was placed in a dedicated room along with all necessary supporting facilitates such as power, backup battery, air conditioning, environment surveillance, and backhaul transmission equipment. The RF signal is generated by the base station RF unit and propagates through pairs of RF cables up to the antennas on the top of a base station tower or other mounting points. This all-in-one architecture was mostly found in macro cell deployments. === Distributed base station === For 3G, a distributed base station architecture was introduced by Ericsson, Nokia, Huawei, and other leading telecom equipment vendors. In this architecture the radio function unit, also known as the remote radio head (RRH), is separated from the digital function unit, or baseband unit (BBU) by fiber. Digital baseband signals are carried over fiber, using the Open Base Station Architecture Initiative (OBSAI) or Common Public Radio Interface (CPRI) standard. The RRH can be installed on the top of tower close to the antenna, reducing the loss compared to the traditional base station where the RF signal has to travel through a long cable from the base station cabinet to the antenna at the top of the tower. The fiber link between RRH and BBU also allows more flexibility in network planning and deployment as they can be placed a few hundred meters or a few kilometers away. Most modern base stations now use this decoupled architecture. === C-RAN/Cloud-RAN === C-RAN may be viewed as an architectural evolution of the above distributed base station system. It takes advantage of many technological advances in wireless, optical and IT communications systems. For example, it uses the latest CPRI standard, low cost Coarse or Dense Wavelength Division Multiplexing (CWDM/ DWDM) technology, and mmWave to allow transmission of baseband signal over long distance thus achieving large scale centralised base station deployment. It applies recent Data Centre Network technology to allow a low cost, high reliability, low latency and high bandwidth interconnect network in the BBU pool. It utilizes open platforms and real-time virtualization technology rooted in cloud computing to achieve dynamic shared resource allocation and support multi-vendor, multi-technology environments. == Architecture overview == C-RAN architecture has the following characteristics that are distinct from other cellular architectures: Large scale centralized deployment: Allows many RRHs to connect to a centralized BBU pool. The maximum distance can be 20km in fiber link for 4G (LTE/LTE-A) systems, and even longer distances (40~80km) for 3G (WCDMA/TD-SCDMA) and 2G (GSM/CDMA) systems. Native support to Collaborative Radio technologies: Any BBU can talk with any other BBU within the BBU pool with very high bandwidth (10 Gbit/s and above) and low latency (10 μs level). This is enabled by the interconnection of BBUs in the pool. This is one major difference from BBU Hotelling, or base station Hotelling; in the latter case, the BBUs of different base stations are simply stacked together and have no direct link between them to allow physical layer co-ordination. Real-time virtualization capability based on open platform: This is different from traditional base stations built on proprietary hardware, where the software and hardware are close-sourced and provided by single vendors. In contrast, a C-RAN BBU pool is built on open hardware, like x86/ARM CPU based servers, and interface cards that handle fiber links to RRHs and inter-connections in the pool. Real-time virtualization ensures that resources in the pool can be allocated dynamically to base station software stacks, say 4G/3G/2G function modules from different vendors, according to network load. However, to satisfy the strict timing requirements of wireless communication systems, the real-time performance for C-RAN is at the level of tens of microseconds, which is two orders of magnitude better than the millisecond level 'real-time' performance usually seen in Cloud Computing environments. == Similar architecture and systems == KT, a telecom operator in the Republic of Korea, introduced a Cloud Computing Center (CCC) system in their 3G (WCDMA/HSPA) and 4G (LTE/LTE-A) network in 2011 and 2012. The concept of CCC is basically the same as C-RAN. SK Telecom has also deployed Smart Cloud Access Network (SCAN) and Advanced-SCAN in their 4G (LTE/LTE-A) network in Korea no later than 2012. In 2014, Airvana (now CommScope) introduced OneCell, a C-RAN-based small cell system designed for enterprises and public spaces. == Competing architectures in cellular network evolution == === All-in-one BTS === One major alternative solution that is addressing similar challenges of RAN, is the small size, all-in-one outdoor BTS. Thanks to the achievements in the semiconductor industry, all the functionality of a BTS, including RF, baseband processing, MAC processing and package level processing, can now be implemented in a volume of <50 liters. This makes the system small and weatherproof, reduces the difficulty of BTS site choice and construction, eliminates the air conditioning requirement, and thus reduces operational costs. However, because each BTS is still working on its own, it cannot readily make use of the collaboration algorithms to reduce the interference between neighboring BTSs. It is also relatively hard to upgrade or repair because the all-in-one BTS units are usually mounted near the antenna. More processing units in less-protected environments also implies a higher failure rate compared to C-RAN, which only has the RRU deployed outdoors. The advantage of Cloud RAN lies in its ability to implement LTE-Advanced features such as Coordinated MultiPoint (CoMP) with very low latency between multiple radio heads. However, the economic benefit of improvements such as CoMP can be negated by the higher backhaul costs for some operators. === Small cell === The main competition between small cell and C-RAN occurs in two deployment scenarios: outdoor hotspot coverage and indoor coverage. == Academic research and publications == As one of the promising evolution paths for future cellular network architecture, C-RAN has attracted high academic research interest. Meanwhile, because the native support of cooperative radio capability built into the C-RAN architecture, it also enables many advanced algorithms that were hard to implement in cellular networks, including Cooperative Multi-Point Transmission/Receiving, Network Coding, etc. In October 2011, Wireless World Research Forum 27 was hosted in Germany, when China Mobile was invited to give a C-RAN presentation. In August 2012, IEEE C-RAN 2012 workshop was hosted in Kunming, China. CRC Press published a book, "Green Communications: Theore
GITEX AI Europe
GITEX AI Europe is an annual technology trade show and conference held in Berlin, Germany, as part of GITEX GLOBAL. The event focuses on the European technology market, specifically in the sectors of artificial intelligence (AI), cybersecurity, quantum computing, and digital infrastructure. The event is organized by Kaoun International GmbH, the international arm of the Dubai World Trade Centre (DWTC), in partnership with Messe Berlin. == History == The establishment of GITEX AI Europe was announced in 2023 as part of a strategic move to bring the GITEX brand to the European market. The inaugural edition took place from May 21 to 23, 2025, at the Messe Berlin exhibition grounds. The launch was supported by the Berlin Senate and the German Federal Ministry for Economic Affairs and Climate Action. The first edition of GITEX AI Europe in 2025 featured 21,650 attendees, 1,434 exhibiting companies, and 755 startups, with 513 speakers representing 125 countries. The next edition is scheduled for June 30 – July 1, 2026 in Berlin. == Program == The event consists of an exhibition floor for corporate displays, several conference stages for keynote speeches, and specialized sub-events. The conference program includes tracks such as "AI Stack Sovereignty," "Cyber Regulation & Trust Convergence," and "Institutional Growth Capital." GITEX AI Europe incorporates brands under its umbrella: AI Everything Europe: Focused on the development and application of generative AI and machine learning. North Star Europe: A dedicated program for startups and venture capital, featuring the "Supernova Challenge" pitch competition. GISEC Europe: A cybersecurity forum discussing regulation and infrastructure defense. GITEX Quantum Expo: Focused on the commercialization of quantum computing. Institutional partners for the event include the German Federal Ministry for Economic Affairs and Climate Action, the European Innovation Council (EIC), the International Telecommunication Union (ITU), Bitkom, and Digital Dubai.