ReactiveX (Rx, also known as Reactive Extensions) is a software library originally created by Microsoft that allows imperative programming languages to operate on sequences of data regardless of whether the data is synchronous or asynchronous. It provides a set of sequence operators that operate on each item in the sequence. It is an implementation of reactive programming and provides a blueprint for the tools to be implemented in multiple programming languages. == Overview == ReactiveX is an API for asynchronous programming with observable streams. Asynchronous programming allows programmers to call functions and then have the functions "callback" when they are done, usually by giving the function the address of another function to execute when it is done. Programs designed in this way often avoid the overhead of having many threads constantly starting and stopping. Observable streams (i.e. streams that can be observed) in the context of Reactive Extensions are like event emitters that emit three events: next, error, and complete. An observable emits next events until it either emits an error event or a complete event. However, at that point it will not emit any more events, unless it is subscribed to again. The examples below use the RxJS implementation of Reactive Extensions for the JavaScript programming language. === Motivation === For sequences of data, it combines the advantages of iterators with the flexibility of event-based asynchronous programming. It also works as a simple promise, eliminating the pyramid of doom that results from multiple layers of callbacks. === Observables and observers === ReactiveX is a combination of ideas from the observer and the iterator patterns and from functional programming. An observer subscribes to an observable sequence. The sequence then sends the items to the observer one at a time, usually by calling the provided callback function. The observer handles each one before processing the next one. If many events come in asynchronously, they must be stored in a queue or dropped. In ReactiveX, an observer will never be called with an item out of order or (in a multi-threaded context) called before the callback has returned for the previous item. Asynchronous calls remain asynchronous and may be handled by returning an observable. It is similar to the iterators pattern in that if a fatal error occurs, it notifies the observer separately (by calling a second function). When all the items have been sent, it completes (and notifies the observer by calling a third function). The Reactive Extensions API also borrows many of its operators from iterator operators in other programming languages. Reactive Extensions is different from functional reactive programming as the Introduction to Reactive Extensions explains: It is sometimes called "functional reactive programming" but this is a misnomer. ReactiveX may be functional, and it may be reactive, but "functional reactive programming" is a different animal. One main point of difference is that functional reactive programming operates on values that change continuously over time, while ReactiveX operates on discrete values that are emitted over time. (See Conal Elliott's work for more-precise information on functional reactive programming.) === Reactive operators === An operator is a function that takes one observable (the source) as its first argument and returns another observable (the destination, or outer observable). Then for every item that the source observable emits, it will apply a function to that item, and then emit it on the destination Observable. It can even emit another Observable on the destination observable. This is called an inner observable. An operator that emits inner observables can be followed by another operator that in some way combines the items emitted by all the inner observables and emits the item on its outer observable. Examples include: switchAll – subscribes to each new inner observable as soon as it is emitted and unsubscribes from the previous one. mergeAll – subscribes to all inner observables as they are emitted and outputs their values in whatever order it receives them. concatAll – subscribes to each inner observable in order and waits for it to complete before subscribing to the next observable. Operators can be chained together to create complex data flows that filter events based on certain criteria. Multiple operators can be applied to the same observable. Some of the operators that can be used in Reactive Extensions may be familiar to programmers who use functional programming language, such as map, reduce, group, and zip. There are many other operators available in Reactive Extensions, though the operators available in a particular implementation for a programming language may vary. ==== Reactive operator examples ==== Here is an example of using the map and reduce operators. We create an observable from a list of numbers. The map operator will then multiply each number by two and return an observable. The reduce operator will then sum up all the numbers provided to it (the value of 0 is the starting point). Calling subscribe will register an observer that will observe the values from the observable produced by the chain of operators. With the subscribe method, we are able to pass in an error-handling function, called whenever an error is emitted in the observable, and a completion function when the observable has finished emitting items. ==== Usage in stream-oriented programming ==== Certain RxJS primitives such as BehaviorSubject make it possible to create pure stateful streams to track application state of arbitrary complexity in simple terms. The button below will feed an event to the stream, which in turn will re-emit the next natural number every time, back into the tag that follows and displays the count of clicks detected. Libraries such as Rimmel.js, designed around RxJS Observables, enable integration between reactive streams and the HTML DOM: == History == Reactive Extensions was created by the Cloud Programmability Team at Microsoft around 2011, as a byproduct of a larger effort called Volta. It was originally intended to provide an abstraction for events across different tiers in an application to support tier splitting in Volta. The project's logo represents an electric eel, which is a reference to Volta. The extensions suffix in the name is a reference to the Parallel Extensions technology which was invented around the same time; the two are considered complementary. The initial implementation of Rx was for .NET Framework and was released on June 21, 2011. Later, the team started the implementation of Rx for other platforms, including JavaScript and C++. The technology was released as open source in late 2012, initially on CodePlex. Later, the code moved to GitHub and has been ported to several other languages, including Go, Java, Kotlin, PHP and Rust.
Jais (language model)
Jais is an open-source large language model launched in August 2023. Developed as a collaboration between Emirati AI company G42, the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), and US-based Cerebras Systems, Jais was designed to produce high-quality Arabic text and was also trained on English data. The model's creation was motivated by the underrepresentation of the Arabic language in the field of generative artificial intelligence. It aims to provide a more culturally and linguistically accurate model for the world's 400 million Arabic speakers. Its name is a reference to Jebel Jais, the highest mountain in the UAE. == Background and development == Jais was developed in response to the limited availability of advanced generative artificial intelligence models for the Arabic language, despite it being spoken by over 400 million people. Existing models were often trained on limited or low-quality Arabic web content, resulting in poor performance. The project represents a significant investment by the United Arab Emirates in the field of AI as part of its national strategy. The model was created through a partnership between Inception (now Core42), a subsidiary of the Abu Dhabi-based AI company G42; the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); and Cerebras Systems, a US company specializing in AI hardware. The model is named after Jebel Jais, the highest peak in the UAE. == Training == The initial version of Jais released in August 2023 had 13 billion parameters. In November 2023, Core42 released Jais 30B, an improved version with 30 billion parameters. Both models were trained on a subset of the Cerebras Condor Galaxy 1 supercomputer. The training dataset consisted of a mix of Arabic, English, and computer code. According to Timothy Baldwin, a professor of natural language processing at MBZUAI, training the model on a diverse Arabic dataset allows it to switch between dialects. == Features == Jais is designed to generate text in both English and Arabic. The project has also released instruction-tuned "Chat" variants for both the 13B and 30B models, which are specifically optimized for conversational applications. Additional functionality for working with images, graphs, and tabular data is planned for future releases.
IBM alignment models
The IBM alignment models are a sequence of increasingly complex models used in statistical machine translation to train a translation model and an alignment model, starting with lexical translation probabilities and moving to reordering and word duplication. They underpinned the majority of statistical machine translation systems for almost twenty years starting in the early 1990s, until neural machine translation began to dominate. These models offer principled probabilistic formulation and (mostly) tractable inference. The IBM alignment models were published in parts in 1988 and 1990, and the entire series is published in 1993. Every author of the 1993 paper subsequently went to the hedge fund Renaissance Technologies. The original work on statistical machine translation at IBM proposed five models, and a model 6 was proposed later. The sequence of the six models can be summarized as: Model 1: lexical translation Model 2: additional absolute alignment model Model 3: extra fertility model Model 4: added relative alignment model Model 5: fixed deficiency problem. Model 6: Model 4 combined with a HMM alignment model in a log linear way == Mathematical setup == The IBM alignment models translation as a conditional probability model. For each source-language ("foreign") sentence f {\displaystyle f} , we generate both a target-language ("English") sentence e {\displaystyle e} and an alignment a {\displaystyle a} . The problem then is to find a good statistical model for p ( e , a | f ) {\displaystyle p(e,a|f)} , the probability that we would generate English language sentence e {\displaystyle e} and an alignment a {\displaystyle a} given a foreign sentence f {\displaystyle f} . The meaning of an alignment grows increasingly complicated as the model version number grew. See Model 1 for the most simple and understandable version. == Model 1 == === Word alignment === Given any foreign-English sentence pair ( e , f ) {\displaystyle (e,f)} , an alignment for the sentence pair is a function of type { 1 , . , . . . , l e } → { 0 , 1 , . , . . . , l f } {\displaystyle \{1,.,...,l_{e}\}\to \{0,1,.,...,l_{f}\}} . That is, we assume that the English word at location i {\displaystyle i} is "explained" by the foreign word at location a ( i ) {\displaystyle a(i)} . For example, consider the following pair of sentences It will surely rain tomorrow -- 明日 は きっと 雨 だWe can align some English words to corresponding Japanese words, but not everyone:it -> ? will -> ? surely -> きっと rain -> 雨 tomorrow -> 明日This in general happens due to the different grammar and conventions of speech in different languages. English sentences require a subject, and when there is no subject available, it uses a dummy pronoun it. Japanese verbs do not have different forms for future and present tense, and the future tense is implied by the noun 明日 (tomorrow). Conversely, the topic-marker は and the grammar word だ (roughly "to be") do not correspond to any word in the English sentence. So, we can write the alignment as 1-> 0; 2 -> 0; 3 -> 3; 4 -> 4; 5 -> 1where 0 means that there is no corresponding alignment. Thus, we see that the alignment function is in general a function of type { 1 , . , . . . , l e } → { 0 , 1 , . , . . . , l f } {\displaystyle \{1,.,...,l_{e}\}\to \{0,1,.,...,l_{f}\}} . Future models will allow one English world to be aligned with multiple foreign words. === Statistical model === Given the above definition of alignment, we can define the statistical model used by Model 1: Start with a "dictionary". Its entries are of form t ( e i | f j ) {\displaystyle t(e_{i}|f_{j})} , which can be interpreted as saying "the foreign word f j {\displaystyle f_{j}} is translated to the English word e i {\displaystyle e_{i}} with probability t ( e i | f j ) {\displaystyle t(e_{i}|f_{j})} ". After being given a foreign sentence f {\displaystyle f} with length l f {\displaystyle l_{f}} , we first generate an English sentence length l e {\displaystyle l_{e}} uniformly in a range U n i f o r m [ 1 , 2 , . . . , N ] {\displaystyle Uniform[1,2,...,N]} . In particular, it does not depend on f {\displaystyle f} or l f {\displaystyle l_{f}} . Then, we generate an alignment uniformly in the set of all possible alignment functions { 1 , . , . . . , l e } → { 0 , 1 , . , . . . , l f } {\displaystyle \{1,.,...,l_{e}\}\to \{0,1,.,...,l_{f}\}} . Finally, for each English word e 1 , e 2 , . . . e l e {\displaystyle e_{1},e_{2},...e_{l_{e}}} , generate each one independently of every other English word. For the word e i {\displaystyle e_{i}} , generate it according to t ( e i | f a ( i ) ) {\displaystyle t(e_{i}|f_{a(i)})} . Together, we have the probability p ( e , a | f ) = 1 / N ( 1 + l f ) l e ∏ i = 1 l e t ( e i | f a ( i ) ) {\displaystyle p(e,a|f)={\frac {1/N}{(1+l_{f})^{l_{e}}}}\prod _{i=1}^{l_{e}}t(e_{i}|f_{a(i)})} IBM Model 1 uses very simplistic assumptions on the statistical model, in order to allow the following algorithm to have closed-form solution. === Learning from a corpus === If a dictionary is not provided at the start, but we have a corpus of English-foreign language pairs { ( e ( k ) , f ( k ) ) } k {\displaystyle \{(e^{(k)},f^{(k)})\}_{k}} (without alignment information), then the model can be cast into the following form: fixed parameters: the foreign sentences { f ( k ) } k {\displaystyle \{f^{(k)}\}_{k}} . learnable parameters: the entries of the dictionary t ( e i | f j ) {\displaystyle t(e_{i}|f_{j})} . observable variables: the English sentences { e ( k ) } k {\displaystyle \{e^{(k)}\}_{k}} . latent variables: the alignments { a ( k ) } k {\displaystyle \{a^{(k)}\}_{k}} In this form, this is exactly the kind of problem solved by expectation–maximization algorithm. Due to the simplistic assumptions, the algorithm has a closed-form, efficiently computable solution, which is the solution to the following equations: { max t ′ ∑ k ∑ i ∑ a ( k ) t ( a ( k ) | e ( k ) , f ( k ) ) ln t ( e i ( k ) | f a ( k ) ( i ) ( k ) ) ∑ x t ′ ( e x | f y ) = 1 ∀ y {\displaystyle {\begin{cases}\max _{t'}\sum _{k}\sum _{i}\sum _{a^{(k)}}t(a^{(k)}|e^{(k)},f^{(k)})\ln t(e_{i}^{(k)}|f_{a^{(k)}(i)}^{(k)})\\\sum _{x}t'(e_{x}|f_{y})=1\quad \forall y\end{cases}}} This can be solved by Lagrangian multipliers, then simplified. For a detailed derivation of the algorithm, see chapter 4 and. In short, the EM algorithm goes as follows:INPUT. a corpus of English-foreign sentence pairs { ( e ( k ) , f ( k ) ) } k {\displaystyle \{(e^{(k)},f^{(k)})\}_{k}} INITIALIZE. matrix of translations probabilities t ( e x | f y ) {\displaystyle t(e_{x}|f_{y})} .This could either be uniform or random. It is only required that every entry is positive, and for each y {\displaystyle y} , the probability sums to one: ∑ x t ( e x | f y ) = 1 {\displaystyle \sum _{x}t(e_{x}|f_{y})=1} . LOOP. until t ( e x | f y ) {\displaystyle t(e_{x}|f_{y})} converges: t ( e x | f y ) ← t ( e x | f y ) λ y ∑ k , i , j δ ( e x , e i ( k ) ) δ ( f y , f j ( k ) ) ∑ j ′ t ( e i ( k ) | f j ′ ( k ) ) {\displaystyle t(e_{x}|f_{y})\leftarrow {\frac {t(e_{x}|f_{y})}{\lambda _{y}}}\sum _{k,i,j}{\frac {\delta (e_{x},e_{i}^{(k)})\delta (f_{y},f_{j}^{(k)})}{\sum _{j'}t(e_{i}^{(k)}|f_{j'}^{(k)})}}} where each λ y {\displaystyle \lambda _{y}} is a normalization constant that makes sure each ∑ x t ( e x | f y ) = 1 {\displaystyle \sum _{x}t(e_{x}|f_{y})=1} .RETURN. t ( e x | f y ) {\displaystyle t(e_{x}|f_{y})} .In the above formula, δ {\displaystyle \delta } is the Dirac delta function -- it equals 1 if the two entries are equal, and 0 otherwise. The index notation is as follows: k {\displaystyle k} ranges over English-foreign sentence pairs in corpus; i {\displaystyle i} ranges over words in English sentences; j {\displaystyle j} ranges over words in foreign language sentences; x {\displaystyle x} ranges over the entire vocabulary of English words in the corpus; y {\displaystyle y} ranges over the entire vocabulary of foreign words in the corpus. === Limitations === There are several limitations to the IBM model 1. No fluency: Given any sentence pair ( e , f ) {\displaystyle (e,f)} , any permutation of the English sentence is equally likely: p ( e | f ) = p ( e ′ | f ) {\displaystyle p(e|f)=p(e'|f)} for any permutation of the English sentence e {\displaystyle e} into e ′ {\displaystyle e'} . No length preference: The probability of each length of translation is equal: ∑ e has length l p ( e | f ) = 1 N {\displaystyle \sum _{e{\text{ has length }}l}p(e|f)={\frac {1}{N}}} for any l ∈ { 1 , 2 , . . . , N } {\displaystyle l\in \{1,2,...,N\}} . Does not explicitly model fertility: some foreign words tend to produce a fixed number of English words. For example, for German-to-English translation, ja is usually omitted, and zum is usually translated to one of to the, for the, to a, for a. == Model 2 == Model 2 allows alignment to be conditional on sentence lengths. That is, we have a probability distribution p a ( j | i , l e , l f ) {\displaystyle
Myhill–Nerode theorem
In the theory of formal languages, the Myhill–Nerode theorem provides a necessary and sufficient condition for a language to be regular. The theorem is named for John Myhill and Anil Nerode, who proved it at the University of Chicago in 1957 (Nerode & Sauer 1957, p. ii). == Statement == Given a language L {\displaystyle L} , and a pair of strings x {\displaystyle x} and y {\displaystyle y} , define a distinguishing extension to be a string z {\displaystyle z} such that exactly one of the two strings x z {\displaystyle xz} and y z {\displaystyle yz} belongs to L {\displaystyle L} . Define a relation ∼ L {\displaystyle \sim _{L}} on strings as x ∼ L y {\displaystyle x\;\sim _{L}\ y} if there is no distinguishing extension for x {\displaystyle x} and y {\displaystyle y} . It is easy to show that ∼ L {\displaystyle \sim _{L}} is an equivalence relation on strings, and thus it divides the set of all strings into equivalence classes. The Myhill–Nerode theorem states that a language L {\displaystyle L} is regular if and only if ∼ L {\displaystyle \sim _{L}} has a finite number of equivalence classes, and moreover, that this number is equal to the number of states in the minimal deterministic finite automaton (DFA) accepting L {\displaystyle L} . Furthermore, every minimal DFA for the language is isomorphic to the canonical one (Hopcroft & Ullman 1979). Generally, for any language, the constructed automaton is a state automaton acceptor. However, it does not necessarily have finitely many states. The Myhill–Nerode theorem shows that finiteness is necessary and sufficient for language regularity. Some authors refer to the ∼ L {\displaystyle \sim _{L}} relation as Nerode congruence, in honor of Anil Nerode. == Use and consequences == The Myhill–Nerode theorem may be used to show that a language L {\displaystyle L} is regular by proving that the number of equivalence classes of ∼ L {\displaystyle \sim _{L}} is finite. This may be done by an exhaustive case analysis in which, beginning from the empty string, distinguishing extensions are used to find additional equivalence classes until no more can be found. For example, the language consisting of binary representations of numbers that can be divided by 3 is regular. Given two binary strings x , y {\displaystyle x,y} , extending them by one digit gives 2 x + b , 2 y + b {\displaystyle 2x+b,2y+b} , so 2 x + b ≡ 2 y + b mod 3 {\displaystyle 2x+b\equiv 2y+b\mod 3} iff x ≡ y mod 3 {\displaystyle x\equiv y\mod 3} . Thus, 00 {\displaystyle 00} (or 11 {\displaystyle 11} ), 01 {\displaystyle 01} , and 10 {\displaystyle 10} are the only distinguishing extensions, resulting in the 3 classes. The minimal automaton accepting our language would have three states corresponding to these three equivalence classes. Another immediate corollary of the theorem is that if for a language L {\displaystyle L} the relation ∼ L {\displaystyle \sim _{L}} has infinitely many equivalence classes, it is not regular. It is this corollary that is frequently used to prove that a language is not regular. == Generalizations == The Myhill–Nerode theorem can be generalized to tree automata.
Project Bergamot
Project Bergamot is a joint project between several European universities and Mozilla for the development of machine translation software based on artificial neural networks, which is intended for local execution on end-user devices. The software library that was created and the associated language models were made available to the general public as Free Software. Execution requires a x86 CPU with SSE4.1 instruction set extensions. In 2022, Devin Coldewey of TechCrunch judged the translation quality to be "more than adequate", but considered Firefox Translations to be not yet fully mature. == Usage == Mozilla used the Bergamot Translator to expand its web browser Firefox with a feature for translating web pages, which was previously considered an important gap in Firefox' feature set. It is often compared to the much older corresponding feature in Google Chrome, which utilizes a cloud-based background service. In contrast, Firefox Translations does not require any data to leave the user's computer, resulting in advantages in terms of data protection, availability and possibly response times. There is just the installation of a new language model that needs to take place the first time a new language is encountered. Greater independence from large technology companies and their interests is also mentioned as an important advantage. Mozilla thus strengthened its position as an alternative software vendor with a particular focus on data protection and security. Mozilla followed up with the similar feature of speech recognition for spoken user input, based on whisperfile. On the other hand, slow translation times have been observed, especially on older devices. Also, Firefox Translations initially supported far fewer language pairs than other major translation services and is only gradually adding new models. On that matter, the training pipeline is also made available to interested parties to enable the creation of missing language models. TranslateLocally is a Firefox-independent translation software based on the Bergamot Translator. It is also available as an (Electron-based) standalone application or as an extension for Chromium-based web browsers. == History == Mozilla had already tried to get a (cloud-based) web content translation feature into Firefox a few years before Project Bergamot, but had failed because of the financial challenge. Microsoft had already delivered offline capabilities for its translation software in 2018. Google soon followed suit, Apple two years later. The software is based on the free translation framework Marian, which the University of Edinburgh had previously developed in cooperation with Microsoft, and is itself based on the Nematus toolkit that was presented in 2017. Under the leadership of the University of Edinburgh, a development consortium was formed with the Mozilla Corporation and the additional European universities of Prague, Sheffield and Tartu. In 2018, it was able to get 3 million euros of funding from the EU's Horizon 2020 programme. Firefox Translations was initially provided as an add-on. A first functional demonstration prototype was presented in October 2019. Beta version 117 had the feature integrated directly into the browser, the official release was in version 118 from September 2023. Both the add-on module and as part of Firefox, the code and the models are subject to the version 2 of the Mozilla Public License. Since 2022, the EU-funded HPLT project creates new language models. It involves additional partners, including the universities of Helsinki, Turku, Oslo and other partners from Spain, Norway and the Czech Republic.
Public computer
A public computer (or public access computer) is any of various computers available in public areas. Some places where public computers may be available are libraries, schools, or dedicated facilities run by government. Public computers share similar hardware and software components to personal computers, however, the role and function of a public access computer is entirely different. A public access computer is used by many different untrusted individuals throughout the course of the day. The computer must be locked down and secure against both intentional and unintentional abuse. Users typically do not have authority to install software or change settings. A personal computer, in contrast, is typically used by a single responsible user, who can customize the machine's behavior to their preferences. Public access computers are often provided with tools such as a PC reservation system to regulate access. The world's first public access computer center was the Marin Computer Center in California, co-founded by David and Annie Fox in 1977. == Kiosks == A kiosk is a special type of public computer using software and hardware modifications to provide services only about the place the kiosk is in. For example, a movie ticket kiosk can be found at a movie theater. These kiosks are usually in a secure browser with zero access to the desktop. Many of these kiosks may run Linux, however, ATMs, a kiosk designed for depositing money, often run Windows XP. == Public computers in the United States == === Library computers === In the United States and Canada, almost all public libraries have computers available for the use of patrons, though some libraries will impose a time limit on users to ensure others will get a turn and keep the library less busy. Users are often allowed to print documents that they have created using these computers, though sometimes for a small fee. ==== Privacy ==== Privacy is an important part of the public library institution, since the libraries entitle the public to intellectual freedom. Use of any computer or network may create records of users' activities that can jeopardize their privacy. It is possible for a patron to jeopardize their privacy if they do not delete cache, clear cookies, or documents from the public computer. In order for a member of the public to remain private on a computer, the American Library Association (ALA) has guidelines. These give patrons an idea of the right way to keep using public library computers. In their provision of services to library users, librarians have an ethical responsibility, expressed in the ALA Code of Ethics, to preserve users' right to privacy. A librarian is also responsible for giving users an understanding of private patron use and access. Libraries must ensure that users have the following rights when browsing on public computers: the computer automatically will clear a users history; libraries should display privacy screens so users do not see another patron's screen; updating software for effective safety measures; restoration data software to clear documents that users may have left on their computers and to combat possible malware; security practices; and making users aware of any possible monitoring of their browsing activities. Users can also view the Library Privacy Checklist for Public Access Computers and Networks to better understand what libraries strive for when protecting privacy. === School computers === The U.S. government has given money to many school boards to purchase computers for educational applications. Schools may have multiple computer labs, which contain these computers for students to use. There is usually Internet access on these machines, but some schools will put up a blocking service to limit the websites that students are able to access to only include educational resources, such as Google. In addition to controlling the content students are viewing, putting up these blocks can also help to keep the computers safe by preventing students from downloading malware and other threats. However, the effectiveness of such content filtering systems is questionable since it can easily be circumvented by using proxy websites, Virtual Private Networks, and for some weak security systems, merely knowing the IP address of the intended website is enough to bypass the filter. School computers often have advanced operating system security to prevent tech-savvy students from inflicting damage (i.e. the Windows Registry Editor and Task Manager, etc.) are disabled on Microsoft Windows machines. Schools with very advanced tech services may also install a locked down BIOS/firmware or make kernel-level changes to the operating system, precluding the possibility of unauthorized activity.
Wasserstein GAN
The Wasserstein Generative Adversarial Network (WGAN) is a variant of generative adversarial network (GAN) proposed in 2017 that aims to "improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches". Compared with the original GAN discriminator, the Wasserstein GAN discriminator provides a better learning signal to the generator. This allows the training to be more stable when generator is learning distributions in very high dimensional spaces. == Motivation == === The GAN game === The original GAN method is based on the GAN game, a zero-sum game with 2 players: generator and discriminator. The game is defined over a probability space ( Ω , B , μ r e f ) {\displaystyle (\Omega ,{\mathcal {B}},\mu _{ref})} , The generator's strategy set is the set of all probability measures μ G {\displaystyle \mu _{G}} on ( Ω , B ) {\displaystyle (\Omega ,{\mathcal {B}})} , and the discriminator's strategy set is the set of measurable functions D : Ω → [ 0 , 1 ] {\displaystyle D:\Omega \to [0,1]} . The objective of the game is L ( μ G , D ) := E x ∼ μ r e f [ ln D ( x ) ] + E x ∼ μ G [ ln ( 1 − D ( x ) ) ] . {\displaystyle L(\mu _{G},D):=\mathbb {E} _{x\sim \mu _{ref}}[\ln D(x)]+\mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))].} The generator aims to minimize it, and the discriminator aims to maximize it. A basic theorem of the GAN game states that Repeat the GAN game many times, each time with the generator moving first, and the discriminator moving second. Each time the generator μ G {\displaystyle \mu _{G}} changes, the discriminator must adapt by approaching the ideal D ∗ ( x ) = d μ r e f d ( μ r e f + μ G ) . {\displaystyle D^{}(x)={\frac {d\mu _{ref}}{d(\mu _{ref}+\mu _{G})}}.} Since we are really interested in μ r e f {\displaystyle \mu _{ref}} , the discriminator function D {\displaystyle D} is by itself rather uninteresting. It merely keeps track of the likelihood ratio between the generator distribution and the reference distribution. At equilibrium, the discriminator is just outputting 1 2 {\displaystyle {\frac {1}{2}}} constantly, having given up trying to perceive any difference. Concretely, in the GAN game, let us fix a generator μ G {\displaystyle \mu _{G}} , and improve the discriminator step-by-step, with μ D , t {\displaystyle \mu _{D,t}} being the discriminator at step t {\displaystyle t} . Then we (ideally) have L ( μ G , μ D , 1 ) ≤ L ( μ G , μ D , 2 ) ≤ ⋯ ≤ max μ D L ( μ G , μ D ) = 2 D J S ( μ r e f ‖ μ G ) − 2 ln 2 , {\displaystyle L(\mu _{G},\mu _{D,1})\leq L(\mu _{G},\mu _{D,2})\leq \cdots \leq \max _{\mu _{D}}L(\mu _{G},\mu _{D})=2D_{JS}(\mu _{ref}\|\mu _{G})-2\ln 2,} so we see that the discriminator is actually lower-bounding D J S ( μ r e f ‖ μ G ) {\displaystyle D_{JS}(\mu _{ref}\|\mu _{G})} . === Wasserstein distance === Thus, we see that the point of the discriminator is mainly as a critic to provide feedback for the generator, about "how far it is from perfection", where "far" is defined as Jensen–Shannon divergence. Naturally, this brings the possibility of using a different criteria of farness. There are many possible divergences to choose from, such as the f-divergence family, which would give the f-GAN. The Wasserstein GAN is obtained by using the Wasserstein metric, which satisfies a "dual representation theorem" that renders it highly efficient to compute: A proof can be found in the main page on Wasserstein metric. == Definition == By the Kantorovich-Rubenstein duality, the definition of Wasserstein GAN is clear:A Wasserstein GAN game is defined by a probability space ( Ω , B , μ r e f ) {\displaystyle (\Omega ,{\mathcal {B}},\mu _{ref})} , where Ω {\displaystyle \Omega } is a metric space, and a constant K > 0 {\displaystyle K>0} . There are 2 players: generator and discriminator (also called "critic"). The generator's strategy set is the set of all probability measures μ G {\displaystyle \mu _{G}} on ( Ω , B ) {\displaystyle (\Omega ,{\mathcal {B}})} . The discriminator's strategy set is the set of measurable functions of type D : Ω → R {\displaystyle D:\Omega \to \mathbb {R} } with bounded Lipschitz-norm: ‖ D ‖ L ≤ K {\displaystyle \|D\|_{L}\leq K} . The Wasserstein GAN game is a zero-sum game, with objective function L W G A N ( μ G , D ) := E x ∼ μ G [ D ( x ) ] − E x ∼ μ r e f [ D ( x ) ] . {\displaystyle L_{WGAN}(\mu _{G},D):=\mathbb {E} _{x\sim \mu _{G}}[D(x)]-\mathbb {E} _{x\sim \mu _{ref}}[D(x)].} The generator goes first, and the discriminator goes second. The generator aims to minimize the objective, and the discriminator aims to maximize the objective: min μ G max D L W G A N ( μ G , D ) . {\displaystyle \min _{\mu _{G}}\max _{D}L_{WGAN}(\mu _{G},D).} By the Kantorovich-Rubenstein duality, for any generator strategy μ G {\displaystyle \mu _{G}} , the optimal reply by the discriminator is D ∗ {\displaystyle D^{}} , such that L W G A N ( μ G , D ∗ ) = K ⋅ W 1 ( μ G , μ r e f ) . {\displaystyle L_{WGAN}(\mu _{G},D^{})=K\cdot W_{1}(\mu _{G},\mu _{ref}).} Consequently, if the discriminator is good, the generator would be constantly pushed to minimize W 1 ( μ G , μ r e f ) {\displaystyle W_{1}(\mu _{G},\mu _{ref})} , and the optimal strategy for the generator is just μ G = μ r e f {\displaystyle \mu _{G}=\mu _{ref}} , as it should. == Comparison with GAN == In the Wasserstein GAN game, the discriminator provides a better gradient than in the GAN game. Consider for example a game on the real line where both μ G {\displaystyle \mu _{G}} and μ r e f {\displaystyle \mu _{ref}} are Gaussian. Then the optimal Wasserstein critic D W G A N {\displaystyle D_{WGAN}} and the optimal GAN discriminator D {\displaystyle D} are plotted as below: For fixed discriminator, the generator needs to minimize the following objectives: For GAN, E x ∼ μ G [ ln ( 1 − D ( x ) ) ] {\displaystyle \mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))]} . For Wasserstein GAN, E x ∼ μ G [ D W G A N ( x ) ] {\displaystyle \mathbb {E} _{x\sim \mu _{G}}[D_{WGAN}(x)]} . Let μ G {\displaystyle \mu _{G}} be parametrized by θ {\displaystyle \theta } , then we can perform stochastic gradient descent by using two unbiased estimators of the gradient: ∇ θ E x ∼ μ G [ ln ( 1 − D ( x ) ) ] = E x ∼ μ G [ ln ( 1 − D ( x ) ) ⋅ ∇ θ ln ρ μ G ( x ) ] {\displaystyle \nabla _{\theta }\mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))]=\mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))\cdot \nabla _{\theta }\ln \rho _{\mu _{G}}(x)]} ∇ θ E x ∼ μ G [ D W G A N ( x ) ] = E x ∼ μ G [ D W G A N ( x ) ⋅ ∇ θ ln ρ μ G ( x ) ] {\displaystyle \nabla _{\theta }\mathbb {E} _{x\sim \mu _{G}}[D_{WGAN}(x)]=\mathbb {E} _{x\sim \mu _{G}}[D_{WGAN}(x)\cdot \nabla _{\theta }\ln \rho _{\mu _{G}}(x)]} where we used the reparameterization trick. As shown, the generator in GAN is motivated to let its μ G {\displaystyle \mu _{G}} "slide down the peak" of ln ( 1 − D ( x ) ) {\displaystyle \ln(1-D(x))} . Similarly for the generator in Wasserstein GAN. For Wasserstein GAN, D W G A N {\displaystyle D_{WGAN}} has gradient 1 almost everywhere, while for GAN, ln ( 1 − D ) {\displaystyle \ln(1-D)} has flat gradient in the middle, and steep gradient elsewhere. As a result, the variance for the estimator in GAN is usually much larger than that in Wasserstein GAN. See also Figure 3 of. The problem with D J S {\displaystyle D_{JS}} is much more severe in actual machine learning situations. Consider training a GAN to generate ImageNet, a collection of photos of size 256-by-256. The space of all such photos is R 256 2 {\displaystyle \mathbb {R} ^{256^{2}}} , and the distribution of ImageNet pictures, μ r e f {\displaystyle \mu _{ref}} , concentrates on a manifold of much lower dimension in it. Consequently, any generator strategy μ G {\displaystyle \mu _{G}} would almost surely be entirely disjoint from μ r e f {\displaystyle \mu _{ref}} , making D J S ( μ G ‖ μ r e f ) = + ∞ {\displaystyle D_{JS}(\mu _{G}\|\mu _{ref})=+\infty } . Thus, a good discriminator can almost perfectly distinguish μ r e f {\displaystyle \mu _{ref}} from μ G {\displaystyle \mu _{G}} , as well as any μ G ′ {\displaystyle \mu _{G}'} close to μ G {\displaystyle \mu _{G}} . Thus, the gradient ∇ μ G L ( μ G , D ) ≈ 0 {\displaystyle \nabla _{\mu _{G}}L(\mu _{G},D)\approx 0} , creating no learning signal for the generator. Detailed theorems can be found in. == Training Wasserstein GANs == Training the generator in Wasserstein GAN is just gradient descent, the same as in GAN (or most deep learning methods), but training the discriminator is different, as the discriminator is now restricted to have bounded Lipschitz norm. There are several methods for this. === Upper-bounding the Lipschitz norm === Let the discriminator function D {\displaystyle D} to be implemented by a multilayer perceptron: D = D n ∘ D n − 1 ∘ ⋯ ∘ D 1 {\displaystyle D=D_{n}\circ D_{n-1}\circ \cdots \circ D_{1}} where D i ( x ) = h ( W i x ) {\displaystyle D_{i}(x)=h(W_