AI Essay Viral

AI Essay Viral — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Exposure Notification

    Exposure Notification

    The (Google/Apple) Exposure Notification System (GAEN) is a framework and protocol specification developed by Apple Inc. and Google to facilitate digital contact tracing during the COVID-19 pandemic. When used by health authorities, it augments more traditional contact tracing techniques by automatically logging close approaches among notification system users using Android or iOS smartphones. Exposure Notification is a decentralized reporting protocol built on a combination of Bluetooth Low Energy technology and privacy-preserving cryptography. It is an opt-in feature within COVID-19 apps developed and published by authorized health authorities. Unveiled on April 10, 2020, it was made available on iOS on May 20, 2020, as part of the iOS 13.5 update and on December 14, 2020, as part of the iOS 12.5 update for older iPhones. On Android, it was added to devices via a Google Play Services update, supporting all versions since Android Marshmallow. The Apple/Google protocol is similar to the Decentralized Privacy-Preserving Proximity Tracing (DP-3T) protocol created by the European DP-3T consortium and the Temporary Contact Number (TCN) protocol by Covid Watch, but is implemented at the operating system level, which allows for more efficient operation as a background process. Since May 2020, a variant of the DP-3T protocol is supported by the Exposure Notification Interface. Other protocols are constrained in operation because they are not privileged over normal apps. This leads to issues, particularly on iOS devices where digital contact tracing apps running in the background experience significantly degraded performance. The joint approach is also designed to maintain interoperability between Android and iOS devices, which constitute nearly all of the market. The ACLU stated the approach "appears to mitigate the worst privacy and centralization risks, but there is still room for improvement". In late April, Google and Apple shifted the emphasis of the naming of the system, describing it as an "exposure notification service", rather than "contact tracing" system. == Technical specification == Digital contact tracing protocols typically have two major responsibilities: encounter logging and infection reporting. Exposure Notification only involves encounter logging which is a decentralized architecture. The majority of infection reporting is centralized in individual app implementations. To handle encounter logging, the system uses Bluetooth Low Energy to send tracking messages to nearby devices running the protocol to discover encounters with other people. The tracking messages contain unique identifiers that are encrypted with a secret daily key held by the sending device. These identifiers change every 15–20 minutes as well as Bluetooth MAC address in order to prevent tracking of clients by malicious third parties through observing static identifiers over time. The sender's daily encryption keys are generated using a random number generator. Devices record received messages, retaining them locally for 14 days. If a user tests positive for infection, the last 14 days of their daily encryption keys can be uploaded to a central server, where it is then broadcast to all devices on the network. The method through which daily encryption keys are transmitted to the central server and broadcast is defined by individual app developers. The Google-developed reference implementation calls for a health official to request a one-time verification code (VC) from a verification server, which the user enters into the encounter logging app. This causes the app to obtain a cryptographically signed certificate, which is used to authorize the submission of keys to the central reporting server. The received keys are then provided to the protocol, where each client individually searches for matches in their local encounter history. If a match meeting certain risk parameters is found, the app notifies the user of potential exposure to the infection. Google and Apple intend to use the received signal strength (RSSI) of the beacon messages as a source to infer proximity. RSSI and other signal metadata will also be encrypted to resist deanonymization attacks. === Version 1.0 === To generate encounter identifiers, first a persistent 32-byte private Tracing Key ( t k {\displaystyle tk} ) is generated by a client. From this a 16 byte Daily Tracing Key is derived using the algorithm d t k i = H K D F ( t k , N U L L , 'CT-DTK' | | D i , 16 ) {\displaystyle dtk_{i}=HKDF(tk,NULL,{\text{'CT-DTK'}}||D_{i},16)} , where H K D F ( Key, Salt, Data, OutputLength ) {\displaystyle HKDF({\text{Key, Salt, Data, OutputLength}})} is a HKDF function using SHA-256, and D i {\displaystyle D_{i}} is the day number for the 24-hour window the broadcast is in starting from Unix Epoch Time. These generated keys are later sent to the central reporting server should a user become infected. From the daily tracing key a 16-byte temporary Rolling Proximity Identifier is generated every 10 minutes with the algorithm R P I i , j = Truncate ( H M A C ( d t k i , 'CT-RPI' | | T I N j ) , 16 ) {\displaystyle RPI_{i,j}={\text{Truncate}}(HMAC(dtk_{i},{\text{'CT-RPI'}}||TIN_{j}),16)} , where H M A C ( Key, Data ) {\displaystyle HMAC({\text{Key, Data}})} is a HMAC function using SHA-256, and T I N j {\displaystyle TIN_{j}} is the time interval number, representing a unique index for every 10 minute period in a 24-hour day. The Truncate function returns the first 16 bytes of the HMAC value. When two clients come within proximity of each other they exchange and locally store the current R P I i , j {\displaystyle RPI_{i,j}} as the encounter identifier. Once a registered health authority has confirmed the infection of a user, the user's Daily Tracing Key for the past 14 days is uploaded to the central reporting server. Clients then download this report and individually recalculate every Rolling Proximity Identifier used in the report period, matching it against the user's local encounter log. If a matching entry is found, then contact has been established and the app presents a notification to the user warning them of potential infection. === Version 1.1 === Unlike version 1.0 of the protocol, version 1.1 does not use a persistent tracing key, rather every day a new random 16-byte Temporary Exposure Key ( t e k i {\displaystyle tek_{i}} ) is generated. This is analogous to the daily tracing key from version 1.0. Here i {\displaystyle i} denotes the time is discretized in 10 minute intervals starting from Unix Epoch Time. From this two 128-bit keys are calculated, the Rolling Proximity Identifier Key ( R P I K i {\displaystyle RPIK_{i}} ) and the Associated Encrypted Metadata Key ( A E M K i {\displaystyle AEMK_{i}} ). R P I K i {\displaystyle RPIK_{i}} is calculated with the algorithm R P I K i = H K D F ( t e k i , N U L L , 'EN-RPIK' , 16 ) {\displaystyle RPIK_{i}=HKDF(tek_{i},NULL,{\text{'EN-RPIK'}},16)} , and A E M K i {\displaystyle AEMK_{i}} using the algorithm A E M K i = H K D F ( t e k i , N U L L , 'EN-AEMK' , 16 ) {\displaystyle AEMK_{i}=HKDF(tek_{i},NULL,{\text{'EN-AEMK'}},16)} . From these values a temporary Rolling Proximity Identifier ( R P I i , j {\displaystyle RPI_{i,j}} ) is generated every time the BLE MAC address changes, roughly every 15–20 minutes. The following algorithm is used: R P I i , j = A E S 128 ( R P I K i , 'EN-RPI' | | 0 x 000000000000 | | E N I N j ) {\displaystyle RPI_{i,j}=AES128(RPIK_{i},{\text{'EN-RPI'}}||{\mathtt {0x000000000000}}||ENIN_{j})} , where A E S 128 ( Key, Data ) {\displaystyle AES128({\text{Key, Data}})} is an AES cryptography function with a 128-bit key, the data is one 16-byte block, j {\displaystyle j} denotes the Unix Epoch Time at the moment the roll occurs, and E N I N j {\displaystyle ENIN_{j}} is the corresponding 10-minute interval number. Next, additional Associated Encrypted Metadata is encrypted. What the metadata represents is not specified, likely to allow the later expansion of the protocol. The following algorithm is used: Associated Encrypted Metadata i , j = A E S 128 _ C T R ( A E M K i , R P I i , j , Metadata ) {\displaystyle {\text{Associated Encrypted Metadata}}_{i,j}=AES128\_CTR(AEMK_{i},RPI_{i,j},{\text{Metadata}})} , where A E S 128 _ C T R ( Key, IV, Data ) {\displaystyle AES128\_CTR({\text{Key, IV, Data}})} denotes AES encryption with a 128-bit key in CTR mode. The Rolling Proximity Identifier and the Associated Encrypted Metadata are then combined and broadcast using BLE. Clients exchange and log these payloads. Once a registered health authority has confirmed the infection of a user, the user's Temporary Exposure Keys t e k i {\displaystyle tek_{i}} and their respective interval numbers i {\displaystyle i} for the past 14 days are uploaded to the central reporting server. Clients then download this report and individually recalculate every Rolling Proximity Identifier starting from interval number i {\displaystyle i} ,

    Read more →
  • Spiking neural network

    Spiking neural network

    Spiking neural networks (SNNs) are artificial neural networks (ANN) that mimic natural neural networks. These models leverage timing of discrete spikes as the main information carrier. In addition to neuronal and synaptic state, SNNs incorporate the concept of time into their operating model. The idea is that neurons in the SNN do not transmit information at each propagation cycle (as it happens with typical multi-layer perceptron networks), but rather transmit information only when a membrane potential—an intrinsic quality of the neuron related to its membrane electrical charge—reaches a specific value, called the threshold. When the membrane potential reaches the threshold, the neuron fires, and generates a signal that travels to other neurons which, in turn, increase or decrease their potentials in response to this signal. A neuron model that fires at the moment of threshold crossing is also called a spiking neuron model. While spike rates can be considered the analogue of the variable output of a traditional ANN, neurobiology research indicated that high speed processing cannot be performed solely through a rate-based scheme. For example humans can perform an image recognition task requiring no more than 10ms of processing time per neuron through the successive layers (going from the retina to the temporal lobe). This time window is too short for rate-based encoding. The precise spike timings in a small set of spiking neurons also has a higher information coding capacity compared with a rate-based approach. The most prominent spiking neuron model is the leaky integrate-and-fire model. In that model, the momentary activation level (modeled as a differential equation) is normally considered to be the neuron's state, with incoming spikes pushing this value higher or lower, until the state eventually either decays or—if the firing threshold is reached—the neuron fires. After firing, the state variable is reset to a lower value. Various decoding methods exist for interpreting the outgoing spike train as a real-value number, relying on either the frequency of spikes (rate-code), the time-to-first-spike after stimulation, or the interval between spikes. == History == Many multi-layer artificial neural networks are fully connected, receiving input from every neuron in the previous layer and signalling every neuron in the subsequent layer. Although these networks have achieved breakthroughs, they do not match biological networks and do not mimic neurons. The biology-inspired Hodgkin–Huxley model of a spiking neuron was proposed in 1952. This model described how action potentials are initiated and propagated. Communication between neurons, which requires the exchange of chemical neurotransmitters in the synaptic gap, is described in models such as the integrate-and-fire model, FitzHugh–Nagumo model (1961–1962), and Hindmarsh–Rose model (1984). The leaky integrate-and-fire model (or a derivative) is commonly used as it is easier to compute than Hodgkin–Huxley. While the notion of an artificial spiking neural network became popular only in the twenty-first century, studies between 1980 and 1995 supported the concept. The first models of this type of ANN appeared to simulate non-algorithmic intelligent information processing systems. However, the notion of the spiking neural network as a mathematical model was first worked on in the early 1970s. As of 2019 SNNs lagged behind ANNs in accuracy, but the gap is decreasing, and has vanished on some tasks. == Underpinnings == Information in the brain is represented as action potentials (neuron spikes), which may group into spike trains or coordinated waves. A fundamental question of neuroscience is to determine whether neurons communicate by a rate or temporal code. Temporal coding implies that a single spiking neuron can replace hundreds of hidden units on a conventional neural net. SNNs define a neuron's current state as its potential (possibly modeled as a differential equation). An input pulse causes the potential to rise and then gradually decline. Encoding schemes can interpret these pulse sequences as a number, considering pulse frequency and pulse interval. Using the precise time of pulse occurrence, a neural network can consider more information and offer better computing properties. SNNs compute in the continuous domain. Such neurons test for activation only when their potentials reach a certain value. When a neuron is activated, it produces a signal that is passed to connected neurons, accordingly raising or lowering their potentials. The SNN approach produces a continuous output instead of the binary output of traditional ANNs. Pulse trains are not easily interpretable, hence the need for encoding schemes. However, a pulse train representation may be more suited for processing spatiotemporal data (or real-world sensory data classification). SNNs connect neurons only to nearby neurons so that they process input blocks separately (similar to CNN using filters). They consider time by encoding information as pulse trains so as not to lose information. This avoids the complexity of a recurrent neural network (RNN). Impulse neurons are more powerful computational units than traditional artificial neurons. SNNs are theoretically more powerful than so called "second-generation networks" defined as ANNs "based on computational units that apply activation function with a continuous set of possible output values to a weighted sum (or polynomial) of the inputs"; however, SNN training issues and hardware requirements limit their use. Although unsupervised biologically inspired learning methods are available such as Hebbian learning and STDP, no effective supervised training method is suitable for SNNs that can provide better performance than second-generation networks. Spike-based activation of SNNs is not differentiable, thus gradient descent-based backpropagation (BP) is not available. SNNs have much larger computational costs for simulating realistic neural models than traditional ANNs. Pulse-coupled neural networks (PCNN) are often confused with SNNs. A PCNN can be seen as a kind of SNN. Researchers are actively working on various topics. The first concerns differentiability. The expressions for both the forward- and backward-learning methods contain the derivative of the neural activation function which is not differentiable because a neuron's output is either 1 when it spikes, and 0 otherwise. This all-or-nothing behavior disrupts gradients and makes these neurons unsuitable for gradient-based optimization. Approaches to resolving it include: resorting to entirely biologically inspired local learning rules for the hidden units translating conventionally trained "rate-based" NNs to SNNs smoothing the network model to be continuously differentiable defining an SG (Surrogate Gradient) as a continuous relaxation of the real gradients The second concerns the optimization algorithm. Standard BP can be expensive in terms of computation, memory, and communication and may be poorly suited to the hardware that implements it (e.g., a computer, brain, or neuromorphic device). Incorporating additional neuron dynamics such as Spike Frequency Adaptation (SFA) is a notable advance, enhancing efficiency and computational power. These neurons sit between biological complexity and computational complexity. Originating from biological insights, SFA offers significant computational benefits by reducing power usage, especially in cases of repetitive or intense stimuli. This adaptation improves signal/noise clarity and introduces an elementary short-term memory at the neuron level, which in turn, improves accuracy and efficiency. This was mostly achieved using compartmental neuron models. The simpler versions are of neuron models with adaptive thresholds, are an indirect way of achieving SFA. It equips SNNs with improved learning capabilities, even with constrained synaptic plasticity, and elevates computational efficiency. This feature lessens the demand on network layers by decreasing the need for spike processing, thus lowering computational load and memory access time—essential aspects of neural computation. Moreover, SNNs utilizing neurons capable of SFA achieve levels of accuracy that rival those of conventional ANNs, while also requiring fewer neurons for comparable tasks. This efficiency streamlines the computational workflow and conserves space and energy, while maintaining technical integrity. High-performance deep spiking neural networks can operate with 0.3 spikes per neuron. == Applications == SNNs can in principle be applied to the same applications as traditional ANNs. In addition, SNNs can model the central nervous system of biological organisms, such as an insect seeking food without prior knowledge of the environment. Due to their relative realism, they can be used to study biological neural circuits. Starting with a hypothesis about the topology of a biological neuronal circuit and its functi

    Read more →
  • Margin-infused relaxed algorithm

    Margin-infused relaxed algorithm

    Margin-infused relaxed algorithm (MIRA) is a machine learning and online algorithm for multiclass classification problems. It is designed to learn a set of parameters (vector or matrix) by processing all the given training examples one-by-one and updating the parameters according to each training example, so that the current training example is classified correctly with a margin against incorrect classifications at least as large as their loss. The change of the parameters is kept as small as possible. A two-class version called binary MIRA simplifies the algorithm by not requiring the solution of a quadratic programming problem (see below). When used in a one-vs-all configuration, binary MIRA can be extended to a multiclass learner that approximates full MIRA, but may be faster to train. The flow of the algorithm looks as follows: The update step is then formalized as a quadratic programming problem: Find m i n ‖ w ( i + 1 ) − w ( i ) ‖ {\displaystyle min\|w^{(i+1)}-w^{(i)}\|} , so that s c o r e ( x t , y t ) − s c o r e ( x t , y ′ ) ≥ L ( y t , y ′ ) ∀ y ′ {\displaystyle score(x_{t},y_{t})-score(x_{t},y')\geq L(y_{t},y')\ \forall y'} , i.e. the score of the current correct training y {\displaystyle y} must be greater than the score of any other possible y ′ {\displaystyle y'} by at least the loss (number of errors) of that y ′ {\displaystyle y'} in comparison to y {\displaystyle y} .

    Read more →
  • IDistance

    IDistance

    In pattern recognition, iDistance is an indexing and query processing technique for k-nearest neighbor queries on point data in multi-dimensional metric spaces. The kNN query is one of the hardest problems on multi-dimensional data, especially when the dimensionality of the data is high. iDistance is designed to process kNN queries in high-dimensional spaces efficiently and performs extremely well for skewed data distributions, which usually occur in real-life data sets. iDistance employs a two-phase search strategy involving an initial filtering of candidate regions and a subsequent refinement of results, an approach aligned with the Filter and Refine Principle (FRP). This means that the index first prunes the search space to eliminate unlikely candidates, then verifies the true nearest neighbors in a refinement step, following the general FRP paradigm used in database search algorithms. The iDistance index can also be augmented with machine learning models to learn data distributions for improved searching and storage of multi-dimensional data. == Indexing == Building the iDistance index has two steps: A number of reference points in the data space are chosen. There are various ways of choosing reference points. Using cluster centers as reference points is the most efficient way. The data points are partitioned into Voronoi cells based on well-chosen reference points. The distance between a data point and its closest reference point is calculated. This distance plus a scaling value is called the point's iDistance. By this means, points in a multi-dimensional space are mapped to one-dimensional values, and then a B+-tree can be adopted to index the points using the iDistance as the key. The figure on the right shows an example where three reference points (O1, O2, O3) are chosen. The data points are then mapped to a one-dimensional space and indexed in a B+-tree. Various extensions have been proposed to make the selection of reference points for effective query performance, including employing machine learning to learn the identification of reference points. == Query processing == To process a kNN query, the query is mapped to a number of one-dimensional range queries, which can be processed efficiently on a B+-tree. In the above figure, the query Q is mapped to a value in the B+-tree while the kNN search ``sphere" is mapped to a range in the B+-tree. The search sphere expands gradually until the k NNs are found. This corresponds to gradually expanding range searches in the B+-tree. The iDistance technique can be viewed as a way of accelerating the sequential scan. Instead of scanning records from the beginning to the end of the data file, the iDistance starts the scan from spots where the nearest neighbors can be obtained early with a very high probability. == Applications == The iDistance has been used in many applications including Image retrieval Video indexing Similarity search in P2P systems Mobile computing Recommender system == Historical background == The iDistance was first proposed by Cui Yu, Beng Chin Ooi, Kian-Lee Tan and H. V. Jagadish in 2001. Later, together with Rui Zhang, they improved the technique and performed a more comprehensive study on it in 2005.

    Read more →
  • Trazzler

    Trazzler

    Trazzler is a travel destination app that specializes in unique and local destinations. The initial concept was developed by Adam Rugel and Biz Stone in 2006 at Twitter's original offices under the name "71 miles". More than 10,000 writers and photographers have contributed and more than $350,000 in freelance contracts have been issued as a result of Trazzeler's weekly writing and photography contests. Investors in the company include SV Angel, AOL Founder Steve Case, and the Twitter founders, Evan Williams, Jack Dorsey, and Biz Stone. The company's partners are the City of Chicago, Hawaii Tourism Authority, Fairmont Hotels & Resorts, Salon.com, and Air New Zealand. Trazzler is designed for use on the iOS, Android, and Facebook.

    Read more →
  • Representer theorem

    Representer theorem

    For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f ∗ {\displaystyle f^{}} of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data. == Formal statement == The following Representer Theorem and its proof are due to Schölkopf, Herbrich, and Smola: Theorem: Consider a positive-definite real-valued kernel k : X × X → R {\displaystyle k:{\mathcal {X}}\times {\mathcal {X}}\to \mathbb {R} } on a non-empty set X {\displaystyle {\mathcal {X}}} with a corresponding reproducing kernel Hilbert space H k {\displaystyle H_{k}} . Let there be given a training sample ( x 1 , y 1 ) , … , ( x n , y n ) ∈ X × R {\displaystyle (x_{1},y_{1}),\dotsc ,(x_{n},y_{n})\in {\mathcal {X}}\times \mathbb {R} } , a strictly increasing real-valued function g : [ 0 , ∞ ) → R {\displaystyle g\colon [0,\infty )\to \mathbb {R} } , and an arbitrary error function E : ( X × R 2 ) n → R ∪ { ∞ } {\displaystyle E\colon ({\mathcal {X}}\times \mathbb {R} ^{2})^{n}\to \mathbb {R} \cup \lbrace \infty \rbrace } , which together define the following regularized empirical risk functional on H k {\displaystyle H_{k}} : f ↦ E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) + g ( ‖ f ‖ ) . {\displaystyle f\mapsto E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)+g\left(\lVert f\rVert \right).} Then, any minimizer of the empirical risk f ∗ = argmin f ∈ H k { E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) + g ( ‖ f ‖ ) } , ( ∗ ) {\displaystyle f^{}={\underset {f\in H_{k}}{\operatorname {argmin} }}\left\lbrace E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)+g\left(\lVert f\rVert \right)\right\rbrace ,\quad ()} admits a representation of the form: f ∗ ( ⋅ ) = ∑ i = 1 n α i k ( ⋅ , x i ) , {\displaystyle f^{}(\cdot )=\sum _{i=1}^{n}\alpha _{i}k(\cdot ,x_{i}),} where α i ∈ R {\displaystyle \alpha _{i}\in \mathbb {R} } for all 1 ≤ i ≤ n {\displaystyle 1\leq i\leq n} . Proof: Define a mapping φ : X → H k φ ( x ) = k ( ⋅ , x ) {\displaystyle {\begin{aligned}\varphi \colon {\mathcal {X}}&\to H_{k}\\\varphi (x)&=k(\cdot ,x)\end{aligned}}} (so that φ ( x ) = k ( ⋅ , x ) {\displaystyle \varphi (x)=k(\cdot ,x)} is itself a map X → R {\displaystyle {\mathcal {X}}\to \mathbb {R} } ). Since k {\displaystyle k} is a reproducing kernel, then φ ( x ) ( x ′ ) = k ( x ′ , x ) = ⟨ φ ( x ′ ) , φ ( x ) ⟩ , {\displaystyle \varphi (x)(x')=k(x',x)=\langle \varphi (x'),\varphi (x)\rangle ,} where ⟨ ⋅ , ⋅ ⟩ {\displaystyle \langle \cdot ,\cdot \rangle } is the inner product on H k {\displaystyle H_{k}} . Given any x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} , one can use orthogonal projection to decompose any f ∈ H k {\displaystyle f\in H_{k}} into a sum of two functions, one lying in span ⁡ { φ ( x 1 ) , … , φ ( x n ) } {\displaystyle \operatorname {span} \left\lbrace \varphi (x_{1}),\ldots ,\varphi (x_{n})\right\rbrace } , and the other lying in the orthogonal complement: f = ∑ i = 1 n α i φ ( x i ) + v , {\displaystyle f=\sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})+v,} where ⟨ v , φ ( x i ) ⟩ = 0 {\displaystyle \langle v,\varphi (x_{i})\rangle =0} for all i {\displaystyle i} . The above orthogonal decomposition and the reproducing property together show that applying f {\displaystyle f} to any training point x j {\displaystyle x_{j}} produces f ( x j ) = ⟨ ∑ i = 1 n α i φ ( x i ) + v , φ ( x j ) ⟩ = ∑ i = 1 n α i ⟨ φ ( x i ) , φ ( x j ) ⟩ , {\displaystyle f(x_{j})=\left\langle \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})+v,\varphi (x_{j})\right\rangle =\sum _{i=1}^{n}\alpha _{i}\langle \varphi (x_{i}),\varphi (x_{j})\rangle ,} which we observe is independent of v {\displaystyle v} . Consequently, the value of the error function E {\displaystyle E} in () is likewise independent of v {\displaystyle v} . For the second term (the regularization term), since v {\displaystyle v} is orthogonal to ∑ i = 1 n α i φ ( x i ) {\displaystyle \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})} and g {\displaystyle g} is strictly monotonic, we have g ( ‖ f ‖ ) = g ( ‖ ∑ i = 1 n α i φ ( x i ) + v ‖ ) = g ( ‖ ∑ i = 1 n α i φ ( x i ) ‖ 2 + ‖ v ‖ 2 ) ≥ g ( ‖ ∑ i = 1 n α i φ ( x i ) ‖ ) . {\displaystyle {\begin{aligned}g\left(\lVert f\rVert \right)&=g\left(\lVert \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})+v\rVert \right)\\&=g\left({\sqrt {\lVert \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})\rVert ^{2}+\lVert v\rVert ^{2}}}\right)\\&\geq g\left(\lVert \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})\rVert \right).\end{aligned}}} Therefore, setting v = 0 {\displaystyle v=0} does not affect the first term of (), while it strictly decreases the second term. Consequently, any minimizer f ∗ {\displaystyle f^{}} in () must have v = 0 {\displaystyle v=0} , i.e., it must be of the form f ∗ ( ⋅ ) = ∑ i = 1 n α i φ ( x i ) = ∑ i = 1 n α i k ( ⋅ , x i ) , {\displaystyle f^{}(\cdot )=\sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})=\sum _{i=1}^{n}\alpha _{i}k(\cdot ,x_{i}),} which is the desired result. == Generalizations == The Theorem stated above is a particular example of a family of results that are collectively referred to as "representer theorems"; here we describe several such. The first statement of a representer theorem was due to Kimeldorf and Wahba for the special case in which E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) = 1 n ∑ i = 1 n ( f ( x i ) − y i ) 2 , g ( ‖ f ‖ ) = λ ‖ f ‖ 2 {\displaystyle {\begin{aligned}E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)&={\frac {1}{n}}\sum _{i=1}^{n}(f(x_{i})-y_{i})^{2},\\g(\lVert f\rVert )&=\lambda \lVert f\rVert ^{2}\end{aligned}}} for λ > 0 {\displaystyle \lambda >0} . Schölkopf, Herbrich, and Smola generalized this result by relaxing the assumption of the squared-loss cost and allowing the regularizer to be any strictly monotonically increasing function g ( ⋅ ) {\displaystyle g(\cdot )} of the Hilbert space norm. It is possible to generalize further by augmenting the regularized empirical risk functional through the addition of unpenalized offset terms. For example, Schölkopf, Herbrich, and Smola also consider the minimization f ~ ∗ = argmin ⁡ { E ( ( x 1 , y 1 , f ~ ( x 1 ) ) , … , ( x n , y n , f ~ ( x n ) ) ) + g ( ‖ f ‖ ) ∣ f ~ = f + h ∈ H k ⊕ span ⁡ { ψ p ∣ 1 ≤ p ≤ M } } , ( † ) {\displaystyle {\tilde {f}}^{}=\operatorname {argmin} \left\lbrace E\left((x_{1},y_{1},{\tilde {f}}(x_{1})),\ldots ,(x_{n},y_{n},{\tilde {f}}(x_{n}))\right)+g\left(\lVert f\rVert \right)\mid {\tilde {f}}=f+h\in H_{k}\oplus \operatorname {span} \lbrace \psi _{p}\mid 1\leq p\leq M\rbrace \right\rbrace ,\quad (\dagger )} i.e., we consider functions of the form f ~ = f + h {\displaystyle {\tilde {f}}=f+h} , where f ∈ H k {\displaystyle f\in H_{k}} and h {\displaystyle h} is an unpenalized function lying in the span of a finite set of real-valued functions { ψ p : X → R ∣ 1 ≤ p ≤ M } {\displaystyle \lbrace \psi _{p}\colon {\mathcal {X}}\to \mathbb {R} \mid 1\leq p\leq M\rbrace } . Under the assumption that the n × M {\displaystyle n\times M} matrix ( ψ p ( x i ) ) i p {\displaystyle \left(\psi _{p}(x_{i})\right)_{ip}} has rank M {\displaystyle M} , they show that the minimizer f ~ ∗ {\displaystyle {\tilde {f}}^{}} in ( † ) {\displaystyle (\dagger )} admits a representation of the form f ~ ∗ ( ⋅ ) = ∑ i = 1 n α i k ( ⋅ , x i ) + ∑ p = 1 M β p ψ p ( ⋅ ) {\displaystyle {\tilde {f}}^{}(\cdot )=\sum _{i=1}^{n}\alpha _{i}k(\cdot ,x_{i})+\sum _{p=1}^{M}\beta _{p}\psi _{p}(\cdot )} where α i , β p ∈ R {\displaystyle \alpha _{i},\beta _{p}\in \mathbb {R} } and the β p {\displaystyle \beta _{p}} are all uniquely determined. The conditions under which a representer theorem exists were investigated by Argyriou, Micchelli, and Pontil, who proved the following: Theorem: Let X {\displaystyle {\mathcal {X}}} be a nonempty set, k {\displaystyle k} a positive-definite real-valued kernel on X × X {\displaystyle {\mathcal {X}}\times {\mathcal {X}}} with corresponding reproducing kernel Hilbert space H k {\displaystyle H_{k}} , and let R : H k → R {\displaystyle R\colon H_{k}\to \mathbb {R} } be a differentiable regularization function. Then given a training sample ( x 1 , y 1 ) , … , ( x n , y n ) ∈ X × R {\displaystyle (x_{1},y_{1}),\ldots ,(x_{n},y_{n})\in {\mathcal {X}}\times \mathbb {R} } and an arbitrary error function E : ( X × R 2 ) m → R ∪ { ∞ } {\displaystyle E\colon ({\mathcal {X}}\times \mathbb {R} ^{2})^{m}\to \mathbb {R} \cup \lbrace \infty \rbrace } , a minimizer f ∗ = argmin f ∈ H k { E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) + R ( f ) } ( ‡ ) {\displaystyle f^{}={\underset {f\in H_{k}}{\operatorname {argmin} }}\left\lbrace E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)+R(f)\right\rbrace \quad (\ddagger )} of the regularized empirical risk admits a repr

    Read more →
  • Gaussian process emulator

    Gaussian process emulator

    In statistics, Gaussian process emulator is one name for a general type of statistical model that has been used in contexts where the problem is to make maximum use of the outputs of a complicated (often non-random) computer-based simulation model. Each run of the simulation model is computationally expensive and each run is based on many different controlling inputs. The variation of the outputs of the simulation model is expected to vary reasonably smoothly with the inputs, but in an unknown way. The overall analysis involves two models: the simulation model, or "simulator", and the statistical model, or "emulator", which notionally emulates the unknown outputs from the simulator. The Gaussian process emulator model treats the problem from the viewpoint of Bayesian statistics. In this approach, even though the output of the simulation model is fixed for any given set of inputs, the actual outputs are unknown unless the computer model is run and hence can be made the subject of a Bayesian analysis. The main element of the Gaussian process emulator model is that it models the outputs as a Gaussian process on a space that is defined by the model inputs. The model includes a description of the correlation or covariance of the outputs, which enables the model to encompass the idea that differences in the output will be small if there are only small differences in the inputs.

    Read more →
  • Homogeneity blockmodeling

    Homogeneity blockmodeling

    In mathematics applied to analysis of social structures, homogeneity blockmodeling is an approach in blockmodeling, which is best suited for a preliminary or main approach to valued networks, when a prior knowledge about these networks is not available. This is because homogeneity blockmodeling emphasizes the similarity of link (tie) strengths within the blocks over the pattern of links. In this approach, tie (link) values (or statistical data computed on them) are assumed to be equal (homogenous) within blocks. This approach to the generalized blockmodeling of valued networks was first proposed by Aleš Žiberna in 2007 with the basic idea, "that the inconsistency of an empirical block with its ideal block can be measured by within block variability of appropriate values". The newly–formed ideal blocks, which are appropriate for blockmodeling of valued networks, are then presented together with the definitions of their block inconsistencies. Similar approach to the homogeneity blockmodeling, dealing with direct approach for structural equivalence, was previously suggested by Stephen P. Borgatti and Martin G. Everett (1992).

    Read more →
  • Tessellation (computer graphics)

    Tessellation (computer graphics)

    In computer graphics, tessellation is the dividing of datasets of polygons (sometimes called vertex sets) presenting objects in a scene into suitable structures for rendering. Especially for real-time rendering, data is tessellated into triangles, for example in OpenGL 4.0 and Direct3D 11. == In graphics rendering == A key advantage of tessellation for realtime graphics is that it allows detail to be dynamically added and subtracted from a 3D polygon mesh and its silhouette edges based on control parameters (often camera distance). In previously leading realtime techniques such as parallax mapping and bump mapping, surface details could be simulated at the pixel level, but silhouette edge detail was fundamentally limited by the quality of the original dataset. In Direct3D 11 pipeline (a part of DirectX 11), the graphics primitive is the patch. The tessellator generates a triangle-based tessellation of the patch according to tessellation parameters such as the TessFactor, which controls the degree of fineness of the mesh. The tessellation, along with shaders such as a Phong shader, allows for producing smoother surfaces than would be generated by the original mesh. By offloading the tessellation process onto the GPU hardware, smoothing can be performed in real time. Tessellation can also be used for implementing subdivision surfaces, level of detail scaling and fine displacement mapping. OpenGL 4.0 uses a similar pipeline, where tessellation into triangles is controlled by the Tessellation Control Shader and a set of four tessellation parameters. == In computer-aided design == In computer-aided design the constructed design is represented by a boundary representation topological model, where analytical 3D surfaces and curves, limited to faces, edges, and vertices, constitute a continuous boundary of a 3D body. Arbitrary 3D bodies are often too complicated to analyze directly. So they are approximated (tessellated) with a mesh of small, easy-to-analyze pieces of 3D volume—usually either irregular tetrahedra, or irregular hexahedra. The mesh is used for finite element analysis. The mesh of a surface is usually generated per individual faces and edges (approximated to polylines) so that original limit vertices are included into mesh. To ensure that approximation of the original surface suits the needs of further processing, three basic parameters are usually defined for the surface mesh generator: The maximum allowed distance between the planar approximation polygon and the surface (known as "sag"). This parameter ensures that mesh is similar enough to the original analytical surface (or the polyline is similar to the original curve). The maximum allowed size of the approximation polygon (for triangulations it can be maximum allowed length of triangle sides). This parameter ensures enough detail for further analysis. The maximum allowed angle between two adjacent approximation polygons (on the same face). This parameter ensures that even very small humps or hollows that can have significant effect to analysis will not disappear in mesh. An algorithm generating a mesh is typically controlled by the above three and other parameters. Some types of computer analysis of a constructed design require an adaptive mesh refinement, which is a mesh made finer (using stronger parameters) in regions where the analysis needs more detail.

    Read more →
  • Latent Dirichlet allocation

    Latent Dirichlet allocation

    In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that explains how a collection of text documents can be described by a set of unobserved "topics." For example, given a set of news articles, LDA might discover that one topic is characterized by words like "president", "government", and "election", while another is characterized by "team", "game", and "score". It is one of the most common topic models. The LDA model was first presented as a graphical model for population genetics by J. K. Pritchard, M. Stephens and P. Donnelly in 2000. The model was subsequently applied to machine learning by David Blei, Andrew Ng, and Michael I. Jordan in 2003. Although its most frequent application is in modeling text corpora, it has also been used for other problems, such as in clinical psychology, social science, and computational musicology. The core assumption of LDA is that documents are represented as a random mixture of latent topics, and each topic is characterized by a probability distribution over words. The model is a generalization of probabilistic latent semantic analysis (pLSA), differing primarily in that LDA treats the topic mixture as a Dirichlet prior, leading to more reasonable mixtures and less susceptibility to overfitting. Learning the latent topics and their associated probabilities from a corpus is typically done using Bayesian inference, often with methods like Gibbs sampling or variational Bayes. == History == In the context of population genetics, LDA was proposed by J. K. Pritchard, M. Stephens and P. Donnelly in 2000. LDA was applied in machine learning by David Blei, Andrew Ng and Michael I. Jordan in 2003. == Overview == === Population genetics === In population genetics, the model is used to detect the presence of structured genetic variation in a group of individuals. The model assumes that alleles carried by individuals under study have origin in various extant or past populations. The model and various inference algorithms allow scientists to estimate the allele frequencies in those source populations and the origin of alleles carried by individuals under study. The source populations can be interpreted ex-post in terms of various evolutionary scenarios. In association studies, detecting the presence of genetic structure is considered a necessary preliminary step to avoid confounding. === Clinical psychology, mental health, and social science === In clinical psychology research, LDA has been used to identify common themes of self-images experienced by young people in social situations. Other social scientists have used LDA to examine large sets of topical data from discussions on social media (e.g., tweets about prescription drugs). Additionally, supervised Latent Dirichlet Allocation with covariates (SLDAX) has been specifically developed to combine latent topics identified in texts with other manifest variables. This approach allows for the integration of text data as predictors in statistical regression analyses, improving the accuracy of mental health predictions. One of the main advantages of SLDAX over traditional two-stage approaches is its ability to avoid biased estimates and incorrect standard errors, allowing for a more accurate analysis of psychological texts. In the field of social sciences, LDA has proven to be useful for analyzing large datasets, such as social media discussions. For instance, researchers have used LDA to investigate tweets discussing socially relevant topics, like the use of prescription drugs and cultural differences in China. By analyzing these large text corpora, it is possible to uncover patterns and themes that might otherwise go unnoticed, offering valuable insights into public discourse and perception in real time. === Musicology === In the context of computational musicology, LDA has been used to discover tonal structures in different corpora. === Machine learning === One application of LDA in machine learning – specifically, topic discovery, a subproblem in natural language processing – is to discover topics in a collection of documents, and then automatically classify any individual document within the collection in terms of how "relevant" it is to each of the discovered topics. A topic is considered to be a set of terms (i.e., individual words or phrases) that, taken together, suggest a shared theme. For example, in a document collection related to pet animals, the terms dog, spaniel, beagle, golden retriever, puppy, bark, and woof would suggest a DOG_related theme, while the terms cat, siamese, Maine coon, tabby, manx, meow, purr, and kitten would suggest a CAT_related theme. There may be many more topics in the collection – e.g., related to diet, grooming, healthcare, behavior, etc. that we do not discuss for simplicity's sake. (Very common, so called stop words in a language – e.g., "the", "an", "that", "are", "is", etc., – would not discriminate between topics and are usually filtered out by pre-processing before LDA is performed. Pre-processing also converts terms to their "root" lexical forms – e.g., "barks", "barking", and "barked" would be converted to "bark".) If the document collection is sufficiently large, LDA will discover such sets of terms (i.e., topics) based upon the co-occurrence of individual terms, though the task of assigning a meaningful label to an individual topic (i.e., that all the terms are DOG_related) is up to the user, and often requires specialized knowledge (e.g., for collection of technical documents). The LDA approach assumes that: The semantic content of a document is composed by combining one or more terms from one or more topics. Certain terms are ambiguous, belonging to more than one topic, with different probability. (For example, the term training can apply to both dogs and cats, but are more likely to refer to dogs, which are used as work animals or participate in obedience or skill competitions.) However, in a document, the accompanying presence of specific neighboring terms (which belong to only one topic) will disambiguate their usage. Most documents will contain only a relatively small number of topics. In the collection, e.g., individual topics will occur with differing frequencies. That is, they have a probability distribution, so that a given document is more likely to contain some topics than others. Within a topic, certain terms will be used much more frequently than others. In other words, the terms within a topic will also have their own probability distribution. When LDA machine learning is employed, both sets of probabilities are computed during the training phase, using Bayesian methods and an expectation–maximization algorithm. LDA is a generalization of older approach of probabilistic latent semantic analysis (pLSA), The pLSA model is equivalent to LDA under a uniform Dirichlet prior distribution. pLSA relies on only the first two assumptions above and does not care about the remainder. While both methods are similar in principle and require the user to specify the number of topics to be discovered before the start of training (as with k-means clustering) LDA has the following advantages over pLSA: LDA yields better disambiguation of words and a more precise assignment of documents to topics. Computing probabilities allows a "generative" process by which a collection of new "synthetic documents" can be generated that would closely reflect the statistical characteristics of the original collection. Unlike LDA, pLSA is vulnerable to overfitting especially when the size of corpus increases. The LDA algorithm is more readily amenable to scaling up for large data sets using the MapReduce approach on a computing cluster. == Model == With plate notation, which is often used to represent probabilistic graphical models (PGMs), the dependencies among the many variables can be captured concisely. The boxes are "plates" representing replicates, which are repeated entities. The outer plate represents documents, while the inner plate represents the repeated word positions in a given document; each position is associated with a choice of topic and word. The variable names are defined as follows: M denotes the number of documents N is number of words in a given document (document i has N i {\displaystyle N_{i}} words) α is the parameter of the Dirichlet prior on the per-document topic distributions β is the parameter of the Dirichlet prior on the per-topic word distribution θ i {\displaystyle \theta _{i}} is the topic distribution for document i φ k {\displaystyle \varphi _{k}} is the word distribution for topic k z i j {\displaystyle z_{ij}} is the topic for the j-th word in document i w i j {\displaystyle w_{ij}} is the specific word. The fact that W is grayed out means that words w i j {\displaystyle w_{ij}} are the only observable variables, and the other variables are latent variables. As proposed in the original paper, a sparse Dirichlet prior can be used to model the to

    Read more →
  • UIMA

    UIMA

    UIMA ( yoo-EE-mə), short for Unstructured Information Management Architecture, is an OASIS standard for content analytics, originally developed at IBM. It provides a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and integration with search technologies. == Structure == The UIMA architecture can be thought of in four dimensions: It specifies component interfaces in an analytics pipeline. It describes a set of design patterns. It suggests two data representations: an in-memory representation of annotations for high-performance analytics and an XML representation of annotations for integration with remote web services. It suggests development roles allowing tools to be used by users with diverse skills. == Implementations and uses == Apache UIMA, a reference implementation of UIMA, is maintained by the Apache Software Foundation. UIMA is used in a number of software projects: IBM Research's Watson uses UIMA for analyzing unstructured data. The Clinical Text Analysis and Knowledge Extraction System (Apache cTAKES) is a UIMA-based system for information extraction from medical records. DKPro Core is a collection of reusable UIMA components for general-purpose natural language processing.

    Read more →
  • Medoid

    Medoid

    Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression (where while the data is sparse the medoid need not be). These are also of interest while wanting to find a representative using some distance other than squared euclidean distance (for instance in movie-ratings). For some data sets there may be more than one medoid, as with medians. A common application of the medoid is the k-medoids clustering algorithm, which is similar to the k-means algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process. Note that a medoid is not equivalent to a median, a geometric median, or centroid. A median is only defined on 1-dimensional data, and it only minimizes dissimilarity to other points for metrics induced by a norm (such as the Manhattan distance or Euclidean distance). A geometric median is defined in any dimension, but unlike a medoid, it is not necessarily a point from within the original dataset. == Definition == Let X := { x 1 , x 2 , … , x n } {\textstyle {\mathcal {X}}:=\{x_{1},x_{2},\dots ,x_{n}\}} be a set of n {\textstyle n} points in a space with a distance function d. Medoid is defined as x medoid = arg ⁡ min y ∈ X ∑ i = 1 n d ( y , x i ) . {\displaystyle x_{\text{medoid}}=\arg \min _{y\in {\mathcal {X}}}\sum _{i=1}^{n}d(y,x_{i}).} == Clustering with medoids == Medoids are a popular replacement for the cluster mean when the distance function is not (squared) Euclidean distance, or not even a metric (as the medoid does not require the triangle inequality). When partitioning the data set into clusters, the medoid of each cluster can be used as a representative of each cluster. Clustering algorithms based on the idea of medoids include: Partitioning Around Medoids (PAM), the standard k-medoids algorithm Hierarchical Clustering Around Medoids (HACAM), which uses medoids in hierarchical clustering == Algorithms to compute the medoid of a set == From the definition above, it is clear that the medoid of a set X {\displaystyle {\mathcal {X}}} can be computed after computing all pairwise distances between points in the ensemble. This would take O ( n 2 ) {\textstyle O(n^{2})} distance evaluations (with n = | X | {\displaystyle n=|{\mathcal {X}}|} ). In the worst case, one can not compute the medoid with fewer distance evaluations. However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models. If the points lie on the real line, computing the medoid reduces to computing the median which can be done in O ( n ) {\textstyle O(n)} by Quick-select algorithm of Hoare. However, in higher dimensional real spaces, no linear-time algorithm is known. RAND is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of O ( n log ⁡ n ϵ 2 ) {\textstyle O\left({\frac {n\log n}{\epsilon ^{2}}}\right)} distance computations to approximate the medoid within a factor of ( 1 + ϵ Δ ) {\textstyle (1+\epsilon \Delta )} with high probability, where Δ {\textstyle \Delta } is the maximum distance between two points in the ensemble. Note that RAND is an approximation algorithm, and moreover Δ {\textstyle \Delta } may not be known apriori. RAND was leveraged by TOPRANK which uses the estimates obtained by RAND to focus on a small subset of candidate points, evaluates the average distance of these points exactly, and picks the minimum of those. TOPRANK needs O ( n 5 3 log 4 3 ⁡ n ) {\textstyle O(n^{\frac {5}{3}}\log ^{\frac {4}{3}}n)} distance computations to find the exact medoid with high probability under a distributional assumption on the average distances. trimed presents an algorithm to find the medoid with O ( n 3 2 2 Θ ( d ) ) {\textstyle O(n^{\frac {3}{2}}2^{\Theta (d)})} distance evaluations under a distributional assumption on the points. The algorithm uses the triangle inequality to cut down the search space. Meddit leverages a connection of the medoid computation with multi-armed bandits and uses an upper-Confidence-bound type of algorithm to get an algorithm which takes O ( n log ⁡ n ) {\textstyle O(n\log n)} distance evaluations under statistical assumptions on the points. Correlated Sequential Halving also leverages multi-armed bandit techniques, improving upon Meddit. By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time. == Implementations == An implementation of RAND, TOPRANK, and trimed can be found here. An implementation of Meddit can be found here and here. An implementation of Correlated Sequential Halving can be found here. == Medoids in text and natural language processing (NLP) == Medoids can be applied to various text and NLP tasks to improve the efficiency and accuracy of analyses. By clustering text data based on similarity, medoids can help identify representative examples within the dataset, leading to better understanding and interpretation of the data. === Text clustering === Text clustering is the process of grouping similar text or documents together based on their content. Medoid-based clustering algorithms can be employed to partition large amounts of text into clusters, with each cluster represented by a medoid document. This technique helps in organizing, summarizing, and retrieving information from large collections of documents, such as in search engines, social media analytics and recommendation systems. === Text summarization === Text summarization aims to produce a concise and coherent summary of a larger text by extracting the most important and relevant information. Medoid-based clustering can be used to identify the most representative sentences in a document or a group of documents, which can then be combined to create a summary. This approach is especially useful for extractive summarization tasks, where the goal is to generate a summary by selecting the most relevant sentences from the original text. === Sentiment analysis === Sentiment analysis involves determining the sentiment or emotion expressed in a piece of text, such as positive, negative, or neutral. Medoid-based clustering can be applied to group text data based on similar sentiment patterns. By analyzing the medoid of each cluster, researchers can gain insights into the predominant sentiment of the cluster, helping in tasks such as opinion mining, customer feedback analysis, and social media monitoring. === Topic modeling === Topic modeling is a technique used to discover abstract topics that occur in a collection of documents. Medoid-based clustering can be applied to group documents with similar themes or topics. By analyzing the medoids of these clusters, researchers can gain an understanding of the underlying topics in the text corpus, facilitating tasks such as document categorization, trend analysis, and content recommendation. === Techniques for measuring text similarity in medoid-based clustering === When applying medoid-based clustering to text data, it is essential to choose an appropriate similarity measure to compare documents effectively. Each technique has its advantages and limitations, and the choice of the similarity measure should be based on the specific requirements and characteristics of the text data being analyzed. The following are common techniques for measuring text similarity in medoid-based clustering: ==== Cosine similarity ==== Cosine similarity is a widely used measure to compare the similarity between two pieces of text. It calculates the cosine of the angle between two document vectors in a high-dimensional space. Cosine similarity ranges between -1 and 1, where a value closer to 1 indicates higher similarity, and a value closer to -1 indicates lower similarity. By visualizing two lines originating from the origin and extending to the respective points of interest, and then measuring the angle between these lines, one can determine the similarity between the associated points. Cosine similarity is less affected by document length, so it may be better at producing medoids that are representative of the content of a cluster instead of the lengt

    Read more →
  • JDoodle

    JDoodle

    JDoodle is a cloud-based online integrated development environment and compiler platform that supports execution of source code in 70+ programming languages including Java, Python, C/C++, PHP, Ruby, Perl, HTML, and more. It provides zero‑setup code for compilation, execution, and sharing via a web browser interface. == Features == Provides real‑time collaboration and code embedding via shareable URLs and APIs Offers an integrated terminal interface supporting database engines such as MySQL and MongoDB. JDroid — AI‑assistant to generate code snippets, optimize code, and assist debugging. == Languages and frameworks supported ==

    Read more →
  • Unique negative dimension

    Unique negative dimension

    Unique negative dimension (UND) is a complexity measure for the model of learning from positive examples. The unique negative dimension of a class C {\displaystyle C} of concepts is the size of the maximum subclass D ⊆ C {\displaystyle D\subseteq C} such that for every concept c ∈ D {\displaystyle c\in D} , we have ∩ ( D ∖ { c } ) ∖ c {\displaystyle \cap (D\setminus \{c\})\setminus c} is nonempty. This concept was originally proposed by M. Gereb-Graus in "Complexity of learning from one-side examples", Technical Report TR-20-89, Harvard University Division of Engineering and Applied Science, 1989.

    Read more →
  • Jpred

    Jpred

    Jpred v.4 is the latest version of the JPred Protein Secondary Structure Prediction Server which provides predictions by the JNet algorithm, one of the most accurate methods for secondary structure prediction, that has existed since 1998 in different versions. In addition to protein secondary structure, JPred also makes predictions of solvent accessibility and coiled-coil regions. The JPred service runs up to 134 000 jobs per month and has carried out over 2 million predictions in total for users in 179 countries. == JPred 2 == The static HTML pages of JPred 2 are still available for reference. == JPred 3 == The JPred v3 followed on from previous versions of JPred developed and maintained by James Cuff and Jonathan Barber (see JPred References). This release added new functionality and fixed many bugs. The highlights are: New, friendlier user interface Retrained and optimised version of Jnet (v2) - mean secondary structure prediction accuracy of >81% Batch submission of jobs Better error checking of input sequences/alignments Predictions now (optionally) returned via e-mail Users may provide their own query names for each submission JPred now makes a prediction even when there are no PSI-BLAST hits to the query PS/PDF output now incorporates all the predictions == JPred 4 == The current version of JPred (v4) has the following improvements and updates incorporated: Retrained on the latest UniRef90 and SCOPe/ASTRAL version of Jnet (v2.3.1) - mean secondary structure prediction accuracy of >82%. Upgraded the Web Server to the latest technologies (Bootstrap framework, JavaScript) and updating the web pages – improving the design and usability through implementing responsive technologies. Added RESTful API and mass-submission and results retrieval scripts - resulting in peak throughput above 20,000 predictions per day. Added prediction jobs monitoring tools. Upgraded the results reporting – both, on the web-site, and through the optional email summary reports: improved batch submission, added results summary preview through Jalview results visualization summary in SVG and adding full multiple sequence alignments into the reports. Improved help-pages, incorporating tool-tips, and adding one-page step-by-step tutorials. Sequence residues are categorised or assigned to one of the secondary structure elements, such as alpha-helix, beta-sheet and coiled-coil. Jnet uses two neural networks for its prediction. The first network is fed with a window of 17 residues over each amino acid in the alignment plus a conservation number. It uses a hidden layer of nine nodes and has three output nodes, one for each secondary structure element. The second network is fed with a window of 19 residues (the result of first network) plus the conservation number. It has a hidden layer with nine nodes and has three output nodes.

    Read more →