Francesco Fusco - LLM and ML Systems Researcher

US12411878B2 Determining specificity of text terms in application contexts

Inventors: Francesco Fusco, Diego Matteo Antognini
Assignee: International Business Machines Corp
Status: Grant

A computer implemented method, a computer program product and a computer system and are provided to enrich downstream learning tasks. A processor stores selected text terms from a corpus of text. A processor determines an initial set of specificity scores for the selected text terms to produce a set of training samples, where each of the training samples comprise a selected text term and an initial specificity score for the selected text term. A processor trains a character-based regression model with the set of training samples. A processor retrieves an Automated Term Extraction (ATE) training data set. A processor determines specificity scores for text terms included in the ATE training data set. A processor, responsive to respective specificity score for a text term in the ATE training data set being below a threshold value, masks the text term from being used in the ATE training data set.

Domain-specificity prediction for natural language processing

Inventors: Diego Matteo Antognini, Francesco Fusco
Assignee: International Business Machines Corp
Status: Application

A method, computer-program product and computer system are provided to determine domain-specificity of a text term. A processor receives a plurality of domain-specific text corpora, wherein each of the plurality of domain-specific text corpora comprises a plurality of text documents of a respective domain. A processor trains a set of subword-unit tokenizers with at least two different vocabulary sizes of the respective domain-specific text corpus. A processor receives the text-term. A processor determines a domain-specificity fingerprint of the text-term, wherein the domain-specificity fingerprint comprises for each subword-unit tokenizer a number of subword-units required to represent the text-term. A processor provides the domain-specificity fingerprint for determining the domain-specificity of the text term.

Self-supervised term encoding with confidence estimation

Inventors: Francesco Fusco, Diego Matteo Antognini
Assignee: International Business Machines Corp
Status: Application

According to one embodiment, a method and computer program product for generating a model including a term encoder is provided. The embodiment may include training the model on a training dataset that associates training terms with first embeddings of the training terms. The training includes generating, with the term encoder, second embeddings from numerical representations of word subunits of the training terms with an objective of minimizing distances between the first embeddings and the second embeddings. The word subunits form part of a predetermined set of word subunits. The training includes predicting confidence scores based on the minimized distances. The embodiment may include deploying the model as part of an executable algorithm to allow a user to infer third embeddings and corresponding confidence scores from any input terms written based on word subunits of the predetermined set.

US12339884B2 Updating window representations of sliding window of text using rolling scheme

Inventors: Francesco Fusco, Diego Matteo Antognini
Assignee: International Business Machines Corp
Status: Grant

An example system includes a processor to compute a token-level fingerprint for each of a number of tokens in a received window of text. The processor can compute a window representation for a window of text based on the token-level fingerprints. The processor can also update the window representation in a rolling scheme when sliding the window of text.

US12210827B2 Specificity ranking of text elements and applications thereof

Inventors: Francesco Fusco, Cesar Berrospi Ramis, Peter Willem Jan Staar
Assignee: International Business Machines Corp
Status: Grant

Ranking a plurality of text elements, each comprising at least one word, by specificity. For each text element to be ranked, such a method includes computing an embedding vector that locates a text element in an embedding space, and selecting a set of text fragments from reference text. Each of these text fragments contains the text element to be ranked and further text elements. For each text fragment, the method calculates respective distances in the embedding space between the further text elements. The method further includes calculating a specificity score for the text element to be ranked and storing the specificity score. After ranking the plurality of text elements, a text data structure using the specificity scores for text elements to extract data having a desired specificity from the data structure may be processed.

Bootstrapping of text classifiers

Inventors: Francesco Fusco, Mattia Atzeni, Abderrahim Labbi
Assignee: International Business Machines Corp
Status: Application

Computer-implemented methods and systems are provided for generating training datasets for bootstrapping text classifiers. Such a method includes providing a word embedding matrix. This matrix is generated from a text corpus by encoding words in the text as respective tokens such that selected compound keywords in the text are encoded as single tokens. The method includes receiving, via a user interface, a user-selected set of the keywords a nearest neighbor search of the embedding space is performed for each keyword in the set to identify neighboring keywords, and a plurality of the neighboring keywords are added to the keyword-set. The method further comprises, for a corpus of documents, string-matching keywords in the keyword-sets to text in each document to identify, based on results of the string-matching, documents associated with each text class. The documents identified for each text class are stored as the training dataset for the classifier.

US11663407B2 Management of text-item recognition systems

Inventors: Francesco Fusco, Abderrahim Labbi, Peter Willem Jan Staar
Assignee: International Business Machines Corp
Status: Grant

A tool for managing text-item recognition systems such as NER (Named Entity Recognition) systems. The tool applies the system to a text corpus containing instances of text items, such as named entities, to be recognized by the system, and selecting from the text corpus a set of instances of text items which the system recognized. The tool tokenizes the text corpus such that each instance in the aforementioned set is encoded as a single token and processing the tokenized text via a word embedding scheme to generate a word embedding matrix. The tool, responsive to selecting a seed token corresponding to an instance in the aforementioned set, performs a nearest-neighbor search of the embedding space to identify a set of neighboring tokens for the seed token, and identifies the text corresponding to each neighboring token as a potential instance of a text item to be annotated.

US11361571B1 Term extraction in highly technical domains

Inventors: Francesco Fusco, Peter Willem Jan Staar
Assignee: International Business Machines Corp
Status: Grant

A language model is fine-tuned by extracting terminology terms from a text document. The method comprises identifying a text snippet, identifying candidate multi-word expressions using part of speech tags, and determining a specificity score value for each of the candidate multi-word expressions. Moreover, the method comprises determining a topic similarity score value for each of the candidate multi-word expressions, selecting remaining expressions from the candidate multi-word expressions using a function of a specificity value and a topic similarity value of each of the candidate multi-word expressions, adding a noun comprised in the text snippet to the remaining expressions depending on a correlation function, labeling the remaining multi-word expressions, and fine-tuning an existing pre-trained transformer-based language model using as training data the identified text snippet marked with the labeled remaining expressions.

US11507601B2 Matching a first collection of strings with a second collection of strings

Inventors: Francesco Fusco, Yves G. Ineichen, Michel F. Speiser
Assignee: International Business Machines Corp
Status: Grant

A method for matching first elements with second elements. Each of the first elements and second elements is a character string. The method comprises: calculating a first integer hash value for each of the first elements using a string hash function, wherein the first integer hash value is an output integer calculated from using each of the first elements as an input character string of the function; calculating second integer hash values for each of the second elements using the function; grouping each of the first elements into at least one group of a set of blocking groups using its first integer hash value; grouping each of the second elements into at the least one group of the set of blocking groups using its second integer hash value; and matching first elements with second elements within each group of the set of blocking groups using a string comparison function.

US10164892B2 Overhead management for virtual machines

Inventors: Francesco Fusco, Thomas Graf, Michael Tsirkin
Assignee: Red Hat Israel Ltd
Status: Grant

A method includes loading a guest virtual machine onto a host system, determining, with the host system, an encapsulation method to be used in association with a virtual network associated with the guest virtual machine, determining an overhead value based on the encapsulation method, determining an adjusted maximum transmission unit (MTU) value based on the overhead value, and passing, information related to the adjusted MTU value from the host system to the guest virtual machine.

US9940344B2 Fractal approach for probabilistic flow cache maintenance

Inventors: Francesco Fusco, Daniel Borkmann, Thomas Graf
Assignee: Red Hat Israel Ltd
Status: Grant

An apparatus sets a layer counter to point to a first layer of a data structure. The apparatus determines the layer counter to reference an overflowing cell. The apparatus increments the layer counter to point to a second layer of the data structure. The apparatus determines the incremented layer counter to reference a non-overflowing cell. The apparatus increments a value of the non-overflowing cell, wherein the first layer is stored in a first cache and the second layer is stored in a second cache, and wherein the first cache differs from the second cache with respect to one or more of speed or size.

US10528578B2 Method and device for data mining on compressed data vector

Inventors: Nikolaos Freris, Francesco Fusco, Michail Vlachos
Assignee: International Business Machines Corporation
Status: Grant

A method for data mining on compressed data vectors by a certain metric being expressible as a function of the Euclidean distance is suggested. In a first step, for each compressed data vector, positions and values of such coefficients having the largest energy in the compressed data vector are stored. In a second step, for each compressed data vector, the coefficients having not the largest energy in the compressed data vector are discarded. In a third step, for each compressed data vector, a compression error is determined in dependence on the discarded coefficients in the compressed data vector. In a fourth step, at least one of an upper and a lower bound for the certain metric is retrieved in dependence on the stored positions and the stored values of the coefficients having the largest energy and the determined compression errors.

US9286333B2 Stream Compression and decompression

Inventors: Harold Douglas Dykeman, Francesco Fusco, Thomas R. Locher
Assignee: International Business Machines Corporation
Status: Grant

A method for compressing a sequence of records, each record comprising a sequence of fields, comprises steps of buffering a record in a line of a matrix, reordering the lines of the matrix according to locality sensitive hash values of the buffered records such that records with similar contents in corresponding fields are placed in proximity, and consolidating fields in columns of the matrix into a block of codes. In this, consolidating yields codes of one of a first type comprising a sequence of individual fields and a second type comprising a sequence of fields with at least one repetition. The second type of code comprises a presence field indicating repeated fields and an iteration field indicating a number of respective repetitions. Decompression of the records from the block codes compressed above is also described

US8688655B2 Network Analysis

Inventors: Francesco Fusco, Andreas Kind, Marc P Stoecklin, Michail Vlachos
Assignee: International Business Machines Corporation
Status: Grant

A method for providing a compressed index for a stream of binary data records comprises steps of indexing a field from each record in a bitmap index, compressing stored bits in each column of the bitmap index by replacing a group of successive bits with a code and outputting the code. There is provided at least one of a first code for replacing a sequence of a first filling, a literal and a second filling, and a second code for replacing a sequence of a first literal, a filling and a second literal. In this context, a filling is a sequence of bits with the same value and a literal is a sequence of bits with different values.

US8782012B2 Network Analysis

Inventors: Francesco Fusco, Marc P Stoecklin, Michail Vlachos
Assignee: International Business Machines Corporation
Status: Grant

Methods and a device for providing a compressed index of binary records. A method includes: sorting the records by content of a predetermined field of the record, indexing the field from one of the records in a line of a bitmap index, compressing bits in a column of the bitmap index by replacing a group of successive bits with a code, where the sorting includes the steps of assigning, for each record, a hash bucket of a hash table on a basis of a locality sensitive hash function on the contents of the predetermined field, so that the probability for two of the records to be assigned to the same has bucket increases with the similarity of the contents of the predetermined field between the records, and where at least one step of the computer implemented method is executed on a computer device.

Patents