Final: Better Together: Text + Context

IT

JSALT 2023

It is standard practice to represent documents as embeddings. We will do this in multiple ways. Embeddings based on deep nets (BERT) capture text and other embeddings based on node2vec and GNNs (graph neural nets) capture citation graphs. Embeddings encode each of N ≈ 200M documents as a vector of K ≈ 768 hidden dimensions. Cosines of two vectors denote similarity of two documents. We will evaluate these embeddings and show that combinations of text and citations are better than either by itself on standard benchmarks of downstream tasks.

As deliverables, we will make embeddings available to the community so they can use them in a range of applications: ranked retrieval, recommender systems and routing papers to reviewers. Our interdisciplinary team will have expertise in machine learning, artificial intelligence, information retrieval, bibliometrics, NLP and systems. Standard embeddings are time invariant. The representation of a document does not change after it is published. But citation graphs evolve over time. The representation of a document should combine time invariant contributions from the authors with constantly evolving responses from the audience, like social media.

Tags: deep nets ia information retrieval informatique jsalt linear algebra nlp workshop

Added by: Gregor Dupuy
Additional owner(s):
- Emmanuelle Billard
Updated on: Sept. 26, 2023, 11:18 a.m.
Channel:
- IT
Type: Conférence
Main language: English
Audience: Other
Discipline(s):
- Informatique

IT

JSALT 2023

Final: Better Together: Text + Context

Infos