Introduction

Humans do great with natural language and struggle with numbers; computers do great with numbers and struggle with natural language. How can we use computers for tasks that require understanding of natural language, then? Given a Neo4j database with movies, how can we query it to know what movies best match your interest, if that is expressed in words? For example, what movies are about an exploration of italian mafias in america? What movies relate to a criminal is changed through love? What movies are similar to one another?

Neo4j’s vector indexes allow you to retrieve nodes and relationships that are similar to other nodes and relationships, or that relate to a given text prompt. Vector indexes don’t work on texts expressed in natural language though: they need the texts to be turned into lists of numbers. These lists are called embeddings, and they attempt to encode the meaning of a text numerically. With a vector index, the database can quickly match graph entities basing on their embeddings.

For example, the picture below shows the title and plot for the movie Despicable Me and its corresponding embedding.

despicable embedding

This tutorial uses the recommendations dataset, which contains movies and people who have directed or acted in those movies.

recommendations model

You will see how to setup the environment, create embeddings on the movie nodes (both with free and open source libraries, and with external AI providers), create a vector index, and query the database for movies given a loose description or a similar movie.

This guide has the following requirements:

  • A running instance of Neo4j >= 5.13 — If you don’t have one, install Neo4j locally or sign up for an Aura cloud instance.

  • Some familiarity with Cypher — If you are new to it, check out Getting started → Cypher®.

  • python and some familiarity with it.

  • (Optional) An API key to one of the external AI providers, if you intend to generate embeddings with them.

You should always use the same model to generate embeddings for a dataset: pick one and use it to generate all embeddings throughout this tutorial.

Glossary

Aura

Aura is Neo4j’s fully managed cloud service. It comes with both free and paid plans.

Cypher

Cypher is Neo4j’s graph query language that lets you retrieve data from the database. It is like SQL, but for graphs.