Join us on YugabyteDB Community Slack
Star us on
Get Started
Slack
GitHub
Get Started
v2.7 (latest) v2.4 (stable) v2.2 (earlier version) v2.1 (earlier version) v2.0 (earlier version) v1.3 (earlier version)
  • YUGABYTEDB CORE
    • Quick start
      • 1. Install YugabyteDB
      • 2. Create a local cluster
      • 3. Explore distributed SQL
      • 4. Build an application
        • Java
        • NodeJS
        • Go
        • Python
        • Ruby
        • C#
        • PHP
        • C++
        • C
        • Scala
    • Explore features
      • YSQL vs PostgreSQL
        • Schemas and Tables
        • Data Types
        • Data Manipulation
        • Queries and Joins
        • Expressions and Operators
        • Cursors
        • Stored Procedures
        • Triggers
        • Table Partitioning
        • Tablespaces
        • Views
      • Fault tolerance
      • Horizontal Scalability
        • Scaling Transactions
        • Sharding Data
      • Transactions
        • Distributed Transactions
        • Isolation Levels
        • Explicit Locking
      • JSON Support
      • Multi-Region Deployments
        • Sync replication (3+ regions)
        • Async Replication (2+ regions)
        • Row-Level Geo-Partitioning
      • Query Tuning
        • Analyzing Queries with EXPLAIN
        • Viewing live queries with pg_stat_activity
        • Optimizing YSQL queries using pg_hint_plan
      • Follower reads
      • Colocated tables
      • Change data capture (CDC)
      • Extensions
      • Observability
        • Prometheus Integration
      • Security
    • Develop
      • Learn app development
        • 1. SQL vs NoSQL
        • 2. Data modeling
        • 3. Data types
        • 4. ACID transactions
        • 5. Aggregations
        • 6. Batch operations
        • 7. Date and time
        • 8. Strings and text
        • 9. TTL for data expiration
      • Ecosystem integrations
        • Apache Kafka
        • Spring Framework
        • Apache Spark
        • JanusGraph
        • KairosDB
        • Hasura
        • Presto
        • Metabase
      • Build GraphQL apps
        • Hasura
        • Prisma
      • Real-world examples
        • E-Commerce app
        • IoT fleet management
        • Retail Analytics
      • Explore sample apps
      • Best practices
    • Migrate
      • Migration process overview
      • Migrate from PostgreSQL
        • Convert a PostgreSQL schema
        • Migrate a PostgreSQL application
        • Export PostgreSQL data
        • Prepare a cluster
        • Import PostgreSQL data
        • Verify Migration
    • Deploy
      • Deployment checklist
      • Manual deployment
        • 1. System configuration
        • 2. Install software
        • 3. Start YB-Masters
        • 4. Start YB-TServers
        • 5. Verify deployment
      • Kubernetes
        • Single-zone
          • Open Source
          • Amazon EKS
          • Google Kubernetes Engine
          • Azure Kubernetes Service
        • Multi-zone
          • Amazon EKS
          • Google Kubernetes Engine
        • Multi-cluster
          • Google Kubernetes Engine
        • Best practices
        • Connect Clients
      • Docker
      • Public clouds
        • Amazon Web Services
        • Google Cloud Platform
        • Microsoft Azure
      • Multi-DC deployments
        • Three+ data center (3DC)
        • Two data center (2DC)
        • Read replica clusters
      • Change data capture (CDC)
        • CDC to Kafka
    • Benchmark
      • TPC-C
      • sysbench
      • YCSB
      • Key-value workload
      • Large datasets
      • Scalability
        • Scaling queries
      • Resilience
        • Jepsen testing
      • Performance Troubleshooting
    • Secure
      • Security checklist
      • Enable Authentication
        • Enable User Authentication
        • Configure ysql_hba_conf_csv
      • Authentication Methods
        • Password Authentication
        • LDAP Authentication
        • Host-Based Authentication
        • Trust Authentication
      • Role-Based Access Control
        • Overview
        • Manage Users and Roles
        • Grant Privileges
        • Row-Level Security (RLS)
        • Column-Level Security
      • Encryption in Transit
        • Create server certificates
        • Enable server-to-server encryption
        • Enable client-to-server encryption
        • Connect to Clusters
      • Encryption at rest
      • Column-Level Encryption
      • Audit Logging
        • Configure Audit Logging
        • Session-Level Audit Logging
        • Object-Level Audit Logging
      • Vulnerability disclosure policy
    • Manage
      • Back up and restore
        • Back up data
        • Restore data
        • Point-in-time restore
        • Snapshot and restore data
      • Migrate data
        • Bulk import
        • Bulk export
      • Change cluster configuration
      • Diagnostics reporting
      • Upgrade a deployment
      • Grow cluster
    • Troubleshoot
      • Troubleshooting
      • Common error messages
      • Cluster level issues
        • YCQL connection issues
        • YEDIS connection Issues
        • Recover tserver/master
        • Replace a failed YB-TServer
        • Replace a failed YB-Master
        • Manual remote bootstrap when a majority of peers fail
      • Node level issues
        • Check servers
        • Inspect logs
        • System statistics
        • Disk failure
    • Contribute
      • Core database
        • Contribution checklist
        • Build the source
        • Configure a CLion project
        • Run the tests
  • YUGABYTE PLATFORM
    • Yugabyte Platform
      • Overview
        • Install
        • Configure
      • Install Yugabyte Platform
        • Prerequisites
        • Prepare the environment
        • Install software
        • Prepare nodes (on-prem)
        • Uninstall software
      • Configure Yugabyte Platform
        • Create admin user
        • Configure the cloud provider
        • Configure the backup target
        • Configure alerts and health checking
        • Create and edit instance tags
      • Create deployments
        • Multi-zone universe
        • Multi-region universe
        • Read replica cluster
      • Manage deployments
        • Start and stop processes
        • Add a node
        • Enable high availability
        • Remove a node
        • Edit a universe
        • Edit configuration flags
        • Upgrade the YugabyteDB software
        • Delete a universe
        • Migrate to Helm 3
      • Back up and restore universes
        • Configure backup storage
        • Back up universe data
        • Restore universe data
        • Schedule data backups
      • Security
        • Security checklist
        • Customize ports
        • Authorization platform
        • Create a KMS configuration
        • Enable encryption at rest
        • Enable encryption in transit (TLS)
        • Network security
      • Alerts and monitoring
        • Live Queries dashboard
        • Slow Queries dashboard
      • Troubleshoot
        • Install and upgrade issues
        • Universe issues
      • Administer Yugabyte Platform
        • Back Up and Restore Yugabyte Platform
  • YUGABYTE CLOUD
    • Yugabyte Cloud
      • Free tier
      • Create clusters
      • Monitor clusters
      • Create databases
      • Manage database access
      • Connect to clusters
  • REFERENCE
    • Reference
    • Architecture
      • Design goals
      • Key concepts
        • Universe
        • YB-TServer Service
        • YB-Master Service
      • Core functions
        • Universe creation
        • Table creation
        • Write IO path
        • Read IO path
        • High availability
      • Layered architecture
      • Query layer
        • Overview
      • DocDB transactions layer
        • Transactions overview
        • Transaction isolation levels
        • Explicit locking
        • Single-row transactions
        • Distributed transactions
        • Transactional IO path
      • DocDB sharding layer
        • Hash & range sharding
        • Tablet splitting
        • Colocated tables
      • DocDB replication layer
        • Replication
        • xCluster replication
        • Read replicas
        • Change data capture (CDC)
      • DocDB storage layer
        • Persistence
        • Performance
    • APIs
      • YSQL
        • The SQL language
          • SQL statements
            • ABORT
            • ALTER DATABASE
            • ALTER DEFAULT PRIVILEGES
            • ALTER DOMAIN
            • ALTER GROUP
            • ALTER POLICY
            • ALTER ROLE
            • ALTER SEQUENCE
            • ALTER TABLE
            • ALTER USER
            • BEGIN
            • CALL
            • COMMENT
            • COMMIT
            • COPY
            • CREATE AGGREGATE
            • CREATE CAST
            • CREATE DATABASE
            • CREATE DOMAIN
            • CREATE EXTENSION
            • CREATE FUNCTION
            • CREATE GROUP
            • CREATE INDEX
            • CREATE OPERATOR
            • CREATE OPERATOR CLASS
            • CREATE POLICY
            • CREATE PROCEDURE
            • CREATE ROLE
            • CREATE RULE
            • CREATE SCHEMA
            • CREATE SEQUENCE
            • CREATE TABLE
            • CREATE TABLE AS
            • CREATE TRIGGER
            • CREATE TYPE
            • CREATE USER
            • CREATE VIEW
            • DEALLOCATE
            • DELETE
            • DO
            • DROP AGGREGATE
            • DROP CAST
            • DROP DATABASE
            • DROP DOMAIN
            • DROP EXTENSION
            • DROP FUNCTION
            • DROP GROUP
            • DROP OPERATOR
            • DROP OPERATOR CLASS
            • DROP OWNED
            • DROP POLICY
            • DROP PROCEDURE
            • DROP ROLE
            • DROP RULE
            • DROP SEQUENCE
            • DROP TABLE
            • DROP TRIGGER
            • DROP TYPE
            • DROP USER
            • END
            • EXECUTE
            • EXPLAIN
            • GRANT
            • INSERT
            • LOCK
            • PREPARE
            • REASSIGN OWNED
            • RESET
            • REVOKE
            • ROLLBACK
            • SELECT
            • SET
            • SET CONSTRAINTS
            • SET ROLE
            • SET SESSION AUTHORIZATION
            • SET TRANSACTION
            • SHOW
            • SHOW TRANSACTION
            • TRUNCATE
            • UPDATE
            • VALUES
          • WITH clause
            • WITH clause—SQL syntax and semantics
            • recursive CTE
            • case study—traversing an employee hierarchy
            • traversing general graphs
              • graph representation
              • common code
              • undirected cyclic graph
              • directed cyclic graph
              • directed acyclic graph
              • rooted tree
              • Unique containing paths
              • Stress testing find_paths()
            • case study—Bacon Numbers from IMDb
              • Bacon numbers for synthetic data
              • Bacon numbers for IMDb data
        • Data types
          • Array
            • array[] constructor
            • Literals
              • Text typecasting and literals
              • Array of primitive values
              • Row
              • Array of rows
            • FOREACH loop (PL/pgSQL)
            • array of DOMAINs
            • Functions and operators
              • ANY and ALL
              • Array comparison
              • Array slice operator
              • Array concatenation
              • Array properties
              • array_agg(), unnest(), generate_subscripts()
              • array_fill()
              • array_position(), array_positions()
              • array_remove()
              • array_replace() / set value
              • array_to_string()
              • string_to_array()
          • Binary
          • Boolean
          • Character
          • Date and time
          • JSON
            • JSON literals
            • Primitive and compound data types
            • Code example conventions
            • Indexes and check constraints
            • Functions & operators
              • ::jsonb, ::json, ::text (typecast)
              • ->, ->>, #>, #>> (JSON subvalues)
              • - and #- (remove)
              • || (concatenation)
              • = (equality)
              • @> and <@ (containment)
              • ? and ?| and ?& (key or value existence)
              • array_to_json()
              • jsonb_agg()
              • jsonb_array_elements()
              • jsonb_array_elements_text()
              • jsonb_array_length()
              • jsonb_build_object()
              • jsonb_build_array()
              • jsonb_each()
              • jsonb_each_text()
              • jsonb_extract_path()
              • jsonb_extract_path_text() and json_extract_path_text()
              • jsonb_object()
              • jsonb_object_agg()
              • jsonb_object_keys()
              • jsonb_populate_record()
              • jsonb_populate_recordset()
              • jsonb_pretty()
              • jsonb_set() and jsonb_insert()
              • jsonb_strip_nulls()
              • jsonb_to_record()
              • jsonb_to_recordset()
              • jsonb_typeof()
              • row_to_json()
              • to_jsonb()
          • Money
          • Numeric
          • Range
          • Serial
          • UUID
        • Functions and operators
          • Aggregate functions
            • Informal functionality overview
            • Invocation syntax and semantics
            • grouping sets, rollup, cube
            • Per function signature and purpose
              • avg(), count(), max(), min(), sum()
              • array_agg(), string_agg(), jsonb_agg(), jsonb_object_agg()
              • bit_and(), bit_or(), bool_and(), bool_or()
              • variance(), var_pop(), var_samp(), stddev(), stddev_pop(), stddev_samp()
              • linear regression
                • covar_pop(), covar_samp(), corr()
                • regr_%()
              • mode(), percentile_disc(), percentile_cont()
              • rank(), dense_rank(), percent_rank(), cume_dist()
            • case study—percentile_cont() and the "68–95–99.7" rule
            • case study—linear regression on COVID data
              • Download the COVIDcast data
              • Ingest the COVIDcast data
                • Inspect the COVIDcast data
                • Copy the .csv files to staging tables
                • Check staged data conforms to the rules
                • Join the staged data into a single table
                • SQL scripts
                  • Create cr_staging_tables()
                  • Create cr_copy_from_scripts()
                  • Create assert_assumptions_ok()
                  • Create xform_to_covidcast_fb_survey_results()
                  • ingest-the-data.sql
              • Analyze the COVIDcast data
                • symptoms vs mask-wearing by day
                • Data for scatter-plot for 21-Oct-2020
                • Scatter-plot for 21-Oct-2020
                • SQL scripts
                  • analysis-queries.sql
                  • synthetic-data.sql
          • currval()
          • lastval()
          • nextval()
          • Window functions
            • Informal functionality overview
            • Invocation syntax and semantics
            • Per function signature and purpose
              • row_number(), rank() and dense_rank()
              • percent_rank(), cume_dist() and ntile()
              • first_value(), nth_value(), last_value()
              • lag(), lead()
              • Tables for the code examples
                • table t1
                • table t2
                • table t3
                • table t4
            • case study—analyzing a normal distribution
              • Bucket allocation scheme
              • do_clean_start.sql
              • cr_show_t4.sql
              • cr_dp_views.sql
              • cr_int_views.sql
              • cr_pr_cd_equality_report.sql
              • cr_bucket_using_width_bucket.sql
              • cr_bucket_dedicated_code.sql
              • do_assert_bucket_ok
              • cr_histogram.sql
              • cr_do_ntile.sql
              • cr_do_percent_rank.sql
              • cr_do_cume_dist.sql
              • do_populate_results.sql
              • do_report_results.sql
              • do_compare_dp_results.sql
              • do_demo.sql
              • Reports
                • Histogram report
                • dp-results
                • compare-dp-results
                • int-results
        • Extensions
        • Keywords
        • Reserved names
      • YCQL
        • ALTER KEYSPACE
        • ALTER ROLE
        • ALTER TABLE
        • CREATE INDEX
        • CREATE KEYSPACE
        • CREATE ROLE
        • CREATE TABLE
        • CREATE TYPE
        • DROP INDEX
        • DROP KEYSPACE
        • DROP ROLE
        • DROP TABLE
        • DROP TYPE
        • GRANT PERMISSION
        • GRANT ROLE
        • REVOKE PERMISSION
        • REVOKE ROLE
        • USE
        • INSERT
        • SELECT
        • EXPLAIN
        • UPDATE
        • DELETE
        • TRANSACTION
        • TRUNCATE
        • Simple expressions
        • Subscripted expressions
        • Function call
        • Operators
        • BLOB
        • BOOLEAN
        • Collection
        • FROZEN
        • INET
        • Integer and counter
        • Non-integer
        • TEXT
        • DATE, TIME, and TIMESTAMP
        • UUID and TIMEUUID
        • JSONB
        • Date and time
        • BATCH
    • CLIs
      • yb-ctl
      • yb-docker-ctl
      • ysqlsh
      • ycqlsh
      • yb-admin
      • yb-ts-cli
      • ysql_dump
      • ysql_dumpall
    • Configuration
      • yb-tserver
      • yb-master
      • yugabyted
      • Default ports
    • Drivers
      • Client drivers for YSQL API
      • YugabyteDB JDBC Driver
      • Client drivers for YCQL
      • Spring Data YugabyteDB
    • Connectors
      • Kafka Connect YugabyteDB
    • Third party tools
      • DBeaver
      • DbSchema
      • pgAdmin
      • SQL Workbench/J
      • TablePlus
      • Visual Studio Code
    • Sample datasets
      • Chinook
      • Northwind
      • PgExercises
      • SportsDB
  • RELEASES
    • Releases
    • Releases overview
    • Release versioning
    • What's new
      • v2.7 (latest)
      • v2.4 (stable)
    • Earlier releases
      • v2.5 series
      • v2.3.3
      • v2.3.2
      • v2.3.1
      • v2.3.0
      • v2.2.0 series
      • v2.1.8
      • v2.1.6
      • v2.1.5
      • v2.1.4
      • v2.1.3
      • v2.1.2
      • v2.1.1
      • v2.1.0
      • v2.0.11
      • v2.0.10
      • v2.0.9
      • v2.0.8
      • v2.0.7
      • v2.0.6
      • v2.0.5
      • v2.0.3
      • v2.0.1
      • v2.0.0
      • v1.3.1
      • v1.3.0
      • v1.2.12
      • v1.2.11
      • v1.2.10
      • v1.2.9
      • v1.2.8
      • v1.2.6
      • v1.2.5
      • v1.2.4
  • FAQ
    • Comparisons
      • Amazon Aurora
      • Google Cloud Spanner
      • CockroachDB
      • TiDB
      • Vitess
      • MongoDB
      • FoundationDB
      • Amazon DynamoDB
      • Azure Cosmos DB
      • Apache Cassandra
      • PostgreSQL
      • Redis in-memory store
      • Apache HBase
    • FAQs
      • General FAQ
      • Operations FAQ
      • API compatibility FAQ
      • Yugabyte Platform FAQ
  • MISC
    • YEDIS
      • Quick start
      • Develop
        • Build an application
        • C#
        • C++
        • Go
        • Java
        • NodeJS
        • Python
      • API reference
        • APPEND
        • AUTH
        • CONFIG
        • CREATEDB
        • DELETEDB
        • LISTDB
        • SELECT
        • DEL
        • ECHO
        • EXISTS
        • EXPIRE
        • EXPIREAT
        • FLUSHALL
        • FLUSHDB
        • GET
        • GETRANGE
        • GETSET
        • HDEL
        • HEXISTS
        • HGET
        • HGETALL
        • HINCRBY
        • HKEYS
        • HLEN
        • HMGET
        • HMSET
        • HSET
        • HSTRLEN
        • HVALS
        • INCR
        • INCRBY
        • KEYS
        • MONITOR
        • PEXPIRE
        • PEXPIREAT
        • PTTL
        • ROLE
        • SADD
        • SCARD
        • RENAME
        • SET
        • SETEX
        • PSETEX
        • SETRANGE
        • SISMEMBER
        • SMEMBERS
        • SREM
        • STRLEN
        • ZRANGE
        • TSADD
        • TSCARD
        • TSGET
        • TSLASTN
        • TSRANGEBYTIME
        • TSREM
        • TSREVRANGEBYTIME
        • TTL
        • ZADD
        • ZCARD
        • ZRANGEBYSCORE
        • ZREM
        • ZREVRANGE
        • ZSCORE
        • PUBSUB
        • PUBLISH
        • SUBSCRIBE
        • UNSUBSCRIBE
        • PSUBSCRIBE
        • PUNSUBSCRIBE
    • Legal
      • Third party software
Case study—using a recursive CTE to compute Bacon Numbers on IMDb data
> APIs > YSQL > The SQL language > WITH clause >

Case study—using a recursive CTE to compute Bacon Numbers for actors listed in the IMDb

The Bacon Numbers problem, sometimes referred to as "The Six_Degrees of Kevin Bacon" (see this Wikipedia article), is a specific formulation of the general problem of tracing paths in an undirected cyclic graph. It is a well-known set-piece exercise in graph analysis and is a popular assignment task in computer science courses. Most frequently, solutions are implemented in an "if-then-else" language like Java. Interestingly, solutions can be implemented in SQL and, as this section will show, the amount of SQL needed is remarkably small.

Representing actors and movies data

The Bacon Numbers problem is conventionally formulated in the context of the data represented in the IMDb—an acronym for Internet Movie Database. See this Wikipedia article. The data are freely available from IMDb but it's better, for the purposes of this section's pedagogy, to use sufficient subsets of the total IMDb content. These subsets restrict the population to the movies in which Kevin Bacon has acted and project the facts about the actors and movies to just their names. This entity relationship diagram (a.k.a. ERD) depicts the sufficient subset of the IMDb:

imdb-erd

  • each actor must act in at least one movie
  • each movie's cast must list at least one actor

The actors are the nodes in an undirected cyclic graph. There is an edge between two actors when they both have acted together in at least one movie.

The ERD implies the conventional three-table representation with an "actors table, a "movies_table", and a "cast_members" intersection table. Create them with this script:

cr-actors-movies-cast-members-tables.sql
drop table if exists actors cascade;
drop table if exists movies cascade;
drop table if exists cast_members cascade;

create table actors(
  actor            text primary key);

create table movies(movie text primary key);

create table cast_members(
  actor text not null,
  movie text not null,

  constraint cast_members_pk primary key(actor, movie),

  constraint cast_members_fk1 foreign key(actor)
    references actors(actor)
    match full
    on delete cascade
    on update restrict,

  constraint cast_members_fk2 foreign key(movie)
    references movies(movie)
    match full
    on delete cascade
    on update restrict
  );

Of course, the IMDb has facts like date of birth, nationality, and so on for the actors and like release date, language and so on for the movies. The information would doubtless allow the "cast_members" table to have columns like "character_name". The data that this case study uses happen to include the movie release date, in parentheses, after the movie name in a single text field. The pedagogy is sufficiently served without parsing out these two facts into separate columns in the "movies" table.

Notice that the notion of a graph is so far only implied. A derived "edges" table makes the graph explicit. An edge exists between a pair of actors if they are both on the cast list of the same one or more movies. The SQL needed to populate the "edges" table from the "cast_members" table is straightforward.

When the paths have been found, it's useful to be able to annotate each edge with the list of movies that are responsible for its existence. The annotation code could, of course, derive this information dynamically. But it simplifies the overall coding scheme if a denormalization is adopted to annotate the paths at the time that they are discovered. Another departure from strict purity simplifies the overall coding scheme further. If the row for the edge between a particular pair of actors records the list of movies that brought it (rather than recording many edges, each with a single-valued "movie" attribute), then the path-tracing code that the section Using a recursive CTE to traverse graphs of all kinds presented can be used "as is". To this end, the columns that represent the actor pair in the "edges" table are called "node_1" and "node_2" rather than the more natural "actor_1" and "actor_2".

Note: The previous paragraph was stated as something of a sketch. In fact, each edge between a pair of actors is recorded twice—once in each direction, as is described in the section Graph traversal using the denormalized "edges" table design. Each of the edges in such a pair is annotated with the same list of movies.

This code creates the "edges" table and the procedure that populates it.

cr-actors-movies-edges-table-and-proc.sql
drop table if exists edges cascade;

create table edges(
  node_1 text,
  node_2 text,
  movies text[],
  constraint edges_pk primary key(node_1, node_2),
  constraint edges_fk_1 foreign key(node_1) references actors(actor),
  constraint edges_fk_2 foreign key(node_2) references actors(actor));

drop procedure if exists insert_edges() cascade;

create or replace procedure insert_edges()
  language plpgsql
as $body$
begin
  delete from edges;

  with
    v1(node_1, movie) as (
      select actor, movie from cast_members),

    v2(node_2, movie) as (
      select actor, movie from cast_members)

  insert into edges(node_1, node_2, movies)
  select node_1, node_2, array_agg(movie order by movie)
  from v1 inner join v2 using (movie)
  where node_1 < node_2
  group by node_1, node_2;

  insert into edges(node_1, node_2, movies)
  select node_2 as node_1, node_1 as node_2, movies
  from edges;
end;
$body$;

Notice the second INSERT statement that re-inserts all the discovered directed edges in the reverse direction. The value of this denormalization is explained in the section Finding the paths in a general undirected cyclic graph.

Create a stored procedure to decorate path edges with the list of movies that brought each edge

The stored procedure (actually a table function) will annotate each successive edge along each path in the specified table with the list of movies that brought that edge.

  • When there are relatively few paths in all, as there are with the synthetic data that the section Computing Bacon Numbers for a small set of synthetic actors and movies data uses, it's convenient simply to show all the decorated paths.

  • However, with a data set as big as the IMDb (even the imdb.small.txt subset has 160 shortest paths), it's useful to be able to name a candidate actor and to annotate just the shortest path (more carefully stated, one of the shortest paths) from Kevin Bacon to the candidate. The site The Oracle of Bacon exposes this functionality.

The first formal parameter of the function "decorated_paths_report()" is mandatory and specifies the table in which the paths are represented. The second optional formal parameter, "terminal", lets you specify the last node along a path. If you omit it, the meaning is "report all the paths"; and it you supply it, the meaning is "report the path to the specified actor".

Dynamic SQL is therefore needed for two reasons, each of which alone is a sufficient reason:

  • The table name isn't known until run-time.

  • There may, or may not, be a WHERE clause.

cr-decorated-paths-report.sql
drop function if exists decorated_paths_report(text, text) cascade;

-- This procedure is more elaborate than you'd expect because of GitHub Issue 3286.
-- It says this in the report:
--
--   Commit 9d66392 added support for cursor. Our next releases will have this work.
--   However, there are a few pending issues.
--
-- Meanwhile, this code works around the issue by using a single-row SELECT... INTO.
-- This is made possible by using array_agg(). But you cannot aggregate arrays of
-- different cardinalities. So a second-level workaround is used. Each array in
-- the result set is cast to "text" for aggregation and then cast back to the array
-- that it represents in the body of the FOREACH loop that steps through the text
-- values that have been aggregated.
--
-- When a "stable" release supports the use of a cursor variable, this implementation
-- will be replaced by a more straightforward version.

create function decorated_paths_report(tab in text, terminal in text default null)
  returns table(t text)
  language plpgsql
as $body$
<<b>>declare
  indent constant int  := 3;
  q      constant text := '''';

  stmt_start constant text := 'select array_agg((path::text) '||
                              'order by cardinality(path), terminal(path), path) '||
                              'from ?';

  where_ constant text := ' where terminal(path) = $1';

  all_terminals_stmt  constant text := replace(stmt_start, '?', tab);
  one_terminal_stmt   constant text := replace(stmt_start, '?', tab)||where_;

  paths text[] not null := '{}';
  p     text   not null := '';
  path  text[] not null := '{}';

  distance         int     not null := -1;
  match            text    not null := '';
  prev_match       text    not null := '';
  movies           text[]  not null := '{}';
  movie            text    not null := '';
  pad              int     not null := 0;
begin
  case terminal is null
    when true then
      execute all_terminals_stmt into paths;
    else
      execute one_terminal_stmt into paths using terminal;
  end case;

  foreach p in array paths loop
    path := p::text[];
    distance := cardinality(path) - 1;
    match := terminal(path);

    -- Rule off before each new match.
    case match = prev_match
      when false then
        t := rpad('-', 50, '-');                                 return next;
    end case;
    prev_match := match;

    pad := 0;
    t := rpad(' ', pad)||path[1];                                return next;
    <<step_loop>>
    for j in 2..cardinality(path) loop
      select e.movies
      into strict b.movies
      from edges e
      where e.node_1 = path[j - 1] and e.node_2 = path[j];

      pad := pad + indent;
      <<movies_loop>>
      foreach movie in array movies loop
        t := rpad(' ', pad)||movie::text;                        return next;
      end loop movies_loop;

      pad := pad + indent;
      t := rpad(' ', pad)||path[j];                              return next;
    end loop step_loop;
  end loop;
  t := rpad('-', 50, '-');                                       return next;
end b;
$body$;

Computing Bacon Numbers for synthetic data and the real IMDb data

The section Computing Bacon Numbers for a small set of synthetic actors and movies data demonstrates the approach using a small data set.

The section Computing Bacon Numbers for real IMDb data shows how to ingest the raw imdb.small.txt file into the same representation that was used for the synthetic data. (The subsection Download and ingest some IMDb data explains how to download the IMDb subset that this case study uses.)

While a straightforward use of a recursive CTE can be used to produce the solution for the small synthetic data set quickly, it fails to complete before crashing (see the section Stress testing different find_paths() implementations on maximally connected graphs) when it's applied to the ingested imdb.small.txt data. The approach described in the How to implement early path pruning section comes to the rescue.

Ask our community
  • Slack
  • Github
  • Forum
  • StackOverflow
Yugabyte
Contact us

Copyright © 2017-2021 Yugabyte, Inc. All rights reserved.