IMDb Dataset

The IMDb dataset is a comprehensive collection of information about movies, TV shows, and the entertainment industry. It includes metadata such as titles, genres, release dates, ratings, cast and crew details, and user reviews.

It is structured in TSV format including the following files:

name.basics.tsv with information about individuals in the industry
title.akas.tsv with alternative title for movies and TV shows
title.basics.tsv with fundamental details about titles
title.crew with lists of directors and writers for each title
title.episode.tsv with details for episodes of TV series
title.principals.tsv with principal cast and crew for each title
title.ratings.tsv with user ratings for titles

IMDb dataset

Initial Dataset Specifications

Entity	Data Link	Mapping
Name.basics	Data Link	Mapping `_: { nconst: 28, primaryName: 29, birthYear: 30, deathYear: 31, knownForTitles: -35 { _index: 36, _value: 37 }, primaryProfession: -32 { _index: 33, _value: 34 } }`
Title.akas	Data Link	Mapping `_: { ordering: 4, title: 5, region: 6, language: 7, attributes: 11, isOriginalTitle: 12, tconst: 54.13, types: -8 { _index: 9, _value: 10 } }`
Title.basics	Data Link	Mapping `_: { tconst: 13, titleType: 14, primaryTitle: 15, originalTitle: 16, isAdult: 17, startYear: 18, endYear: 19, runtimeMinutes: 20, genres: -21 { _index: 22, _value: 23 } }`
Title.crew	Data Link	Mapping `_: { tconst: 58.13, Array: -50 { _index: 51, name.basics.tsv: 52 }, Array: -47 { _index: 48, name.basics.tsv: 49 } }`
Title.episode	Data Link	Mapping `_: { parentTconst: 25, seasonNumber: 26, episodeNumber: 27, tconst: 57.13 }`
Title.principals	Data Link	Mapping `_: { ordering: 39, category: 41, job: 42, tconst: 55.13, nconst: 56.28, characters: -43 { _index: 44, _value: 45 } }`
Title.ratings	Data Link	Mapping `_: { averageRating: 1, numVotes: 2, tconst: 53.13 }`

Generated Dataset Specifications

Case A: Transforming Title Data into Embedded JSON

The title.basics table, which contains information about movies, TV shows, and other titles, is enriched by embedding additional data from the title.akas table. The title.akas table provides alternative titles for the same content, often specific to regions or languages. By embedding this data directly into each title’s JSON representation, the result is a unified, self-contained dataset where each title includes its own set of alternative titles.

Transforming the original TSV files into JSON with embedded data offers several benefits. Firstly, by embedding the alternative titles directly into each title from title.basics, the resulting JSON structure becomes self-contained - all information about a title is available in one place.Secondly, JSON is human-readable and easier to work with compared to TSV files, especially when dealing with hierarchical or nested data. And finally, the original TSV files separate title.basics and title.akas into different datasets, which can lead to redundant processing steps to combine them during analysis. JSON embedding simplifies this by consolidating the data at the storage stage.

Entity	Output Mapping
Title.basics	Output Mapping `_: { tconst: 13, titleType: 14, primaryTitle: 15, originalTitle: 16, isAdult: 17, startYear: 18, endYear: 19, runtimeMinutes: 20, genres: -21.23, title.akas.tsv: -54 { title: 5, language: 7 } }`

Generated File:

File Link

Case B: Splitting Title Crew Data into Separate CSV Files for Directors and Writers

The title.crew data, which originally combined both directors and writers into a single file, is split into two separate CSV files. This separation ensures that each CSV file serves a more focused and specific purpose.

Having separate files allows users to work directly with the subset of data they need, reducing the preprocessing steps required to extract relevant information.

Currently, the tool can perform this splitting operation, but the process is somewhat inefficient. The tool duplicates the entire title.crew data and then removes directors from one file and writers from the other. While this method achieves the desired result, it is not optimal in terms of processing efficiency. To address this limitation, we plan to implement conditional mapping, a feature that will allow selective processing of data based on specific conditions. This feature is part of our planned future work and will make operations like this significantly faster and more efficient.

Entity	Output Mapping
Title.crew.writers	Output Mapping `_: { writers: -50.52, tconst: 57.13 }`
Title.crew.directors	Output Mapping `_: { directors: -47.49, tconst: 57.13 }`

Generated Files:

Writers File Link

Directors File Link