Simple Recommender System adalah sistem rekomendasi yang hanya menggunakan urutan sebagai dasar perhitungannya. Sebagai contoh dalam menentukan rekomendasi film terbaik, kita akan menggunakan urutan berdasarkan mungkin vote terbanyak, rating tertinggi, penjualan film paling tinggi, atau apapun yang lain. Dalam pembahasan ini, kita akan membuat sistem rekomendasi film menggunakan kombinasi antara rata-rata rating, jumlah vote, dan membentuk metric baru dari metric yang sudah ada, kemudian kita akan melakukan sorting untuk metric ini dari yang tertinggi ke terendah.
Simple Recommender using Weighted Rating
Simple Recommender menawarkan rekomendasi yang umum untuk semua user berdasarkan popularitas film dan terkadang genre. Ide awal di balik sistem rekomendasi ini adalah sebagai berikut.
- Film-film yang lebih populer akan memiliki kemungkinan yang lebih besar untuk disukai juga oleh rata-rata penonton.
- Model ini tidak memberikan rekomendasi yang personal untuk setiap tipe user.
- Implementasi model ini pun juga bisa dibilang cukup mudah, yang perlu kita lakukan hanyalah mengurutkan film-film tersebut berdasarkan rating dan popularitas dan menunjukkan film teratas dari list film tersebut.
Sebagai tambahan, kita dapat menambahkan genre untuk mendapatkan film teratas untuk genre spesifik tersebut
Formula dari IMDB dengan Weighted Rating
$Weighted Rating = (v / (v+m)). R + (m / (v+m)).C$
dimana,
- v: jumlah votes untuk film tersebut
- m: jumlah minimum votes yang dibutuhkan supaya dapat masuk dalam chart
- R: rata-rata rating dari film tersebut
- C: rata-rata jumlah votes dari seluruh semesta film
Dataset
Dataset yang digunakan dalam pembahasan ini, yaitu:
title.basic.tsv
yang berisi informasi umum mengenai film-film yang ada.
https://dqlab-dataset.s3-ap-southeast-1.amazonaws.com/title.basics.tsvtitle.ratings.tsv
yang berisi mengenail rating dan jumlah votes dari film-film yang ada.
https://dqlab-dataset.s3-ap-southeast-1.amazonaws.com/title.ratings.tsv
Library
Library yang dibutuhkan untuk pembahasan ini adalah:
- numpy untuk perhitungan array atau matriks
- pandas untuk manipulasi dan analisis data
# Load library
import numpy as np
import pandas as pd
File Unloading
Melakukan pembacaan file title_basic.tsv
dan title_rating.tsv
ke dalam bentuk dataframe.
# Load file into dataframe
movie_df = pd.read_csv('data/title_basics.tsv', sep='\t')
rating_df = pd.read_csv('data/title_ratings.tsv', sep='\t')
Data Cleaning
Adapun langkah-langkah yang dilakukan, seperti:
- Preview data awal
- Melihat informasi data
- Mengecek dan mengatasi data kosong atau missing values
- Mengecek dan mengatasi format data yang tidak sesuai
Table Movies
# Print first five rows
movie_df.head()
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | |
---|---|---|---|---|---|---|---|---|---|
0 | tt0221078 | short | Circle Dance, Ute Indians | Circle Dance, Ute Indians | 0 | 1898 | \N | \N | Documentary,Short |
1 | tt8862466 | tvEpisode | ¡El #TeamOsos va con todo al "Reality del amor"! | ¡El #TeamOsos va con todo al "Reality del amor"! | 0 | 2018 | \N | \N | Comedy,Drama |
2 | tt7157720 | tvEpisode | Episode #3.41 | Episode #3.41 | 0 | 2016 | \N | 29 | Comedy,Game-Show |
3 | tt2974998 | tvEpisode | Episode dated 16 May 1987 | Episode dated 16 May 1987 | 0 | 1987 | \N | \N | News |
4 | tt2903620 | tvEpisode | Frances Bavier: Aunt Bee Retires | Frances Bavier: Aunt Bee Retires | 0 | 1973 | \N | \N | Documentary |
Terlihat ada kolom dengan nilai \\N
yang dimungkinkan kesalahan pembacaan data.
# View info data
movie_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9025 entries, 0 to 9024
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tconst 9025 non-null object
1 titleType 9025 non-null object
2 primaryTitle 9011 non-null object
3 originalTitle 9011 non-null object
4 isAdult 9025 non-null int64
5 startYear 9025 non-null object
6 endYear 9025 non-null object
7 runtimeMinutes 9025 non-null object
8 genres 9014 non-null object
dtypes: int64(1), object(8)
memory usage: 634.7+ KB
Kemudian mengecek jumlah data kosong atau missing values pada kolom yang ada.
# Check missing values
movie_df.isnull().sum()
tconst 0
titleType 0
primaryTitle 14
originalTitle 14
isAdult 0
startYear 0
endYear 0
runtimeMinutes 0
genres 11
dtype: int64
Dari hasil pengecekan di atas diketahui bahwa kolom primaryTitle, originalTitle, dan genres terdapat missing values. Untuk mengatasi hal itu akan dilakukan penghapusan baris pada data yang kosong karena jumlahnya masih sedikit.
# Get rows with missing values
movies_missing = movie_df.loc[(movie_df['primaryTitle'].isnull()) | (movie_df['originalTitle'].isnull()) |
(movie_df['genres'].isnull())]
print('Jumlah data movies yang kosong yaitu', len(movies_missing))
movies_missing
Jumlah data movies yang kosong yaitu 25
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | |
---|---|---|---|---|---|---|---|---|---|
9000 | tt10790040 | tvEpisode | NaN | NaN | 0 | 2019 | \N | \N | \N |
9001 | tt10891902 | tvEpisode | NaN | NaN | 0 | 2020 | \N | \N | Crime |
9002 | tt11737860 | tvEpisode | NaN | NaN | 0 | 2020 | \N | \N | Comedy,Drama,Romance |
9003 | tt11737862 | tvEpisode | NaN | NaN | 0 | 2020 | \N | \N | Comedy,Drama,Romance |
9004 | tt11737866 | tvEpisode | NaN | NaN | 0 | 2020 | \N | \N | Comedy,Drama,Romance |
9005 | tt11737872 | tvEpisode | NaN | NaN | 0 | 2020 | \N | \N | \N |
9006 | tt11737874 | tvEpisode | NaN | NaN | 0 | 2020 | \N | \N | Comedy,Drama,Romance |
9007 | tt1971246 | tvEpisode | NaN | NaN | 0 | 2011 | \N | \N | Biography |
9008 | tt2067043 | tvEpisode | NaN | NaN | 0 | 1965 | \N | \N | Music |
9009 | tt4404732 | tvEpisode | NaN | NaN | 0 | 2015 | \N | \N | Comedy |
9010 | tt5773048 | tvEpisode | NaN | NaN | 0 | 2015 | \N | \N | Talk-Show |
9011 | tt8473688 | tvEpisode | NaN | NaN | 0 | 1987 | \N | \N | Drama |
9012 | tt8541336 | tvEpisode | NaN | NaN | 0 | 2018 | \N | \N | Reality-TV,Romance |
9013 | tt9824302 | tvEpisode | NaN | NaN | 0 | 2016 | \N | \N | Documentary |
9014 | tt10233364 | tvEpisode | Rolling in the Deep Dish\tRolling in the Deep ... | 0 | 2019 | \N | \N | Reality-TV | NaN |
9015 | tt10925142 | tvEpisode | The IMDb Show on Location: Star Wars Galaxy's ... | 0 | 2019 | \N | \N | Talk-Show | NaN |
9016 | tt10970874 | tvEpisode | Die Bauhaus-Stadt Tel Aviv - Vorbild für die M... | 0 | 2019 | \N | \N | \N | NaN |
9017 | tt11670006 | tvEpisode | ...ein angenehmer Unbequemer...\t...ein angene... | 0 | 1981 | \N | \N | Documentary | NaN |
9018 | tt11868642 | tvEpisode | GGN Heavyweight Championship Lungs With Mike T... | 0 | 2020 | \N | \N | Talk-Show | NaN |
9019 | tt2347742 | tvEpisode | No sufras por la alergia esta primavera\tNo su... | 0 | 2004 | \N | \N | \N | NaN |
9020 | tt3984412 | tvEpisode | I'm Not Going to Come Last, I'm Just Going to ... | 0 | 2014 | \N | \N | Reality-TV | NaN |
9021 | tt8740950 | tvEpisode | Weight Loss Resolution Restart - Ins & Outs of... | 0 | 2015 | \N | \N | Reality-TV | NaN |
9022 | tt9822816 | tvEpisode | Zwischen Vertuschung und Aufklärung - Missbrau... | 0 | 2019 | \N | \N | \N | NaN |
9023 | tt9900062 | tvEpisode | The Direction of Yuu's Love: Hings Aren't Goin... | 0 | 1994 | \N | \N | Animation,Comedy,Drama | NaN |
9024 | tt9909210 | tvEpisode | Politik und/oder Moral - Wie weit geht das Ver... | 0 | 2005 | \N | \N | \N | NaN |
# Get data without missing values
movie_df = movie_df.loc[(movie_df['primaryTitle'].notnull()) & (movie_df['originalTitle'].notnull()) &
(movie_df['genres'].notnull())]
# Print number of rows
print('Jumlah data movies tanpa missing values yaitu', len(movie_df))
Jumlah data movies tanpa missing values yaitu 9000
Jika kita perhatikan pada kolom startYear, endYear, runtimeMinutes, dan genres, terdapat data dengan nilai \\N
yang berarti NULL (kesalahan format). Hal selanjutnya yang akan kita lakukan adalah mengubah nilai dari \\N
tersebut menjadi np.nan
dan melakukan formatting tipe data kolom startYear, endYear, dan runtimeMinutes menjadi float64.
# Change values'\\N' column startYear
movie_df['startYear'] = movie_df['startYear'].replace('\\N', np.nan)
movie_df['startYear'] = movie_df['startYear'].astype('float64')
# Change values '\\N' column endYear
movie_df['endYear'] = movie_df['endYear'].replace('\\N', np.nan)
movie_df['endYear'] = movie_df['endYear'].astype('float64')
# Change values '\\N' column runtimeMinutes
movie_df['runtimeMinutes'] = movie_df['runtimeMinutes'].replace('\\N', np.nan)
movie_df['runtimeMinutes'] = movie_df['runtimeMinutes'].astype('float64')
# View new first five rows
movie_df.head()
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | |
---|---|---|---|---|---|---|---|---|---|
0 | tt0221078 | short | Circle Dance, Ute Indians | Circle Dance, Ute Indians | 0 | 1898.0 | NaN | NaN | Documentary,Short |
1 | tt8862466 | tvEpisode | ¡El #TeamOsos va con todo al "Reality del amor"! | ¡El #TeamOsos va con todo al "Reality del amor"! | 0 | 2018.0 | NaN | NaN | Comedy,Drama |
2 | tt7157720 | tvEpisode | Episode #3.41 | Episode #3.41 | 0 | 2016.0 | NaN | 29.0 | Comedy,Game-Show |
3 | tt2974998 | tvEpisode | Episode dated 16 May 1987 | Episode dated 16 May 1987 | 0 | 1987.0 | NaN | NaN | News |
4 | tt2903620 | tvEpisode | Frances Bavier: Aunt Bee Retires | Frances Bavier: Aunt Bee Retires | 0 | 1973.0 | NaN | NaN | Documentary |
Selanjutnya, kita akan membuat sebuah fungsi yang bernama transform_to_list
untuk mengubah nilai genre menjadi list.
def transform_to_list(x):
if ',' in x:
# Change values genres to list with split (comma)
return x.split(',')
else:
# Return self list
return [x]
if x == '\\N':
# Return empty list
return []
movie_df['genres'] = movie_df['genres'].apply(lambda x: transform_to_list(x))
movie_df['genres'].head()
0 [Documentary, Short]
1 [Comedy, Drama]
2 [Comedy, Game-Show]
3 [News]
4 [Documentary]
Name: genres, dtype: object
Table Rating
# Print first five rows
rating_df.head()
tconst | averageRating | numVotes | |
---|---|---|---|
0 | tt0000001 | 5.6 | 1608 |
1 | tt0000002 | 6.0 | 197 |
2 | tt0000003 | 6.5 | 1285 |
3 | tt0000004 | 6.1 | 121 |
4 | tt0000005 | 6.1 | 2050 |
# View info data
rating_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030009 entries, 0 to 1030008
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tconst 1030009 non-null object
1 averageRating 1030009 non-null float64
2 numVotes 1030009 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 23.6+ MB
Data rating sudah bersih karena tidak ditemukan data kosong atau missing values dan wrong format.
Merge Table Movie and Table Rating
Melakukan penggabungan data dari kedua tabel berdasarkan kolom tconst
dengan inner join.
# Merge table movie and rating
movie_rating_df = pd.merge(movie_df, rating_df, on='tconst', how='inner')
# Print first five rows
movie_rating_df.head()
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | averageRating | numVotes | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | tt0043745 | short | Lion Down | Lion Down | 0 | 1951.0 | NaN | 7.0 | [Animation, Comedy, Family] | 7.1 | 459 |
1 | tt0167491 | video | Wicked Covergirls | Wicked Covergirls | 1 | 1998.0 | NaN | 85.0 | [Adult] | 5.7 | 7 |
2 | tt6574096 | tvEpisode | Shadow Play - Part 2 | Shadow Play - Part 2 | 0 | 2017.0 | NaN | 22.0 | [Adventure, Animation, Comedy] | 8.5 | 240 |
3 | tt6941700 | tvEpisode | RuPaul Roast | RuPaul Roast | 0 | 2017.0 | NaN | NaN | [Reality-TV] | 8.0 | 11 |
4 | tt7305674 | video | UCLA Track & Field Promo | UCLA Track & Field Promo | 0 | 2017.0 | NaN | NaN | [Short, Sport] | 9.7 | 7 |
# View info data
print(movie_rating_df.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1376 entries, 0 to 1375
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tconst 1376 non-null object
1 titleType 1376 non-null object
2 primaryTitle 1376 non-null object
3 originalTitle 1376 non-null object
4 isAdult 1376 non-null int64
5 startYear 1376 non-null float64
6 endYear 26 non-null float64
7 runtimeMinutes 1004 non-null float64
8 genres 1376 non-null object
9 averageRating 1376 non-null float64
10 numVotes 1376 non-null int64
dtypes: float64(4), int64(2), object(5)
memory usage: 129.0+ KB
None
Terlihat masih ada data kosong pada kolom endYear dan runtimeMinutes. Disini kita hanya akan menghilangkan data yang tidak memiliki durasi saja.
# Remove missing data
movie_rating_df = movie_rating_df.dropna(subset=['runtimeMinutes'])
# Print total data
print('Jumlah data baru yaitu', len(movie_rating_df))
Jumlah data baru yaitu 1004
Building Simple Recommender System
Sesuai rumus Weighted Rating, kita akan cari nilai-nilai berikut:
$Weighted Rating = (v / (v+m)). R + (m / (v+m)).C$
Nilai C
Hal pertama yang akan kita cari adalah nilai dari C yang merupakan rata-rata dari averageRating.
C = movie_rating_df['averageRating'].mean()
print(C)
6.829581673306767
Nilai m
Mari kita ambil contoh film dengan numVotes di atas 80% populasi, jadi populasi yang akan kita ambil hanya sebesar 20%.
m = movie_rating_df['numVotes'].quantile(0.8)
print(m)
229.0
Selanjutnya kita harus membuat sebuah fungsi imdb_weighted_rating
berdasarkan rumus Weighted Rating.
# Function Weighted Rating
def imdb_weighted_rating(df, var=0.8):
# Variabel IMDB Score
v = df['numVotes']
R = df['averageRating']
C = df['averageRating'].mean()
m = df['numVotes'].quantile(var)
# Formula IMDB
df['score'] = (v/(m+v))*R + (m/(m+v))*C
return df['score']
imdb_weighted_rating(movie_rating_df)
# View data with IMDB score
movie_rating_df.head()
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | averageRating | numVotes | score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | tt0043745 | short | Lion Down | Lion Down | 0 | 1951.0 | NaN | 7.0 | [Animation, Comedy, Family] | 7.1 | 459 | 7.009992 |
1 | tt0167491 | video | Wicked Covergirls | Wicked Covergirls | 1 | 1998.0 | NaN | 85.0 | [Adult] | 5.7 | 7 | 6.796077 |
2 | tt6574096 | tvEpisode | Shadow Play - Part 2 | Shadow Play - Part 2 | 0 | 2017.0 | NaN | 22.0 | [Adventure, Animation, Comedy] | 8.5 | 240 | 7.684380 |
5 | tt2262289 | movie | The Pin | The Pin | 0 | 2013.0 | NaN | 85.0 | [Drama] | 7.7 | 27 | 6.921384 |
6 | tt0874027 | tvEpisode | Episode #32.9 | Episode #32.9 | 0 | 2006.0 | NaN | 29.0 | [Comedy, Game-Show, News] | 8.0 | 8 | 6.869089 |
Dari tahap yang sudah kita lakukan sebelumnya, telah terdapat kolom tambahan score. Pertama kita akan filter numVotes yang lebih dari m kemudian diurutkan score dari tertinggi ke terendah untuk diambil nilai beberapa nilai teratas.
# Create function recommender system
def simple_recommender(df, top=100):
# Filtering and sorting
df = df.loc[df['numVotes'] >= m]
df = df.sort_values(by='score', ascending=False)
# Get top 100
df = df[:top]
return df
# Get top 25 data movies
simple_recommender(movie_rating_df, top=25)
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | averageRating | numVotes | score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
68 | tt4110822 | tvEpisode | S.O.S. Part 2 | S.O.S. Part 2 | 0 | 2015.0 | NaN | 43.0 | [Action, Adventure, Drama] | 9.4 | 3820 | 9.254624 |
236 | tt2200252 | video | Attack of the Clones Review | Attack of the Clones Review | 0 | 2010.0 | NaN | 86.0 | [Comedy] | 9.3 | 1411 | 8.955045 |
1181 | tt7697962 | tvEpisode | Chapter Seventeen: The Missionaries | Chapter Seventeen: The Missionaries | 0 | 2019.0 | NaN | 54.0 | [Drama, Fantasy, Horror] | 9.2 | 1536 | 8.892450 |
326 | tt7124590 | tvEpisode | Chapter Thirty-Four: Judgment Night | Chapter Thirty-Four: Judgment Night | 0 | 2018.0 | NaN | 42.0 | [Crime, Drama, Mystery] | 9.1 | 1859 | 8.850993 |
1045 | tt0533506 | tvEpisode | The Prom | The Prom | 0 | 1999.0 | NaN | 60.0 | [Action, Drama, Fantasy] | 8.9 | 2740 | 8.740308 |
71 | tt8399426 | tvEpisode | Savages | Savages | 0 | 2018.0 | NaN | 58.0 | [Drama, Fantasy, Romance] | 9.0 | 1428 | 8.700045 |
1234 | tt2843830 | tvEpisode | VIII. | VIII. | 0 | 2014.0 | NaN | 57.0 | [Adventure, Drama] | 8.9 | 1753 | 8.660784 |
1087 | tt4295140 | tvSeries | Chef's Table | Chef's Table | 0 | 2015.0 | NaN | 50.0 | [Documentary] | 8.6 | 12056 | 8.566998 |
1054 | tt2503932 | tvEpisode | Trial and Error | Trial and Error | 0 | 2013.0 | NaN | 43.0 | [Drama, Fantasy, Horror] | 8.6 | 2495 | 8.451165 |
448 | tt0337566 | video | AC/DC: Live at Donington | AC/DC: Live at Donington | 0 | 1992.0 | NaN | 120.0 | [Documentary, Music] | 8.5 | 1343 | 8.256663 |
624 | tt0620159 | tvEpisode | Strike Out | Strike Out | 0 | 2000.0 | NaN | 30.0 | [Comedy] | 8.7 | 401 | 8.020118 |
1281 | tt3166390 | tvEpisode | Looking for a Plus-One | Looking for a Plus-One | 0 | 2014.0 | NaN | 28.0 | [Comedy, Drama, Romance] | 8.7 | 396 | 8.014679 |
314 | tt0954759 | tvEpisode | Ben Franklin | Ben Franklin | 0 | 2007.0 | NaN | 21.0 | [Comedy] | 8.1 | 2766 | 8.002863 |
189 | tt5661506 | video | Florence + the Machine: The Odyssey | Florence + the Machine: The Odyssey | 0 | 2016.0 | NaN | 49.0 | [Music] | 8.8 | 330 | 7.992798 |
151 | tt3954426 | tvEpisode | Bleeding Kansas | Bleeding Kansas | 0 | 2014.0 | NaN | 42.0 | [Drama, Western] | 8.6 | 437 | 7.991253 |
1344 | tt6644294 | tvEpisode | The Hostile Hospital: Part Two | The Hostile Hospital: Part Two | 0 | 2018.0 | NaN | 40.0 | [Adventure, Comedy, Drama] | 8.3 | 812 | 7.976536 |
1242 | tt3677742 | tvSpecial | Saturday Night Live: 40th Anniversary Special | Saturday Night Live: 40th Anniversary Special | 0 | 2015.0 | NaN | 106.0 | [Comedy] | 8.1 | 1931 | 7.965312 |
1217 | tt3642464 | tvEpisode | Giant Woman | Giant Woman | 0 | 2014.0 | NaN | 11.0 | [Adventure, Animation, Comedy] | 8.4 | 566 | 7.947641 |
544 | tt0734655 | tvEpisode | The Little People | The Little People | 0 | 1962.0 | NaN | 25.0 | [Drama, Fantasy, Horror] | 8.1 | 1559 | 7.937290 |
49 | tt9119838 | tvEpisode | Parisian Legend Has It... | Parisian Legend Has It... | 0 | 2019.0 | NaN | 42.0 | [Drama] | 8.9 | 263 | 7.936330 |
357 | tt4084774 | tvEpisode | Trial and Punishment | Trial and Punishment | 0 | 2015.0 | NaN | 56.0 | [Adventure, Drama] | 8.8 | 289 | 7.928908 |
1111 | tt4174072 | tvEpisode | Immortal Emerges from Cave | Immortal Emerges from Cave | 0 | 2017.0 | NaN | 53.0 | [Action, Adventure, Crime] | 8.0 | 2898 | 7.914287 |
790 | tt4279086 | tvEpisode | And Santa's Midnight Run | And Santa's Midnight Run | 0 | 2014.0 | NaN | 42.0 | [Action, Adventure, Comedy] | 8.2 | 823 | 7.901687 |
972 | tt0048028 | movie | East of Eden | East of Eden | 0 | 1955.0 | NaN | 118.0 | [Drama] | 7.9 | 38543 | 7.893678 |
819 | tt0032156 | movie | The Story of the Last Chrysanthemum | Zangiku monogatari | 0 | 1939.0 | NaN | 143.0 | [Drama, Romance] | 7.9 | 2974 | 7.823470 |
Dari hasil diatas dapat diketahui bahwa:
- Dari tahap yang sudah dilakukan sebelumnya, dapat dilihat sekarang daftar film telah diurutkan dari score tertinggi ke terendah. Film dengan averageRating yang tinggi tidak selalu mendapat posisi yang lebih tinggi dibanding film dengan averageRating lebih rendah, hal ini disebabkan karena kita juga memperhitungkan faktor banyaknya votes.
- Sistem rekomendasi ini masih bisa ditingkatkan dengan menambah filter spesifik tentang titleType, startYear, ataupun filter yang lain.
Selanjutnya yang akan kita lakukan adalah membuat fungsi untuk melakukan filter berdasarkan isAdult, startYear, dan genres dan mengetahui hasil sistem rekomendasi film yang diberikan.
# Copy dataframe
new_df = movie_rating_df.copy()
# Create recommender system with filtering
def user_prefer_recommender(df, ask_adult, ask_start_year, ask_genre, top):
# Ask_adult = yes/no
if ask_adult.lower() == 'yes':
df = df.loc[df['isAdult'] == 1]
elif ask_adult.lower() == 'no':
df = df.loc[df['isAdult'] == 0]
# Ask_start_year (numeric)
df = df.loc[df['startYear'] >= int(ask_start_year)]
# Ask_genre = 'all' or other genres
if ask_genre.lower() == 'all':
df = df
else:
def filter_genre(x):
if ask_genre.lower() in str(x).lower():
return True
else:
return False
df = df.loc[df['genres'].apply(lambda x: filter_genre(x))]
# Get rows with greater than or equal m numVotes
df = df.loc[df['numVotes'] >= m]
df = df.sort_values(by='score', ascending=False)
# Get top movies
df = df[:top]
return df
# Result movies recommendation
user_prefer_recommender(new_df, ask_adult = 'no', ask_start_year = 2000, ask_genre = 'drama', top=10)
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | averageRating | numVotes | score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
68 | tt4110822 | tvEpisode | S.O.S. Part 2 | S.O.S. Part 2 | 0 | 2015.0 | NaN | 43.0 | [Action, Adventure, Drama] | 9.4 | 3820 | 9.254624 |
1181 | tt7697962 | tvEpisode | Chapter Seventeen: The Missionaries | Chapter Seventeen: The Missionaries | 0 | 2019.0 | NaN | 54.0 | [Drama, Fantasy, Horror] | 9.2 | 1536 | 8.892450 |
326 | tt7124590 | tvEpisode | Chapter Thirty-Four: Judgment Night | Chapter Thirty-Four: Judgment Night | 0 | 2018.0 | NaN | 42.0 | [Crime, Drama, Mystery] | 9.1 | 1859 | 8.850993 |
71 | tt8399426 | tvEpisode | Savages | Savages | 0 | 2018.0 | NaN | 58.0 | [Drama, Fantasy, Romance] | 9.0 | 1428 | 8.700045 |
1234 | tt2843830 | tvEpisode | VIII. | VIII. | 0 | 2014.0 | NaN | 57.0 | [Adventure, Drama] | 8.9 | 1753 | 8.660784 |
1054 | tt2503932 | tvEpisode | Trial and Error | Trial and Error | 0 | 2013.0 | NaN | 43.0 | [Drama, Fantasy, Horror] | 8.6 | 2495 | 8.451165 |
1281 | tt3166390 | tvEpisode | Looking for a Plus-One | Looking for a Plus-One | 0 | 2014.0 | NaN | 28.0 | [Comedy, Drama, Romance] | 8.7 | 396 | 8.014679 |
151 | tt3954426 | tvEpisode | Bleeding Kansas | Bleeding Kansas | 0 | 2014.0 | NaN | 42.0 | [Drama, Western] | 8.6 | 437 | 7.991253 |
1344 | tt6644294 | tvEpisode | The Hostile Hospital: Part Two | The Hostile Hospital: Part Two | 0 | 2018.0 | NaN | 40.0 | [Adventure, Comedy, Drama] | 8.3 | 812 | 7.976536 |
49 | tt9119838 | tvEpisode | Parisian Legend Has It... | Parisian Legend Has It... | 0 | 2019.0 | NaN | 42.0 | [Drama] | 8.9 | 263 | 7.936330 |
Hasil di atas adalah rekomendasi 10 film terbaik bergenre drama pada tahun 2000 keatas dengan kategori bukan hanya untuk dewasa.