Neflix Natural Language Understanding

In this report I will characterize the most popular shows and movies on Netflix. Feel free to make your way over to https://github.com/BrandonSmith710/netflix_nlp_analysis if you would like the full dataset and notebook.

I will begin with some introductory analysis and visualize the ratio of shows to movies.

Next, I’ll check the distributions of age certifications on Netflix.

Take a look at the average IMDB scores for television, and then movies.

An important next step will be to locate the extreme outliers for IMDB score, as this score will be a primary metric in gauging overall success for a movie or show. In this case, rows with an IMDB score further than three standard deviations from the mean will be considered outlying. Notably the shows depicted by their respective box plot have a higher average IMDB score than movies.

TV Show Outliers:

Movie Outliers:

Now I can remove the noisy titles, as the descriptions of these plotlines will not help with natural language analysis.

tv_std, tv_mean = df_tv.imdb_score.std(), df_tv.imdb_score.mean()
mv_std, mv_mean = df_mv.imdb_score.std(), df_mv.imdb_score.mean()
df_tv = df_tv.iloc[[x for x in range(len(df_tv)) if df_tv.iloc[x]['imdb_score']
                     >= (tv_mean - (3 * tv_std)) and df_tv.iloc[x]['imdb_score']
                     <= (tv_mean + (3 * tv_std))]]
df_mv = df_mv.iloc[[x for x in range(len(df_mv)) if df_mv.iloc[x]['imdb_score']
                     >= (mv_mean - (3 * mv_std)) and df_mv.iloc[x]['imdb_score']
                     <= (mv_mean + (3 * mv_std))]]

Popular TV: Drama is the most popular television genre according to our dataset.

Popular Movies: The people love drama, and comedy is not far behind.

In order to visualize the impact of genre, a categorical variable, I’ve encoded it two ways. The first will support inspection of the genre attribute as a whole, and the second will scrutinize each of the 19 genres.

TV presents a minimal relationship between genre and IMDB score.

Movies show virtually no relationship between genre and IMDB score.

Drama has a higher correlation with IMDB score than any other TV genre. Also visible is the relation between age certification and number of seasons, this is another indicator of popularity(Or at least financial success).

Family oriented shows average the most seasons.

Documentary movies have the highest correlation with IMDB score, with a coefficient of roughly .3.

What can be learned from the language used to describe each movie or show? Next I’ll dissect the descriptions and find out what words and topics are associated with top scores. After fitting a gensim LDA model to the descriptions, both movies and shows can be represented as 15 topics, respectively.

Shows concerning topic 8 have the highest correlation with number of seasons, and several other topics bear a similar relationship.

Movies focused on topics 8 and 10 have the highest correlations to IMDB score.

Now that topic and genre data is available in numeric a format, it will be easy to comb through other features, like production countries, and see which topics and genres are most popular in each country.

lst = list()
for countries in df_titles['production_countries']:
    lamb_alf = lambda x: x.isalpha()
    re = [''.join(filter(lamb_alf, c)) for c in countries.split(',')]
    for country in re:
        lst.append(country)
countries = [c for c in list(set(lst)) if c]
countries_dict = {}
for country in countries:
    tv_max_genre, tv_min_genre = '', ''
    mv_max_genre, mv_min_genre = '', ''
    tv_max, tv_min = 0, 0
    mv_max, mv_min = 0, 0
    df_xtv = df_tv[df_tv['production_countries'].apply(
        lambda x: country in x)]
    df_xtv = df_xtv[df_xtv['production_countries'] != '[]']
    df_xmv = df_mv[df_mv['production_countries'].apply(
        lambda x: country in x)]
    df_xmv = df_xmv[df_xmv['production_countries'] != '[]']
    max_top_tv, min_top_tv = df_xtv['topic'].max(), df_xtv['topic'].min()
    max_top_mv, min_top_mv = df_xmv['topic'].max(), df_xtv['topic'].min()
    for genre in df_xtv.columns[14: -3]:
        stv = df_xtv[genre].sum()
        smv = df_xmv[genre].sum()
        if stv > tv_max:
            tv_max, tv_max_genre = stv, genre
        elif stv < tv_min:
            tv_min, tv_min_genre = stv, genre
        if smv > mv_max:
            mv_max, mv_max_genre = smv, genre
        elif smv < mv_min:
            mv_min, mv_min_genre = smv, genre
    countries_dict[country] = [max_top_tv, min_top_tv, tv_max_genre,
                               tv_min_genre, max_top_mv, min_top_mv,
                               mv_max_genre, mv_min_genre]
country_df = pd.DataFrame(countries_dict)
country_df.index = ['max_topic_tv', 'min_topic_tv', 'max_genre_tv',
                    'min_genre_tv', 'max_topic_mv', 'min_topic_mv',
                    'max_genre_mv', 'min_genre_mv']

The country dataframe holds null values due to the number of topics and genres for a given title occasionally being one or zero. Also, there are numerous countries which only have one title recorded. For these reasons, and due to the defaults of Python, the minimum genres for TV and movies can be dropped, and the remaining null values can be safely ignored.

country_df.drop('min_genre_tv min_genre_mv'.split(), axis = 0, inplace = True)

In working with the new data we’ve acquired, 107 countries can be seen popularizing certain topics and genres in the country dataframe.

Screenshot (52)