In this report I will characterize the most popular shows and movies on Netflix. Feel free to make your way over to https://github.com/BrandonSmith710/netflix_nlp_analysis if you would like the full dataset and notebook.

I will begin with some introductory analysis and visualize the ratio of shows to movies.

image

Next, I’ll check the distributions of age certifications on Netflix.

image

image

Take a look at the average IMDB scores for television, and then movies.

image

image

An important next step will be to locate the extreme outliers for IMDB score, as this score will be a primary metric in gauging overall success for a movie or show. In this case, rows with an IMDB score further than three standard deviations from the mean will be considered outlying. Notably the shows depicted by their respective box plot have a higher average IMDB score than movies.

TV Show Outliers:

image

Movie Outliers:

image

Now I can remove the noisy titles, as the descriptions of these plotlines will not help with natural language analysis.

tv_std, tv_mean = df_tv.imdb_score.std(), df_tv.imdb_score.mean()
mv_std, mv_mean = df_mv.imdb_score.std(), df_mv.imdb_score.mean()
df_tv = df_tv.iloc[[x for x in range(len(df_tv)) if df_tv.iloc[x]['imdb_score']
                     >= (tv_mean - (3 * tv_std)) and df_tv.iloc[x]['imdb_score']
                     <= (tv_mean + (3 * tv_std))]]
df_mv = df_mv.iloc[[x for x in range(len(df_mv)) if df_mv.iloc[x]['imdb_score']
                     >= (mv_mean - (3 * mv_std)) and df_mv.iloc[x]['imdb_score']
                     <= (mv_mean + (3 * mv_std))]]

Popular TV: Drama is the most popular television genre according to our dataset.

image

Popular Movies: The people love drama, and comedy is not far behind.

image

In order to visualize the impact of genre, a categorical variable, I’ve encoded it two ways. The first will support inspection of the genre attribute as a whole, and the second will scrutinize each of the 19 genres.

TV presents a minimal relationship between genre and IMDB score.

image

Movies show virtually no relationship between genre and IMDB score.

image

Drama has a higher correlation with IMDB score than any other TV genre. Also visible is the relation between age certification and number of seasons, this is another indicator of popularity(Or at least financial success).

image

Family oriented shows average the most seasons.

image

Documentary movies have the highest correlation with IMDB score, with a coefficient of roughly .3.

image

What can be learned from the language used to describe each movie or show? Next I’ll dissect the descriptions and find out what words and topics are associated with top scores. After fitting a gensim LDA model to the descriptions, both movies and shows can be represented as 15 topics, respectively.

Shows concerning topic 8 have the highest correlation with number of seasons, and several other topics bear a similar relationship.

image

Movies focused on topics 8 and 10 have the highest correlations to IMDB score.

image

Now that topic and genre data is available in numeric a format, it will be easy to comb through other features, like production countries, and see which topics and genres are most popular in each country.

lst = list()
for countries in df_titles['production_countries']:
    lamb_alf = lambda x: x.isalpha()
    re = [''.join(filter(lamb_alf, c)) for c in countries.split(',')]
    for country in re:
        lst.append(country)
countries = [c for c in list(set(lst)) if c]
countries_dict = {}
for country in countries:
    tv_max_genre, tv_min_genre = '', ''
    mv_max_genre, mv_min_genre = '', ''
    tv_max, tv_min = 0, 0
    mv_max, mv_min = 0, 0
    df_xtv = df_tv[df_tv['production_countries'].apply(
        lambda x: country in x)]
    df_xtv = df_xtv[df_xtv['production_countries'] != '[]']
    df_xmv = df_mv[df_mv['production_countries'].apply(
        lambda x: country in x)]
    df_xmv = df_xmv[df_xmv['production_countries'] != '[]']
    max_top_tv, min_top_tv = df_xtv['topic'].max(), df_xtv['topic'].min()
    max_top_mv, min_top_mv = df_xmv['topic'].max(), df_xtv['topic'].min()
    for genre in df_xtv.columns[14: -3]:
        stv = df_xtv[genre].sum()
        smv = df_xmv[genre].sum()
        if stv > tv_max:
            tv_max, tv_max_genre = stv, genre
        elif stv < tv_min:
            tv_min, tv_min_genre = stv, genre
        if smv > mv_max:
            mv_max, mv_max_genre = smv, genre
        elif smv < mv_min:
            mv_min, mv_min_genre = smv, genre
    countries_dict[country] = [max_top_tv, min_top_tv, tv_max_genre,
                               tv_min_genre, max_top_mv, min_top_mv,
                               mv_max_genre, mv_min_genre]
country_df = pd.DataFrame(countries_dict)
country_df.index = ['max_topic_tv', 'min_topic_tv', 'max_genre_tv',
                    'min_genre_tv', 'max_topic_mv', 'min_topic_mv',
                    'max_genre_mv', 'min_genre_mv']

The country dataframe holds null values due to the number of topics and genres for a given title occasionally being one or zero. Also, there are numerous countries which only have one title recorded. For these reasons, and due to the defaults of Python, the minimum genres for TV and movies can be dropped, and the remaining null values can be safely ignored.

country_df.drop('min_genre_tv min_genre_mv'.split(), axis = 0, inplace = True)

In working with the new data we’ve acquired, 107 countries can be seen popularizing certain topics and genres in the country dataframe.

Screenshot (52)