Bollywood and Hollywood are very famous film industries nowadays. A westerner might assumes
that Bollywood has a long way to go before having the same importance in term of production,
compared to the American industry. When we look at some numbers, the Indian movie industry is
extremely famous in India, which has more than 1.4 billion people (more than 1/8 of the population
on Earth ! (source)). The Indian film industry’s export to other countries is also growing, notably in
China, and its productions have appeared in occidental cinemas more and more in recent years.
Therefore, it could be interesting to deconstruct this bias and ask ourselves : what assumptions
we have as westerners?
For example, what comes to your mind when I talk about Indian cinema? Drama?Singing?Dancing?Romantic?Foreign?
Unlike Hollywood, Bollywood is not a real place, but only a contraction
of Hollywood, the term representing the place of reference for American cinema, and Bombay,
the capital of India (now Mumbai) but also the historical base of Indian cinema. The association
between these two industries may lead one to believe that Indian cinema is strongly influenced by
American cinema. Thus, in this project we will look at the features of their films and analyze the
similarities and differences between these two world famous film factories.
Are Indian films heavily influenced by American production or do they have their own completely different identity?
To answer these questions, we use data from the CMU Movie Summary Corpus, which contains a
wealth of information on different films from around the world, as well as their summaries.
We will take the films from the American and Indian production and extract the most interesting
features such as movie genre, topics spoken in the summaries or the characteristics of the
actors and then compare them to each other.
Getting familiar with the data
Movie genres
As the plot shows, "drama" is the prominent movie genre in both American and
Indian movies.
American movies seems to have a better distribution of different genres than
Indian movies, that are more focused around "drama".
Some questions arises here : Is American drama the same as Indian drama ? What if these notions are completely different in both culture ? We will investigate these questions further towards the end of our analysis.
Furthermore, both of Indian and American movies have a large number of movie
genres, so it could be interesting to select a restricted group of movie genres
that allows better and more precise results for our further analysis.
For example, genre like "world cinema" seems rather not specific enough to
categorize the movies. Indeed, the term "world cinema" refers to films produced
outside of the United States and Europe, or to films that are made in a
particular country or region but are intended for international audiences.
World cinema includes a wide variety of film genres, such as action, drama,
comedy, romance, horror, and more. It encompasses films from many different
countries and cultures, each with their own unique traditions and styles of
storytelling.
It is rather reducer to put every genre produced outside of the Occident in a
global term like "world cinema". We hence a lot of information about the
spectrum of differences and different themes of movies produced around the
world. This could be another bias that our data set has, on the fact that
Indian movies are not represented the same way as American movies.
This could explain the fact that there is more movie genres attributed to
American movies than Indian movies, although we have to be careful of the
fact that we have less Indian movies than American movies.
Finally, some genres are similar to each other or some of them include
other genres (e.g. "action/aventure"). For these types of genres, we
separated them and re-classified them in each one of the single genres.
Movie runtime
We notice that Indian movies have a tendency to be longer (by ~48 minutes) and
than American movies.
There could be a bias in our dataset, because early Indian cinema might be
under-represented compared to early American cinema.
Although we can add that in general, Indian movies tend to have more
elaborate plots and subplots, and they often include songs and dance sequences,
which can contribute to their longer runtime. This is often a characteristic of
Bollywood movies, in particular, which are known for their elaborate storylines
and large casts. American movies, on the other hand, may be more focused on
action and special effects, which may not require as much screen time.
Lexicon
Indian movies
American movies
These wordcloud represent the prevalence of words in the movie plots for Indian an
American movies respectively. The size of a word is proportional to its frequency in the
movie summaries.
We notice that the words family, love, father occur a lot more
often in Indian movies compared to American ones. This might be the reflection of underlying cultural
differences, but concluding anything is not appropriate at this level of the analysis.
It is also suprising to see that the word woman
is more important than the word girl in the US whereas it is the opposite in India.
This can maybe be put in perspective with what we observed in the previous section :
the mean female actor age in American-dominated cluster of romantic movies is 33 years
old whereas it is 27 years old in Indian-dominated clusters (6 years gap!). Moreover,
the topics of family and marriage is also much more important in Indian romance than
Americans.
Although this representation gives us a good idea of the differences between the
subjects covered by the two industries, it stay shallow and doesn't give us informations
about the context or the type of movie those words appear in.
Actor data
Age and gender
We can observe that there is a significant difference (we computed a t-test
between the two data and have a significant p-values, c.f. notebook) between
actress in Indian and in American. The actresses are younger in the Indian movies
than in the American ones.
It might mainly be driven by the sample size, but we can propose some possible
factors to explain why Indian actresses might be younger than the American
actresses:
Cultural preferences: In some cultures, including Indian culture, youth and
beauty are often prized and valued highly. This may lead to a preference for
younger actresses in the film industry.
The demands of the film industry : Acting in movies can be physically
demanding, and younger actresses may be better able to handle the
demands of long shooting schedules and demanding roles.
Physical demands of acting: In some cases, certain roles may require
actresses to have a certain appearance or physicality, which may be more
common in younger actresses.
It is also worth noting that there is a wide range of ages among actresses
in Indian movies, and many actresses in India continue to work in the
film industry well into their 40s, 50s, and beyond.
Number of films per actors
We probably have a bias in our dataset in the quantity of information we
have in our dataset regarding the American films.
We are far more likely to have the names of actors that make a single
brief appearance in an American film.
However, regarding the other extreme, we can see that there is a lot more
ultra-prolific Indian film actors than American film actors.
In both film industries we see however that most of the hyper-prolific
actors are males, which is probably linked to the difference in career
prospects with age.
Also, there is some evidence to suggest that female actresses tend to
have fewer opportunities to play leading roles in movies than male actors
in both Hollywood and Bollywood. Indeed, in 2015, the Center for the
Study of Women in Television and Film at San Diego State University
published a report titled "It's a Man's (Celluloid) World," which found
that women made up just 12% of protagonists in the top 100 grossing
films of 2014.
This phenomenon, known as the "gender gap" or the "female movie deficit,"
refers to the underrepresentation of women in the film industry, both in
front of and behind the camera.
There are also more factors that may contribute to the gender gap in the
film industry, including societal and cultural biases, the lack of diverse
and complex female characters in film scripts, and the limited number of
female directors and producers.
This is once again an angle that could be included in further analysis
regarding trends and film success (ratings) prediction.
Did you know?
India is a country with more than a dozen languages talked! Silent films brought
all audiences together! But when Indian cinema
entered the sound era, the use of music and dance became a way to homogenise
the national market across linguistic divides!
Topic detection and further analysis
Extracting topics using Latent Dirichelet Allocation
Navigating the visualization
Topic Bubble
The representation includes topics distribution in the 2-dimensional space (left side panel). These topics are represented in the form of bubbles.
The larger the bubble, the more frequent is the topic in the documents.
Since we have a low number of topics (10), we have big non-overlapping bubbles, scattered throughout the chart.
The distance between the topics is an approximation of the semantic relationship between these latter.
The topics that share common words will be overlapping in comparison to the non-overlapping topics.
Horizontal Bar Graph
The bar graph shows the frequency distribution of the words in the documents (in blue color).
The red area describes the frequency of each word in a given topic.
When selecting a topic (clicking on a topic bubble), the top 30 most relevant terms for the topic are shown.
Hovering over the specific words (in the right panel), the bubbles containing the words grows bigger or smaller if they also have it. The size of the bubble in this scenario describes the weight of the word on that topic. The higher the weight of the selected word, the larger will be the size of the bubble.
Relevance Metric
Rank words in topics based on their frequency by varying the relevance metric lambda parameter (top right slide bar), that goes from 0 to 1.
Decreasing the lambda parameter means increasing the weight of the ratio (Frequency of word given the topic / Overall frequency of the word in the documents). Decreasing the lambda parameters gives words that are more specific to the topic. Important words for the given topic then moves upward.
Interpretation
This topic analysis was really successful in detecting and segregating topics. We find a diversity of topics that span a wide range of lexicons.
However a limitation to this analysis is that we are not able to detect the semantic relationship between the words in a given topic, nor can we infer the context in which they appear.
This plot, while being very informative, doesn't inform us about the use of said topics in each movie industry.
We extracted those topics from a dataset containing both movie industries, and we are not able to see
how they are used in each industry.
This is why our next step is to explore the differences between the topics in each movie industry.
Labeling topics
Let us identify the topics and label them. To do that, we have observed
every topic bubble and identified the keywords that made the most sense
to describe the topic.
This enable us to label in a specific manner each topic and not
to give a general genre, since we already have that in our features.
It was difficult to pinpoint a specific theme since movies' stories
can be very broad and unique, so we decided to chose the keywords as
the most relevant terms to describe the topics.
If at any point you need a refresher as to what each topic is about, you can click on the blue square to the bottom right of your screen
We therefore came up with a small description for each topics:
Topic 1: We chose "experiment, scientist, power, creature" as relevant keywords. This is probabily related more to science fiction, with a topic related to experimentations and creating supernatural creatures. It seems to be linked to control and malificient forces (keywords power, world, use, destroy).
Topic 2: We chose "body, vampire, child, night" as the relevant keywords. Even though the topic of the family is present (keywords child, home, mother), it seems to be a topic that is in the supernatural register, with a horror part (keywords run, dead, death, attack, body, night). The effect of a glooming revelation seems also depicted by this topic with strong verbs (keywords reveal, discover, appear, realize).
Topic 3: "money, steal, prisoner, bank, drug" made the most sense to depicts the topic. It seems pretty straighforward that the movies in that topic talks about crimes like stealing, break-ins of banks, drug gangs, involving money in it. The consequences of the crime is also catched by the LDA (keywords prison, guard, prisoner, police, catch). There is also a supernatural component in that topic with the keywords dragon, stone and spell.
Topic 4: "team, game, coach, player" were the keywords chosen for this topic. It is also straighforward that the topic revolves around the universe of sport, as several sports name are mentioned in the keywords (basketball, football, hockey). The vocabulary of a game is also there (keywords win, lose, score, play, first, start, decide).
Topic 5: The keywords "ship, alien, earth, attack" are also pretty straighforwards to draw an idea of the topic. It seems more revolved into sci-fi movies, probabily involving space adventures (keywords ship, alien, earth), that has also conflict and violence (keywords attack, destroy, death, shoot, fire, use).
Topic 6: "murder, police, gang, fight, crime". It is also pretty straightforward that the subject revolves around killing (keywords murder, crime, gang, killer, death, shoot). The crime probabily involves family in it (keywords father, daughter, wife, brother, family, mother), so it would be more a psychological thriller rather than the topic 3 which involves more of an organization and money in it.
Topic 7: We chose "agent, shoot, military, bomb" as relevant keywords to depict the topic. This topic seems to be more revolved around the using of military forces (keywords military, soldier, bomb, fire, shoot), in order to maybe to protect an important entity (keywords president, plane, plan, decide, order) from a group or person that wants to harm (keywords group, attempt, bomb) this latter.
Topic 8: "band, family, show, music, dream" keywords seems describe stories revolved around music and pursuing a dream career in the music domain. The family theme is present (keywords family, mother, home), along with romance (keywords girlfriend, relationship, together). We also have the theme of the desire, probabily to succeed in music (keywords dream, want, career, start).
Topic 9: The "family, father, mother, daughter, wedding" keywords are also very straightforward, and movies that are in this topic revolves around the family, specially around romance and relationship between a man and a woman concretizing into mariage and founding a family together(keywords husband, wife, marry, wedding, pregnant, child). The professional world is also depicted (keywords work, money, business).
Topic 10: "home, school, night, parent, party, decide" were the keywords that we chose to describe this topic. Although it is less straightforward than some of the previous topics, it seems to revolves around family (keywords mother, father, parent, home, family), probablily with a teenager (keywords school) and involving some conflicts or discussion around that (keywords party, ask, want, talk).
Topic 11: The kewyords "escape, camp, attack, truck" were chosen for that topic. This also depicts the usage of force and violence, so this topic will be found more in the action movies. The vocabulary of protecting something is also present (keywords fight, rescue, town, monster).
Topic 12: This last topic's keywords are "movie, woman, play, character, role". This seems to talk about the life in the movie field (keywords role, character, play, movie, lead, story). The character involved seems to be more feminime (keyswords woman, girl, wife).
This plot shows the distribution of the mean normalized topic prevalence in each movie industry. We replaced topic number by
words that are the most representative of the topic.
We can see that the topics are not distributed uniformly across the movie industries.
For example, the topic containing love, marriage, wedding is significantly more prevalent in the Indian movie industry than in the American one.
We can also see that the topic containing home, school, work, party is more prevalent in the American movie industry than in the Indian one.
Those differences could be due to underlying cultural differences between the two countries, but also could
be due to a bias in our dataset.
Topics inderectly inform us about the content of the movies through their summary, but so does their genre. How does the movie genre relate
to the topics covered in the movie? Can we identify differences between the movies industry
and refine our understanding of them through the intersection of the topics and the genre?
Putting it all together
In this part, we offer you an in-depth study of 4 movie genres that we have selected to show you that the
notion of movie genre is not uniform across the globe. This part of the analysis can be explored in many
ways and we encourage you to play with the different visualizations and the different genres. We also include
an analysis of one genre below the plot to help you understand how to navigate through the visualizations.
Navigating the visualizations
Changing movie genre
To change the focus from a movie genre to another, nothing simpler, just click on the button of interest just below,
it will update the left and right pannels accordingly.
Left pannel
Left pannel contains a plot of the t-SNE
(t-Distributed Stochastic Neighbor Embedding) latent space for Indian and American movies.
This plot has been created using standardized actor data and the prevalence of various topics
in the movies. The t-SNE latent space is a way of visualizing high-dimensional data in a
lower-dimensional space.
There are two visualization options
Clicking on Country will differentiate the movies by country.
Sticking with the color code we previously used, Indian movies
will be colored in orange and American movies in blue.
Clicking on K-means will differentiate the movies by the
cluster they belong to. The clusters are computed using the K-means algorithm on the t-SNE latent space.
Right pannel
On the right panel, you can switch between 3 cluster representations. You can relate each
cluster to the data that describe it via its color.
There are three visualization options
Clicking on the Topic option will show you the mean
standardized prevalence of each topic of each cluster. Hovering the cursor over each cluster
will show you the mean prevalence of each topic in the cluster, ordered from most (top) to least
(bottom) prevalent topic in the selected cluster. Negative values indicate that this topic is
under-represented in the cluster compared to it's overall prevalence in our dataset.
Clicking on the Country option will show you the country
representation in each cluster. As we have an unbalanced dataset with more American than Indian movies,
we have normalized those values by dividing them respectively by the total proportion of Indian and
American movies in our dataset.
Clicking on the Actor option will show you the raw mean of each cluster
for different statistics about actors.
These include (from left to right):
The mean age of actors in the films, per cluster
The mean age of actresses in the films, per cluster
The mean number of films actors and actresses presnt in the dataset played in
The mean percentage of actresses per casts, per cluster