In 2019

American movies generated...

Incrementing Number

... dollars (source)

... and the Indian movie industry produced

... films (source)

Bollywood and Hollywood are very famous film industries nowadays. A westerner might assumes that Bollywood has a long way to go before having the same importance in term of production, compared to the American industry. When we look at some numbers, the Indian movie industry is extremely famous in India, which has more than 1.4 billion people (more than 1/8 of the population on Earth ! (source)). The Indian film industry’s export to other countries is also growing, notably in China, and its productions have appeared in occidental cinemas more and more in recent years. Therefore, it could be interesting to deconstruct this bias and ask ourselves : what assumptions we have as westerners?

For example, what comes to your mind when I talk about Indian cinema?
Drama? Singing? Dancing? Romantic? Foreign?

Unlike Hollywood, Bollywood is not a real place, but only a contraction of Hollywood, the term representing the place of reference for American cinema, and Bombay, the capital of India (now Mumbai) but also the historical base of Indian cinema. The association between these two industries may lead one to believe that Indian cinema is strongly influenced by American cinema. Thus, in this project we will look at the features of their films and analyze the similarities and differences between these two world famous film factories.

Are Indian films heavily influenced by American production or do they have their own completely different identity?

To answer these questions, we use data from the CMU Movie Summary Corpus, which contains a wealth of information on different films from around the world, as well as their summaries. We will take the films from the American and Indian production and extract the most interesting features such as movie genre, topics spoken in the summaries or the characteristics of the actors and then compare them to each other.

Getting familiar with the data

Movie genres

As the plot shows, "drama" is the prominent movie genre in both American and Indian movies.

American movies seems to have a better distribution of different genres than Indian movies, that are more focused around "drama".

Some questions arises here :
Is American drama the same as Indian drama ?
What if these notions are completely different in both culture ?
We will investigate these questions further towards the end of our analysis.

Furthermore, both of Indian and American movies have a large number of movie genres, so it could be interesting to select a restricted group of movie genres that allows better and more precise results for our further analysis.

For example, genre like "world cinema" seems rather not specific enough to categorize the movies. Indeed, the term "world cinema" refers to films produced outside of the United States and Europe, or to films that are made in a particular country or region but are intended for international audiences. World cinema includes a wide variety of film genres, such as action, drama, comedy, romance, horror, and more. It encompasses films from many different countries and cultures, each with their own unique traditions and styles of storytelling.

It is rather reducer to put every genre produced outside of the Occident in a global term like "world cinema". We hence a lot of information about the spectrum of differences and different themes of movies produced around the world. This could be another bias that our data set has, on the fact that Indian movies are not represented the same way as American movies.

This could explain the fact that there is more movie genres attributed to American movies than Indian movies, although we have to be careful of the fact that we have less Indian movies than American movies.

Finally, some genres are similar to each other or some of them include other genres (e.g. "action/aventure"). For these types of genres, we separated them and re-classified them in each one of the single genres.

Movie runtime

We notice that Indian movies have a tendency to be longer (by ~48 minutes) and than American movies.

There could be a bias in our dataset, because early Indian cinema might be under-represented compared to early American cinema.

Although we can add that in general, Indian movies tend to have more elaborate plots and subplots, and they often include songs and dance sequences, which can contribute to their longer runtime. This is often a characteristic of Bollywood movies, in particular, which are known for their elaborate storylines and large casts. American movies, on the other hand, may be more focused on action and special effects, which may not require as much screen time.


Indian movies

American movies

These wordcloud represent the prevalence of words in the movie plots for Indian an American movies respectively. The size of a word is proportional to its frequency in the movie summaries.

We notice that the words family, love, father occur a lot more often in Indian movies compared to American ones. This might be the reflection of underlying cultural differences, but concluding anything is not appropriate at this level of the analysis. It is also suprising to see that the word woman is more important than the word girl in the US whereas it is the opposite in India.

This can maybe be put in perspective with what we observed in the previous section : the mean female actor age in American-dominated cluster of romantic movies is 33 years old whereas it is 27 years old in Indian-dominated clusters (6 years gap!). Moreover, the topics of family and marriage is also much more important in Indian romance than Americans.

Although this representation gives us a good idea of the differences between the subjects covered by the two industries, it stay shallow and doesn't give us informations about the context or the type of movie those words appear in.

Actor data

Age and gender

We can observe that there is a significant difference (we computed a t-test between the two data and have a significant p-values, c.f. notebook) between actress in Indian and in American. The actresses are younger in the Indian movies than in the American ones.

It might mainly be driven by the sample size, but we can propose some possible factors to explain why Indian actresses might be younger than the American actresses:

  • Cultural preferences: In some cultures, including Indian culture, youth and beauty are often prized and valued highly. This may lead to a preference for younger actresses in the film industry.
  • The demands of the film industry : Acting in movies can be physically demanding, and younger actresses may be better able to handle the demands of long shooting schedules and demanding roles.
  • Physical demands of acting: In some cases, certain roles may require actresses to have a certain appearance or physicality, which may be more common in younger actresses.

It is also worth noting that there is a wide range of ages among actresses in Indian movies, and many actresses in India continue to work in the film industry well into their 40s, 50s, and beyond.

Number of films per actors

We probably have a bias in our dataset in the quantity of information we have in our dataset regarding the American films. We are far more likely to have the names of actors that make a single brief appearance in an American film. However, regarding the other extreme, we can see that there is a lot more ultra-prolific Indian film actors than American film actors.

In both film industries we see however that most of the hyper-prolific actors are males, which is probably linked to the difference in career prospects with age.

Also, there is some evidence to suggest that female actresses tend to have fewer opportunities to play leading roles in movies than male actors in both Hollywood and Bollywood. Indeed, in 2015, the Center for the Study of Women in Television and Film at San Diego State University published a report titled "It's a Man's (Celluloid) World," which found that women made up just 12% of protagonists in the top 100 grossing films of 2014.

This phenomenon, known as the "gender gap" or the "female movie deficit," refers to the underrepresentation of women in the film industry, both in front of and behind the camera.

There are also more factors that may contribute to the gender gap in the film industry, including societal and cultural biases, the lack of diverse and complex female characters in film scripts, and the limited number of female directors and producers.

This is once again an angle that could be included in further analysis regarding trends and film success (ratings) prediction.

Did you know?

India is a country with more than a dozen languages talked! Silent films brought all audiences together! But when Indian cinema entered the sound era, the use of music and dance became a way to homogenise the national market across linguistic divides!

Topic detection and further analysis

Extracting topics using Latent Dirichelet Allocation

Navigating the visualization

Topic Bubble
  • The representation includes topics distribution in the 2-dimensional space (left side panel). These topics are represented in the form of bubbles.
  • The larger the bubble, the more frequent is the topic in the documents.
  • Since we have a low number of topics (10), we have big non-overlapping bubbles, scattered throughout the chart.
  • The distance between the topics is an approximation of the semantic relationship between these latter.
  • The topics that share common words will be overlapping in comparison to the non-overlapping topics.
Horizontal Bar Graph
  • The bar graph shows the frequency distribution of the words in the documents (in blue color).
  • The red area describes the frequency of each word in a given topic.
  • When selecting a topic (clicking on a topic bubble), the top 30 most relevant terms for the topic are shown.
  • Hovering over the specific words (in the right panel), the bubbles containing the words grows bigger or smaller if they also have it. The size of the bubble in this scenario describes the weight of the word on that topic. The higher the weight of the selected word, the larger will be the size of the bubble.
Relevance Metric
  • Rank words in topics based on their frequency by varying the relevance metric lambda parameter (top right slide bar), that goes from 0 to 1.
  • Decreasing the lambda parameter means increasing the weight of the ratio (Frequency of word given the topic / Overall frequency of the word in the documents). Decreasing the lambda parameters gives words that are more specific to the topic. Important words for the given topic then moves upward.


This topic analysis was really successful in detecting and segregating topics. We find a diversity of topics that span a wide range of lexicons. However a limitation to this analysis is that we are not able to detect the semantic relationship between the words in a given topic, nor can we infer the context in which they appear.

This plot, while being very informative, doesn't inform us about the use of said topics in each movie industry. We extracted those topics from a dataset containing both movie industries, and we are not able to see how they are used in each industry.

This is why our next step is to explore the differences between the topics in each movie industry.

Labeling topics

Let us identify the topics and label them. To do that, we have observed every topic bubble and identified the keywords that made the most sense to describe the topic.

This enable us to label in a specific manner each topic and not to give a general genre, since we already have that in our features. It was difficult to pinpoint a specific theme since movies' stories can be very broad and unique, so we decided to chose the keywords as the most relevant terms to describe the topics.

If at any point you need a refresher as to what each topic is about, you can click on the blue square to the bottom right of your screen

We therefore came up with a small description for each topics:

  • Topic 1: We chose "experiment, scientist, power, creature" as relevant keywords. This is probabily related more to science fiction, with a topic related to experimentations and creating supernatural creatures. It seems to be linked to control and malificient forces (keywords power, world, use, destroy).
  • Topic 2: We chose "body, vampire, child, night" as the relevant keywords. Even though the topic of the family is present (keywords child, home, mother), it seems to be a topic that is in the supernatural register, with a horror part (keywords run, dead, death, attack, body, night). The effect of a glooming revelation seems also depicted by this topic with strong verbs (keywords reveal, discover, appear, realize).
  • Topic 3: "money, steal, prisoner, bank, drug" made the most sense to depicts the topic. It seems pretty straighforward that the movies in that topic talks about crimes like stealing, break-ins of banks, drug gangs, involving money in it. The consequences of the crime is also catched by the LDA (keywords prison, guard, prisoner, police, catch). There is also a supernatural component in that topic with the keywords dragon, stone and spell.
  • Topic 4: "team, game, coach, player" were the keywords chosen for this topic. It is also straighforward that the topic revolves around the universe of sport, as several sports name are mentioned in the keywords (basketball, football, hockey). The vocabulary of a game is also there (keywords win, lose, score, play, first, start, decide).
  • Topic 5: The keywords "ship, alien, earth, attack" are also pretty straighforwards to draw an idea of the topic. It seems more revolved into sci-fi movies, probabily involving space adventures (keywords ship, alien, earth), that has also conflict and violence (keywords attack, destroy, death, shoot, fire, use).
  • Topic 6: "murder, police, gang, fight, crime". It is also pretty straightforward that the subject revolves around killing (keywords murder, crime, gang, killer, death, shoot). The crime probabily involves family in it (keywords father, daughter, wife, brother, family, mother), so it would be more a psychological thriller rather than the topic 3 which involves more of an organization and money in it.
  • Topic 7: We chose "agent, shoot, military, bomb" as relevant keywords to depict the topic. This topic seems to be more revolved around the using of military forces (keywords military, soldier, bomb, fire, shoot), in order to maybe to protect an important entity (keywords president, plane, plan, decide, order) from a group or person that wants to harm (keywords group, attempt, bomb) this latter.
  • Topic 8: "band, family, show, music, dream" keywords seems describe stories revolved around music and pursuing a dream career in the music domain. The family theme is present (keywords family, mother, home), along with romance (keywords girlfriend, relationship, together). We also have the theme of the desire, probabily to succeed in music (keywords dream, want, career, start).
  • Topic 9: The "family, father, mother, daughter, wedding" keywords are also very straightforward, and movies that are in this topic revolves around the family, specially around romance and relationship between a man and a woman concretizing into mariage and founding a family together(keywords husband, wife, marry, wedding, pregnant, child). The professional world is also depicted (keywords work, money, business).
  • Topic 10: "home, school, night, parent, party, decide" were the keywords that we chose to describe this topic. Although it is less straightforward than some of the previous topics, it seems to revolves around family (keywords mother, father, parent, home, family), probablily with a teenager (keywords school) and involving some conflicts or discussion around that (keywords party, ask, want, talk).
  • Topic 11: The kewyords "escape, camp, attack, truck" were chosen for that topic. This also depicts the usage of force and violence, so this topic will be found more in the action movies. The vocabulary of protecting something is also present (keywords fight, rescue, town, monster).
  • Topic 12: This last topic's keywords are "movie, woman, play, character, role". This seems to talk about the life in the movie field (keywords role, character, play, movie, lead, story). The character involved seems to be more feminime (keyswords woman, girl, wife).

Topics distribution between movie industries

This plot shows the distribution of the mean normalized topic prevalence in each movie industry. We replaced topic number by words that are the most representative of the topic.

We can see that the topics are not distributed uniformly across the movie industries. For example, the topic containing love, marriage, wedding is significantly more prevalent in the Indian movie industry than in the American one. We can also see that the topic containing home, school, work, party is more prevalent in the American movie industry than in the Indian one.

Those differences could be due to underlying cultural differences between the two countries, but also could be due to a bias in our dataset.

Topics inderectly inform us about the content of the movies through their summary, but so does their genre. How does the movie genre relate to the topics covered in the movie? Can we identify differences between the movies industry and refine our understanding of them through the intersection of the topics and the genre?

Putting it all together

In this part, we offer you an in-depth study of 4 movie genres that we have selected to show you that the notion of movie genre is not uniform across the globe. This part of the analysis can be explored in many ways and we encourage you to play with the different visualizations and the different genres. We also include an analysis of one genre below the plot to help you understand how to navigate through the visualizations.

Navigating the visualizations

Changing movie genre

To change the focus from a movie genre to another, nothing simpler, just click on the button of interest just below, it will update the left and right pannels accordingly.

Left pannel

Left pannel contains a plot of the t-SNE (t-Distributed Stochastic Neighbor Embedding) latent space for Indian and American movies. This plot has been created using standardized actor data and the prevalence of various topics in the movies. The t-SNE latent space is a way of visualizing high-dimensional data in a lower-dimensional space.

There are two visualization options

  • Clicking on Country will differentiate the movies by country. Sticking with the color code we previously used, Indian movies will be colored in orange and American movies in blue.
  • Clicking on K-means will differentiate the movies by the cluster they belong to. The clusters are computed using the K-means algorithm on the t-SNE latent space.

Right pannel

On the right panel, you can switch between 3 cluster representations. You can relate each cluster to the data that describe it via its color.

There are three visualization options

  • Clicking on the Topic option will show you the mean standardized prevalence of each topic of each cluster. Hovering the cursor over each cluster will show you the mean prevalence of each topic in the cluster, ordered from most (top) to least (bottom) prevalent topic in the selected cluster. Negative values indicate that this topic is under-represented in the cluster compared to it's overall prevalence in our dataset.
  • Clicking on the Country option will show you the country representation in each cluster. As we have an unbalanced dataset with more American than Indian movies, we have normalized those values by dividing them respectively by the total proportion of Indian and American movies in our dataset.
  • Clicking on the Actor option will show you the raw mean of each cluster for different statistics about actors.
    These include (from left to right):
    • The mean age of actors in the films, per cluster
    • The mean age of actresses in the films, per cluster
    • The mean number of films actors and actresses presnt in the dataset played in
    • The mean percentage of actresses per casts, per cluster

t-SNE latent space, with k-means clustering

Actor data, country-wise distributions and topic repartition in each cluster

You will find many examples by navigating through this plot. We will detail one of them here

Let's take the genre 'comedy'. The K-Means algorithm was set to 4 clusters. We can observe on the right panel by selecting Countries that in the cluster 1, that Indian movies are largely over-represented. On the other hand, for the cluster 2, it's the American movies that are largely over-represented.

Now let's focus on the standardized topic prevalence.
(You can access topic differences by clicking on the box with the downward arrow on the top left of the right-side plot and selecting)

The most important topic in the cluster 1 is the topic centered on family and daily life whereas in the cluster 2 this topic doesn't seem to be more covered than the average. Moreover we see that in cluster 2 the most important topic is about ships, aliens and discoveries, and in cluster 1 this topic is much less present than on average (negative mean normalized topic prevalence).

We can deduce from this analysis that in India, more comedies are made involving a family's daily life and very rarely add a sci-fi/adventure the US, comedies mixed with some action seem to be very frequent. However, it is also important to notice that there are also other clusters, or sub-genres, that are represented equally in both countries like cluster 0 or cluster 3. This shows that there are some sub-genres that are equally represented in India and the US.

We can also notice than when investigating the differences in the Actor visualization between cluster 1 and cluster 2, women happen to represent a smaller proportion of the cast in the cluster 2 than in the cluster 1.

Final words

With our study, we can determine that “Bollywood” and “Hollywood” films are surely different. By making a first analysis on different characteristics, we see that they accumulate differences notably on their genres, actors features and runtimes. For the topics talked in movies, they have similarities but some subjects are more present in Indian movies such as family or wedding, while American movies have more stories about school or parties. This analysis shows that there is a cultural difference in American and Indian cinema, even when they are labeled as belonging to the same genre, for example.

We have seen that the genre drama the most common genre in America and India. However, in the last part of our analysis, we see that these type of movies have similar topics keywords for some of the clusters, but also have their differences. Indeed, two clusters stand out for each industry. We have observed that a type of Indian drama is more focused around love, romance and relationship, whereas a type of American drama is more revolved around action with crime and violence (c.f. t-SNE results and country proportion/topic repartition on drama genre, cluster id 0 and 3). Furthermore, it is interesting to see that drama has a lot of subgenres in it and that there is a whole spectrum of different stories within it.

This result supports the idea that the movie industry cultures of two radically different countries are rather difficult to compare and quite complex. Much care needs to be taken when trying to capture the subtle differences that lie between the two cultures.

It is hence reducing to try to label each movies (which is, to some extent a representation of the country's culture : c. f. « Movies and Culture: The Role of Films in Shaping Societal Norms and Values » by Jennifer A. Fritsche, published in the Journal of Social and Political Psychology in 2016.) and have a meaningful, sense-making comparison. It was important here for us to rather observe the differences and to deconstruct our biases towards a different culture and try to take a step back. It was extremely fulfilling to have a critical view of how the Hollywood or American heavily influences western culture (c.f. "The Americanization of European Cinemas: Hollywood's Influence on Local Film Industries" by Mark Jancovich, published in the journal Screen in 2002.).

In addition, we have to keep in mind the bias that we have in the dataset. Indeed, there are many more American films than Indian ones in our data, although the Indian industry is known for its large number of films produced. The films in our data are potentially just the tip of the iceberg. We also may only have the famous films which is not representative of the entire Indian film culture. To support the idea that the Indian film selection in the dataset was shaped by the western view of Indian cinema, we noted the overwhelming presence the (reducing) genre 'world cinema' when labeling the Indian movies.

Finally, we started this analysis by having the assumption that Bollywood, by the essence of its name, was inevitably influenced and shaped by Hollywood. Going through the analysis, learning more and more about Indian movie industry was enlightening and we realized how much these two industries can be considered as entities with their own identities.

Indian cinema, now a $2 billion industry, has much to offer. A treasure trove of stories and songs, its films are an art form that has grown to encompass every facet of the nation's culture. From the epic historical films to the glossy masala movies, from the arthouse parallel cinema to the song-and-dance spectaculars, Indian cinema has something for everyone. In fact, it is the largest and most diverse film industry in the world, and it is now finding an increasingly global audience.

Shah Rukh Khan, Indian actor and film producer