How to Analyse YouTube Data Using MapReduce? | A Hadoop MapReduce Use Case
Hello, and Welcome to ACADGILD! In this tutorial, you will learn to analyze data from YouTube using Hadoop MapReduce.
Did you know that three hundred videos are uploaded to YouTube every single minute? And that these videos are made available to more than 1 billion YouTube users in 75 countries in 61 languages?
Just imagine the volume of data that is generated on YouTube every minute, and how you can make use of this information to measure how well your video marketing efforts are faring.
The YouTube data is publicly available and, thus, can act as a powerful tool for video marketers who wish to analyze their competitors’ videos too. Although, video uploaders may disable the option to make their statistics available to the public.
Now, let us see how data from YouTube can be analyzed using Hadoop.
The Data Set
Column 1 gives the video I D of 11 characters.
Column 2 gives information about the uploader of the video.
Column 3 gives the number of days between when YouTube was established and the date when the video was uploaded.
Column 4 gives information about the category of the video.
Column 5 gives information about the length of the video.
Column 6 states the number of views for the video.
Column 7 gives the rating for the video.
Column 8 gives the number of ratings received for the video.
Column 9 gives the number of comments received for the video.
Column 10 gives related video IDs for the video uploaded.
Problem Statement 1
Here, we will collect information about the top 5 categories of videos for which maximum number of videos have been uploaded.
Source Code from the Mapper, we need to get the video category as key and the final int value ‘1’, which will be passed to the shuffle and sort phase and will be further sent to the reducer phase where the aggregation of these values will be performed.
In line 1, we are taking a class by the name Top 5 Address Code Categories
In line 2, we are extending the Mapper default class that has the arguments:
keyIn as LongWritable
ValueIn as Text
KeyOut as Text and
ValueOut as IntWritable.
In line 3, we are declaring a private Text variable ‘category’ which will store the category of videos on YouTube.
In line 4, we are declaring a private final static IntWritable variable ‘one’, which will be constant for every value.
MapReduce deals with Key and Value pairs. Here, we can set the key as gender and value as age
In line 5, we are overriding the map method that will run one time for every line
In line 7, we are storing the line in a string variable ‘line’
In line 8, we are splitting the line by using the Backward Slash t delimiter and are storing the values in a String Array so that all the columns in a row are stored in it
In line 9, we are taking an if condition.
If we have a string array of length greater than 6, which means that if the row has at least 7 columns, then such a string will enter into the if condition and execute the code to eliminate the ArrayIndexOutOfBoundsException .
In line 10, we are storing the category which is in 4th column
In line 12, we are entering the key and value into context, which will act as the output of the MapReduce method.
The Reducer Code
Line 1 extends the default Reducer class with arguments like, KeyIn as Text and ValueIn as IntWritable, these are the same as the outputs of the mapper class.
And KeyOut as Text and ValueOut as IntWritbale will be the final outputs of our MapReduce program.
In line 2, we are overriding the Reduce method which will run each time for every key.
In line 3, we are declaring an integer variable sum which will store
the sum of all the values for every key.
In line 4, a for each loop is taken which will run each time for the values inside the “Iterable values” that will come from the shuffle and sort phase after the mapper phase.
In line 5, we are storing and calculating the sum of these values.
Line 7 will help write the respected key and the obtained sum as value to the context.
The Configuration Code
Two configuration classes are included in the main class to clarify the output key type of the mapper and the output value type of the mapper
How to Execute?
Here, ‘Hadoop’ specifies that we are running a Hadoop command, jar specifies the type of application we are running, and top 5.jar is the jar file which has been created that consists of the above source code.
The path of the Input file in our case is root directory of HDFS that is denoted by Forward Slash YouTubedata.txt and the output file location to store the output has been given as top 5 Address Code out.
Just watch the video completely to get all the other outputs!
Learn Hadoop today and stay ahead of your competitors!
For more updates on courses and tips follow us on:
Facebook: https://www.facebook.com/acadgild
Twitter: https://twitter.com/acadgild
LinkedIn: https://www.linkedin.com/company/acadgild