‘Zooming in’ to Videos – Behind the scenes


It is an age of media. We are used to a lifestyle where we seldom think about the technological advancements that goes behind all of this. While watching a video on YouTube,  during a video chat, while casting our phone’s screen to the television or a simple recording of our precious moments, not many of us think about the innumerable steps that goes with their creation. An Engineer may understand some of the underlying steps but, to a layman, it’s not important to worry about the entire flow. However, at times, it becomes joyous to know and appreciate the technological advancements.

One such technology that we use on a day to day basis is videos. Let’s dig into it a bit. Few statistics about our favourite YouTube:

  • It has around 500 hours of videos uploaded every minute
  • 6 billion hours of videos are watched on YouTube every month, on an average.

These are just few statistics from only one of the  many video streaming websites. There exists a plethora of different streaming services like Netflix, Hooq, Google movies, Hotstar, Voot, etc. Furthermore, there are chat applications supporting video calls like Skype, Hangout, etc. Apart from these, the television  plays  videos from a satellite connection; people are moving towards connected / surveillance cameras for better security; all the Virtual Reality headsets are relying on the 360 degree videos now. The number of use cases is increasing by the day.

Just for some insight, let us go see the basics of technology behind working of a video. A video is nothing but a set of static pictures played back to back at a very high speed resulting in continuous motion to our eyes. Human persistence of vision is around 64ms (millisecond), i.e., it takes about 64 ms to register an image in the brain. If we replace a second picture before that duration, it creates a perception of continuous motion in the brain. Now, to create this effect in the brain, earlier video cameras used to record at about 16 Frames per second (fps) (~ 66ms). With advancement in technology we moved to 24 fps and now to 30 fps. Currently, we are on the edge of moving towards 60 (almost there) and 120 fps (by 2017) (about 16.66 ms and 8.33 ms).

Apart from the fps improvements, there are improvements in the picture sizes as well. There was a time when a picture size of 640 x 480 were considered to be advanced. However, nowadays 1920 x 1080 videos are the norm on a smartphone and we are moving towards 2560 x 1440 and 3840 x 2160 contents. By 2017, even 7680 x 4320 videos will also be in demand for TVs and larger devices.

Coming to what internet speed is needed  for videos – we call it internet bandwidth. The smallest element of a picture is called a pixel (short for picture element). Basically, the smallest visible dot on a picture is a pixel. lt requires 3 bytes of memory to store the Red, Green and Blue colors in a pixel i.e. RGB. Mathematically, a 1080p picture (1920 x 1080) contains a total of  2073600 of such pixels. So, total network speed needed to stream such a content @ standard 30FPS, would be, 2073600 x 30 * 3 * 8 = 1492992000 bits per second ~= 1.5 Gbps. None of the ISPs worldwide provide such a high network speed. Moreover, these calculations are done for a mere 1080p picture size. The streaming speed needed for UHD (3840 x 2160) content would be ~ 6Gbps (4 times a 1080p). There is one more issue with storing just the raw format. A 1080p frame takes about 6MB of disc space to be stored. If we store the videos uncompressed, a 1 minute video takes about 10GB of space on the hard disk. Not feasible right ! So, how did we manage to view videos despite of obstacles in terms of space and bandwidth? The answer is Compression. The following paragraphs will be technically informative. One may continue reading to get an overview on Video Codes – the technology behind videos.

Why video codecs when photography codecs like jpeg, png etc., already exist? Essentially JPEG & PNGs do similar things to what is done for the videos. But they work on only one picture at a time. The compression in photography formats consider only spatial redundancy in the picture, i.e. more similar the pixels in the picture, lesser the size for same quality. Whereas, for videos, the gap between two consecutive frames is about 33ms (for 30 fps) is also taken into consideration and is ignored by the photography codecs. The motion between 2 such frames would be so less, that we can use this fact to further reduce the size of the video. Thus, the first frame of the video would essentially be compressed only spatially, and the rest of the frames will have spatial as well as temporal compressions applied. To bring things into perspective, for a visually good quality 1080p picture, a jpeg typically gives a compression of about 15 times, which brings the size of the image to around 6MB / 15 = 400KB. Whereas a typical 1080p 30FPS video of good visual quality is of about 20 Mbps, which translates to about 83KB per frame.

This process when a video is compressed is called Encoding. Encoding is done by a device or software who wants to store or transmit a video. The reverse process of encoding is Decoding, where an encoded bit stream is uncompressed back to a raw RGB format, which can then be, then, viewed by the audience. The component that does encoding is called Encoder. Similarly, the component that does decoding is called Decoder. As a whole, an Encoder and a decoder works in tandem to provide the videos to the users. These two words are used together to create the word Encoder – Decoder or codec.

There are various companies working on their own encoders and decoders. Companies like Apple, Microsoft and Google have come up with their own formats of codecs to support videos. This has led to a variety of different video codecs adding up to the confusion. To standardise everything worldwide  and have mutual consent between various encoding and decoding groups, various standardisation committees were established , and multiple formats created. Few of the famous formats used worldwide are, Mpeg2, Mpeg4, H.264 / AVC, H.265 / HEVC, VP8 and VP9. Each of them was developed keeping in mind a different use case, and a separate problem in hand. HEVC and VP9 standards are the latest, and are still being adopted by the major providers of decoding solutions. H.264 being the most widely adopted format, can be seen at various places like Set top boxes and TVs, video chat applications, online streaming services like Youtube, Netflix, etc.

Summing up, the journey of video codecs has been an intense one.. This is one service that we use on a daily basis  and seldom think about all into the effort going on behind the scenes. It’s true with all the other technologies, but at times it seems very fascinating to know more about some of them. Just appreciating one of the many facets of day to day technology.


Super geeky and always abreast with latest technologies and happenings in the mobile world, Mohit works in developing kick-ass video codecs for mobile chipsets during the day and sports the cape of content author at BoredHomosapien during the night.

I comment. You comment. We comment.