The first thing you must realize is that none of these pictures are to scale. Next you have to know that these pictures are a vague representation of the layout of a Matroska file as a full representation would be just as complex as the specs themselves.
The first picture is a simple representation of a Matroska file.
The Header contains information saying what EBML version this files was created with, and what type of EBML file this is. In our case it is a Matroska file.
The Metaseek section contains an index of where all of the other groups are in the file are located, such as the Track information, Chapters, Tags, Cues, Attachments, and so on. This element isn't technicaly required, but you would have to search the entire file to find all of the other Level 1 elements if you did not have it. This is because any of the items can occur in any order. For instance you could have the chapters section in the middle of the Clusters. This is part of the flexibility of EBML and Matroska.
The Segment Information section contains basic information relating to the whole file. This includes the title for the file, a unique ID so that the file can be identified around the world, and if it is part of a series of files, the ID of the next file.
The Track section has basic information about each of the tracks. For instance, is it a video, audio or subtitle track? What resolution is the video? What sample rate is the audio? The Track section also says what codec to use to view the track, and has the codec's private data for the track.
The Chapters section lists all of the Chapters. Chapters are a way to set predefined points to jump to in video or audio.
The Clusters section has all of the Clusters. These contain all of the video frames and audio for each track.
The Cueing Data section contains all of the cues. Cues are the index for each of the tracks. It is a lot like the MetaSeek, but this is used for seeking to a specific time when playing back the file. Without this it is possible to seek, but it is much more difficult because the player has to 'hunt and peck' through the file looking for the correct timecode.
The Attachment section is for attaching any type of file you want to a Matroska file. You could attach anything, pictures, webpages, programs, even the codec needed to play back the file. What you attach is up to you. (Someone might even want to attach an Ogg, or maybe another Matroska file some day?!?) In the future we want to come up with a standard way to label things like an album cover of a CD.
The Tagging section contains all of the Tags that relate to the the file and each of the tracks. These tags are just like the ID3 tags found in MP3's. It has information such as the singer or writer of a song, ctors that were in the video, or who made the video.
While EBML allows elements of the same level to be in no particular order, for better use in streaming contexts (and with no drawback for local playback) we have introduced a few guidelines on the order of certain elements.
Here is a more complex representation of a Matroska file. This one lists some of the elements for examples. Each of these elements are described in the specs.
The Header must occur at the beginning of the file. This is how the library knows whether or not it can read the file. The design of EBML is pretty straight forward and is its own project on SourceForge. EBML is not just for Matroska, there are many different potential applications for it. As such, there is the possibility of there being new versions, such as a 2.0 design. The EBMLVersion element would let the parser know first if it can read this file at all. If the EBMLVersion is set to 2.0, and the library is only able to read up to 1.2, then it knows it shouldn't even attempt to read this file.
The DocType tells us that this is a Matroska file. If the DocType says that this is a "Bob's Container Format", then any parser designed for Matrsoka will know right away that even if it can parse the EBML, its not going to know what to do with the data inside of this file.
The Meta Seek section is to let the parser know where the other major parts of the file are. The design is pretty simple. You should normally have just one SeekHead in a file. You then have a couple of Seek entries. One for each seek point. The SeekID contains the "Class-ID" of a level 1 element. For example, the Tracks element has a Class-ID of "[AE][6B]". You would put that in the SeekID, and then the byte position of that particular element in SeekPosition. The Meta Seek section is usually just used when the file is openned so that it can get information about the file. Any seeking that happens when playing back the file uses the Cues.
The Segment Information portion gives us information that is vital to identifying the file. This includes the Title of the file and a SegmentUID that is used to identify the file. The ID is a randomly generated number. It also has the ID of any file that should be associated with it.
The Track portion tells us the technical side of what is in each track. The name of the track goes in Name. The tracks number goes into the TrackNumber element. And the TrackType tells us what the track contains, such as audio, video, subtitles, etc. There are also settings to tell us what language it is in, and what codec to use for playback of the track. Each Track has a unique ID called TrackUID, much like the ID for the whole file. This can be used when you are editing files and have several different versions, it makes it easy to see what files have that specific track. The TrackUID is also used in the Tagging system.
I am, unfortunately, unable to give a more detailed description of Chapters at this time. I will describe these better when possible. Look at the specs for more information.
In a given Matroska file, there are usually many Clusters. The Clusters help to break up the Blocks some and help with seeking and with error protection. There is no set limit to how much data a Cluster can contain, or how much time they can span, but so far developers seem to like to place the limit at 5 seconds or 5 megabytes. At the beginning of every Cluster is a timecode. This timecode is usually the timecode that the first Block in the Cluster should be played back, but it doesn't have to be. Then there are one or more (usually many more) BlockGroups in each Cluster. A BlockGroup can contain a Block of data, and any information relating directly to that Block. For a more detailed description of the Block stucture, see picture 3.
The ReferenceBlock shown above, in the BlockGroup, is what we use instead of the basic "P-frame"/"B-frame" description. Instead of simply saying that this Block depends on the Block directly before, or directly afterwards, we put the timecode of the needed Block. And because you can have as many ReferenceBlock elements as you want for a Block, it allows for some extremely complex referencing.
The Cues are what is used to seek when playing back a file. They form what is commonly known as an 'index'. In a single CuePoint, you have the timecode store in CueTime, and then a listing for the exact position in the file for each of the tracks for that timecode. The Cues are pretty flexible for what exactly you want to index. For instance, you can index every single timecode of every Block, in every track if you liked, but you don't really need to. If you have a video file, you really just need to index the keyframes of the video track.
The Attachments is a pretty simple design. You have an AttachedFile element. Inside of this you have the files name stored in FileName, and the file itself is stored in FileData. You can also list a more readable name and the MIME-type.
And the Tags. These are possibly the most complex part of Matroska. Under the Tags element, you can have many Tag elements. Each Tag element contains all of the information pertaining to specific Track(s) and/or Chapter(s). Each Track or Chapter that those tags applies to has its UID listed in the tags. The Tags contain all extra information about the file, script writer, singer, actors, directors, titles, edition, price, dates, genre, comments, etc. And it allows you to enter many of these (title, edition, comments, ect) in different languages.
Here is a representation of the Block structure. There is an in depth discussion of it in the specs. I will add some descriptions here when I have time.
One thing that I do want to mention however, to avoid confusion, is the Timecode. The quick eye will notice that there is one Timecode shown per Cluster, and then another within the Block structure itself. The way that this works is the Timecode in the Cluster is relative to the whole file. It is usually the Timecode that the first Block in the Cluster needs to be played at. The Timecode in the Block itself is relative to the Timecode in the Cluster. For example, lets say that the Timecode in the Cluster is set to 10 seconds, and you have a Block in that Cluster that is supposed to be played 12 seconds into the clip. This means that the Timecode in the Block would be set to 2 seconds.