Oracle AI & Data Science Blog
Learn AI, ML, and data science best practices

Parallel Decoders with Parameter Sharing

Auto Composing - making a computer mimic a human to generate texts - has been an active and appealing research area in the past few years. In turn, more researchers have tried using deep neural networks to design models to deal with this tough composing task.

OpenAI released its pre-trained language model called GPT (Generative Pre-Training Transformer) in 2018, which first introduced Transformer architecture into the design of a language model and was a big success. Later, they released an updated stronger version of the model called GPT2 in 2019 and GPT3 in 2020 with a few more good examples, which further demonstrated the effectiveness of transformer architecture in language model designing.

However, models in the GPT family usually contain a huge number of parameters, and the size of the model is usually too large to put in production for some storage limited applications. Also, they stack all the attention layers to each other, which further decreases the inference speed.

In this article, we are trying to play with the parallel transformer architecture with a parameter sharing scheme for auto composing texts. Concretely, we are going to modify the transformer decoder and design a relatively lightweight model, which contains much fewer parameters compared to the classical transformer-based language models; additionally, two parallel transformer decoders are deployed instead of stacking them together.

Subscribe to the Oracle AI & Data Science Newsletter to get the latest content on AI, machine learning, and data science! 

Review Transformer Decoder Architecture

A classical transformer decoder usually contains several layers; different layers share the same structure but have different parameters. For a given layer, two key parts are the multi-head attention block and the masked multi-head attention block. The masked multi-head attention block gets the inputs from the embedding of the tokens of a raw sentence with additional position information, and the multi-head attention gets the inputs from both the output of the encoder and the output of the masked multi-head attention (after dropout and norm).

Transformer-based language models, like OpenAI’s GPT, however, use a modified decoder, which contains only the masked multi-head self-attention in each layer.

In the modified decoder, it chops off the multi-head attention and uses only the masked multi-head attention instead. A simple visual comparison between the classical transformer decoder and the modified transformer decoder is given in Fig 1.

A comparison between classical decoder and modified decoder

Figure 1.  A comparison between classical decoder and modified decoder. (Left) classical decoder; (Right) Modified Decoder. N usually equals to 12 for basic configuration.

Compared to the classical decoder, the modified decoder contains fewer parameters, since in each layer, it cuts off one attention block. However, since each layer has its own parameters, the total parameters are still a lot. A simple calculation shows that, with the below configuration, the entire 12 layers of a modified decoder contain around 85 million parameters.

word embedding size    768
num of attention heads    12
size per head    64
hidden size    768
feedforward network size    3072
num of layers     12


Parallel Decoder Architecture and Parameter Sharing

From the above section, we can see that a modified decoder with basic configurations usually has a huge number of parameters to learn. It means that not only do we need to spend more effort to learn the parameters well, but also, the generated model usually takes a lot of space to store, which makes it a big challenge to some applications where only limited storage is available.  

In this section, we are trying to design a parallel parameter sharing decoder architecture to explore its capability of building a language model. For convenience, we just name it PDPS for Parallel Decoders with Parameter Sharing.

Concretely, in PDPS, we have two smaller modified transformer decoders, each of which has its own set of parameters for both the masked multi-head attention and the feedforward network, with the parameters being shared across all the layers within the decoders. The two smaller decoders are tied together by concatenating their outputs, and then another mapping is applied to scale the combined outputs back to the embedding size. Standard layer normalization, as used in each of the attention layers, is applied before the final output.  

Fig 2 provides a visual illustration of the architecture of PDPS; the parts that need to share parameters are shadowed. Masked multi-head attention and the feedforward network of the two smaller decoders are assigned different colors (green and pink) to indicate they have different sets of parameters. 

A visual illustration of PDPS

Figure 2.  A visual illustration of PDPS

In PDPS, if we set N equal to 6, we still have 12 layers in total, but since we are doing parameters sharing, the total number of parameters decreases a lot. By a rough calculation, we can see that, by using the same configuration as in the preceding section, the entire 12 layers of PDPS contains around 14.2 million parameters, which is about 17% of the original decoder (85 million).

However, with the same configuration, PDPS does introduce about 1.18 million additional parameters in the mapping step after concatenating the two outputs, but the overall parameters reduction is still considerable.

The optimization objective of PDPS is to minimize the average cross entropy between the expected outputs and the actual outputs of each batch: 

Loss = -1/n * Si Sj Sk yijk * log(pijk)                                                     

Here n is the size of batch, i refers to the index of each sample in the batch; j refers to the index of the token in each sample; k refers to the index of the tokens in the dictionary. y is the expected output, and p is the actual output.


Experiments: Compositions Generated by PDPS

The training data used in this model is from Wikipedia articles with minimal filtering, like ignoring sentences less than 10 words and removing html tags. I use two 12 layers modified decoders. The training step is set to be 5,000,000, and batch size is 8. The concrete configuration is listed below:

word embedding size    1200
num of attention heads    12
size per head    100
hidden size    1200
feedforward network size    4800
num of layers (each decoder)    12
Maximum sequence length    128

The trained model is about 300 MB. (The model will be about 1730M MB if we stack the decoders without sharing parameters.)

Here, we showcase several generated compositions by PDPS. For each of these compositions, we first give it a start prompt, and let PDPS finish the whole story. (In example 5, we leave the start prompt empty and let PDPS generate the story completely from scratch).

We are trying to make the generated stories be around 150 tokens, and we are using auto-regressive to generate each token. One trick used here is, if a token has been generated before, we will try it again for up to 3 times. All the characters are transformed to lowercase.


Example 1:

Start prompt: after reaching outer space

Compositions generated by PDPS:

after reaching outer space , " discovery " was sent to the iss and arrived at its destination on 1 july . " discovery " was the first of four iss - built spacecraft used to conduct scientific research into solar physics ; the others were " atlantis " ( from may to september 2006 ) , " voyager " ( october to december 2003 ) , and " endeavour " ( april 2005 to march 2006 ) . all three spacecraft had successfully completed orbiters1 , which included a probe into the earth ' s atmosphere on 24 august 2007 . they also carried out experiments on a large number of other instruments including the x - ray observatory , which was launched to study the sun ' s formation . these observations revealed that a solar wind that has been driven through the atmosphere by the sun may have caused the planet ' s mass to rise significantly . 


Example 2:

Start prompt: the scientists finally found the cave

Compositions generated by PDPS:

the scientists finally found the cave and discovered that it contained an ancient stone age rock shelter . however , the archaeologists were unable to find any artifacts of the stone age or human occupation from which they could trace their history . this discovery was not immediately confirmed by subsequent archaeological findings , and there was a debate over whether the cave was actually a prehistoric site or a prehistoric settlement . some researchers suggest that the caves were inhabited before 500 bc , and that the evidence suggests they were constructed during the late bronze age and early iron age periods . other scholars believe that the cave ' s construction began after 400 – 600 ad , when humans were driven from the area of present - day southern spain by an unknown group of people from the nearby town of palencia ( now part of valencia ) .


Example 3:

Start prompt: the video game super mario

Compositions generated by PDPS:

the video game super mario was released in japan on november 16 , 2009 . it is based in the " mario karts " series of games . this game contains three levels : " the legend of the golden donkey " , " the legend of the dragon queen " and " the legend of the king of thieves " , which are set in different places and have different rules to them . each level includes several new items such as coins , gems , and armor and some other accessories that can be used by the player as part of their quest . these items are all available in the " mario karts " and " mario karts 2 " collections . there are also special cards and minigames for certain stages .


Example 4:

Start prompt: renaissance emerged from the 14th to the 17th centuries

Compositions generated by PDPS:

renaissance emerged from the 14th to the 17th centuries , as a result of the spread of enlightenment in europe . it was first introduced into germany around 1525 , where its popularity led to the publication of numerous works by artists such as johannes vermeer , albrecht durer , and hans holbein . however , the spread was interrupted by the protestant reformation in 1536 , which brought the development of modern painting closer to the new age . this resulted in more elaborate forms of paintings than the earlier ones and , in particular , the " madonna and child with angels " series in the late 16th century was the basis for later developments in art .


Example 5:

Start prompt: 

Compositions generated by PDPS:

the station was opened in 1906 by the north eastern railway and was located near the town of new romney . the station closed on 4 december 1964 when it was demolished . its goods yard was then re - built as part of the london midland and scottish region ' s network rail operation between 1966 and 1971 , and the platforms were also used for freight trains . after the passing of the railways act 1972 , all remaining stations in the area reverted to their original names . there are now three signal boxes in the village : one near the west entrance ( which is open only during weekday peak hours ) and one next to the east , and another near the north end .



From the above works, we see that PDPS can complete the story with understandable and reasonable sentences. However, due to our random sampling strategy, we may bring in some uncertainties, which may cause the generated compositions looking weird. I do observe the following phenomena:

1. In some cases, the generated compositions are not in line with facts. Like in example 3 and the following paragraph:

“deep ocean fishes are mysterious creatures that live in the sea, including the giant "big horn" and various large species of birds such as those from north america.”

Obviously, birds are not fish and cannot live in the deep ocean.

2. Sometimes, it may generate repeat chunks like the following paragraph:

“the most popular foods available at the factory include chicken soup, chicken soup, and beef meat.”

3. Sometimes, it may repeatedly generate similar chunks like the following paragraph:

after graduated from the college of engineering , he joined the faculty of cornell university as a lecturer . his tenure there lasted until 1969 where he was appointed professor emeritus at the same institution . during that time he published several books including " principles for management , organization and control of business " ( 1973 ) , " management and control of industry " ( 1977 ) , " management and control of enterprise : a guide to management " ( 1980 ) , " management and control of business " ( 1982 ) , " marketing and sales management , strategy and sales operations , business and marketing " ( 1984 ) , " management and management of businesses " ( 1985 ) , " management and control of companies " ( 1986 ) , " management and operations management " ( 1987 ) , " marketing and distribution " ( 1988 ) , " management and control " ( 1989 ) , " management " ( 1990 ) .

4. In some cases, two consecutive sentences may not be coherent.

Overall, it may take several tries to complete one high quality paragraph.



In this article, we investigate the capability of parallel transformer decoders with a parameter sharing scheme. Unlike stacking all the attention layers together, we parallel two decoders and share parameters across all the layers within each of them. The generated model employs much fewer parameters and maintains the capability to compose understandable and reasonable compositions at the same time. 

Future studies will aim to take advantage of the design to further push the parallel inferences. Also, it is worth a try to build a hierarchical decoder array to explore the limit of the model as well. 


To learn more about AI and machine learning, visit the Oracle AI page, and follow us on Twitter @OracleAI

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.