Home Artificial Intelligence DALL·E 2 pre-training mitigations

DALL·E 2 pre-training mitigations

0
DALL·E 2 pre-training mitigations

We observed that our internal predecessors to DALL·E 2 would sometimes reproduce training images verbatim. This behavior was undesirable, since we would really like DALL·E 2 to create original, unique images by default and not only “stitch together” pieces of existing images. Moreover, reproducing training images verbatim can raise legal questions around copyright infringement, ownership, and privacy (if people’s photos were present in training data).

To raised understand the problem of image regurgitation, we collected a dataset of prompts that often resulted in duplicated images. To do that, we used a trained model to sample images for 50,000 prompts from our training dataset, and sorted the samples by perceptual similarity to the corresponding training image. Finally, we inspected the highest matches by hand, finding only a number of hundred true duplicate pairs out of the 50k total prompts. Despite the fact that the regurgitation rate gave the impression to be lower than 1%, we felt it was crucial to push the speed all the way down to 0 for the explanations stated above.

Once we studied our dataset of regurgitated images, we noticed two patterns. First, the pictures were just about all easy vector graphics, which were likely easy to memorize attributable to their low information content. Second, and more importantly, the pictures all had many near-duplicates within the training dataset. For instance, there is likely to be a vector graphic which looks like a clock showing the time 1 o’clock—but then we’d discover a training sample containing the identical clock showing 2 o’clock, after which 3 o’clock, etc. Once we realized this, we used a distributed nearest neighbor search to confirm that, indeed, all the regurgitated images had perceptually similar duplicates within the dataset. Other works have observed an analogous phenomenon in large language models, finding that data duplication is strongly linked to memorization.

The above finding suggested that, if we deduplicated our dataset, we would solve the regurgitation problem. To attain this, we planned to make use of a neural network to discover groups of images that looked similar, after which remove all but one image from each group.[^footnote-2]

Nevertheless, this is able to require checking, for every image, whether it’s a reproduction of each other image within the dataset. Since our whole dataset accommodates a whole lot of tens of millions of images, we’d naively need to examine a whole lot of quadrillions of image pairs to search out all of the duplicates. While that is technically nearby, especially on a big compute cluster, we found a far more efficient alternative that works almost as well at a small fraction of the cost.Consider what happens if we cluster our dataset before performing deduplication. Since nearby samples often fall into the identical cluster, a lot of the duplicate pairs wouldn’t cross cluster decision boundaries. We could then deduplicate samples inside each cluster without checking for duplicates outside of the cluster, while only missing a small fraction of all duplicate pairs. This is far faster than the naive approach, since we not have to examine each pair of images.[^footnote-3]

Once we tested this approach empirically on a small subset of our data, it found 85% of all duplicate pairs when using clusters.To enhance the success rate of the above algorithm, we leveraged one key statement: if you cluster different random subsets of a dataset, the resulting cluster decision boundaries are sometimes quite different. Subsequently, if a reproduction pair crosses a cluster boundary for one clustering of the information, the identical pair might fall inside a single cluster in a distinct clustering. The more clusterings you are attempting, the more likely you might be to find a given duplicate pair. In practice, we settled on using five clusterings, which suggests that we seek for duplicates of every image within the union of 5 different clusters. In practice, this found 97% of all duplicate pairs on a subset of our data.

Surprisingly, almost 1 / 4 of our dataset was removed by deduplication. Once we checked out the near-duplicate pairs that were found, lots of them included meaningful changes. Recall the clock example from above: the dataset might include many images of the identical clock at different times of day. While these images are prone to make the model memorize this particular clock’s appearance, they may also help the model learn to differentiate between times of day on a clock. Given how much data was removed, we were apprehensive that removing images like this may need hurt the model’s performance.

To check the effect of deduplication on our models, we trained two models with an identical hyperparameters: one on the complete dataset, and one on the deduplicated version of the dataset. To check the models, we used the identical human evaluations we used to judge our original GLIDE model. Surprisingly, we found that human evaluators barely  the model trained on deduplicated data, suggesting that the massive amount of redundant images within the dataset was actually hurting performance.

Once we had a model trained on deduplicated data, we reran the regurgitation search we had previously done over 50k prompts from the training dataset. We found that the brand new model never regurgitated a training image when given the precise prompt for the image from the training dataset. To take this test one other step further, we also performed a nearest neighbor search over your complete training dataset for every of the 50k generated images. This fashion, we thought we would catch the model regurgitating a distinct image than the one related to a given prompt. Even with this more thorough check, we never found a case of image regurgitation.

LEAVE A REPLY

Please enter your comment!
Please enter your name here