My project is "image captioning using attention mechanism" using the models cnn and lstm to implement the project. write methodology of image captioning project. for example data collection software and libaries used methods used to implement the project image captioning. please use relevant references. am attaching two files please use as reference to write the methodology which is almost similar.
Technology being an integral part of our daily lives; all devices collect data to enhance user experience. Technologies like Artificial Intelligence and Deep Learning are proliferating in many areas. The absolute fortune for the research community working on Machine learning and Artificial Intelligence is the ready availability of scalable volumes of open-source data. In this era of Artificial Intelligence, generating captions for images is crucial for many conglomerates like Google to enhance the search experience by allowing users to search by images. Social media platforms like Facebook, Twitter, Instagram, and Snapchat also employ this kind of image captioning that user's feed accordingly. The task of finding direct objects in the image is not very difficult. However, the task is arduous for machines to identify the salient features of the image, like children playing at a playground. The captions generated should be syntactically and semantically correct. Free accessibility to massive datasets like ImageNet, Flickr 8k, Flickr30k, and the Microsoft COCO: Common Objects in Context (MS COCO) has made research in this area more robust. Moreover, the encoder-decoder framework inspired by neural networks, i.e., Convolutional Neural Networks and Recurrent Neural Networks, has added value to the research in this area.
1.Dataset UsedMS-COCO(Microsoft Common Objects in Context) dataset [5] was used, which contains 82,000 unique images, each of which has five captions annotated to it. Out of these, 30,000 captions that correspond to 6,000 images were used to develop this model. These images are preprocessed using a TensorFlow [7] preprocessing library.
2.Pre-processing of Data
The dataset consists of images and five captions corresponding to each image. Hence, there need to be two kinds of preprocessing: one for images and another for text(captions). In the first part, i.e., image preprocessing, the images need to be re-shaped into an input format compatible with the input of the CNN (encoder) [1]. Furthermore, the captions (text) need to be mapped to their corresponding image names. Later, the captions are broken down word by word to form a dictionary of unique words. The words in the dictionary are processed to form vectors. This process is called tokenization. Post tokenizing words from the vocabulary, the resultant vectors are padded to maintain uniformity of size amongst all the tokens(word vectors). While training, caption vectors and image features vectors are mapped to each other and trained accordingly.
3.Methodology
To implement this, the model contains two systems. The first converts the image information into features, and the second converts the features into a meaningful English sentence (caption). The first part can be called the encoder, as it encodes image information into a feature vector. The second part will be referred to as a decoder that converts the features into a caption.