Note finishing 4 - python realize extract pictures exif information

A main idea:

(1) Locate "All pictures tag" from the corresponding web page,
get html content by corresponding url. After passing through
BeautifulSoup be parsed into a tree html element.
Find all the "image tag"

(2) Download the images
extracted by the label SRC get, get the picture address, download pictures.

(3). Extracts meta information exif
picture exif information extraction achieved through the appropriate library for exif traverse stored in the dictionary variable.
Which to judge whether or not there exif information (some can not extract), whether or not there GPSInfo information (Some compression
of the information lost or have never had), if it does not meet, deleting the picture.

(4) Delete the picture
using the remove function os's. As long as there is a corresponding directory. It can be achieved deleted.
In fact os module can be used to achieve many automation of windows and linux.

3. Summary modules and methods

urlparse module
This module defines a standard interface for parsing the Uniform Resource Locator (URL) in the component string (addressing scheme, the network location, path, etc.), the combination of components back URL string, and "relative URL "converted to an absolute URL given" basic URL. "

urlsplit urlparse similar function
resolves the URL six components, a 6-tuple returns. This corresponds to the general structure of a URL: scheme: // netloc / path; parameters? query # fragment. Each tuple item is a string, possibly empty. Components are not decomposed to smaller parts (e.g., a network location is a single string), and does not expand% escapes. Delimiter is not a part of the results shown above, in addition to the leading slash path component, if present, is retained.

os.path.basename(path)
Return the base name of pathname path. where basename for '/foo/bar/' returns 'bar', the basename() function returns an empty string ('').

Beautiful Soup
is a can extract data from HTML or XML file Python library. It can be achieved through your favorite converter usual document navigation, search, way .Beautiful Soup modify the document to help you save hours or even days working time
official document: https: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

PIL library official documents, commonly used as image processing, this program used in Image_getexif () method to extract exif, but only for jpg and jpeg images for processing
and can not identify the case suffix
http://effbot.org/imagingbook/
exiftags .tags (TagName = TAGS.get (Tag, Tag))
IS A the Dictionary. SUCH of As, you CAN GET A GIVEN at The value for Key by a using TAGS.get (Key). Key the If that does not exist, you have have IT CAN A default value to you return by passing in A SECOND, argument TAGS.get (Key, Val)
Source: http://www.tutorialspoint.com/python/dictionary_get.htm

pip install exifread, exif information processing program of the present png image, using the
exifread.process_file (imageFile) method
by tags = exifread.process_file (fd) This function reads the image information of exif, which is the format exif
{ 'Image ImageLength ': (0x0101) 3024 = @ 42 is Short,
......
' Image GPSInfo ': (0x8825) Long @ 114 = 792,
' Thumbnail The jpeginterchangeformat ': (0x0201) Long @ 808 = 928,
.... ..
}
4. error with solutions
'str' object has no attribute ' read'
in fact, is itself a string parameter, the parameter requirements is a binary file,
which is transmitted is only passed a reference file name (by open the file name), rather than a file
'PngImageFile' object has no attribute ' _getexif'
the error is because _getexif not extract .png file
solution:
You can use exifread module to read, but the module can be read only .png file

There are still other BUG exist, yet to be resolved, but does not affect the basic use, have to say,
the original author wrote the part of the code is really bad for the current site can not use

5. Summary and thinking
(1). The end result is still not run out of exif information, there may be encrypted, or
part of the picture itself is not stored exif information.
(2) Some picture format itself has gif, JPG, svg, etc., and not conduct strict filtering.
(3). Some sites have their own anti-climb mechanism, the picture can not be crawling, as you www.qq.com

II. Code

#!/usr/bin/python
# coding: utf-8

import os
import exifread
import urllib2
import optparse
from urlparse import urlsplit
from os.path import basename
from bs4 import BeautifulSoup
from PIL import Image
from PIL.ExifTags import TAGS



def findImages(url): #找到该网页的所有图片标签
    print '[+] Finding images of '+str(urlsplit(url)[1])
    resp = urllib2.urlopen(url).read()
    soup = BeautifulSoup(resp,"lxml")
    imgTags = soup.findAll('img')
    return imgTags


def downloadImage(imgTag):  #根据标签从该网页下载图片
    try:
        print '[+] Downloading image...'
        imgSrc = imgTag['src']
        imgContent = urllib2.urlopen(imgSrc).read()
        imgName = basename(urlsplit(imgSrc)[2])
        f = open(imgName,'wb')
        f.write(imgContent)
        f.close()
        return imgName
    except:
        return ''

def delFile(imgName):   #删除该目录下下载的文件
    os.remove('/mnt/hgfs/temp/temp/python/exercise/'+str(imgName))
    print "[+] Del File"

def exifImage(imageName):  #提取exif信息,若无则删除
    if imageName.split('.')[-1] == 'png':
        imageFile = open(imageName,'rb') 
        Info = exifread.process_file(imageFile) 
    elif imageName.split('.')[-1] == 'jpg' or imageName.split('.')[-1] == 'jpeg':
        imageFile = Image.open(imageName)
        Info = imageFile._getexif()
    else:
        pass
    try:
        exifData = {}
        if Info:
            for (tag,value) in Info:
                TagName = TAGS.get(tag,tag)
                exifData[TagName] = value
            exifGPS = exifData['GPSInfo']
            if exifGPS:
                print '[+] GPS: '+str(exifGPS)
            else:
                print '[-] No GPS information'
                delFile(imageName)
        else:
            print '[-] Can\'t detecated exif'
            delFile(imageName)
    except Exception, e:
        print e
        delFile(imageName)
        pass 


def main():
    parser = optparse.OptionParser('-u <target url>')
    parser.add_option('-u',dest='url',type='string',help='specify the target url')
    (options,args) = parser.parse_args()
    url = options.url

    if url == None:
        print parser.usage
        exit(0)

    imgTags = findImages(url)
    for imgTag in imgTags:
        imgFile = downloadImage(imgTag)
        exifImage(imgFile)


if __name__ == '__main__':
    main()


Guess you like

Origin www.cnblogs.com/qianxinggz/p/11402602.html