全面解读Marshmallow

marshmallow是一个用来将复杂的orm对象与python原生数据类型之间相互转换的库，简而言之，就是实现object -> dict， objects -> list, string -> dict 和 string -> list。

要用到marshmallow，首先需要一个用于序列化和反序列化的类：

import datetime as dt

class User(object):
	def __init__(self, name, email):
		self.name = name
		self.eamil = eamil
		self.careated_at = dt.datetime.now()
	
	def __repr__(self):
		return '<User(name={self.name!r})>'.format(self=self)

##Schema
要对一个类或者一个json数据实现相互转换(即序列化和反序列化, 序列化的意思是将数据转化为可存储或可传输的数据类型), 需要一个中间载体, 这个载体就是Schema.
除了转换以外, Schema还可以用来做数据校验. 每个需要转化的类, 都需要一个对应的Schema:

from marshmallow import Schema, fields

class UserSchema(Schema):
	name = fields.Str()
	eamil = fields.Email()
	created_at = fields.DataTime()

##Serializing(序列化)
序列化使用schema中的dump()或dumps()方法, 其中,dump() 方法实现obj -> dict, dumps()方法实现obj -> string, 由于Flask能直接序列化dict(使用jsonify), 而且你肯定还会对dict进下一步的处理, 没必要现在转化成string, 所以通常Flask与Marshmallow配合序列化时, 用dump()方法即可:

from marshmallow import pprint

user = User(name="Monty", email="[email protected]")
schema = UserSchema()
result = schema.dump(user)
pprint(result.data)
# {"name": "Monty",
#  "email": "[email protected]",
#  "created_at": "2014-08-17T14:54:16.049594+00:00"}

##过滤输出
当然那你不需要每次都输出对象中所有字段, 可以使用only参数来指定你需要输出的字段, 这个在实际场景中很常见.

summary_schema = UserSchema(only=('name', 'email'))
summary_schema.dump(user).data
# {"name": "Monty Python", "email": "[email protected]"}

你也可以使用exclude字段来排除你不想输出的字段.
##Deserializing(反序列化)
相对dump()的方法就是load()了, 可以将字典等类型转换成应用层的数据结构, 即orm对象:

from pprint import pprint

user_data = {
	'created_at':'2014-08-11T05:26:03.869245',
	'email': u'[email protected]',
	'name': u'Ken'
	}
schema = UserSchema()
result = schema.load(user_data)
pprint(result.data)
# {'name': 'Ken',
#  'email': '[email protected]',
#  'created_at': datetime.datetime(2014, 8, 11, 5, 26, 3, 869245)},

对反序列化而言, 将传入的dict变成object更加有意义. 在Marshmallow中, dict -> object的方法需要自己实现, 然后在该方法前面加上一个decoration: post_load即可,
即:

from marshmallow import Schema, fields, post_load

class UserSchema(Schema):
	name = fields.Str()
	email = fields.Email()
	created_at = fields.Datetime()
	
	@post_load
	def make_user(self, data):
		return User(**data)

这样每次调用load()方法时, 会按照make_user的逻辑, 返回一个User类对象:

user_data = {
	'name': 'Ronnie',
	'email': '[email protected]'
}
schema = UserSchema()
result = schema.load(user_data)
result.data  # => <User(name='Ronnie')>

tips: 相对于dumps(), 也存在loads()方法, 用于string -> object, 有些简单场景可以用.

##Objects <-> List
上面的序列化和反序列化, 是针对一个object而言的, 对于objects的处理, 只需在schema中增加一个参数: many=True, 即:

user1 = User(name="Mick", email="[email protected]")
user2 = User(name="Keith", email="[email protected]")
users = [user1, user2]

# option 1:
schema = UserSchema(many=True)
result = schema.dump(users)

# Option 2:
schema = UserSchema()
result = schema.dump(users, many=True)
result.data

# [{'name': u'Mick',
#   'email': u'[email protected]',
#   'created_at': '2014-08-17T14:58:57.600623+00:00'}
#  {'name': u'Keith',
#   'email': u'[email protected]',
#   'created_at': '2014-08-17T14:58:57.600623+00:00'}]

##Validation
Schema.load()和loads()方法会在返回值中加入验证错误的dictionary, 例如email和URL都有内建的验证群.

result = UserSchema().load({'email': 'foo'})
result.errors  # => {'email': ['"foo" is not a valid email address.']}

当验证一个集合时, 返回的错误dictionary会以错误序号对应错误信息的key:value形式保存:

class BandMemberSchema(Schema):
    name = fields.String(required=True)
    email = fields.Email()

user_data = [
    {'email': '[email protected]', 'name': 'Mick'},
    {'email': 'invalid', 'name': 'Invalid'},  # invalid email
    {'email': '[email protected]', 'name': 'Keith'},
    {'email': '[email protected]'},  # missing "name"
]

result = BandMemberSchema(many=True).load(user_data)
result.errors
# {1: {'email': ['"invalid" is not a valid email address.']},
#  3: {'name': ['Missing data for required field.']}}

你可以向内建的field中传入validate参数来定制验证的逻辑, validate的值可以是函数, 匿名函数lambda, 或者是定义了__call__的对象:

class ValidatedUserSchema(UserSchema):
    # NOTE: This is a contrived example.
    # You could use marshmallow.validate.Range instead of an anonymous function here
    age = fields.Number(validate=lambda n: 18 <= n <= 40)

in_data = {'name': 'Mick', 'email': '[email protected]', 'age': 71}
result = ValidatedUserSchema().load(in_data)
result.errors  # => {'age': ['Validator <lambda>(71.0) is False']}

如果你传入的函数中定义了ValidationError, 当它触发时, 错误信息会得到保存:

from marshmallow import Schema, fields, ValidationError

def validate_quantity(n):
    if n < 0:
        raise ValidationError('Quantity must be greater than 0.')
    if n > 30:
        raise ValidationError('Quantity must not be greater than 30.')

class ItemSchema(Schema):
    quantity = fields.Integer(validate=validate_quantity)

in_data = {'quantity': 31}
result, errors = ItemSchema().load(in_data)
errors  # => {'quantity': ['Quantity must not be greater than 30.']}

注意1:
如果你需要执行多个验证, 你应该传入可调用的验证器的集合(list, tuple, generator)

注意2:
Schema.dump()也会返回错误信息dictionary, 也会包含序列化时的所有ValidationErrors. 但是required, allow_none, validate, @validates, 和@validates_schema只用于反序列化, 即Schema.load().

##Field Validators as Methods
把生成器写成方法可以提供极大的便利. 使用validates装饰器就可以注册一个验证方法:

from marshmallow import fields, Schema, validates, ValidationError
class ItemSchema(Schema):
    quantity = fields.Integer()

    @validates('quantity')
    def validate_quantity(self, value):
        if value < 0:
            raise ValidationError('Quantity must be greater than 0.')
        if value > 30:
            raise ValidationError('Quantity must not be greater than 30.')

##strict Mode
如果将strict=True传入Schema构造器或者class的Meta参数里, 则仅会在传入无效数据时报错. 可以使用ValidationError.messages变量来获取验证错误的dictionary.

from marshmallow import fields, Schema, ValidationError, validates_schema

class ItemSchema(Schema):
    quantity = fields.Integer()
    class Meta:
        strict = True

    @validates_schema()
    def validate_quantity(self, data):
        if data['quantity'] < 0:
            raise ValidationError('Quantity must be greater than 0.')
        if data['quantity'] > 30:
            raise ValidationError('Quantity must not be greater than 30.')

schema = ItemSchema()
d = {'quantity': 31}
loaded = schema.load(d)
print loaded
# 直接报错:marshmallow.exceptions.ValidationError: {u'_schema': ['Quantity must not be greater than 30.']}

##Required Fields
你可以在field中传入required=True. 当Schema.load()的输入缺少某个字段时错误会记录下来.
如果需要定制required fields的错误信息, 可以传入一个error_messages参数, 参数的值为以required为键的键值对.

from marshmallow import fields, Schema

#option1
fields.Field.default_error_messages = {
    'required': u'缺少必填数据.',
    'type': u'数据类型不合法.',
    'null': u'数据不能为空.',
    'validator_failed': u'非法数据.'
}
fields.Str.default_error_messages = {
    'invalid': '不是合法文本.'
}
fields.Int.default_error_messages = {
    'invalid': u'不是合法整数.'
}
fields.Number.default_error_messages = {
    'invalid': u'不是合法数字.'
}
fields.Boolean.default_error_messages = {
    'invalid': u'不是合法布尔值.'
}
# option2
class ItemSchema(Schema):
    quantity = fields.Int(required=True, error_messages={'required':'quantity is required.'})
schema = ItemSchema()
d = {'quantity': '12a'}
loaded = schema.load(d)
print loaded
#option1: UnmarshalResult(data={}, errors={'quantity': [u'不是合法数字']})
#option2: UnmarshalResult(data={}, errors={'quantity':[r'quantity is required']

##Partial Loading
按照RESTful架构风格的要求, 更新数据使用HTPP方法中的PUT或PATCH方法, 使用PUT方法时, 需要把完整的数据全部传给服务器, 使用PATCH方法时, 只需要改动的部分数据传给服务器即可. 因此, 当使用PATCH方法时, 由于之前设定的required, 传入数据存在无法通过Marshmallow数据校验的风险, 为了避免这种情况, 需要借助Partial Loading功能.

实现Partial Loading只要在schema构造器中增加一个partial参数即可:

class UserSchema(Schema):
	name = fields.String(required=True)
	age = fields.Integer(required=True)

data, errors = UserSchema().load({'age':12}, partial=('name',))
# OR UserSchema(partial=('name',)).load({'age': 12})
data, erros # => ({'age':12},{})

##Schema.validate
如果你只是想用Schema去验证数据, 而不生成对象, 可以使用Schema.validate()
可以看到, 通过schema.validate()会自动对数据进行校验, 如果有错误, 则会返回回来, 通过返回的数据, 我们就可以确认验证是否通过.

class ItemSchema(Schema):
    name = fields.Str(required=True)
    country = fields.Str()
    quantity = fields.Int()
    @validates('country')
    def validate_country(self, country):
        if country != 'china':
            raise ValidationError('Country only is china')

schema = ItemSchema()
d = {'country': 'china1', 'quantity': '12a'}
loaded = schema.load(d)
print loaded
errors = ItemSchema().validate(d)
print errors
#UnmarshalResult(data={}, errors={'country': ['Country only is china'], 'name': [u'不是合法文本'], 'quantity': [u'不是合法数字']})
# {'country': ['Country only is china'], 'name': [u'不是合法文本'], 'quantity': [u'不是合法数字']}

##Specifying Attribute Names
Schema默认会序列化传入对象和自身定义的fields相同的属性, 然而你也会有需求使用不同的fields和属性名. 在这种情况下, 你需要明确定义这个fields将从什么属性名取值:

class UserSchema(Schema):
    name = fields.String()
    email_addr = fields.String(attribute="email")
    date_created = fields.DateTime(attribute="created_at")

user = User('Keith', email='[email protected]')
ser = UserSchema()
result, errors = ser.dump(user)
pprint(result)
# {'name': 'Keith',
#  'email_addr': '[email protected]',
#  'date_created': '2014-08-17T14:58:57.600623+00:00'}

##Specifying Deserialization Keys
Schema默认会反序列化传入字典和输出字典中相同的字段名. 如果你觉得数据不匹配你的schema, 你可以传入load_from参数指定需要增加load的字段名(原字段名也能load, 且优先load原字段名):

class UserSchema(Schema):
    name = fields.String()
    email = fields.Email(load_from='emailAddress')

data = {
    'name': 'Mike',
    'emailAddress': '[email protected]'
}
s = UserSchema()
result, errors = s.load(data)
#{'name': u'Mike',
# 'email': '[email protected]'}

##“Read-only” and “Write-only” Fields
可以指定某些字段只能dump()或load():

class UserSchema(Schema):
	name = fields.Str()
	# password is "write-only"
	password = fields.Str(load_only=True)
	# created_at is "read-only"
	created_at = fields.DateTime(dump_only=True)

##Nesting Schemas
当你的模型含有外键, 那这个外键的对象在schema如何定义.
举个例子, Blog就具有User对象作为它的外键:

import datetime as dt
calss User(object):
	def __init__(self, name, email):
		self.name = name
		self.email = email
		self.created_at = dt.datetime.now()
		self.friends = []
		self.employer = None

class Blog(object):
	def __init__(self, title, author):
		self.title = title
		self.author = author  # A User object

使用Nested field表示外键对象:

from marshmallow import Schema, fields, pprint

class UserSchema(Schema):
	name = fields.String()
	email = fields.Email()
	created_at = fields.DateTime()
	
class BlogSchema(Schema):
	title = fields.Str()
	author = fields.Nested(UserSchema)

这样序列化blog就会带上user信息了:

user = User(name="Monty", email="[email protected]")
blog = Blog(title="something Completely Different", author=user)
result, errors = BlogSchema().dump(blog)
pprint(result)
# {'title': u'Something Completely Different',
# {'author': {'name': u'Monty',
#             'email': u'[email protected]',
#             'created_at': '2014-08-17T14:58:57.600623+00:00'}}

如果field是多个对象的集合, 定义时可以使用many参数:

collaborators = fields.Nested(UserSchema, many=True)

如果外键对象是自引用, 则Nested里第一个参数为self
##Specifying Which Fields to Nest
如果你想指定外键对象序列化后只保留它的几个字段, 可以使用Only参数:

class BlogSchema2(Schema):
    title = fields.String()
    author = fields.Nested(UserSchema, only=["email"])

schema = BlogSchema2()
result, errors = schema.dump(blog)
pprint(result)
# {
#     'title': u'Something Completely Different',
#     'author': {'email': u'[email protected]'}
# }

如果需要选择外键对象的字段层次较多, 可以使用"."操作符来指定:

class Site(object):
	def __init__(self, blog)
		self.blog = blog
class SiteSchema(Schema):
	blog = fields.Nested(BlogSchema2)
user = User(name='xxx', email='xxx', created_at='xxx')
blog = Blog(title='xxx', author=user)
site = Site(blog=blog)
schema = SiteSchema(only=['blog.author.email'])
result, errors = schema.dump(site)
pprint(result)
# {
#     'blog': {
#         'author': {'email': u'[email protected]'}
#     }
# }

##Note
如果你往Nested是多个对象的列表, 传入only可以获得这列表的指定字段.

class User(object):
    def __init__(self, name, email):
        self.name = name
        self.email = email
        self.friends = []

class UserSchema(Schema):
    name = fields.Str()
    email = fields.Email()
    friends = fields.Nested('self', only='name', many=True) # 这里的many=True, 代表friends是一个可迭代对象

user1 = User('1a', '[email protected]')
user2 = User('2b', '[email protected]')
user3 = User('3c', '[email protected]')
user1.friends = [user2, user3]
user2.friends = [user1, user3]
user3.friends = [user1, user2]
user = [user1, user2, user3]
dumped1 = UserSchema(many=True).dump(user) 
print dumped1
# 这里的many=True 代表传入的user是一个可迭代对象
dumped = UserSchema().dump(user1)
print dumped
#MarshalResult(data=[{u'friends': [u'2b', u'3c'], u'name': u'1a', u'email': u'[email protected]'}, {u'friends': [u'1a', u'3c'], u'name': u'2b', u'email': u'[email protected]'}, {u'friends': [u'1a', u'2b'], u'name': u'3c', u'email': u'[email protected]'}], errors={})
#MarshalResult(data={u'friends': [u'2b', u'3c'], u'name': u'1a', u'email': u'[email protected]'}, errors={})
这种情况, 你也可以使用exclude去掉你不需要的字段. 同样这里也可以使用"."操作符.

##Two-way Nesting
如果有两个对象需要相互包含, 可以指定Nested对象的类名字符串, 而不需要类. 这样你可以包含一个还未定义的对象:

class AuthorSchema(Schema):
    # Make sure to use the 'only' or 'exclude' params
    # to avoid infinite recursion
    books = fields.Nested('BookSchema', many=True, exclude=('author', ))
    class Meta:
        fields = ('id', 'name', 'books')

class BookSchema(Schema):
    author = fields.Nested(AuthorSchema, only=('id', 'name'))
    class Meta:
        fields = ('id', 'title', 'author')

举个例子, Author类包含很多books, 而Book对Author也有多对一的关系.

from marshmallow import pprint
from mymodels import Author, Book

author = Author(name='William Faulkner')
book = Book(title='As I Lay Dying', author=author)
book_result, errors = BookSchema().dump(book)
pprint(book_result, indent=2)
# {
#   "id": 124,
#   "title": "As I Lay Dying",
#   "author": {
#     "id": 8,
#     "name": "William Faulkner"
#   }
# }
author.books = [book]
author_result, errors = AuthorSchema().dump(author)
pprint(author_result, indent=2)
# {
#   "id": 8,
#   "name": "William Faulkner",
#   "books": [
#     {
#       "id": 124,
#       "title": "As I Lay Dying"
#     }
#   ]
# }

Nesting A Schema Within Itself

如果需要自引用, “Nested"构造时传入"self”(包含引号)即可:

class User(object):
    def __init__(self, name, email):
        self.name= name
        self.email = email
        self.friends = []
        self.employer = None

class UserSchema(Schema):
    name = fields.Str()
    email = fields.Email()
    friends = fields.Nested('self', many=True)
    # 因为包含自身, 或者相互引用, 会出现一个无限递归(infinite recuision)的问题, 所以使用exclude/only避免
    employer = fields.Nested('self', exclude=('employer,'), default=None)

user = User('steve', '[email protected]')
user.friends.append(User('Mike', '[email protected]'))
user.friends.append(User('Joe', '[email protected]'))
user.employer = User('Dirk', '[email protected]')
result = UserSchema().dump(user)
pprint(result.data)
# {
#     "name": "Steve",
#     "email": "[email protected]",
#     "friends": [
#         {
#             "name": "Mike",
#             "email": "[email protected]",
#             "friends": [],
#             "employer": null
#         },
#         {
#             "name": "Joe",
#             "email": "[email protected]",
#             "friends": [],
#             "employer": null
#         }
#     ],
#     "employer": {
#         "name": "Dirk",
#         "email": "[email protected]",
#         "friends": []
#     }
# }

指定默认序列化/反序列化值

可以为Field为序列化和反序列化提供默认值
missing如果在输入数据中找不到该字段, 则用于反序列化. 同样, default如果缺少输入值, 则用于序列化.
例:

class UserSchema(Schema):
	id = fields.UUID(missing=uuid.uuid)
	birthdate = fields.DateTime(default=dt.datetime(2017, 9, 29))
UserSchema().load({})
# {'id': UUID('337d946c-32cd-11e8-b475-0022192ed31b')}
UserSchema().dump({})
# {'birthdate': '2017-09-29T00:00:00+00:00'}