Recently, I have been collecting and processing a large amount of data, which usually involves converting XML/text format data into relational, structured data for easy export and direct analysis, especially formats that can be directly analyzed by pandas (csv/json). Therefore, I need to use a mysql processing library in Python. Since I was looking for a processing library, I decided to go all out and choose a library that supports ORM. So, I chose SQLAlchemy. After using it for a while, I decided to write a record to remember the commonly used things during this period.
SQLAlchemy is an open-source software under the Python programming language. It provides an SQL toolkit and Object-Relational Mapping (ORM) tools, mainly to meet the following needs:
- Provide common operations on databases: add, delete, modify, and query
- Provide ORM functionality, which can turn the operating unit into an object, so there is no need to write sql statements for hard coding/decoding parsing data
Database Connection
SQLalchemy cannot support database operations, so additional database drivers need to be installed. For different databases and drivers, there are different configuration URIs, the overall format is dbms://user:pwd@host/dbname
. Here I use Mysql+mysqlconnector, the specific connection code is as follows
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
sql_connection = 'mysql+mysqlconnector://root:pwd@localhost:3306/database'
engine = create_engine(sql_connection)
# Transactions need to be executed using sessions
DBsession = sessionmaker(bind=engine)
session = DBsession()
# Close the session before ending
session.close()
Structure Definition
In SQLAlchemy, ORM binds the database model by defining objects.
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relation, sessionmaker
Base = declarative_base()
class Movie(Base):
__tablename__ = 'movies'
id = Column(Integer, primary_key=True)
title = Column(String(255), nullable=False)
year = Column(Integer)
directed_by = Column(Integer)
def __init__(self, title=None, year=None):
self.title = title
self.year = year
- The ORM model class is defined by inheriting from declarative_base and is bound to the corresponding table through
__tablename__
- Each Column attribute is a column in the table, the type needs to be set, and the type must correspond to the type of the column in the database. Common types are Integer, Float, String (corresponding to varchar), Text
- One attribute/column must be set as
primary_key
- You can create a table using
Base.metadata.create_all(engine)
Data Operation
Data is operated using ORM, and the basic unit of operation is an object (you can also directly use sql statements, but there is no need for common operations). Except for queries, operations that modify the database need to commit transactions.
Insert
m1 = Movie("Star Trek", 2009)
session.add(m1)
session.commit()
- Create the object to be inserted and add the corresponding data attributes
- Add it to the session
- Submit the session and commit the transaction
Query
movies = session.query(Movie).all() # Get all data
movie = session.query(Movie).get(movie_id) # Get a single record
filter_movie = session.query(Movie).filter(Movie.id == 1).one()
- The parameter of the query in the query is the table/object to be returned
- Filter is equivalent to the where statement, which can further filter the query results. One returns a unique row, and all returns all rows.
Update
The data that is queried out, after modifying the properties, re-submit the transaction.
movie = session.query(Movie).get(id)
movie.year = 1999
session.commit()
Delete
movie = session.query(Movie).get(id)
session.delete()
session.commit()
Some Problems
Large Batch Data Insertion
During the use process, I encountered a situation where I needed to insert more than 300,000 pieces of data into the database at once. Before this, I always commit after inserting the data (a few thousand/a few hundred), because each commit needs to be written to the database, which will be slower (limited by disk IO). But this time the number is too much, an error occurred when submitting, indicating that the mysql connection was lost. I guess it is because the writing time is too much and the time is too long, which exceeds its limit and disconnects the connection. So I chose to commit in batches, commit once after adding a certain amount of data (5000/1000/500/100/10), but it was not successful. After a certain amount of records were inserted, it indicated that a certain column of a certain inserted record was too long (even though that column of mine was Text with no length limit), which was obviously problematic. I searched for methods of batch data insertion on the Internet, but the content is basically the same, all of which are reproduced/translated from a answer on StackOverflow. Two methods were given:
- bulk_save_objects(objects)
- add_all()
Neither worked, and in the end, I adopted the method of adding and submitting one by one. Although it was a bit slow, it could work.
The reason for writing this is just to complain about the serious homogenization of content when using Google to search for some specific technical problems. Everyone is copying each other and turning around. Like this search, the original source is a question on StackOverflow, and then a website translates it into Chinese and publishes it (even if it is pioneering, it has spread to various Chinese websites. When you search, all the webpages with the same content are really disgusting.