Querying SQLite database with python

This project is part of a series of articles around football analytics. Work is hosted on github: check it out!

Github link

Introduction

I have retrieved european football data from a Kaggle challenge. It was made available as a SQLite database, which is an embedded, server-less relational database management system.
In this post, I will show you how to connect to a SQLite database and query data with Python.

Approach

1 - IDE, database and Python DB API standards

In order to query data fronm a SQLite database (but also PostGreSQL or MySQL DBs) with Python, several steps are required:

  • First, a connection is setup between the IDE and the DB, with a connect() method
  • Second, a cursor is created. It is a python object that acts as intermediary between a DB connection and a SQL query
  • Third, a SQL statement is passed through the cursor object
  • Fourth, data is retrieved (via fetch).

After a few internet searches, I could not find a clear, universal definition for what a cursor object was and how useful it could be. For now, remember it is the intermediary between python and the database, with which you will send SQL statements and from which you will fetch results. This approach (using cursors) is common to many databases (PostgreSQL, SQLite, MySQL...) so it is good to be familiar with it.
Here is a schema found on, that summarizes things well.

Relationship between a Python IDE and a MySQL database with Python DB API

2 - Connection, profiling and test queries

Connection

Usually, when connecting to a DB with python, you build a connection string, which specifies the target data source and how you connect to it (by providing host, port, service name, username, password). Here, there are no such thing so the connection is extremely straightforward.

Data profiling

Working with a database tool such as DBeaver or Toad, you usually have access to a nice graphical interface that shows you the DB schema, its tables, its columns and a preview of the data.
As a data analyst, it is a good practice to always start by profiling your data, i.e. analyzing its content, structure and metadata. Put simply: getting to know what you are dealing with.

When profiling a database, Here are a few things to look for:

  • the list of available tables
  • the number of rows in tables
  • the list of columns in tables
  • the relationships between tables
  • the column data types

Below are some useful SQLite queries and a python code snippet to output a dataframe with column headers.

                        
# import SQLite3 package
import sqlite3
import pandas as pd

# connect to DB
conn = sqlite3.connect(cfg.db_file)

# create a cursor object
cursor = conn.cursor()

# list tables
list_tables = "SELECT name FROM sqlite_master WHERE type='table'"

# list columns (position, name, type, number of non null values, default value, primary key indicator)
list_columns_from_table = "PRAGMA table_info(table_name)"

# preview a few rows of data
preview_table_sample = "SELECT * FROM table_name LIMIT 10"

# preview output as dataframe, with column headers
# with SQLite DB, columns aren't available by default
query_output = cursor.execute(preview_table_sample).fetchall()
column_names = [row[0] for row in cursor.description]
df = pd.DataFrame(data=query_output, columns=column_names)
                        
                    

3 - Querying data with pandas

Once you are more familiar with the data you are working with and that you have a clearer idea of what you want to extract out of your SQL database, you can proceed to the next step. Pandas library works well with SQL databases, including SQLite. It is fairly easy to query data and output it as a dataframe, using the "pd.read_sql_query" function.
When writing SQL queries, here are a few good practices to have in mind:

  • Avoid SELECT * statements: that will save network and disk resources and be easier to manage (in case of added columns, changed names, column's reordering)
  • Comment your queries: it will be easier for you to maintain over time and it will make things simpler for your colleagues if they ever need to use your queries
  • Indent your code: if you are lazy, there are free tools to help you, such as Code Beautify
  • Use aliases: when querying multiple tables, it makes things easier to manage with aliases

                        
# retrieving Real Madrid matches
get_rma_matches = """

    -- team_api_id = 8633 is Real Madrid identifier
    -- home and away matches involving Real Madrid will be queried

    SELECT
        m.id, m.country_id, 
        m.league_id,
        m.season,
        m.stage,
        m.date,
        m.match_api_id,
        m.home_team_api_id,
        ht.team_long_name AS home_team_name,
        ht.team_long_name AS home_team_acronym,
        m.away_team_api_id,
        at.team_long_name AS away_team_name,
        at.team_long_name AS away_team_acronym,
        m.home_team_goal,
        m.away_team_goal,
        m.goal,
        m.shoton,
        m.shotoff,
        m.foulcommit,
        m.card,
        m.cross,
        m.corner,
        m.possession
    FROM Match m
    LEFT JOIN Team AS ht
    ON ht.team_api_id = m.home_team_api_id
    LEFT JOIN Team AS at
    ON at.team_api_id = m.away_team_api_id
    WHERE (
        m.home_team_api_id = 8633
        OR m.away_team_api_id = 8633
    )
"""

team_names_df = pd.read_sql_query(get_rma_matches, conn)
                        
                    

Conclusion

SQL and Python are very powerful, especially when combined together. For a future data analyst, it is important to be comfortable with both. Here is a great resource to get hands on experience with SQL: SQLZoo