Here in this post, I am going to show you on how to import IMDb plain text data files available at this page into your MySQL database server. I was using this data set to perform various benchmark tests around a moderately-large database size. If you want to have a small MySQL database sample data, you can try to use Sakila or World database available at MySQL Documentation Page.
I will be using CentOS 6.3 64bit as the OS platform and presume that MySQL has already installed and running. Make sure that you are having at least 4 GB of free space in the partition that MySQL @@datadir reside to allow this long-running process (3 to 5 hours – depending on your hardware specs + server workload) to successfully complete.
We will be using a Python-based package called IMDbPy. This package requires you to have Python with some development libraries installed as well as SQLObject. Then, we will need to download all data files from IMDb mirror site: ftp://ftp.fu-berlin.de/pub/misc/movies/database/, create the database and start the import process.
Installing IMDbPy and SQLObject
1. Install required packages using yum:
$ yum install -y gcc python python-devel openssl-devel libxml2-devel libxslt-devel zlib-devel MySQL-python python-setuptools python-pip
2. Install SQLObject using Python EasyInstall:
$ easy_install -U SQLObject
3. Download IMDbPy from this page into the MySQL server, extract it and start the installation process:
$ wget http://prdownloads.sourceforge.net/imdbpy/IMDbPY-4.9.tar.gz $ tar -xzf IMDbPY-4.9.tar.gz $ cd IMDbPY-* $ python setup.py install
1. Create a directory to dump all the data files that we will download:
$ mkdir /root/data $ cd /root/data
2. Download only .gz file from the IMDb mirror site to /root/data :
$ wget -r --accept="*.gz" --no-directories --no-host-directories --level 1 ftp://ftp.fu-berlin.de/pub/misc/movies/database/
3. Create a database in MySQL called ‘imdb’, with user ‘imdb’ and password ‘imdb’. We will then GRANT the user to the designated database:
mysql> CREATE DATABASE imdb; mysql> GRANT ALL PRIVILEGES ON imdb.* TO 'imdb'@'localhost' IDENTIFIED BY 'imdb'; mysql> FLUSH PRIVILEGES;
3. Start the import process with -u and -d flag:
$ imdbpy2sql.py -d /root/data/ -u 'mysql://imdb:[email protected]/imdb'
Take note that -d is the directory of the .gz dump files are located and -u is the connection string for our MySQL database server. You can change the connection string to any of SQLObject’s supported database such as PostgreSQL, SQLite, Firebird and MAX DB. Please refer to this documentation for details.
You will see similar output as below which indicates the importing process has started:
SCANNING movies: Last Sunset (2006) (movieID: 2130001) SCANNING movies: Legend of Hell (2012) (movieID: 2140001) SCANNING movies: Lifestyles of Squirrels (2011) (movieID: 2150001) SCANNING movies: Los signos del tiempo (1983) (movieID: 2160001) SCANNING movies: Madame T (2012) (movieID: 2170001) SCANNING movies: Marijji ringu (2007) (movieID: 2180001) SCANNING movies: Menculik miyabi (2010) (movieID: 2190001) * FLUSHING MoviesCache...
Wait up until it finish and you will have large sample data to play around in your MySQL server!