Skip to content

Commit afa9481

Browse files
authored
IMDB: Update import script, source files and dump (#12)
* Update script and source files - Removed duplicate edges - Removed empty fields ("") - Removed a dummy edge - Converted releaseDate to number, corrected timezone (was apparently assuming Europe/Berlin instead of UTC) - Fixed 13 releaseDates manually - Make the script work with ArangoDB 3.7/3.8 - Don't drop edge _key anymore (duplicates removed in source file) - Add type: "Genre" to genre vertices - Disable creation of fulltext indexes * Update README * Change source file extensions from .json to .jsonl * Update dump
1 parent d3711aa commit afa9481

15 files changed

+350413
-563841
lines changed

Graphs/IMDB/README.md

+48-17
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,59 @@
1-
This dataset is taken from the IMDB service http://www.imdb.com.
1+
# IMDB Movies and Actors Graph Dataset
22

3-
Before you start please make sure you do not have collections named:
4-
**imdb_vertices**, **imdb_edges**
5-
and you do not have a graph called
6-
**imdb**
7-
These will be overwritten by the importer.
3+
This dataset is taken from the Internet Movie Database <http://www.imdb.com>.
84

5+
The dataset has two collections:
6+
- **imdb_vertices**:
7+
Containing all Movies, Actors, Directors etc. and Genres of movies that can
8+
be used for traversal.
9+
- **imdb_edges**:
10+
Containing the relations between the vertices, who-acts-where,
11+
movie-has-genre etc.
912

10-
To import the data execute the following on your bash:
13+
Furthermore, it comes with a named graph **imdb** using both collections.
1114

12-
```Bash
15+
## Remarks
16+
17+
The attributes `birthday` and `lastModified` are Unix timestamps in milliseconds,
18+
but they are stored as strings. `releaseDate` is stored as number.
19+
20+
## Restore from dump
21+
22+
To restore the data from the dump into a new database `IMDB`, execute the
23+
following on your bash:
24+
25+
```bash
26+
unix> arangorestore --server.endpoint tcp://<host>:<port> --server.database IMDB --create-database --include-system-collections
27+
```
28+
29+
To restore the dump with default settings you may run the following instead:
30+
31+
```bash
1332
unix> ./import.sh
1433
```
1534

16-
You will then have the following two collections:
17-
* **imdb_vertices** Containing all Movies, Actors, Directores etc. and Genres of movies that can be used for traversal.
18-
* **imdb_edges** Containing the relations between the vertices, who-acts-where, movie-has-genre etc.
35+
## Import from source files
36+
37+
Before you start, please make sure you do not have **collections** named:
38+
- **imdb_vertices**
39+
- **imdb_edges**
40+
and that you do not have a **graph** called:
41+
- **imdb**
42+
43+
These will be overwritten by the importer!
44+
45+
To import the data from the `nodes.json` and `edges.json` source files into the
46+
`_system` collection, run this on your bash:
47+
48+
```bash
49+
unix> arangosh --server.endpoint tcp://<host>:<port> --javascript.execute ./import.js
50+
```
1951

20-
Furthermore you have the graph **imdb** using both collections.
52+
## Create new dump
2153

22-
The above import uses dumps of the collections. In order to recreate the collections from
23-
the source file `hero-comic-network.csv`, you can use
54+
To create a new dump of the imported data including the two collections and
55+
the graph, run:
2456

25-
```Bash
26-
unix> cat import.js | arangosh
27-
unix> arangodump --collection imdb_edges --collection imdb_vertices
57+
```bash
58+
unix> arangodump --server.endpoint tcp://<host>:<port> --server.database <database> --include-system-collections --collection _graphs --collection imdb_vertices --collection imdb_edges --compress-output false --envelope true
2859
```

Graphs/IMDB/dump/ENCRYPTION

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
none
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"type":2300,"data":{"_key":"imdb","_id":"_graphs/imdb","_rev":"_cPlZM_e---","edgeDefinitions":[{"collection":"imdb_edges","from":["imdb_vertices"],"to":["imdb_vertices"]}],"orphanCollections":[]}}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"indexes":[],"parameters":{"allowUserKeys":true,"cacheEnabled":false,"cid":"163325","deleted":false,"globallyUniqueId":"_graphs","id":"163325","isDisjoint":false,"isSmart":false,"isSmartChild":false,"isSystem":true,"keyOptions":{"allowUserKeys":true,"type":"traditional","lastValue":0},"minReplicationFactor":1,"minRevision":"0","name":"_graphs","numberOfShards":1,"planId":"163325","replicationFactor":1,"schema":null,"shardKeys":["_key"],"shards":{},"status":3,"syncByRevision":true,"type":2,"usesRevisionsAsDocumentIds":true,"version":9,"waitForSync":false,"writeConcern":1}}

Graphs/IMDB/dump/dump.json

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"database":"IMDB","lastTickAtDumpStart":"356559","useEnvelope":true,"properties":{"id":"163323","name":"IMDB","isSystem":false}}

Graphs/IMDB/dump/imdb_edges.data.json

-225,060
This file was deleted.

Graphs/IMDB/dump/imdb_edges.structure.json

-1
This file was deleted.

Graphs/IMDB/dump/imdb_vertices.data.json renamed to Graphs/IMDB/dump/imdb_edges_f4353381bd511b59d5cccbd7f7db840b.data.json

+118,325-63,027
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"indexes":[],"parameters":{"allowUserKeys":true,"cacheEnabled":false,"cid":"163369","deleted":false,"globallyUniqueId":"hABC5EBAFC11E/163369","id":"163369","isDisjoint":false,"isSmart":false,"isSmartChild":false,"isSystem":false,"keyOptions":{"allowUserKeys":true,"type":"traditional","lastValue":249839},"minReplicationFactor":1,"minRevision":"_cPlZM-u---","name":"imdb_edges","numberOfShards":1,"planId":"163369","replicationFactor":1,"schema":null,"shardKeys":["_key"],"shards":{},"status":3,"syncByRevision":true,"type":3,"usesRevisionsAsDocumentIds":true,"version":9,"waitForSync":false,"writeConcern":1}}

Graphs/IMDB/dump/imdb_vertices.structure.json

-1
This file was deleted.

0 commit comments

Comments
 (0)