Extracting Data From Elasticsearch With Python (Scan API)⚓︎
Executive Summary⚓︎
Sometimes you need an easy way to save the full contents of a index out to disk, there is a helper API that makes this really easy.
helper.bulk⚓︎
The below code illustrates how to leverage this capability. At a high level the steps are; * Import the required packages * Setup some environment variables * Create the scan iterator * Then write all the data from the iterator to disk
## Load in Libraries
from elasticsearch import helpers
from elasticsearch.client import Elasticsearch
import json
##set variables
elasticProtocol = 'http'
elastichost = 'localhost'
elasticPrefix = 'elasticsearch'
elasticport = '9200'
elasticUser = 'user'
elasticPassword = 'password'
elasticIndex = 'my-index'
actions = []
fileRecordCount = 160000
fileCounter = 0
## Generate RFC-1738 formatted URL
elasticURL = '%s://%s:%s@%s:%s/%s' % (elasticProtocol,elasticUser, elasticPassword, elastichost, elasticport, elasticPrefix )
## Create Connection to Elasticsearch
es = Elasticsearch([elasticURL],verify_certs=True)
output = helpers.scan(es,
index=elasticIndex,
doc_type="_doc",
size=1000, ### Obviously this can be increased
query={"query": {"match_all": {}}},
)
## Write Everything Out to Disk
for record in output:
actions.append(record['_source'])
if len(actions) >= fileRecordCount:
with open(elasticIndex + '-extract-' + str(fileCounter) + '.json' , 'w') as f:
json.dump(actions, f, ensure_ascii=False, indent=4, sort_keys=True)
actions = []
print('file ' + str(fileCounter) + ' written')
fileCounter = fileCounter + 1
if len(actions) > 0:
with open(elasticIndex + '-extract-' + str(fileCounter) + '.json' , 'w') as f:
json.dump(actions, f, ensure_ascii=False, indent=4, sort_keys=True)
print('file ' + str(fileCounter) + ' written')