Kansas Injection Well Data

In [77]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import os

inj_datadir = 'c:\MyDocs/sandbox/data/datasets/FracFocus/KS_injection_wells/'
outdir = 'c:\MyDocs/sandbox/data/datasets/FracFocus/output/'

Using previous FracTracker formats as a template

Quantity reported is "FLUID_INJECTED" from the original Kansas data files. There are ~500 "GAS_INJECTED" records that are not incorportated here....

In [48]:
def fetch_raw_KS_injection(indir=inj_datadir):
    # it is assumed that all files in the directory are
    # to be processed.
    flist = os.listdir(indir)
    if len(flist)<1 :
        raise Exception('No files to process!')
    for f in flist:
        if f[-4:] != '.txt':
            raise Exception('Some non-txt files present in input directory')
    dlst = []
    for fn in flist:
        print(f'Processing: {fn}')
        d = pd.read_csv(indir+fn,low_memory=False)
        d = d.filter(['API','KGS_ID','YEAR','MONTH','FLUID_INJECTED']) # drop most columns
        dlst.append(d)
        print(f'    length of data from this file: {len(d)}') 
    df_inj = pd.concat(dlst,sort=True)
    print(f'Total length of dataframe: {len(df_inj)}')
    return df_inj
In [49]:
d = fetch_raw_KS_injection()
Processing: KS_UIC_archive.txt
    length of data from this file: 597856
Processing: KS_UIC_archive2016.txt
    length of data from this file: 197674
Processing: KS_UIC_archive2017.txt
    length of data from this file: 194857
Total length of dataframe: 990387
In [50]:
d.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 990387 entries, 0 to 194856
Data columns (total 5 columns):
API               990307 non-null object
FLUID_INJECTED    990387 non-null int64
KGS_ID            990387 non-null int64
MONTH             990387 non-null int64
YEAR              990387 non-null int64
dtypes: int64(4), object(1)
memory usage: 45.3+ MB

Create string to label the period

In [51]:
d['YearMo'] = 'Y' + d.YEAR.astype('str').str[:] + 'M' + d.MONTH.astype('str').str.zfill(2)[:]
d.head()
Out[51]:
API FLUID_INJECTED KGS_ID MONTH YEAR YearMo
0 15-003-21733-0001 1300 1042538729 1 2016 Y2016M01
1 15-003-21733-0001 1375 1042538729 2 2016 Y2016M02
2 15-003-21733-0001 1410 1042538729 3 2016 Y2016M03
3 15-003-21733-0001 1350 1042538729 4 2016 Y2016M04
4 15-003-21733-0001 1295 1042538729 5 2016 Y2016M05
In [52]:
print(len(d.KGS_ID.unique()))
print(len(d.API.unique()))
25424
25387

NOTE: There are more KGS_IDs than API numbers. So some wells are labeled with 1 API# but more than one KGS_ID. When compiling these data, we will keep only the first KGS_ID.

In [53]:
kgs = pd.DataFrame(d.groupby(['API'],as_index=False)['KGS_ID'].first())
kgs.head()
Out[53]:
API KGS_ID
0 15-001-01023-0001 1041771184
1 15-001-01024-0001 1041771204
2 15-001-01025-0001 1041103386
3 15-001-01027-0001 1041771629
4 15-001-01332 1005454437

Although in the groupby below, there should only be a single number (one measure for any Year/Mo combination), some wells apparently have more than a single record:

In [60]:
test = d.groupby(['YearMo','API'],as_index=False)['FLUID_INJECTED'].count()
test.FLUID_INJECTED.unique()
Out[60]:
array([1, 2, 3, 4, 5, 6, 8], dtype=int64)

So we will treat them as partial measures and add all records for a given period.

In [72]:
gb = d.groupby(['API','YearMo'],as_index=False)['FLUID_INJECTED'].sum()
gb.head()
Pivot =gb.pivot(index='API',columns='YearMo')
Pivot.columns = Pivot.columns.get_level_values(1) # snags YearMo as the column name

Pivot.head()
Out[72]:
YearMo Y1910M00 Y1912M00 Y1918M00 Y1921M00 Y1922M00 Y1928M00 Y1977M00 Y1981M00 Y1982M00 Y1983M00 ... Y2017M03 Y2017M04 Y2017M05 Y2017M06 Y2017M07 Y2017M08 Y2017M09 Y2017M10 Y2017M11 Y2017M12
API
15-001-01023-0001 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15-001-01024-0001 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15-001-01025-0001 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15-001-01027-0001 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 83256.0 81254.0 30982.0 75894.0 39412.0 73293.0 87850.0 41367.0 102965.0 111937.0
15-001-01332 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 128 columns

Get the KGS_ID back in there...

In [75]:
mg = pd.merge(kgs,Pivot,on='API',validate='1:m',how='right')
mg.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25386 entries, 0 to 25385
Columns: 130 entries, API to Y2017M12
dtypes: float64(128), int64(1), object(1)
memory usage: 25.4+ MB

save the file

In [79]:
last = d.YearMo.max()
fn = outdir+'KS_Injection_thru_'+last+'.csv'
mg.to_csv(fn)
print(fn)
c:\MyDocs/sandbox/data/datasets/FracFocus/output/KS_Injection_thru_Y2017M12.csv

NOTES:

  • Columns with a month = 0 seem to be annual records; monthly ones start in 2011.
  • Even though there are monthly records after that, there is a remarkable regularity in the numbers both within a well and across wells that suggests that there is after-the-fact splitting of numbers to spread across the months and wells.
  • Of the 25000 wells in this set, many wells have no injections (at least of Fluid...): over 10000 wells. The same was true for the 2016 report.
  • The cells for which no report was made have been kept blank and not made '0' to allow inspection of the cells that were intentionally made zero.
In [83]:
t = pd.DataFrame(d.groupby('API')['FLUID_INJECTED'].sum())
t[t.FLUID_INJECTED==0].count()
Out[83]:
FLUID_INJECTED    10026
dtype: int64
In [85]:
old = pd.read_csv(outdir+'KansasInject_old_from_Ted.csv')